Section 0 Preface for Imports, Data Handling, & Methodologies
Section 0.1 Preface for Write-Up Interpretation & Acknowledgements
Code was used from my own Github repository, found at www.github.com/sunnydigital/IDS_F21
, including code derived from Stephen Spivak from Introduction to Data Science, Fall 2021. Most of the code falling under the aforementioned two categories surrounds the PCA
and k-means
analysis plots.
Section 0.2 Imports & Installation of Packages, Libraries, Seaborn Settings, and Dataset(s)
Below we set the random seed to the numeric portion of my ID: N12345678
import packages & libraries as well as set the settings for seaborn
plots
!pip install xgboost
!pip install impyute
!pip install missingno
!pip install eli5
!pip install scikit-optimize
!pip install tune-sklearn ray[tune]
!pip install scikit-plot
!pip install colorcet==3.0.0
!pip install yellowbrick
Requirement already satisfied: xgboost in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (1.7.3)
Requirement already satisfied: numpy in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from xgboost) (1.21.5)
Requirement already satisfied: scipy in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from xgboost) (1.7.3)
Requirement already satisfied: impyute in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (0.0.8)
Requirement already satisfied: scipy in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from impyute) (1.7.3)
Requirement already satisfied: scikit-learn in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from impyute) (1.0.2)
Requirement already satisfied: numpy in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from impyute) (1.21.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn->impyute) (2.2.0)
Requirement already satisfied: joblib>=0.11 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn->impyute) (1.1.0)
Requirement already satisfied: missingno in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (0.5.1)
Requirement already satisfied: numpy in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from missingno) (1.21.5)
Requirement already satisfied: seaborn in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from missingno) (0.11.2)
Requirement already satisfied: scipy in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from missingno) (1.7.3)
Requirement already satisfied: matplotlib in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from missingno) (3.5.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib->missingno) (1.4.2)
Requirement already satisfied: packaging>=20.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib->missingno) (21.3)
Requirement already satisfied: pyparsing>=2.2.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib->missingno) (3.0.4)
Requirement already satisfied: cycler>=0.10 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib->missingno) (0.11.0)
Requirement already satisfied: pillow>=6.2.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib->missingno) (9.0.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib->missingno) (4.25.0)
Requirement already satisfied: python-dateutil>=2.7 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib->missingno) (2.8.2)
Requirement already satisfied: six>=1.5 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
Requirement already satisfied: pandas>=0.23 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from seaborn->missingno) (1.4.2)
Requirement already satisfied: pytz>=2020.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn->missingno) (2021.3)
Requirement already satisfied: eli5 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (0.13.0)
Requirement already satisfied: six in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (1.16.0)
Requirement already satisfied: numpy>=1.9.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (1.21.5)
Requirement already satisfied: attrs>17.1.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (21.4.0)
Requirement already satisfied: scikit-learn>=0.20 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (1.0.2)
Requirement already satisfied: jinja2>=3.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (3.0.3)
Requirement already satisfied: scipy in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (1.7.3)
Requirement already satisfied: graphviz in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (0.20.1)
Requirement already satisfied: tabulate>=0.7.7 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from eli5) (0.8.9)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (2.2.0)
Requirement already satisfied: joblib>=0.11 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (1.1.0)
Requirement already satisfied: scikit-optimize in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (0.9.0)
Requirement already satisfied: scikit-learn>=0.20.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-optimize) (1.0.2)
Requirement already satisfied: pyaml>=16.9 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-optimize) (21.10.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-optimize) (1.21.5)
Requirement already satisfied: joblib>=0.11 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-optimize) (1.1.0)
Requirement already satisfied: scipy>=0.19.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-optimize) (1.7.3)
Requirement already satisfied: PyYAML in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from pyaml>=16.9->scikit-optimize) (6.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.20.0->scikit-optimize) (2.2.0)
zsh:1: no matches found: ray[tune]
Requirement already satisfied: scikit-plot in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (0.3.7)
Requirement already satisfied: scikit-learn>=0.18 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-plot) (1.0.2)
Requirement already satisfied: matplotlib>=1.4.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-plot) (3.5.1)
Requirement already satisfied: scipy>=0.9 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-plot) (1.7.3)
Requirement already satisfied: joblib>=0.10 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-plot) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (21.3)
Requirement already satisfied: python-dateutil>=2.7 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (3.0.4)
Requirement already satisfied: pillow>=6.2.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (9.0.1)
Requirement already satisfied: numpy>=1.17 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (1.21.5)
Requirement already satisfied: fonttools>=4.22.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4.0->scikit-plot) (1.4.2)
Requirement already satisfied: six>=1.5 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=1.4.0->scikit-plot) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.18->scikit-plot) (2.2.0)
Requirement already satisfied: colorcet==3.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (3.0.0)
Requirement already satisfied: param>=1.7.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from colorcet==3.0.0) (1.12.0)
Requirement already satisfied: pyct>=0.4.4 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from colorcet==3.0.0) (0.4.6)
Requirement already satisfied: yellowbrick in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (1.5)
Requirement already satisfied: scikit-learn>=1.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.0.2)
Requirement already satisfied: numpy>=1.16.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.21.5)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (3.5.1)
Requirement already satisfied: cycler>=0.10.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: scipy>=1.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.7.3)
Requirement already satisfied: packaging>=20.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.2)
Requirement already satisfied: python-dateutil>=2.7 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2)
Requirement already satisfied: pillow>=6.2.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.0.1)
Requirement already satisfied: pyparsing>=2.2.1 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.4)
Requirement already satisfied: six>=1.5 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
Requirement already satisfied: joblib>=0.11 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0)
Below we set the random seed to the numerical portion of our NYU ID (as per the spec sheet).
import random
random.seed(12345678)
import sys
import os
import time
import re
import numpy as np
import pandas as pd
from statsmodels.distributions.empirical_distribution import ECDF
import statsmodels.api as sm
import scipy.stats as st
import statsmodels.api as sm ## Need revision for Windows use
from scipy.stats import zscore
from scipy.spatial.distance import squareform
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression, Perceptron
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_validate, KFold, cross_val_score, RepeatedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer, confusion_matrix, plot_confusion_matrix, mean_squared_error as mse, f1_score
from sklearn import tree, ensemble, metrics, calibration
from sklearn.svm import SVC, LinearSVC
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, MDS
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_samples
from sklearn.impute import KNNImputer
from sklearn.ensemble import AdaBoostClassifier
from IPython import display
from eli5.sklearn import PermutationImportance
from eli5 import show_prediction
import eli5
from skopt.space import Real, Categorical, Integer
from skopt.plots import plot_objective
from skopt import BayesSearchCV
from tune_sklearn import TuneSearchCV, TuneGridSearchCV
import ray.tune as tune
from tqdm import tqdm
from matplotlib import pyplot as plt
from matplotlib import mlab
from matplotlib.pyplot import figure
from matplotlib.colors import ListedColormap
from mpl_toolkits.mplot3d import Axes3D
import scikitplot as skplt
import seaborn as sns; sns.set_theme(color_codes=True); sns.set_style("whitegrid")
import graphviz
import colorcet as cc
from impyute.imputation.cs import fast_knn
from xgboost.sklearn import XGBClassifier
import xgboost as xgb
from yellowbrick.classifier import ROCAUC
Section 0.3 NaN
Data Visual Inspection
Below we view the potentially missing data out of the dataset we obtained. Missing data (yellow) are greatly contrasted with the present data (blue).
This was done to visually inspect the proportion of data missing versus those present.
data = 'musicData.csv'
df = pd.read_csv(data)
colors = ['#000099', '#ffff00']
fig, ax = plt.subplots(figsize=(100,10))
sns.heatmap(df.isnull(), cmap=sns.color_palette(colors), ax=ax)
<AxesSubplot:>
Section 0.4 string
Data Removal
Through our goal of classification/unsupervised learning, we will not include string
data, as in the scope of this analysis, we will not be considering tthe potentially lingusitic properties of artist_name
, track_name
, obtained_date
. Furthermore, instance_id
is simply the unique classifier of an individual song, and given we are not considering the properties associated with a song, we will not need the specific label.
df_map = df[['instance_id', 'artist_name', 'track_name', 'obtained_date']]
df = df.loc[:, ~df.columns.isin(['instance_id', 'artist_name','track_name','obtained_date'])]
df.head(10)
popularity | acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | valence | music_genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.0 | 0.00468 | 0.652 | -1.0 | 0.941 | 0.792000 | A# | 0.1150 | -5.201 | Minor | 0.0748 | 100.889 | 0.759 | Electronic |
1 | 31.0 | 0.01270 | 0.622 | 218293.0 | 0.890 | 0.950000 | D | 0.1240 | -7.043 | Minor | 0.0300 | 115.00200000000001 | 0.531 | Electronic |
2 | 28.0 | 0.00306 | 0.620 | 215613.0 | 0.755 | 0.011800 | G# | 0.5340 | -4.617 | Major | 0.0345 | 127.994 | 0.333 | Electronic |
3 | 34.0 | 0.02540 | 0.774 | 166875.0 | 0.700 | 0.002530 | C# | 0.1570 | -4.498 | Major | 0.2390 | 128.014 | 0.270 | Electronic |
4 | 32.0 | 0.00465 | 0.638 | 222369.0 | 0.587 | 0.909000 | F# | 0.1570 | -6.266 | Major | 0.0413 | 145.036 | 0.323 | Electronic |
5 | 47.0 | 0.00523 | 0.755 | 519468.0 | 0.731 | 0.854000 | D | 0.2160 | -10.517 | Minor | 0.0412 | ? | 0.614 | Electronic |
6 | 46.0 | 0.02890 | 0.572 | 214408.0 | 0.803 | 0.000008 | B | 0.1060 | -4.294 | Major | 0.3510 | 149.995 | 0.230 | Electronic |
7 | 43.0 | 0.02970 | 0.809 | 416132.0 | 0.706 | 0.903000 | G | 0.0635 | -9.339 | Minor | 0.0484 | 120.008 | 0.761 | Electronic |
8 | 39.0 | 0.00299 | 0.509 | 292800.0 | 0.921 | 0.000276 | F | 0.1780 | -3.175 | Minor | 0.2680 | 149.94799999999998 | 0.273 | Electronic |
9 | 22.0 | 0.00934 | 0.578 | 204800.0 | 0.731 | 0.011200 | A | 0.1110 | -7.091 | Minor | 0.1730 | 139.933 | 0.203 | Electronic |
Section 0.5 Handling Null/String & Categorical Data
Below, we check the presence of NA
’s and any non-numeric values in the dataframe and afterwards output a description of the dataframe.
First, we create a lambda
function to attempt to coerce
all columns into a numeric
format, and if that is impossible, they are instead turned into NaN
’s. Furthermore, incorrect (missing) values in the duration_ms
column are represented as -1
and also converted into NaN
’s.
This was done to convert all potentially non-numeric values (entries that cannot be worked with) into a consolidated format.
Next, we impute using the sklearn.impute.KNNImputer
to find missing values given other available labels in a sample. This incorporates all of the converted NaN
values in the previous step.
This was done to numerically solve for missing values, using a k-Nearest Neighbor imputation technique, given the dearth of missing data (Section 0.3) that makes this technique particularly potent for the case.
lambda_apply = lambda x: pd.to_numeric(x, errors='coerce')
df[['popularity', 'acousticness', 'danceability', 'duration_ms',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'valence']] = df[['popularity', 'acousticness', 'danceability', 'duration_ms',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'valence']].apply(lambda_apply).replace(-1, np.NaN)
to_impute = df[['popularity', 'acousticness', 'danceability', 'duration_ms',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'valence']]
imputer = KNNImputer(n_neighbors=5)
impute_fit = imputer.fit_transform(to_impute)
df_drop = df.copy()
df_drop[['popularity', 'acousticness', 'danceability', 'duration_ms',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'valence']] = impute_fit
df_drop.dropna(inplace=True)
Section 0.6 key
& mode
Column Combination
Below, we create a combined column out of key
and mode
, signifying the gradable key, mode (e.g. A#_Major
or C#_Minor
) combination in music, denoting the colloquial key
a song is written in.
This was done in obervance of classical music theory and application of the Circle of Fifths[1] showcasing the cyclical continuity of the key
feature explored in this analysis.
df_drop['key'] = df_drop['key'] + '_' + df_drop['mode']
df_drop = df_drop.drop(labels=['mode'], axis='columns').reset_index(drop=True)
We then check the number of NaN
’ there are in a given column.
df_drop['tempo'].isnull().sum()
0
Section 0.7 LabelEncoder
Application to Obtain Numeric Feature Classes
In this section we use the LabelEncoder
package from sklearn.preprocessing
to obtain numerical data for features with string values that are continuous. We fit key
(modified from the previous section) and the label classes in music_genre
against LabelEncoder
s.
This was done to enable the PCA
and XGBoost
algorithms to be applied later on the be usable.
key_le = LabelEncoder().fit(df_drop['key'])
key_encode = pd.DataFrame(data=key_le.transform(df_drop['key']), columns=['key'])
genre_le = LabelEncoder().fit(df_drop['music_genre'])
genre_encode = pd.DataFrame(data=genre_le.transform(df_drop['music_genre']), columns=['music_genre'])
key_inverse = lambda x: key_le.inverse_transform(x)
genre_inverse = lambda x: genre_le.inverse_transform(x)
In the following line we create a copy of df_encode
and replace the original key
& music_genre
columns with the LabelEncode
transformed outputs.
df_encode = df_drop.copy()
df_encode['key'] = key_encode['key']
df_encode['music_genre'] = genre_encode['music_genre']
We then inspect the entire dataframe to determine any missing (NaN
) values. There fortunately seems to be none.
print(key_encode.isnull().sum())
print(genre_encode.isnull().sum())
df_encode.isnull().sum()
key 0
dtype: int64
music_genre 0
dtype: int64
popularity 0
acousticness 0
danceability 0
duration_ms 0
energy 0
instrumentalness 0
key 0
liveness 0
loudness 0
speechiness 0
tempo 0
valence 0
music_genre 0
dtype: int64
Section 0.8 Train Test Split Using Stratification
In the section below we utilize train_test_split
from sklearn.model_selection
to obtain a test size 0.1
the size of the entire dataframe, and stratified against each class of the label (y = df_encode['music_genre']
). This means we obtain proportional 0.1
amounts from each class label as opposed to obtaining from the total dataset, where it would be highly unlikely to obtain similar proportions of each class.
We do this in accordance to the Spec Sheet, but also to reduce sampling bias given the lack of equality in presence of all classes.
X = df_encode.loc[:,:'valence']
y = df_encode.loc[:,'music_genre']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
Section 0.9 Standardization & Normalization
For this dataset, (and the applicable functions of dimensionality reduction through PCA and clustering through k-means), I have chosen to standardize through the below mapping:
$$ \vec{X}{\textrm{Z-Score}} = \frac{\vec{X} - \mu{\vec{X}}}{\sigma_{\vec{X}}} $$
where each $\vec{X}$ is a given column of data.
X_train = st.zscore(X_train)
X_test = st.zscore(X_test)
Section 1.0 PCA Fitting and Principal Component Analysis
We run a Principal Component Analysis using the PCA
package from the sklearn
library. When, we determine the eig_vals
as the amount of explained variance, the loadings
as the components of the output and rotated_data
as the transformed data. Note the * -1
for all lines of code: this was done due to a fundamental mistake in Python and makes outputs more interpretable. We then cycle through eig_vals
to determine the percent of variance explained by each principal component.
This was done to initialize the PCA (with correct assumptions inclusive of Python flaws) and display the amount of variation attributable.
ordinal = lambda n: "%d%s" % (n,"tsnrhtdd"[(n//10%10!=1)*(n%10<4)*n%10::4])
pca, pca_t = PCA().fit(X_train), PCA().fit(X_test)
eig_vals, eig_vals_t = pca.explained_variance_, pca_t.explained_variance_
loadings, loadings_t = pca.components_ * -1, pca_t.components_ * -1
rotated_data, rotated_data_t = pca.fit_transform(X_train) * -1, pca_t.fit_transform(X_test) * -1
covar_explained, covar_explained_t = eig_vals / sum(eig_vals) * 100, eig_vals_t / sum(eig_vals_t) * 100
for n in range(len(covar_explained)):
explained = covar_explained[n].round(5)
print(f'{ordinal(n+1)} PC explains {explained}% of variance')
1st PC explains 31.04849% of variance
2nd PC explains 11.28439% of variance
3rd PC explains 8.84077% of variance
4th PC explains 8.32752% of variance
5th PC explains 8.04052% of variance
6th PC explains 7.78231% of variance
7th PC explains 6.787% of variance
8th PC explains 6.03472% of variance
9th PC explains 4.94202% of variance
10th PC explains 3.85229% of variance
11th PC explains 2.14013% of variance
12th PC explains 0.91984% of variance
Section 1.1 Scree Plot
Below we show a scree plot of the eigenvalues and the amount of total explainability (out of the total number of eig_vals
) attributed to each PC. We then draw a line at Eigenvalue = 1
to show how many eig_vals
are above 1.
This was done to visually inspect the scree plot and determine the cutoff point for the Kaiser criterion (Eigenvalue = 1
).
num_col = len(X_train.columns)
x = np.linspace(1,num_col, num_col)
plt.bar(x, eig_vals, color='gray')
plt.plot([0,num_col], [1,1], color='orange')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.title('musicData.csv Principal Components')
plt.show()
Section 1.2 Kaiser, Elbow, 90% Variance
Below we show code regarding the number of PCs to choose around each criterion.
- For the Kaiser criterion, we count the number of PCs with
eigenvalue > 1
. This was determined to be3 PCs. - For the Elbow criterion I eyeballed the data and determined the cutoff point at 2 PCs
- For the number of factors to account for more than 90% variance of the data, we ran
np.cumsum
on thecovar_explained
(as defined in Section 1.0) and count the number needed. This was determined to be 9 PCs
We do so due to these being the main methods of determining the number of PCs to choose, and run three methods to exhaustively optimize for the number.
Code from Introduction to Data Science, Fall 2021
threshold = 1
print('Number of factors selected by Kaiser criterion:', np.count_nonzero(eig_vals > threshold))
print('Number of factors selected by Elbow criterion: 2')
threshold = 90
eig_sum = np.cumsum(covar_explained)
print('Number of factors to account for at least 90% variance:', np.count_nonzero(eig_sum < threshold) + 1)
Number of factors selected by Kaiser criterion: 3
Number of factors selected by Elbow criterion: 2
Number of factors to account for at least 90% variance: 9
Section 1.3 Principal Component Loadings Plots
Below (two cells) we cycle through the first 9 PCs and plot each attributable loadings using seaborn
. Next, we output each predictor and the associated label.
This was done to provide an initial visual inspection of the loadings plots of each PC.
look_at = 9
for i in range(0,look_at):
plt.subplot(3,3,i+1)
plt.bar(x, loadings[i,:])
plt.title(f'Principal Component {i+1}', fontsize=10)
plt.xlabel('Predictor', fontsize=8)
plt.ylabel('Loading', fontsize=8)
plt.tight_layout()
for index in range(len(X_train.columns)):
print(f'Predictor {index+1}: {X_train.columns[index]}')
Predictor 1: popularity
Predictor 2: acousticness
Predictor 3: danceability
Predictor 4: duration_ms
Predictor 5: energy
Predictor 6: instrumentalness
Predictor 7: key
Predictor 8: liveness
Predictor 9: loudness
Predictor 10: speechiness
Predictor 11: tempo
Predictor 12: valence
Section 1.4 Scatterplot
We scatter a 2d plot of the first 2 PCs against each other, with PC1 on the x-axis and PC2 on the y-axis. We then color the scattered points associated with the appropriate label class.
This provides an ultimate visual representation of the associated PC loadings for later visual/clustering analysis, as well as a visual (colored) representation of the label calss each point belongs in.
Code modified from Introduction to Data Science Fall 2021
sns.scatterplot(x=rotated_data[:,0], y=rotated_data[:,1], hue=y_train, s=10, marker='x', palette='Paired')
plt.title('Comparison of First Two PCs')
# Set x-axis label
plt.xlabel('Principal Component 1')
# Set y-axis label
plt.ylabel('Principal Component 2')
Text(0, 0.5, 'Principal Component 2')
Section 1.5 3-D Plot of First 3 Principal Components.
We plot in 3d projections using seaborn
’s Axes3D
package the first 3 PCs.
This was done to obtain a better understanding of the data. We see that PC3 does not seem to provide added variance, but rather a distribution horizontally around equal above and below the mean. However, in the 3rd dimension striations seem to result from the label classes presented.
%matplotlib widget
fig = plt.figure()
ax = fig.gca(projection='3d')
fig.add_axes(ax)
cmap = ListedColormap(sns.color_palette('Paired', 10).as_hex())
sc = ax.scatter(rotated_data[:,0],
rotated_data[:,1],
rotated_data[:,2],
s=40, marker='.', c=y_train.to_list(), cmap=cmap, alpha=0.05)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()
/var/folders/75/nxb38mxd16q4m2ylbprp28s00000gn/T/ipykernel_78524/1430082577.py:4: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot().
ax = fig.gca(projection='3d')
%matplotlib inline
Section 1.6 Results & Discussion
From the Section 1.1 scree plot & 1.3 analytical determination, we see that 4 PCs have eigenvalue > 1
(adhering to the Kaiser criterion). This means that the number of PCs that have greater explanability of the total dataset greater than an individual feature is 3.
From the Section 1.0 original fitting and cumsum
of eig_vals
analysis, we see that the first PC accounts for “31.04386%
of variance” and the second PC accounts for “11.29876%
of variance” (the third PC accounting for 8.84217%
). Together, they (the first two) account for 42.34262%
of the variance in the data (with 3 PCs accounting for 51.18479%
). This means that with just 2 PCs, almost half of the variance in the data is accounted for (and with 3 PCs, slightly more than half is accounted for). In 2-D this would be a well-suited method for visual/future-clustering analysis, but in scope of this analysis (applying the Kaiser)
From viewing the loadings
of the first two PCs, we see that:
-
PC1: loadings points very positively towards
popularity
,danceability
,energy
,loudness
, andvalence
but very negatively towardsacousticness
andinstrumentalness
. Unlisted loadings were smaller/near-zero in magnitude. -
PC2: strongly positive towards
popularity
,acousticness
,danceability
,speechiness
, negatively towardsduration_ms
,energy
,liveliness
,loudness
, andtempo
. Unlisted loadings were smaller/near-zero in magnitude.
In interpretation, we see that PC1 is accute towards features able to be interpreted as having qualities of a “pop song” (with popularity
, danceability
, and loudness
all influencing the quality), and PC2 as being accute in features positioned toward dance music: danceability
, popularity
, and speechiness
.
Section 2.0 K-Means Clustering
For this Section, we realized the importance of using a distance-preserving method of dimensionality reduction, given k-means’ dependance on distance metrics to create clusters. This is why I chose PCA as the form of dimensionality reduction to be used, as opposed to t-SNE, which only preserves local similarities. I did not choose MDS nor t-SNE due to the lack of open interpretability of factors in both (loadings).
In the two blocks of code below, we call and create a np.array
of the first two PCs of the PCA analysis done in Section 1. We then create a np.array
of NaN
s of num_clusters
to look at, with num_clusters=9
to store each sum of silhouette scores for every number of clusters tried.
We then loop through the 9 total clusters, using the KMeans
class from sklearn.cluster
to calculate the clusters for each number of clusters, and create subplots using matplotlib
of associated silhouette score (found using silhouette_samples
from sklearn.metrics
) vs. count of occurrence. Each plot displays the number of clusters and the sum of the silhouette scores as the title.
This was done to output subplots showing the silhouette score vs. the count of occurrence, with titles bearing the sum of silhouette scores.
The code below is modified from Introduction to Data Science, Fall 2021
x = np.column_stack((rotated_data[:,0], rotated_data[:,1]))
num_clusters = 9 # how many clusters are we looping over? (from 2 to 10)
Q = np.empty([num_clusters,1])*np.NaN # init container to store sums
# Compute kMeans:
for i in range(2, 11): # Loop through each cluster (from 2 to 10!)
kMeans = KMeans(n_clusters = int(i)).fit(x) # compute kmeans using scikit
cId = kMeans.labels_ # vector of cluster IDs that the row belongs to
cCoords = kMeans.cluster_centers_ # coordinate location for center of each cluster
s = silhouette_samples(x, cId) # compute the mean silhouette coefficient of all samples
Q[i-2] = sum(s) # take the sum
# Plot data:
plt.subplot(3, 3, i-1)
plt.hist(s,bins=20)
plt.xlim(-0. 2,1)
plt.ylim(0, 10000)
plt.xlabel('Silhouette score')
plt.ylabel('Count')
plt.title(f'Clust: {i}, Sum: {int(Q[i-2])}') # sum rounded to nearest integer
plt.tight_layout() # adjusts subplot padding
Section 2.1 Plot of Number of Clusters vs. Sum of Silhouette
Below we show on the x-axis the number of clusters vs. the sum of silhouette scores on the y-axis. We use matplotlib
to plot the associated np.linspace
object, set between 2 to 10 with 9 in-between divisions against the sum of silhouette scores obtained from the filled Q
array in the previous section.
The code below is modified from Introduction to Data Science, Fall 2021
plt.plot(np.linspace(2,10,9), Q)
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Scores')
plt.title('Sum of Silhouette Scores per Cluster Number')
plt.show()
Section 2.2 Plot of First Two PCs
Below we plot the the first two PCs, colored by individual clusters (the number of clusters obtained by optimal silhouette value obtained above through np.argmax
of the Q
silhouette sum array added to 2 (np.argmax(Q) + 2
because we start at index 0 to store the lowest number of clusters, 2). We first hard code the optimal number of clusters into num_clusters
, obtained from the silhouette plot (Section 2.1) above (which is 2
). We instance a KMeans
class object and fit it with two dimensions of the loadings
from PCA analysis (Section 1), as requested by the spec sheet to produce visualizations in 2-D. Next, looping through the number of clusters to be plotted (num_clusters
), we plot (in green
, blue
, orange
respectively), associated clusters by determining the indices of associated labels in rotated_data
. Finally, we plot the associated centroids.
This was done to visualize in two PC dimensions individual clusters, alongside their associated centroids.
Note: if code does not output clusters ALONGSIDE centroids, please disregard this section in grading and use the 3-D analogy of Section 2.3 instead.
Code adapted from Introduction to Data Science, Fall 2021
%matplotlib inline
num_clusters = np.argmax(Q) + 2
kMeans = KMeans(n_clusters=num_clusters)
kMeans = kMeans.fit(np.column_stack((rotated_data[:,0], rotated_data[:,1])))
cId = kMeans.labels_
cCoords = kMeans.cluster_centers_
for i in range(num_clusters):
plotIndex = np.argwhere(cId == int(i))
plt.plot(rotated_data[plotIndex,0], rotated_data[plotIndex,1], '.', markersize=10, alpha=0.2)
plt.plot(cCoords[i,0], cCoords[i,1],'o', markersize=10, color='black', alpha=1.0)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Comparison of First Two PCs')
Section 2.3 3-D Plot of 3 Principal Component Clustering
This section maps the two chosen optimal number of clusters onto a 3-D space with 3 PCs.
This was done to visualize what k-means would look like in higher-dimensions with more variation in the data (explained by more PCs) included.
%matplotlib notebook
num_clusters = np.argmax(Q) + 2
kMeans = KMeans(n_clusters = num_clusters).fit(np.column_stack((rotated_data[:,0],
rotated_data[:,1],
rotated_data[:,2])))
cId = kMeans.labels_
cCoords = kMeans.cluster_centers_
fig = plt.figure(figsize=(6,6))
ax = Axes3D(fig, auto_add_to_figure=False)
fig.add_axes(ax)
# Plot the color-coded data:
for i in range(num_clusters):
plotIndex = np.argwhere(cId == int(i))
sc = ax.scatter(rotated_data[plotIndex,0],
rotated_data[plotIndex,1],
rotated_data[plotIndex,2],
label=f'Cluster {i+1}',
s=40, marker='.', alpha=0.05)
sc = ax.scatter(cCoords[int(i-1),0],
cCoords[int(i-1),1],
cCoords[int(i-1),2],
color='black', s=70, alpha=1.0)
ax.legend(loc='best')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
Warning: Cannot change to a different GUI toolkit: notebook. Using widget instead.
Text(0.5, 0, 'Principal Component 3')
%matplotlib inline
Section 2.4 Results & Discussion
From the above, we find that by implementing k-means on a PCA reduced musicData.csv
, with a sum of silhouette scores of 24558
, and an optimal number of clusters of 2
(Sections 2.0 & 2.1), we obtain the visualization shown in Section 2.2.
This however, could potentially not be accurate given a brief viewing of the dataset: shown in Section 2.1, we can see isses with the dataset being potentially unable to be broken up into individual “spherical clusters” (due to the singular, already clustered blob resulting from the PCA). Potential outliers are also included in the k-means analysis, such as the point around (-3.9
, -15.25
) in the graph of Section 2.2.
However, this could also mean that using the first two and three PCs respectively, explaining 42.34262%
and 51.18479%
of the variation in the data respectively (Section 1.6), 2
clusters do exist, allowing for further analysis through unsupervised classification (not proceeded with in the scope of this analysis).
Section 3.0 Xtreme Gradient Boost (XGBoost
) Function Definitions
Below we create a dictionary of preprocessed training_data
inclusive of X_train
, X_test
, y_train
, & y_test
.
This was done for ease of accessibility.
In the next block of code, we define fitXgb
to output the fitting progress (mlogloss
and merror
, m
meaning multiclass for each test
and train
sets) every 100 epochs. At the end of fitting, we output the accuracy
, micro f1
score, and macro f1
score.[4]
Each micro
and macro
score utilize different aggregation methods of determining a combined score, with micro
calculating each precision, recall by aggregating then dividing, and macro
calculating each precision and recall individually, then finding the combined average.
We then define plot_compare
to output visualization plots of the mlogloss
and merror
for each train
and test
set.
This was done to give more quantitative (for the former) and qualitative (latter plot) information regarding the fitting of the XGBoost
model.
I choose XGBoost
due to it being the “silver bullet” of Machine Learning at the moment, as well as the combinatorially positive effects of mixing together the plus-sides of random forests (bagging) as well as the adherence to an objective (loss) function through “Gradient Boosting.” Gradient Boosted Decision Trees (GBDTs) formalizes additively generating weak models (akin to AdaBoost) but over an objective function. Each successive, iterative model trains a “shallow decision tree,.. over the error residuals of a previous model to fit the next.” The model then weighs the sum of all predictions.[5]
training_data = {'X_train': rotated_data[:,:4],
'X_test': rotated_data_t[:,:4],
'y_train': y_train,
'y_test': y_test}
#allow logloss and classification error plots for each iteration of xgb model
def plot_compare(metrics, eval_results, epochs):
for m in metrics:
test_score = eval_results['test'][m]
train_score = eval_results['train'][m]
rang = range(0, epochs)
plt.rcParams["figure.figsize"] = [6,6]
plt.plot(rang, test_score,"c", label="Test")
plt.plot(rang, train_score,"orange", label="Train")
title_name = m + " plot"
plt.title(title_name)
plt.xlabel('Iterations')
plt.ylabel(m)
lgd = plt.legend()
plt.show()
def fitXgb(sk_model, training_data=training_data, epochs=100):
print('Fitting model...')
sk_model.fit(training_data['X_train'], training_data['y_train'].values.reshape(-1, 1))
print('Fitting done!')
train = xgb.DMatrix(training_data['X_train'], label=training_data['y_train'])
test = xgb.DMatrix(training_data['X_test'], label=training_data['y_test'])
params = sk_model.get_xgb_params()
metrics = ['mlogloss','merror']
params['eval_metric'] = metrics
store = {}
evallist = [(test, 'test'), (train,'train')]
xgb_model = xgb.train(params, train, epochs, evallist, evals_result=store, verbose_eval=100)
print('-- Model Report --')
accuracy = accuracy_score(sk_model.predict(training_data['X_test']), training_data['y_test'])
f1_micro = f1_score(sk_model.predict(training_data['X_test']), training_data['y_test'], average='micro')
f1_macro = f1_score(sk_model.predict(training_data['X_test']), training_data['y_test'], average='macro')
print(f'XGBoost Accuracy: {accuracy}')
print(f'XGBoost F1-Score (Micro): {f1_micro}')
print(f'XGBoost F1-Score (Macro): {f1_macro}')
plot_compare(metrics, store, epochs)
Section 3.1 XGBoost
Initial Fitting & Analysis
Below, we fit out initial XGBoost
model, xgb_clf1
with objective=softproba
to determine AUC ROC in later sections.
This was done to be able to determine AUC ROC (also, as specified by the spec sheet). We specify which hyperparameters will be tuned later on within the code, and output graphics and text using the functions defined in Section 3.0 above.
## Computer/dataset specifics about running the XGBoost model
num_threads = 8
num_classes = 10
#initial model
xgb_clf1 = XGBClassifier(learning_rate=0.1,
n_estimators=1000,
max_depth=5, #tunable
min_child_weight=1, #tunable
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softproba',
nthread=num_threads,
num_class=num_classes,
seed=1234567)
fitXgb(xgb_clf1, training_data)
Fitting model...
Fitting done!
[0] test-mlogloss:2.24280 test-merror:0.71080 train-mlogloss:2.23709 train-merror:0.68424
/Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages/xgboost/core.py:617: FutureWarning: Pass `evals` as keyword args.
warnings.warn(msg, FutureWarning)
[99] test-mlogloss:1.72184 test-merror:0.67020 train-mlogloss:1.52533 train-merror:0.58136
-- Model Report --
XGBoost Accuracy: 0.3096
XGBoost F1-Score (Micro): 0.3096
XGBoost F1-Score (Macro): 0.29959849702613905
Section 3.2 Tuning max_depth
& min_child_weight
Direcly below, we define getTrainScores
to output more detail about the fully tuned model shown further below.
Two blocks below, we create a param_test
dictionary of hyperparameters to tune, the two being max_depth
& min_child_weight
. We then use GridSearchCV
from sklearn.model_selection
to fit different cardinal products of the param_test
hyperparameters to maximize the ovr_auc_roc
function.
The aforementioned loss function is a heuristic for using individual, binary classifications to proxy a multi-class classification on the AUC ROC of the target class versus the combination of all other classes for the label.
This was done to induce hyperparameter tuning of the two mentioned above, but in a way directed towards maximizing AUC ROC.
def getTrainScores(gs):
results = {}
runs = 0
for x,y in zip(list(gs.cv_results_['mean_test_score']), gs.cv_results_['params']):
results[runs] = 'mean:' + str(x) + 'params' + str(y)
runs += 1
best = {'best_mean': gs.best_score_, "best_param":gs.best_params_}
return results, best
param_test1 = {
'max_depth': [3, 6, 9],
'min_child_weight': [1, 3, 5]
}
# metrics to consider: f1_micro, f1_macro, roc_auc_ovr
gsearch1 = GridSearchCV(estimator=xgb_clf1, param_grid=param_test1, scoring='roc_auc_ovr', verbose=10, cv=2)
gsearch1.fit(training_data['X_train'], training_data['y_train'])
Fitting 2 folds for each of 9 candidates, totalling 18 fits
[CV 1/2; 1/9] START max_depth=3, min_child_weight=1.............................
[CV 1/2; 1/9] END max_depth=3, min_child_weight=1;, score=0.815 total time= 1.0min
[CV 2/2; 1/9] START max_depth=3, min_child_weight=1.............................
[CV 2/2; 1/9] END max_depth=3, min_child_weight=1;, score=0.816 total time= 1.2min
[CV 1/2; 2/9] START max_depth=3, min_child_weight=3.............................
[CV 1/2; 2/9] END max_depth=3, min_child_weight=3;, score=0.815 total time= 1.1min
[CV 2/2; 2/9] START max_depth=3, min_child_weight=3.............................
[CV 2/2; 2/9] END max_depth=3, min_child_weight=3;, score=0.816 total time= 57.7s
[CV 1/2; 3/9] START max_depth=3, min_child_weight=5.............................
[CV 1/2; 3/9] END max_depth=3, min_child_weight=5;, score=0.815 total time= 51.7s
[CV 2/2; 3/9] START max_depth=3, min_child_weight=5.............................
[CV 2/2; 3/9] END max_depth=3, min_child_weight=5;, score=0.816 total time= 1.0min
[CV 1/2; 4/9] START max_depth=6, min_child_weight=1.............................
[CV 1/2; 4/9] END max_depth=6, min_child_weight=1;, score=0.801 total time= 2.0min
[CV 2/2; 4/9] START max_depth=6, min_child_weight=1.............................
[CV 2/2; 4/9] END max_depth=6, min_child_weight=1;, score=0.801 total time= 1.9min
[CV 1/2; 5/9] START max_depth=6, min_child_weight=3.............................
[CV 1/2; 5/9] END max_depth=6, min_child_weight=3;, score=0.798 total time= 1.4min
[CV 2/2; 5/9] START max_depth=6, min_child_weight=3.............................
[CV 2/2; 5/9] END max_depth=6, min_child_weight=3;, score=0.800 total time= 1.3min
[CV 1/2; 6/9] START max_depth=6, min_child_weight=5.............................
[CV 1/2; 6/9] END max_depth=6, min_child_weight=5;, score=0.798 total time= 2.0min
[CV 2/2; 6/9] START max_depth=6, min_child_weight=5.............................
[CV 2/2; 6/9] END max_depth=6, min_child_weight=5;, score=0.799 total time= 1.4min
[CV 1/2; 7/9] START max_depth=9, min_child_weight=1.............................
[CV 1/2; 7/9] END max_depth=9, min_child_weight=1;, score=0.790 total time= 2.5min
[CV 2/2; 7/9] START max_depth=9, min_child_weight=1.............................
[CV 2/2; 7/9] END max_depth=9, min_child_weight=1;, score=0.791 total time= 2.2min
[CV 1/2; 8/9] START max_depth=9, min_child_weight=3.............................
[CV 1/2; 8/9] END max_depth=9, min_child_weight=3;, score=0.788 total time= 2.2min
[CV 2/2; 8/9] START max_depth=9, min_child_weight=3.............................
[CV 2/2; 8/9] END max_depth=9, min_child_weight=3;, score=0.789 total time= 1.7min
[CV 1/2; 9/9] START max_depth=9, min_child_weight=5.............................
[CV 1/2; 9/9] END max_depth=9, min_child_weight=5;, score=0.787 total time= 1.9min
[CV 2/2; 9/9] START max_depth=9, min_child_weight=5.............................
[CV 2/2; 9/9] END max_depth=9, min_child_weight=5;, score=0.788 total time= 2.2min
GridSearchCV(cv=2,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=0.8,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=0, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=0.1, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None, max_depth=5,
max_leaves=None, min_child_weight=1,
missing=nan, monotone_constraints=None,
n_estimators=1000, n_jobs=None, nthread=8,
num_class=10, num_parallel_tree=None, ...),
param_grid={'max_depth': [3, 6, 9], 'min_child_weight': [1, 3, 5]},
scoring='roc_auc_ovr', verbose=10)
getTrainScores(gsearch1)
({0: "mean:0.8156157459533608params{'max_depth': 3, 'min_child_weight': 1}",
1: "mean:0.8152964554183812params{'max_depth': 3, 'min_child_weight': 3}",
2: "mean:0.8153314858710563params{'max_depth': 3, 'min_child_weight': 5}",
3: "mean:0.8009061569272977params{'max_depth': 6, 'min_child_weight': 1}",
4: "mean:0.799073213717421params{'max_depth': 6, 'min_child_weight': 3}",
5: "mean:0.7985815017832647params{'max_depth': 6, 'min_child_weight': 5}",
6: "mean:0.7903423593964335params{'max_depth': 9, 'min_child_weight': 1}",
7: "mean:0.7885236373113854params{'max_depth': 9, 'min_child_weight': 3}",
8: "mean:0.7873373580246914params{'max_depth': 9, 'min_child_weight': 5}"},
{'best_mean': 0.8156157459533608,
'best_param': {'max_depth': 3, 'min_child_weight': 1}})
Section 3.2.1 Optimal Parameter Selection
Below we select the optimal parameters for opt_min_child
and opt_max_depth
based on hyperparameter tuning done previously.
opt_min_child = 1
opt_max_depth = 3
Section 3.2.2 Output of Model with Optimal max_depth
& min_child_weight
Below we output the model with the determined optimal max_depth
and min_child_height
found above. We evoke the functions defined in Section 3.0 to visually and numerically break down mlogloss
and merror
.
xgb_clf2 = XGBClassifier(learning_rate=0.1,
n_estimators=1000,
max_depth=opt_max_depth, # tuned
min_child_weight=opt_min_child, # tuned
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softproba',
nthread=num_threads,
num_class=num_classes,
seed=1234567)
fitXgb(xgb_clf2, training_data)
Fitting model...
Fitting done!
[0] test-mlogloss:2.24656 test-merror:0.71280 train-mlogloss:2.24377 train-merror:0.69944
/Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages/xgboost/core.py:617: FutureWarning: Pass `evals` as keyword args.
warnings.warn(msg, FutureWarning)
[99] test-mlogloss:1.71226 test-merror:0.67100 train-mlogloss:1.61737 train-merror:0.62084
-- Model Report --
XGBoost Accuracy: 0.3162
XGBoost F1-Score (Micro): 0.3162
XGBoost F1-Score (Macro): 0.3060725450187838
Section 3.3 Tuning reg_alpha
Direcly below, we define getTrainScores
to output more detail about the fully tuned model shown further below.
Two blocks below, we create a param_test
dictionary of hyperparameters to tune, the two being reg_alpha
. We then use GridSearchCV
from sklearn.model_selection
to fit different cardinal products of the param_test
hyperparameters to maximize the ovr_auc_roc
function.
The aforementioned loss function is a heuristic for using individual, binary classifications to proxy a multi-class classification on the AUC ROC of the target class versus the combination of all other classes for the label.
This was done to induce hyperparameter tuning of the two mentioned above, but in a way directed towards maximizing AUC ROC.
param_test2 = {
'reg_alpha':[1e-2, 0.1, 1]
}
gsearch2 = GridSearchCV(estimator = xgb_clf2, param_grid = param_test2, scoring='roc_auc_ovr', verbose = 10, cv=2)
gsearch2.fit(training_data['X_train'], training_data['y_train'])
Fitting 2 folds for each of 3 candidates, totalling 6 fits
[CV 1/2; 1/3] START reg_alpha=0.01..............................................
[CV 1/2; 1/3] END ...............reg_alpha=0.01;, score=0.815 total time= 35.1s
[CV 2/2; 1/3] START reg_alpha=0.01..............................................
[CV 2/2; 1/3] END ...............reg_alpha=0.01;, score=0.816 total time= 42.9s
[CV 1/2; 2/3] START reg_alpha=0.1...............................................
[CV 1/2; 2/3] END ................reg_alpha=0.1;, score=0.815 total time= 37.3s
[CV 2/2; 2/3] START reg_alpha=0.1...............................................
[CV 2/2; 2/3] END ................reg_alpha=0.1;, score=0.816 total time= 34.3s
[CV 1/2; 3/3] START reg_alpha=1.................................................
[CV 1/2; 3/3] END ..................reg_alpha=1;, score=0.815 total time= 45.6s
[CV 2/2; 3/3] START reg_alpha=1.................................................
[CV 2/2; 3/3] END ..................reg_alpha=1;, score=0.816 total time= 55.2s
GridSearchCV(cv=2,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=0.8,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=0, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=0.1, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None, max_depth=3,
max_leaves=None, min_child_weight=1,
missing=nan, monotone_constraints=None,
n_estimators=1000, n_jobs=None, nthread=8,
num_class=10, num_parallel_tree=None, ...),
param_grid={'reg_alpha': [0.01, 0.1, 1]}, scoring='roc_auc_ovr',
verbose=10)
getTrainScores(gsearch2)
({0: "mean:0.8156626792866941params{'reg_alpha': 0.01}",
1: "mean:0.8156025240054869params{'reg_alpha': 0.1}",
2: "mean:0.8154581849108367params{'reg_alpha': 1}"},
{'best_mean': 0.8156626792866941, 'best_param': {'reg_alpha': 0.01}})
opt_reg_alpha = 0.01
Section 3.3.1 Final Model Output & Loss Visualization (w/ Optimized max_depth
, min_child_weight
, & reg_alpha
)
Below we output the model with the determined optimal max_depth
, min_child_height
, and reg_alpha
found above. We evoke the functions defined in Section 3.0 to visually and numerically break down mlogloss
and merror
.
xgb_clf3 = XGBClassifier(learning_rate=0.1,
n_estimators=1000,
max_depth=opt_max_depth, # tuned (previously)
min_child_weight=opt_min_child, # tuned (previously)
reg_alpha=opt_reg_alpha, # tuned
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softproba',
nthread=10,
num_class=10,
seed=1234567)
fitXgb(xgb_clf3, training_data)
Fitting model...
Fitting done!
[0] test-mlogloss:2.24656 test-merror:0.71280 train-mlogloss:2.24377 train-merror:0.69944
/Users/sunnyson/opt/anaconda3/lib/python3.9/site-packages/xgboost/core.py:617: FutureWarning: Pass `evals` as keyword args.
warnings.warn(msg, FutureWarning)
[99] test-mlogloss:1.71240 test-merror:0.67180 train-mlogloss:1.61757 train-merror:0.62124
-- Model Report --
XGBoost Accuracy: 0.3142
XGBoost F1-Score (Micro): 0.3142
XGBoost F1-Score (Macro): 0.3034744675861638
Section 3.4 Results & Discussion
From the above, we can see that an XGBoost
model initiated with the hyperparameters max_depth=3
, min_child_weight=1
, reg_alpha=0.01
makes for the lowest mlogloss=1.66697
and merror=0.65900
with a micro f1=0.345
, a macro f1=0.33689
, and accuracy=0.345
.
We can interpret this as meaning XGBoost
as not having a too-distinctive ability to differentiate music_genre
based on the first 3 Principal Components of the musicData.csv
dataset. For a real-world dataset, achieveing ~0.25
is impressive enough, so the model is not without merit. However, further results (such as the AUC ROC plot showin the section below) will moreso elucidate predictive ability versus guessing.
Section 4.0 AUC ROC of Fitted & Tuned XGBoost
Model
Below we define a function plot_ROC_curve
that takes in the parameters model
, xtrain
, ytrain
, xtest
, and ytest
to output the AUC ROC of each multiclass (multiple classes but each sample can only belong to one class) entry. Furthermore, different aggregations (micro
and macro
averages) of the AUC ROC curve of the model as a whole are presented.
This was done as per the spec sheet, to visually output not only the AUC ROC curve of every class, but also the AUC ROC performance of the model as a whole through different consolidation methods (micro
and macro
).
def plot_ROC_curve(model, xtrain, ytrain, xtest, ytest):
# Creating visualization with the readable labels
visualizer = ROCAUC(model, encoder={0: 'Alternative',
1: 'Anime',
2: 'Blues',
3: 'Classical',
4: 'Country',
5: 'Electronic',
6: 'Hip-Hop',
7: 'Jazz',
8: 'Rap',
9: 'Rock'})
# Fitting to the training data first then scoring with the test data
visualizer.fit(xtrain, ytrain)
visualizer.score(xtest, ytest)
visualizer.show()
return visualizer
clf_choice = xgb_clf3
plot_ROC_curve(clf_choice, training_data['X_train'], y_train, training_data['X_test'], y_test)
ROCAUC(ax=<AxesSubplot:title={'center':'ROC Curves for XGBClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>,
encoder={0: 'Alternative', 1: 'Anime', 2: 'Blues', 3: 'Classical',
4: 'Country', 5: 'Electronic', 6: 'Hip-Hop', 7: 'Jazz',
8: 'Rap', 9: 'Rock'},
estimator=XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_by...
feature_types=None, gamma=0, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=3, max_leaves=None, min_child_weight=1,
missing=nan, monotone_constraints=None,
n_estimators=1000, n_jobs=None, nthread=10,
num_class=10, num_parallel_tree=None, ...))
Section 4.1 Results & Discussion
From the above analysis, we see that even given the low accuracy
and f1
metrics, we obtain above-random performance for the AUC ROC.
With a range of min=(Alternative=0.71)
to max=(Classical=0.96)
, the scores of individual features ovr
(one versus rest comparison mentioned earlier) are displayed.
Also shown are the micro=0.83
and macro=0.81
averages of the AUC ROC curve of the entire model as a whole.
I interpret these numbers for the AUC ROC as showing that our model best classifies Classical
music, and worstly does so for Alternative
.
The micro
and macro
aggregations individually and combined show the efficacy of this model to perform better than random guessing.
From this, I think the most important factor of classification in this dataset is the availability of predictors that are able to better differentiate music (such as popularity).
Section 5.0 Works Cited
[1] https://ledgernote.com/columns/music-theory/circle-of-fifths-explained/
[2] https://towardsdatascience.com/xgboost-for-multi-class-classification-799d96bcd368
[3] https://medium.com/swlh/how-to-create-an-auc-roc-plot-for-a-multiclass-model-9e13838dd3de
[5] https://www.nvidia.com/en-us/glossary/data-science/xgboost/
Section 6.0 Appendix
Section 6.1 Depreciated Code
df_drop = df.drop(index=df[(df['duration_ms'] == -1)].index, axis=0).reset_index(drop=True)
string_data = ['key']
categorical_data = ['mode', 'music_genre']
smap = lambda x: re.search('\w+(?!_)', x)[0]
cmap = lambda x: re.search('\w+(?=_name)|mode|((?<=music_)\w+)', x)[0]
encode_string = {}
for i, col in zip(range(len(string_data)), string_data):
encode_string[f'encode_{smap(col)}'] = LabelEncoder().fit(df[[col]])
encode_categorical = {}
for i, col in zip(range(len(categorical_data)), categorical_data):
encode_string[f'encode_{cmap(col)}'] = LabelEncoder().fit(df[[col]])
df_process = pd.get_dummies(data=df_encode, prefix='mode', columns=['mode'])
df_process = df_process[['popularity', 'acousticness', 'danceability', 'duration_ms',
'energy', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode_Major', 'mode_Minor',
'speechiness', 'tempo', 'valence', 'music_genre']]
df_process[['tempo', 'mode_Major', 'mode_Minor']] = df_process[['tempo', 'mode_Major', 'mode_Minor']].apply(pd.to_numeric)
tts_dict = {}
for i in range(10):
tts = train_test_split(df_encode.loc[(df_encode['music_genre'] == i), :'music_genre'],
df_encode.loc[(df_encode['music_genre'] == i), 'music_genre'],
test_size=0.1)
tts_dict[f'genre_{i}_tts'] = tts