Project: Creating Customer Segments

In this project, I analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with, and find costumer segments around which specific costumer groups can be found. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Import supplementary visualizations code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (14,8)

# Load the wholesale customers dataset
try:
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

This section is a basic exploration of the dataset to understand how each feature is related to the others and the relevance of each feature.

The dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'.

In [2]:
# Display a description of the dataset
display(data.describe())
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000
In [3]:
data.hist(bins=30, layout = (2,3), figsize = (12,5))
Out[3]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x106b936a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x106c12cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x106c75cc0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x106cf0978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x106d572e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x106d90710>]],
      dtype=object)

Implementation: Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. In the code below, I look at three costumers whose spending habits vary significantly from each other.

In [4]:
# Select three indices of to sample from the dataset
numbers = np.random.randint(0,439,3)
print(numbers)
indices = numbers
# After several iterations, I settled on indices a sample of 3 costumers that are different 
# from each other. Updated indices with chosen costumers.
indices = [30, 402, 200]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
display(samples)
[155 372 346]
Chosen samples of wholesale customers dataset:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 18815 3610 11107 1148 2134 2963
1 26539 4753 5091 220 10 340
2 3067 13240 23127 3941 9959 731

Sample customer preferences

The figure below shows where each of the sampled costumers stands with respect with to the rest in terms of amount spent in each of the categories I am analyzing. The color shade represents the percentile at which the customer (row) stands for that category (column).

In [5]:
import seaborn as sns

percentiles_data = 100*data.rank(pct=True)
percentiles_samples = percentiles_data.iloc[indices]
mpl.pyplot.figure(figsize=(12, 4))
sns.heatmap(percentiles_samples, annot=True, yticklabels = ["0","1","2"],vmin=0, vmax=100)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x106e5eac8>

The three samples vary in where they stand in relation to most values of each category. On a very general especulation, I would guess that:

0 is likely a family oriented customer who is health conscious. Their spenditure history puts them at the 80th percentile for Fresh, the 75th percentile for Grocery, the 65th percentile for Detergents-paper, and over 90th percentile for Delicatessen, a top spender for all these categories. For Milk and Frozen, consumption is slightly above and below average.

  • With this in mind, I hypothesize that this a customer that patronizes supermarkets and retail stores often, including health supermarkets and food stores, as well as co-ops.

1 is the least likely to be buying for a family, perhaps a single person, or someone who travels often. Their spenditure on Detergents-paper is at the 1st percentile so is likely a person who does not spend much time at home or invests in its general maintance; Milk and Grocery expenses puts them at the 50-55th and 55-60th percentile respectively, very average; lower than 10th percentile for Frozen, and Delicatessen slightly over 20th percentile; notably, this customer is around the 90th percentile for Fresh expenditure.

  • This suggests that the customer could be a frequent patron of cafes, restaurants, and resorts.

2 looks like a family oriented customer who is not as health-conscious. They are at lower than 25% of consumers on Fresh expenditure and slightly below average on Delicatessen (40-45th percentile), yet spend more than 90% of customers on Milk, more than around 95% of customers on Grocery, more than 75-80% of customers on Frozen, and more than 90-95% of customers on Detergents-paper. Like customer 0, they seem invested on domestic chores, but unlike them, they spend little on Fresh and Delicaressen.

  • This customer could be expected to visit regular supermarkets and retailers, and, when eating out, more likely to visit fast-food venues.

There are many caveats and other potential explanations for the behavior of each customer sampled, and should keep in mind it is just a sample with few variables. For example, buying few products from fresh, as #3, could also be due to being a farmer and sourcing all fresh food needs from their own.

Feature Relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In the code block below, I do this for all six features of the dataset.

In [6]:
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

columns = list(data.columns)

for col in columns:

    # Make a copy of the DataFrame, using the 'drop' function to drop the given feature
    new_data = data.drop([col], axis=1)

    # Split the data into training and testing sets(0.25) using the given feature as the target
    # Set a random state.
    X_train, X_test, y_train, y_test = train_test_split(new_data, data[col],test_size=0.25,random_state=4)

    # Create a decision tree regressor and fit it to the training set
    regressor = DecisionTreeRegressor(random_state=0)
    regressor.fit(X_train, y_train)
    
    # Create a linear regressor and fit it to the training set
    linear_m = LinearRegression()
    linear_m.fit(X_train, y_train)

    # Report the score of the prediction using the testing set
    score = regressor.score(X_test, y_test)    
    score_linear_m = linear_m.score(X_test, y_test)

    print("Decission tree R^2 score when "+ col + " is target: " + str(round(score,2)))
    print("Linear model R^2 score when "+ col + " is target: " + str(round(score_linear_m,2)) + "\n")

    
    
/Users/amui/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Decission tree R^2 score when Fresh is target: -0.4
Linear model R^2 score when Fresh is target: 0.04

Decission tree R^2 score when Milk is target: 0.58
Linear model R^2 score when Milk is target: 0.65

Decission tree R^2 score when Grocery is target: 0.72
Linear model R^2 score when Grocery is target: 0.93

Decission tree R^2 score when Frozen is target: -0.18
Linear model R^2 score when Frozen is target: 0.13

Decission tree R^2 score when Detergents_Paper is target: 0.69
Linear model R^2 score when Detergents_Paper is target: 0.9

Decission tree R^2 score when Delicatessen is target: -9.5
Linear model R^2 score when Delicatessen is target: -1.06

It seems that Milk, Grocery, and Detergents_Paper are the most easily predictable features from spending habits in other categories. Conversely, Fresh, Frozen, and Delicatessen are difficult to predict from the other categories -- with negative R^2, these models fit the data very poorly, worse than a simple average.

A feature which cannot be predicted by other features should not be removed from the dataset for the sake of reducing the dimensionality of our dataset, since its' information content, however useful that might eventually prove to be, is not contained in the rest of the features. On the other hand, a feature that can be predicted from other features would not really give us much additional information and thus, would be a fit candidate for removal, if we ever need it to make the dataset more manageable.

Visualize Feature Distributions

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. This will reveal whether there are any visible correlations between any two given variables.

In [7]:
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (10,10), diagonal = 'kde');

The heatmap below corresponds to the correlation coefficients between each pair of variables.

In [8]:
import seaborn as sns
mpl.pyplot.figure(figsize=(6, 5))
sns.heatmap(data.corr(), annot=True)
data.corr()
Out[8]:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Fresh 1.000000 0.100510 -0.011854 0.345881 -0.101953 0.244690
Milk 0.100510 1.000000 0.728335 0.123994 0.661816 0.406368
Grocery -0.011854 0.728335 1.000000 -0.040193 0.924641 0.205497
Frozen 0.345881 0.123994 -0.040193 1.000000 -0.131525 0.390947
Detergents_Paper -0.101953 0.661816 0.924641 -0.131525 1.000000 0.069291
Delicatessen 0.244690 0.406368 0.205497 0.390947 0.069291 1.000000

There are three pairs of features that exhibit strong correlation between each other:

- Grocery - Detergents_Paper .92
- Milk - Grocery .73
- Milk - Detergents_paper .66


The most significant correlation is between Grocery and Detergents_Paper. This implies that both are weakly relevant as they contain approximately the same information, but if we remove one of them, the other becomes strongly relevant. Milk is also correlated with both these features, but the correlation is relatively mild. This mildness of correlation explains, to some extent, the relatively lower R^2-score obtained for Milk in the previous section.

Correlation for other pairs of features is somewhat insignificant, which explains their low/negative R^2-scores in the previous section.

Data Preprocessing

All six variables are skewed to the right, with most consumers spending smaller quantities on them. Overall, "Delicatessen" is the more tightly distributed, but also the one with the smaller range. Futhermore, all variables present a few customers with values extraordinarily high, what some would call outliers.

Implementation: Feature Scaling

Clustering algorithms discussed in this project work under the assumption that the data features are (roughly) normally distributed. Significant deviation from zero skewness indicates that we must apply some kind of normalisation to make the features normally distributed. In this section I apply the natural logarithm to the values for each feature, in order to achieve an approximately normal distribution for all of them.

Transforming the data by taking the logarithm gives us normal distributions for each variable. The correlations and lack of correlations between variables remain the same, as expected.

In [9]:
# Scale the data using the natural logarithm
log_data = np.log(data)

# Scale the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (10,10), diagonal = 'kde');
In [10]:
mpl.pyplot.figure(figsize=(6, 5))
sns.heatmap(log_data.corr(), annot=True)
log_data.corr()
Out[10]:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Fresh 1.000000 -0.019834 -0.132713 0.383996 -0.155871 0.255186
Milk -0.019834 1.000000 0.758851 -0.055316 0.677942 0.337833
Grocery -0.132713 0.758851 1.000000 -0.164524 0.796398 0.235728
Frozen 0.383996 -0.055316 -0.164524 1.000000 -0.211576 0.254718
Detergents_Paper -0.155871 0.677942 0.796398 -0.211576 1.000000 0.166735
Delicatessen 0.255186 0.337833 0.235728 0.254718 0.166735 1.000000

Implementation: Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. Here, I we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

In [11]:
# Create an array of all outliers
all_outliers = np.array([], dtype='int64')

# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
    
    # Calculate Q1 (25th percentile of the data) for the given feature
    Q1 = log_data[feature].quantile(q=.25)
    
    # Calculate Q3 (75th percentile of the data) for the given feature
    Q3 = log_data[feature].quantile(q=.75)
    
    # Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
    step = 1.5*(Q3-Q1)
    
    # Display the outliers
    print("Data points considered outliers for the feature '{}':".format(feature))
    indices = ~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))
    display(log_data[indices])
    outlier_points = log_data[indices]
    
    # OSelect the indices for data points to remove    
    all_outliers = np.append(all_outliers, outlier_points.index.values.astype('int64'))

# Count the unique elements in the all_outliers array
all_outlier, indices = np.unique(all_outliers, return_inverse=True)
counts = np.bincount(indices)

# These are the "outliers" that appeared in several variables
repeated_outliers = all_outlier[counts>1]
print("Repeated outliers: " + str(repeated_outliers))
    
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[repeated_outliers]).reset_index(drop = True)
Data points considered outliers for the feature 'Fresh':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
Data points considered outliers for the feature 'Milk':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275
Data points considered outliers for the feature 'Grocery':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
Data points considered outliers for the feature 'Frozen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244
Data points considered outliers for the feature 'Detergents_Paper':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101
Data points considered outliers for the feature 'Delicatessen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918
Repeated outliers: [ 65  66  75 128 154]
In [12]:
print("Outlier percentage of total data: " + str(round(all_outliers.shape[0]*100/log_data.shape[0],2)) + "%")
print("Repeated outliers percentage of total data: " + str(round(repeated_outliers.shape[0]*100/log_data.shape[0],2)) + "%")
Outlier percentage of total data: 10.91%
Repeated outliers percentage of total data: 1.14%

By looking at this procedure, ~11% of the original data is deemed "outliers". However, this being such a large proportion of the data, it is unadvisable to remove it without particularly strong reasons. Doing so would impact the results of the algorithm we use.

A stronger case can be made for datapoints that are outliers in more than one feature. They represent only ~1% of the dataset, and removing them could actually improve the performance of the chosen model.

Feature Transformation

In this section you will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

Implementation: PCA

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. A component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.

In [13]:
from sklearn.decomposition import PCA

# Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA(n_components=good_data.shape[1])
pca.fit(good_data)

# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
In [14]:
pca_results
Out[14]:
Explained Variance Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Dimension 1 0.4430 0.1675 -0.4014 -0.4381 0.1782 -0.7514 -0.1499
Dimension 2 0.2638 -0.6859 -0.1672 -0.0707 -0.5005 -0.0424 -0.4941
Dimension 3 0.1231 -0.6774 0.0402 -0.0195 0.3150 -0.2117 0.6286
Dimension 4 0.1012 -0.2043 0.0128 0.0557 0.7854 0.2096 -0.5423
Dimension 5 0.0485 -0.0026 0.7192 0.3554 -0.0331 -0.5582 -0.2092
Dimension 6 0.0204 0.0292 -0.5402 0.8205 0.0205 -0.1824 0.0197
In [15]:
print(pca_results['Explained Variance'].cumsum())
Dimension 1    0.4430
Dimension 2    0.7068
Dimension 3    0.8299
Dimension 4    0.9311
Dimension 5    0.9796
Dimension 6    1.0000
Name: Explained Variance, dtype: float64

It is important to notice that by adding the first four components, they together account form 93% of the variance. Importantly, just the first two components get us well over half of the variance in the dataset, 70%.

The high absolute values of certain features in each component speak of the relative importance of the given feature for the component, especially relevant for the present objective of separating costumers in different categories.

  • The first component is driven mainly by purchases of Detergents_paper, and to a lesser extent, of Groceries and Milk, as can be seen from their high absolute values. Differences along this dimension could indicate a separation between customers with families and children, or customers highly involved with household activities, and customers who are less involved in domestic chores. This distinction accounts for 50% of the variance in the dataset and is thus very important.
  • The second component is mostly driven by Fresh, Frozen, and Delicatessen purchases. A hypothesis could be made that high values along this axis could indicate childless customers or less domestically-minded customers, narrowing to demographic that favors convenience but is also health conscious (Dresh) and wordly (Delicatessen). It accounts for almost a quarter of variance in the data, 22%.
  • The third component accounts for just 10% of the variance. It is driven mainly by spenditure in Fresh and Delicatessen (in opposite amounts). It could separate vegetarians from omnivores. Since both categories require little preparation, this dimension would not be particularly useful in separating household oriented vs. not, as the first and second components do.
  • The fourth component accounts for another ~10% of the variance. It would be useful in dividing customers between, on one end, those who spend a lot on Frozen (and little Delicatessen), and the opposite on the other.
  • The fifth component only accounts for 5% of the variance. It is driven mainly by Milk, and opposite behavior in Detergents_paper.
  • Accounting for only 2% of the variance, the the sixth and last component is mostly Grocery, with opposite values for Milk spenditure.

Observation

The code below to displays how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions. The numerical value for the first four dimensions of the sample points shows that this is consistent with the initial interpretation of the sample points.

In [16]:
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6
0 -1.1156 -1.3483 -0.1973 -0.9135 -0.4000 0.5608
1 3.2335 0.5493 -1.1499 -2.2714 3.0208 0.6836
2 -2.9903 -0.3645 0.2521 1.5653 0.1922 0.1244

Implementation: Dimensionality Reduction

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

In [17]:
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components = 2)
pca.fit(good_data)

# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)

# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

Observation

The code below shows how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions. The values for the first two dimensions remain unchanged when compared to a PCA transformation in six dimensions.

In [18]:
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
Dimension 1 Dimension 2
0 -1.1156 -1.3483
1 3.2335 0.5493
2 -2.9903 -0.3645

Visualizing a Biplot

A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

In [19]:
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a12f09470>

Observation

Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery' and 'Detergents_Paper', but not so much on the other product categories.

From the biplot we see that Detergents_paper, Grocery, and Milk are the original features most strongly correlated with the first component. For the second component, the most relevant original features are Fresg, Frozen, and Delicatessen. These observations agree with the bar plots of each component in an earlier figure.

Clustering

In this section, I will use the first two principal components to attempt to cluster the customers from the dataset into relevant groups according to their spending habits. The two clustering algorihtms I consider are K-Means clustering and Gaussian Mixture Models clustering.

K-Means clustering is fast and can produce tight clusters, it is great for cases in which the underlying clusters are neatly separated from each other, tightly wound together, and spherical along all dimensions. However, for cases in which members of a cluster do not conform to a circular or spherical shape and consequently have varying distances to its center, K-means fails, since it relies on distance-to-centroid as definition for clustering. Furthermore, it is very sensitive to initialization values.

Gaussian Mixture Model clustering, on the other hand, has a slower convergence rate, but it has several advantages. This algorithm allows for greater shape flexibility, including the possibility that a cluster contains another clustering within it. And the soft-clustering approach allows for membership of datapoints to several clusters.

For the present dataset, a soft-clustering approach such as what is afforded with Gaussian Mixture Models makes more sense, especially considering that there are no visually separable clusters in the biplot. The dataset is quite small and scalability is not an issue.

For large datasets, an alternative strategy could be to use the faster K-Means for preliminary analysis, and if upon revision we determine that the results could be significantly improved, use Gaussian Mixture Model in the next step while using the cluster assignments and centres obtained from K-Means as the initialisation for Gaussian Mixture Model.

Implementation: determining optimal number of clusters

Depending on the problem, the number of clusters expected to be in the data may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

In [20]:
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

clusters = np.linspace(2,20,19, dtype=int)
scores = list()

for cluster_number in clusters:

    # Apply your clustering algorithm of choice to the reduced data 
    clusterer = GaussianMixture(n_components= cluster_number)
    clusterer.fit(reduced_data)

    # Predict the cluster for each data point
    preds = clusterer.predict(reduced_data)

    # Find the cluster centers
    centers = clusterer.means_

    # Calculate the mean silhouette coefficient for the number of clusters chosen
    score = silhouette_score(reduced_data, preds)
    scores.append(score)

    #print("Cluster number: " + str(cluster_number) + "  Score: " + str(round(score,2)))
    
import matplotlib.pyplot as plt
plt.figure(figsize=(7,3))
plt.plot(clusters, scores)
plt.ylabel('Silhouettescore')
plt.xlabel('Number of clusters')
plt.xticks(clusters)
plt.show()  
print("Highest score: " + str(round(max(scores),2)))    
Highest score: 0.42

As the figure above shows, the highest silhouette score, 0.42, is for when we use two clusters.

Implementation: clustering algorithm

Once the optimal number of clusters has been chosen using the scoring metric above, we can initialize and fit the model.

In [21]:
# Initialize model with optimal number of clusters
clusterer = GaussianMixture(n_components= 2)
clusterer.fit(reduced_data)

#Predict the cluster for each data point
preds = clusterer.predict(reduced_data)

#Find the cluster centers
centers = clusterer.means_

Function to visualize contours of predicted TWO clusters

The following function allows to visualize the contours of the predicted two clusters, overlaid over the colored scatterplots.

In [22]:
import matplotlib.mlab as mlab

# Reset preds to outcome from a GMM model of 2 clusters
clusterer = GaussianMixture(n_components= 2)
clusterer.fit(reduced_data)
preds = clusterer.predict(reduced_data)

def draw_contours():
    # --- Contours of each cluster ---
    # Subset of one cluster
    ones = reduced_data[preds.astype(bool)]

    # Subset of another cluster
    nopreds = abs(preds-1)
    zeros = reduced_data[nopreds.astype(bool)]

    # Prepare grid
    x = np.arange(-7.0, 7.0,0.025)
    y = np.arange(-7.0, 7.0,0.025)
    X, Y = np.meshgrid(x, y)

    for subtable in [zeros, ones]:

        tab = subtable

        a = np.var(tab["Dimension 1"])
        b = np.var(tab["Dimension 2"]) 
        c = np.mean(tab["Dimension 1"]) 
        d = np.mean(tab["Dimension 2"])
        Z_0 = mlab.bivariate_normal(X, Y, a,b,c,d)

        plt.contour(X, Y, Z_0)        
        plt.scatter(tab["Dimension 1"], tab["Dimension 2"], 5)
        plt.xlabel("Dimension 1")
        plt.ylabel("Dimension 2")
In [23]:
draw_contours()

Observation

From the figure above we can see the two customer clusters along with their area of overlap. This is an advantage of using a Gaussian Mixture Model, for the possibility of soft clustering datapoints that could belong to more than one cluster.

The superimposed biplot below illustrates the mapping of the real world variables onto the principal components. This aids in the interpretation of the clustering in terms of the original features of the dataset.


The two clusters are divided mainly along the direction of tendency to buy Detergents_paper, Grocery, and Milk. This would indicate a split between customers that are more involved in household maintenance (cleaning and cooking), and customers who are not. The first group would be more commonly expected at supermarkets, while the second group is more likely at restaurants.

The second principal component adds to this: it conveys a tendency to buy more frozen, fresh and delicatessen. The second cluster occupies more of the higher values of the second principal component than the domestic cluster does. Customers that are less involved in domestic chores are more likely to resort to ready-made foods like those found in the deli, fresh fruits, and ready-to-eat ones.

Including other principal components could potentially help fine tune the overlapping areas between dimension 1 and dimension 2.

In [24]:
vs.biplot(good_data, reduced_data, pca)
draw_contours()
plt.show()

Implementation: Data Recovery

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

In [25]:
# Inverse transform the centers -- to return to original number of variables
log_centers = pca.inverse_transform(centers)

# Exponentiate the centers -- to descale from log to original values
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Segment 0 3552.0 7837.0 12219.0 870.0 4696.0 962.0
Segment 1 8953.0 2114.0 2765.0 2075.0 353.0 732.0

The table above displays the actual values in monetary units for each of the original, real world variables.

Examining predictions the 3 sample customers chosen in terms of their assigned clusters

If we look at what cluster each of the chosen distinctive sample customers have been assigned, we find consistency with our earlier predictions.

In [26]:
display(samples)

# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)

# Display the predictions
for i, pred in enumerate(sample_preds):
    print("Sample point", i, "predicted to be in Cluster", pred)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 18815 3610 11107 1148 2134 2963
1 26539 4753 5091 220 10 340
2 3067 13240 23127 3941 9959 731
Sample point 0 predicted to be in Cluster 0
Sample point 1 predicted to be in Cluster 0
Sample point 2 predicted to be in Cluster 1

Sample point 1 spends large amounts on Fresh abd very little on Detergents_paper and Grocery (by comparison to the others). This is enough to offset that this particular customer does not spend much on Delicatessen.

Similarly, the large spenditure on Detergents_paper and Grocery by sample points 0 and 2, place them clearly in the domestically inclined cluster 1. That, even though the customer from sample point 0 spends fairly little on Milk.

In [27]:
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)

Conclusion and suggestions

This final section presents two ways in which the clustered data could be used.

First, we will consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme.

Next, we will consider how giving a label to each customer (which segment that customer belongs to) can provide for additional features about the customer data.

Case 1

Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively.

  • How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?*

Customers in cluster 0 are likely to be the most affected by reduced delivery days, since they are the ones that purchase more from the Fresh category. Reduced delivery days will cause either depletion of Fresh products, or reduction in quality of Fresh items stored longer. On the other hand, if frequency of deliveries decreases, the amounts of Detergents_paper, Grocery, and Milk can simply be made larger per delivery, to make sure there is enough in stock to satisfy customers from cluster 1.

Based on this, should the wholesale distributor decide to A/B test the change in delivery, it would be advisable that they do it on cluster 1 customers only. They would need to use a subset from cluster 1 as the control, and another subset as the treatment (test) group.

Case 2

If the wholesale distributor acquires new customers and each provides estimates for anticipated annual spending of each product category, knowing these estimates, the wholesale distributor can classify each new customer to a customer segment to determine the most appropriate delivery service.

  • How could this be done? That is, how could the wholesale distributor label the new customers using only their estimated product spending and the customer segment data?

The present project has found a systematic grouping within the existing pool of customers. They can be divided into belonging to cluster 0 or cluster 1 type of customer -- the dependent variable.

With this existing data which has now been labeled in a manner that is useful to the wholesale seller, a supervised learning algorithm could be trained and used to make predictions about what type the new customers belong to. I believe that a classifier decission tree, multivariate regression, or a neural network, would be useful supervised algorithms for this purpose.