diff --git a/projects/customer_segments/customer_segments.ipynb b/projects/customer_segments/customer_segments.ipynb index 88901630a6..abb9a9fb07 100755 --- a/projects/customer_segments/customer_segments.ipynb +++ b/projects/customer_segments/customer_segments.ipynb @@ -114,8 +114,18 @@ "source": [ "### Question 1\n", "Consider the total purchase cost of each product category and the statistical description of the dataset above for your sample customers. \n", - "*What kind of establishment (customer) could each of the three samples you've chosen represent?* \n", - "**Hint:** Examples of establishments include places like markets, cafes, and retailers, among many others. Avoid using names for establishments, such as saying *\"McDonalds\"* when describing a sample customer as a restaurant." + "\n", + "* What kind of establishment (customer) could each of the three samples you've chosen represent?\n", + "\n", + "**Hint:** Examples of establishments include places like markets, cafes, delis, wholesale retailers, among many others. Avoid using names for establishments, such as saying *\"McDonalds\"* when describing a sample customer as a restaurant. You can use the mean values for reference to compare your samples with. The mean values are as follows:\n", + "\n", + "* Fresh: 12000.2977\n", + "* Milk: 5796.2\n", + "* Grocery: 3071.9\n", + "* Detergents_paper: 2881.4\n", + "* Delicatessen: 1524.8\n", + "\n", + "Knowing this, how do your samples compare? Does that help in driving your insight into what kind of establishments they might be? \n" ] }, { @@ -151,7 +161,8 @@ "# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature\n", "new_data = None\n", "\n", - "# TODO: Split the data into training and testing sets using the given feature as the target\n", + "# TODO: Split the data into training and testing sets(0.25) using the given feature as the target\n", + "# Set a random state.\n", "X_train, X_test, y_train, y_test = (None, None, None, None)\n", "\n", "# TODO: Create a decision tree regressor and fit it to the training set\n", @@ -166,8 +177,12 @@ "metadata": {}, "source": [ "### Question 2\n", - "*Which feature did you attempt to predict? What was the reported prediction score? Is this feature necessary for identifying customers' spending habits?* \n", - "**Hint:** The coefficient of determination, `R^2`, is scored between 0 and 1, with 1 being a perfect fit. A negative `R^2` implies the model fails to fit the data." + "\n", + "* Which feature did you attempt to predict? \n", + "* What was the reported prediction score? \n", + "* Is this feature necessary for identifying customers' spending habits?\n", + "\n", + "**Hint:** The coefficient of determination, `R^2`, is scored between 0 and 1, with 1 being a perfect fit. A negative `R^2` implies the model fails to fit the data. If you get a low score for a particular feature, that lends us to beleive that that feature point is hard to predict using the other features, thereby making it an important feature to consider when considering relevance." ] }, { @@ -202,8 +217,12 @@ "metadata": {}, "source": [ "### Question 3\n", - "*Are there any pairs of features which exhibit some degree of correlation? Does this confirm or deny your suspicions about the relevance of the feature you attempted to predict? How is the data for those features distributed?* \n", - "**Hint:** Is the data normally distributed? Where do most of the data points lie? " + "* Using the scatter matrix as a reference, discuss the distribution of the dataset, specifically talk about the normality, outliers, large number of data points near 0 among others. If you need to sepearate out some of the plots individually to further accentuate your point, you may do so as well.\n", + "* Are there any pairs of features which exhibit some degree of correlation? \n", + "* Does this confirm or deny your suspicions about the relevance of the feature you attempted to predict? \n", + "* How is the data for those features distributed?\n", + "\n", + "**Hint:** Is the data normally distributed? Where do most of the data points lie? You can use [corr()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) to get the feature correlations and then visualize them using a [heatmap](http://seaborn.pydata.org/generated/seaborn.heatmap.html)(the data that would be fed into the heatmap would be the correlation values, for eg: `data.corr()`) to gain further insight." ] }, { @@ -326,7 +345,11 @@ "metadata": {}, "source": [ "### Question 4\n", - "*Are there any data points considered outliers for more than one feature based on the definition above? Should these data points be removed from the dataset? If any data points were added to the `outliers` list to be removed, explain why.* " + "* Are there any data points considered outliers for more than one feature based on the definition above? \n", + "* Should these data points be removed from the dataset? \n", + "* If any data points were added to the `outliers` list to be removed, explain why.\n", + "\n", + "** Hint: ** If you have datapoints that are outliers in multiple categories think about why that may be and if they warrant removal. Also note how k-means is affected by outliers and whether or not this plays a factor in your analysis of whether or not to remove them." ] }, { @@ -380,7 +403,11 @@ "metadata": {}, "source": [ "### Question 5\n", - "*How much variance in the data is explained* ***in total*** *by the first and second principal component? What about the first four principal components? Using the visualization provided above, discuss what the first four dimensions best represent in terms of customer spending.* \n", + "\n", + "* How much variance in the data is explained* **in total** *by the first and second principal component? \n", + "* How much variance in the data is explained by the first four principal components? \n", + "* Using the visualization provided above, talk about each dimension and the cumulative variance explained by each, stressing upon which features are well represented by each dimension(both in terms of positive and negative variance explained). Discuss what the first four dimensions best represent in terms of customer spending.\n", + "\n", "**Hint:** A positive increase in a specific dimension corresponds with an *increase* of the *positive-weighted* features and a *decrease* of the *negative-weighted* features. The rate of increase or decrease is based on the individual feature weights." ] }, @@ -512,7 +539,12 @@ "metadata": {}, "source": [ "### Question 6\n", - "*What are the advantages to using a K-Means clustering algorithm? What are the advantages to using a Gaussian Mixture Model clustering algorithm? Given your observations about the wholesale customer data so far, which of the two algorithms will you use and why?*" + "\n", + "* What are the advantages to using a K-Means clustering algorithm? \n", + "* What are the advantages to using a Gaussian Mixture Model clustering algorithm? \n", + "* Given your observations about the wholesale customer data so far, which of the two algorithms will you use and why?\n", + "\n", + "** Hint: ** Think about the differences between hard clustering and soft clustering and which would be appropriate for our dataset." ] }, { @@ -567,7 +599,9 @@ "metadata": {}, "source": [ "### Question 7\n", - "*Report the silhouette score for several cluster numbers you tried. Of these, which number of clusters has the best silhouette score?* " + "\n", + "* Report the silhouette score for several cluster numbers you tried. \n", + "* Of these, which number of clusters has the best silhouette score?" ] }, { @@ -635,8 +669,10 @@ "metadata": {}, "source": [ "### Question 8\n", - "Consider the total purchase cost of each product category for the representative data points above, and reference the statistical description of the dataset at the beginning of this project. *What set of establishments could each of the customer segments represent?* \n", - "**Hint:** A customer who is assigned to `'Cluster X'` should best identify with the establishments represented by the feature set of `'Segment X'`." + "\n", + "* Consider the total purchase cost of each product category for the representative data points above, and reference the statistical description of the dataset at the beginning of this project(specifically looking at the mean values for the various feature points). What set of establishments could each of the customer segments represent?\n", + "\n", + "**Hint:** A customer who is assigned to `'Cluster X'` should best identify with the establishments represented by the feature set of `'Segment X'`. Think about what each segment represents in terms their values for the feature points chosen. Reference these values with the mean values to get some perspective into what kind of establishment they represent." ] }, { @@ -651,7 +687,9 @@ "metadata": {}, "source": [ "### Question 9\n", - "*For each sample point, which customer segment from* ***Question 8*** *best represents it? Are the predictions for each sample point consistent with this?*\n", + "\n", + "* For each sample point, which customer segment from* **Question 8** *best represents it? \n", + "* Are the predictions for each sample point consistent with this?*\n", "\n", "Run the code block below to find which cluster each sample point is predicted to be." ] @@ -697,7 +735,10 @@ }, "source": [ "### Question 10\n", - "Companies will often run [A/B tests](https://en.wikipedia.org/wiki/A/B_testing) when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively. *How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?* \n", + "Companies will often run [A/B tests](https://en.wikipedia.org/wiki/A/B_testing) when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively. \n", + "\n", + "* How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?*\n", + "\n", "**Hint:** Can we assume the change affects all customers equally? How can we determine which group of customers it affects the most?" ] }, @@ -714,7 +755,8 @@ "source": [ "### Question 11\n", "Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a ***customer segment*** it best identifies with (depending on the clustering algorithm applied), we can consider *'customer segment'* as an **engineered feature** for the data. Assume the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor wants to classify each new customer to a ***customer segment*** to determine the most appropriate delivery service. \n", - "*How can the wholesale distributor label the new customers using only their estimated product spending and the* ***customer segment*** *data?* \n", + "* How can the wholesale distributor label the new customers using only their estimated product spending and the **customer segment** data?\n", + "\n", "**Hint:** A supervised learner could be used to train on the original customers. What would be the target variable?" ] }, @@ -754,7 +796,10 @@ "metadata": {}, "source": [ "### Question 12\n", - "*How well does the clustering algorithm and number of clusters you've chosen compare to this underlying distribution of Hotel/Restaurant/Cafe customers to Retailer customers? Are there customer segments that would be classified as purely 'Retailers' or 'Hotels/Restaurants/Cafes' by this distribution? Would you consider these classifications as consistent with your previous definition of the customer segments?*" + "\n", + "* How well does the clustering algorithm and number of clusters you've chosen compare to this underlying distribution of Hotel/Restaurant/Cafe customers to Retailer customers? \n", + "* Are there customer segments that would be classified as purely 'Retailers' or 'Hotels/Restaurants/Cafes' by this distribution? \n", + "* Would you consider these classifications as consistent with your previous definition of the customer segments?" ] }, {