Understand how different customer groups of users or cohorts behave for a retail dataset, and see their engagement with the business overtime.
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.
Downloaded from here
Variable Name | Role | Type | Units | Missing Values | Description |
---|---|---|---|---|---|
InvoiceNo | ID | Categorical | no | a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation | |
StockCode | ID | Categorical | no | a 5-digit integral number uniquely assigned to each distinct product | |
Description | Feature | Categorical | no | product name | |
Quantity | Feature | Integer | no | the quantities of each product (item) per transaction | |
InvoiceDate | Feature | Date | no | the day and time when each transaction was generated | |
UnitPrice | Feature | Continuous | sterling | no | product price per unit |
CustomerID | Feature | Categorical | no | a 5-digit integral number uniquely assigned to each customer | |
Country | Feature | Categorical | no | the name of the country where each customer resides |
In this process I used SSIS tool which is a powerful ETL tool, designed to handle a large data migration tasks, and can be used to automate dataflows easily.
So I created a package that executes an SQL task to delete any old data, then uses a dataflow task to import data from an EXCEL file to SQL Server database
-- Total Records = 541,909
-- 135,080 records have no CustomerId
-- 406,829 records have CustomerId
----------------------------------------------------------------------------------------------------------------------------------------
-- ---------Create CTEs that focuses on records with "customerid" and with both Quantity and UnitPrice > 0 and clear duplicated records
---------------------------------------------------------------------------------------------------------------------------------------
with
Online_Retail as -- 406,829 records have CustomerId
(
SELECT [InvoiceNo]
,[StockCode]
,[Description]
,[Quantity]
,[InvoiceDate]
,[UnitPrice]
,[CustomerID]
,[Country]
FROM [Portfolio DBs].[dbo].[Online_Retail]
WHERE CustomerID IS NOT NULL
)
,Quantity_UnitPrice AS -- 397,884 records with Quantity and UnitPrice > 0
(
SELECT *
FROM Online_Retail
WHERE Quantity > 0 AND UnitPrice > 0
)
,Duplicates AS
(
SELECT
*,
--for deleting all but the first occurrence and not just counting duplicates!
ROW_NUMBER() OVER (PARTITION BY InvoiceNo, StockCode, Quantity ORDER BY InvoiceDate) Duplicated
FROM Quantity_UnitPrice
)
-----------------------------------------------------------------------------------------------------------
-------------Save ALL created CTEs into a temp table in order to not run them all each time----------------
-----------------------------------------------------------------------------------------------------------
SELECT *
INTO #Temp_Online_Retail
FROM Duplicates
WHERE Duplicated = 1 --> clean data = 392,669
-- WHERE Duplicated > 1 --> duplicated records = 5,215
To begin we need to identify 3 things…
- Unique Identifier (Customer ID)
- Initial Start Date (Invoice Date)
- Revenue Data
-----------------------------------------------------------------------------------------------------------
----------------------------------------- COHORT ANALYSIS -------------------------------------------------
-----------------------------------------------------------------------------------------------------------
-- 1. Unique Identifier (CustomerID)
-- 2. Initial Start Date (InvoiceDate)
-- 3. Revenue Data
SELECT
CustomerID,
MIN(InvoiceDate) AS First_Purchase_Date,
DATEFROMPARTS( YEAR(MIN(InvoiceDate)), MONTH(MIN(InvoiceDate)), 1) AS CohortDate -- day=1 because I'm not doing analysis on the day just on Year & Month
INTO #Retention_Cohort
FROM #Temp_Online_Retail
GROUP BY CustomerID;
SELECT * FROM #Retention_Cohort;
-------------------------------------------------------------------------------------------------------------------------------
----------------- Create a Cohort Index that represents the number of months passed sice their first purchase -----------------
----------------- when cohort_index = 0 , this means they made a purchase in the same month -----------------
-------------------------------------------------------------------------------------------------------------------------------
SELECT
q2.*,
cohort_month_index = (year_diff * 12) + month_diff
FROM(
-- after extracting year & month, we calculate the year and month differences between the InvoiceDate and the CohortDate aka first purchase
SELECT
q1.*,
year_diff = q1.Invoice_Year - q1.Cohort_Year,
month_diff = q1.Invoice_Month - q1.Cohort_Month
FROM(
-- first extract both year & month from both temp tables #Temp_Online_Retail & #Retention_Cohort
SELECT
T.*,
R.CohortDate,
YEAR(T.InvoiceDate) AS Invoice_Year,
MONTH(T.InvoiceDate) AS Invoice_Month,
YEAR(R.CohortDate) AS Cohort_Year, -- 1st year_purchase
MONTH(R.CohortDate) AS Cohort_Month -- 1st month_purchase
FROM #Temp_Online_Retail AS T
LEFT JOIN #Retention_Cohort AS R
ON T.CustomerID = R.CustomerID
) AS q1
)AS q2;
After that I stored them in a VIEW to import them for vizualization:
--------------------------------------Bringing them all together------------------------------------------------------
------------------------------------------- Data Cleaning -----------------------------------------------------
-------------------------------------------------------------------------------------------------
CREATE OR ALTER VIEW CleanedData
AS
WITH
Online_Retail AS (
SELECT
[InvoiceNo], [StockCode], [Description], [Quantity],[InvoiceDate], [UnitPrice], [CustomerID], [Country]
FROM [Portfolio DBs].[dbo].[Online_Retail]
WHERE CustomerID IS NOT NULL
),
Quantity_UnitPrice AS (
SELECT *
FROM Online_Retail
WHERE Quantity > 0 AND UnitPrice > 0
),
Duplicates AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY InvoiceNo, StockCode, Quantity ORDER BY InvoiceDate) AS Duplicated
FROM Quantity_UnitPrice
),
CleanedData AS (
SELECT *
FROM Duplicates
WHERE Duplicated = 1
),
FirstPurchase AS (
SELECT
CustomerID,
MIN(InvoiceDate) AS First_Purchase_Date,
DATEFROMPARTS(YEAR(MIN(InvoiceDate)), MONTH(MIN(InvoiceDate)), 1) AS CohortDate
FROM CleanedData
GROUP BY CustomerID
),
SUB1 AS (
SELECT
c.*,
f.CohortDate,
YEAR(c.InvoiceDate) AS Invoice_Year,
MONTH(c.InvoiceDate) AS Invoice_Month,
YEAR(f.CohortDate) AS Cohort_Year, -- 1st year_purchase
MONTH(f.CohortDate) AS Cohort_Month -- 1st month_purchase
FROM CleanedData AS c
LEFT JOIN FirstPurchase AS f
ON c.CustomerID = f.CustomerID
),
SUB2 AS (
SELECT
SUB1.*,
year_diff = Invoice_Year - Cohort_Year,
month_diff = Invoice_Month - Cohort_Month
FROM SUB1
),
CohortTable AS(
SELECT
SUB2.*,
cohort_month_index = (year_diff * 12) + month_diff
FROM SUB2
)
select * from CohortTable
I also created a dynamic pivot table for the cohort analysis with variables and saved it in a stored procedure.. if needed in the future.
-------------------------------------------------------------------------------------------------
--------------------------- DYNAMIC PIVOT TABLE -----------------------------------------------
-------------------------------------------------------------------------------------------------
GO
CREATE OR ALTER PROCEDURE sp_CohortPivot
AS
BEGIN
-- Dynamic Pivot Table
IF OBJECT_ID('tempdb..#PivotTemp') IS NOT NULL
DROP TABLE #PivotTemp;
DECLARE
@pvt AS NVARCHAR(MAX),
@pvt_cols AS NVARCHAR(MAX);
-- Generate the dynamic columns
SELECT @pvt_cols = COALESCE(@pvt_cols + ',' , '') + QUOTENAME(cohort_month_index)
FROM (SELECT DISTINCT cohort_month_index FROM ##CTEs) AS sub
ORDER BY cohort_month_index;
-- Generate the pivot query
SET @pvt =
N'
SELECT *
INTO #PivotTemp
FROM (
SELECT
DISTINCT CustomerID,
CohortDate,
cohort_month_index
FROM ##CTEs
) AS q1
PIVOT
(
COUNT(CustomerID)
FOR cohort_month_index IN (' + @pvt_cols + ')
) AS dynamic_pvt;
SELECT * FROM #PivotTemp ORDER BY CohortDate;';
EXEC sp_executesql @pvt;
END
GO
EXEC sp_CohortPivot
The dataset consists of 1 Fact Table and 2 Dimesnsion Tables
- Retail Data as the Fact Table
- Date Data as Dimension Table
- Customer Dimension Table --> created the customer table using DAX
In this dashboard I used lots of measures and DAX functions to calculate different types of customers like:
New customers
are individuals who make their first purchase from the business. They have no prior purchase history with the company.Returned customers
are those who have made at least one previous purchase and come back to make additional purchases. They may not have a consistent buying pattern but choose to return based on satisfaction with previous experiences or productsRetained customers
are those who continue to make purchases from a business over a longer period, indicating a higher level of loyalty and satisfaction. Retention is often measured over a specific time period, such as a yearRecovered customers
are those who were previously considered lost (i.e., they had stopped purchasing for a significant period) but have returned to make a purchase again
cohort.retention.mp4
- In Dec 2010, a total no. of 855 customers made their 1st purchase with the company, these new customers spent 407.26 sterling on average, which is 3.4% less than the overall spend average
- 1 month later
- only 324 (36.6%) customers were retained and made another purchase with the company, and they spent 533.3 sterling on average which is 29% more than the overall spend average.
- and 417 new customers made their purchase with us, and spend 622.06 sterling on average
-
On the right-hand side of the dashboard, we have “Customer Retention by Month Stacked Area chart” and “Customer Retention Breakdown of a 100% stacked bar chart”
both of them are pretty much the same concept and they both corresponds 1 to 1 with the matrix table here of the cohort analysis.
- There is a peak in the bar chart around November 2011, where it may be associated with Christmas time or another holiday periods, and this is why we have a peak with a total spending of ~1.1M
- Looking at the line chart, I noticed that even though we have a high peak in the line chart around December representing our highest average spend, we have the lowest number of customers in that time. and those customers spend more than average per transaction, So this is maybe because all other retailers have run out of stock and customers were trying to purchase as much as they can