Margin Clustering over Each Category of Products with Python and MDX

Fakhredin Khorasani
3 min readJan 18, 2020

--

In our retail store, we wanted to find a way to discover three margin boundaries dynamically for three low, medium, and high profit clusters that have been made by brands in one of the category levels of sale cube. Moreover, we has a cube that contains sales and margins up to the barcode level which has been built from the data warehouse. Hence, in python, we fetch the required data with MDX from SSAS cube which contains category, brand, and margin. Then we fit the k-means cluster for each category on margin numbers and get brand names and its clusters as the results.

In the beginning, python needs pyodbc package to be connected to SQL Server which helps us to run the MDX query via OPENQUERY against Microsoft Analysis Server. In order to show the implementation of this idea, we use adventure works cube dimensions as [Category] and [Model Name], and [Internet Sales Amount] as a measure instead of category, brand, and margin. You can find out how to configure a linked server for the SSAS database in this article. We have fetched data with this query:

import pyodbc

conn = pyodbc.connect('Driver={SQL Server};'
'Server=.;' #IP or PC Name
'Database=master;'
'Trusted_Connection=yes;')

sql = """
SELECT
[Category] = "[Product].[Category].[Category].[MEMBER_CAPTION]",
[Model] = "[Product].[Model Name].[Model Name].[MEMBER_CAPTION]",
[Sale] = "[Measures].[Internet Sales Amount]"
FROM OPENQUERY([AW],
'SELECT NON EMPTY { [Measures].[Internet Sales Amount] } ON COLUMNS,
NON EMPTY { ([Product].[Category].[Category].ALLMEMBERS
* [Product].[Model Name].[Model Name].ALLMEMBERS ) } ON ROWS
FROM [Adventure Works]'
)
"""

Afterward, we run the query and store it in a pandas data frame. We also apply a filter to keep only the categories with at least three items, because we want a k-means cluster with three centers. Then we change the sale type into the integer data type for sorting, and turning off an option to avoid warnings.

import pandas as pd

dataset = pd.read_sql(sql, conn)
dataset = dataset.groupby('Category').filter(lambda x: len(x) >= 3)
dataset['Sale'] = dataset['Sale'].astype(int)
pd.options.mode.chained_assignment = None

Now, we build a loop over categories to fit each of them into a K-Means cluster to find boundaries and give a rank to each cluster.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
for category in dataset['Category'].unique():
df = dataset.loc[dataset['Category'] == category]
kmeans = KMeans(n_clusters=3)
kmeans.fit(df['Sale'].ravel().reshape(-1, 1))
y_km = kmeans.fit_predict(df['Sale'].ravel().reshape(-1, 1))
df.loc[y_km == 0, 'Cluster'] = 'Cluster 1'
df.loc[y_km == 1, 'Cluster'] = 'Cluster 2'
df.loc[y_km == 2, 'Cluster'] = 'Cluster 3'
df = df.sort_values(by='Sale', ascending=False)
rank = df.groupby('Cluster')['Sale'].max().rank(ascending=False).astype(int)
rank = pd.DataFrame({'Cluster': rank.index, 'Rank': rank.values})
df = pd.merge(df, rank, on='Cluster')
result.append(df) x1 = df[df['Rank'] == 1]
x2 = df[df['Rank'] == 2]
x3 = df[df['Rank'] == 3]
fig = plt.figure()
plt.scatter(x1['Sale'], np.zeros(len(x1['Sale'])), c='green')
plt.scatter(x2['Sale'], np.zeros(len(x2['Sale'])), c='blue')
plt.scatter(x3['Sale'], np.zeros(len(x3['Sale'])), c='red')

fig.suptitle('Category: '+ category, y=0.9)
plt.xlabel('Sale')
plt.show()
print(pd.concat(result))

Then we plot three clusters by sale as x-axis for each category as below.

And the result is shown in an excel sheet in the following figure.

Conclusion

Clustering helps us to find the boundaries for each subgroup dynamically in different cases. We choose a simple K-Means algorithm for hard clustering in order to assign each product model to only one center and find out the borders.

The GitHub gist is available below.

--

--

No responses yet