1.The data in multiple excel sheet was aggregated into a single file. 2.The data was cleaned(Some redundant attributes were not considered in the final data ex- Store ID and the unique store names are same, visit and household attribute had almost redundant data. etc.). An attribute called discount was added to the data set. This is the base price minus the current price and might reflect negative values if the price of the product was increased from the base price of the product. 3.The data was converted into CSV and exported to IBM SPSS for further analysis(We used the knowledge flow feature of IBM ). 4.The missing values where handled and after Z-Score normalization the data was fed into a K-means clustering algorithm with default clusters set to 6.This is a blind clustering approach where we cluster the dataset and then look at the data set to find meaning among the clustered points. 3 out of the 6 resulting clusters had got intra-cluster distance and exhibited good silhouette coefficient and was not split further. The other three clusters were split into 5 clusters each using the k-means algorithm. This concept is basically splitting the bad clusters until the clusters ends up with good silhouette coefficient. This concept splits the bad cluster just like a Bisecting K-means algorithm but instead of 2 clusters it splits it into K clusters.
- After blind clustering the clusters were investigated. The findings and trends are listed below.