Spark for Business
Introduction to Business 360°
Functionalities of Business 360°
Product Analytics - Daily Sales & Predictions
Product Analytics - Anomaly Detection & Customer Segmentation
Marketing Analytics - Recommendation for Cross-Selling & Customer Clustering for Targeted Offers
Marketing Analytics - RFM Modelling
Complaints Analytics - Monthly Trends
Complaints Analytics - Segmentation by Demographic
Complaints Analytics - Segmentation by Product, Medium, Resolution & Timely Response
Feedback Analytics - Keywords Detection using Natural Language Processing
Enterprises generates loads of transactional data daily. This data is very rich in insights and if mined correctly, it can help businesses to grow and expand. These insights lead to data driven decisions, which can take businesses to a new heights.
And, we believe that Apache Spark is a robust framework for data analytics and can support business decisions effectively. We wanted to showcase what Spark can do for businesses, using machine learning algorithms and analytics. However, we also wanted to ensure that the analytics should be presented in a user friendly manner so that the non-technical business owners can understand them easily and take appropriate actions.
Moreover, along with showcasing the strength of Spark, we also wanted to propose a technical architecture which can be deployed in practical scenario, to be used by business in day-to-day operation.
To incorporate this vision, we present Business 360°, a Web based Application with the power of Apache Spark.
What it does
Four core aspects which every organisation need to focus on, are Products/Services, Customers, Team and Competitors. Business 360° analyses your business data covering these 4 aspects by running machine learning algorithms and thereby, provides decision support insights for your business through a smart User Interface through a Web Application.
Through Business 360°, you can get insights which can help in increased sales and profit along with improvisation in their products and services. For customers, you can focus on targeted offers & campaigns based on segmentation and get means to increase customer base and ways to retain them. For the team members of your organisation, Business 360° helps in analyzing team performance and help in resource planning & allocation. Lastly, competitors can be analysed to understand their product features and help improve your product accordingly.
Sales Analytics - Provides details of top selling product categories based on Number of transactions and Gross Merchandise Value. Also provides list of product categories, whose sales are on increasing or decreasing trends, in order to take appropriate action. If on increasing trend, such products can be used to up-sell or cross-sell. And if on decreasing trends, reason can be determined to increase sales through promotions.
Product Analytics - For the products, provides daily sales trends, which can be used to infer trends in customer spending. Also provides sales predictions based on weather and week day, which can be used to determine inventories or promotions. It can also detect any anomalies in sales, for which business can do a causal analysis and act appropriately. It also identifies which customer segments prefers the particular product category, based on their age and income.
Marketing Analytics - Provides product recommendations, which identifies which products are bought together by customers. This can help in cross-selling or up-selling. Using Cluster Analysis, it provides details of customer segments who are more likely to buy the product based on previously used offers for that particular product category. Using RFM Modelling, it provides list of customers who are more likely to buy the product, which can be used for targeted offers and campaigns.
Feedback Analytics - Provides textual analysis of customer reviews which can help identify why the products are sold more or less. It also provides key features of your competitor's products which the customers prefer.
Complaint Analytics - Provides monthly trends of customer complaints, which can help in resource planning. Also provides complaints segmentation based on demographics, product type, medium and resolution. Team performance can be judged on how timely the complaints are responded to.
How we built it
IBM Bluemix Platform was used to build and host the Business 360° Application. The entire development cycle consisted of the below steps:
- Input Data (Retail Transaction & Master Data, Historical Weather Data and Amazon Customer Reviews) was imported into IBM Object Storage
- Using IBM Analytics for Apache Spark, Input Data was analysed and machine learning algorithms were run using Python Jupyter Notebook. Analysis was stored in Cloudant DB. Document based storage was chosen for easy retrieval of data
- Open Weather Map API was used to get weather predictions for next 10 days, which helps in sales forecasting models
Machine Learning Algorithms:
Below Machine Learning algorithms were used to derive the analytics:
RFM (recency, frequency, monetary) analysis is a technique used to determine quantitatively which customers are the best ones and can be targeted through customized offers, by examining how recently a customer has purchased (recency), how often they purchase (frequency), and how much the customer spends (monetary).
Technical Implementation: Use of Data Frame Analytics using pyspark SQL
Likelihood to Purchase Modelling
Propensity models, also called likelihood to purchase or response models, help predict the likelihood of a certain type of customer behavior and purchase patterns. This helps marketers optimize marketing strategies like promotional email/app notification frequency, discounts or offers/campaigns.
Technical Implementation: Use of KMeans Clustering
Natural Language Processing - Keyword Extraction
Keyword extraction is the automatic identification of terms which best describe the subject of a text based input.
Technical Implementation: Use of Frequency Distribution and Keyword Determination using NLTK Library. Rapid Automatic Keyword Extraction method was implemented using frequency distribution in order to get the key terms which represents the reviews.
Linear Regression is an approach used to model the relationship between a scalar dependent variable and one or more explanatory (or independent) variables.
Technical Implementation: Use of Linear Regression Modelling methods of scikit-learn machine learning library
Through Anomaly detection, also known as outlier detection, one can identify the items, events or observations which do not conform to an expected pattern or other items in a dataset.
Technical Implementation: Use of luminol package to detect anomaly
Recommendation Engine - Associative Rule
Associative rule based collaborative filtering is a technique to discover interesting relations between variables in large databases.
Technical Implementation: Market Basket Analysis using Associative Rules method to determine which products are bought together frequently
Data Analytics is the science of examining raw data to gather insights, based on data segmentation, aggregation, etc.
Technical Implementation: Using pyspark SQL, get insights from data and stored in Cloudant DB for easy retrieval
Data Sets Used:
- Retail Transaction Data (with more than 1.4 million records) which contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer, along with customer information, product catalog, marketing offers run and coupon distributed and claimed. Source: dunnhumby.com - The Complete Journey DataSet
- Amazon Reviews Dataset (with more than 550K records) for Home & Kitchen category. Contains Products, Ratings and Text Reviews. Source: Amazon product data by Julian McAuley, UCSD
- Consumer Complaints Dataset for financial products. Contains complaints with Product Category, Medium & Resolution. Source: Consumer Financial Protection Buraeau
Challenges we ran into
Finding the right open data set in order to showcase the power of Spark on business data.
What we learned
Apache Spark is a robust platform for business analytics. What other platforms take ages to execute, same machine learning algorithms can be executed in a jiffy with Apache Spark
What's next for Business360
To add more types of modelling and machine learning algorithms like sentiment analysis, customer sentiment prediction, targeted offers based on sentiments on social media.