Single User Activity over Time. Fraud instances can be visualized for the admin.
Global Transition Matrix. Probability of an activity (row) to be followed by another activity (column)
Activity Time distribution - Activities have well behaved distributions, except the daily exercise that has not been recorded correctly.
Number of User Activity distribution - showing exponential distribution while, due to the nature of distribution, power-law is expected.
Age-hour of usage correlation matrices
We were quite intrigued by the current reward system that Helsana has in place, as well as with the problems they face with this program. We wanted to take advantage of the real dataset provided to add value to the customer and the company, using our favourite data analytics techniques!
What it does
This client facing web app consists of a history of the activities that the user has accomplished. To enrich this experience, we used some data analytics and statistical inference to provide recommendations on what activity the user should do next, based on their past preferences. We also mark suspicious activities uploaded as the user, which could be fraudulent, using the entire database to detect abnormal activities. Once a fraud is detected, we can deduct some points, hoping that the user would avoid providing false information in the future.
This way, we want to encourage users to be honest, as well as give them a chance to correct or provide additional proof. This mechanism promotes values of transparency and trust between the client and the company. Having an automatic way of checking potential frauds allows the system to be more scalable and sustainable in the long term, such that more clients could benefit from the program.
Apart from the main functionalities mentioned above, we also implemented a simple login page where the login form is fully validated as well.
How I built it
Front end development
The back-end was implemented using Django, Django REST framework and connecting directly to MSSQL
Considering the participants' choices to be markov-chains, we obtain the transition matrix of the activities. By aggregating all of the transition matrices we create a global activity transition matrix.
We take into account a user's last activity type and focus on its corresponding row in the transition matrix.
We perform a Multinomial logistic regression using a softargmax function on the row to recommend one of the activities to the users.
Note 1.: Daily exercise activity is excluded from the matrix, since it will be independently recommended to the user everyday.
Note 2: Bonus achieving activities will be included in the matrix, as they have correlations with other types of activities, but they will not be recommended to the users. (they are excluded from the columns)
We also used statistical analysis to detect outlier activities for the fraud detection feature.
Challenges I ran into
The first challenge was to be able to coordinate between the different team members, as three out of four of us were participating remotely in three different time zones! On the technical side, making use of the data provided to us was quite a challenge, as the dataset was very heterogeneous and the number of features given to us was limited. It was also quite challenging to come up with the fraud detection system as no activities were already labelled fraudulent or not.
Accomplishments that I'm proud of
Being able to run statistical analysis on a relative small dataset with few entries for each participant.
What I learned
We learned how to deal with raw data and extract value out of it. We also perfected our frontend, backend, devops and database analysis skills!
What's next for DataSoundsNicetoMe
Suggestions for data-collection processes:
Categorize activities into different categories for example nutrition, fitness, recreation, and loyalty (including bonus programs) so you can analyze data effectively.
Create a dataset of fraudulent activities
Deployment and login
The application was deployed to a micro EC2 instance on AWS. Since it's a demo project, it's running on the embedded Django dev server (hence port 8000)
When logging in, use the following user IDs to log in (the password is always
63803693: This user has a lot of activities
586550651: This user has two fraudulent activities