Our Tragic Story
The biggest problem was to figure out how to manage the data (understand and preprocess it). We hadn't much progress because of the curios stuff happening with different ML models we tried to find the anomalies (DBSCAN, one class SVM, LOF, isolated forest, etc). It lasted for ~8 hours, and then we noticed that our dataset was messed up. It had mixed rows as we thought that inodes were unique for every filename.
The Models Behaviour
- DBSCAN: didn't work (the dataset was to big, so we had our memory dumped)
- One Class SVM: provides us with the most optimal results
- LOF: an algorithm from the example; has the best precise
- Isolated Forest: the worst one; is too sensitive to the anomalies which is a disadvantage in our case
Some Ideas (We Have No Time To Implement It :D)
After getting all suspicious files (their inodes and filenames), we can use our 2 remaining tables (the timeline and evt) to have an understaning of what is actually "suspicious" about those file; it would be also nice to find any relationships between the "sus-files", users, the datetime, and many other interesting things. It would be nice to have a visualization of the relationships as well.
Problems We Are Facing and A Possible Solution
Even tho we only have 48 suspicious files, it still makes a lot of rows in the timeline. Maybe, we can again use One Class SVM to reduce the number of files.
Log in or sign up for Devpost to join the conversation.