Data loss/liability protection using ML and Azure Sentinel

OLAF data ingested into Azure Sentinel

Inspiration

The recent Waymo-Uber lawsuit and the continuous almost-daily stream of news stories on large-scale data breaches and leaks has highlighted the necessity of auditing and protecting the information that flows both into and out of an organizations networks and computers. Both data exfiltration and infiltration can make organizations liable for lawsuits and criminal charges over intellectual property theft or confidential data disclosure. Protecting information like blueprints or engineering drawings or source code or sensitive financial or health data that may reside in files that users can download or take home with them on a USB drive are now essential parts of any organization-wide security plan.

Data Loss Prevention (DLP) systems are now offered by most large enterprise security vendors and cloud services providers like Google and Microsoft. But most DLP solutions available today are commercial proprietary systems that are closed-source and require centralized user and device management like Active Directory. In addition to the considerable initial costs such systems will incur, a closed-source security solution does not allow its code to be audited or tested to find potential bugs or security vulnerabilities or insecure methods of storing data. A security hole or data breach in an enterprise-wise DLP system could be disastrous for an organization.

What it does

OLAF (Online Automated Forensics) is an open-source digital forensics tool for public-facing PCs or business desktops or other endpoints, that can classify and audit in near real-time documents and images users downloads or copy to removable storage. OLAF continuosly monitors a PC for activity like downloading or mounting external storage, then extracts the text or image content of files transferred using open-source libraries. Content analysis is then performed on the artifact using different methods inlcuding analysis of images by ML services like Azure Cognitive Services Vision. This information as well as metadata like the date and time a document was downloaded or copied and the user and machine name where the operation occured is logged to Azure Sentinel where it provides a permanent event log of data movement of potential intellectual property or sensitive or PII data, which can then provide evidence of a breach of an organization's policies on information protection.

OLAF helps an organization determine how information flows into and out of their systems when users download files using their web browser or copy files to USB and other removal storage devices and can help enforce policies on using and transferring intellectual property, trade secrets, PII or other sensitive and confidential information. It can also detect when sexual or other inappropriate content from the Internet enters into an organization's systems which may violate computer use policies. OLAF is a open-source standalone program designed to be modular and extendable and can log to any available log target from simple text files to cloud-based SIEM services like Azure Sentinel, without locking an organization into any particular user management or log collection technology.

I have written previously about the image classification capabilities of OLAF in a Code Project article which I entered into their Image Classification Challenge contest last year and won 2nd place. For this hackathon I added the following capabilities to OLAF:

Detect storage volumes mounted and unmounted in Windows and monitor the files transferred to them.
Use Apache Tika for cracking and analyzing the content of files downloaded or copied to removable storage.
Use dictionaries and regular expressions to detect when a file may contain PII or confidential or sensitive information like a competitor's intellectual property.
Add metadata like file hashes and the current Windows application that has the user's focus when files are created or copied.
Log file analysis results and metadata to Azure Log Monitor and create queries in Azure Sentinel to surface potential data-loss events from the OLAF log data.

How I built it

Design

OLAF is a .NET Framework app written in C# targeting Windows desktops. Although some platform-specific detectors use functions from the Windows API, the base OLAF libraries are cross-platform and should run on .NET Core, or on Mono for Linux or Mac. OLAF currently runs as a console app for development but can easily be converted to a Windows service for production use.

OLAF is designed around a multi-threaded message queue where the different components can listen to and respond to messages without blocking execution of other components. Since OLAF is designed to run continously as a background service, performance without affecting the responsiveness of other user applications that are running is a key factor in the design. design

Modularity and extendability are also key factors in the design. OLAF components are designed to be plugged in and out without difficulty with each category of component deriving from a common base class. There are 3 main classes of components: ActivityDetector, Monitor, and Service:

Activity detectors are lightweight components designed to receive notifications from the operating system and then place messages on the OLAF queue notifying of file or other activity that OLAF components are interested in. Activity Detectors do not do any significant processing on their own and are only there to quickly send messages to monitor components when for instance the user plugs in or removes a USB drive. Activity Detectors must be able to handle a large number of notifications coming in from the operating system and place activity messages on the queue where Monitor components process them serially.
Monitors are the components which handle activity detection messages and identify the image and document artifacts that will be preserved and analyzed by OLAF. For instance the DirectoryChanges monitor receives file system activity messages about new files being created at a particular storage location and first copies the file to a internal data folder to preserve it before reading the file and creating a FileArtifact message that will be processed by a Service.
Services process file and image artifacts and implement file cracking, text and metadata extraction, image and text analysis, classification, PII detection, and event log storage. A service will process an Artifact, enrich it with additional information and metadata and then place it back on the queue where it is available for processing by other services. For instance the Tesseract OCR service will extract text from an ImageArtifact and create a TextArtifact which is placed on the queue and can be processed by a text Classifier service and finally a Storage service which stores the artifact and metadata as a log entry in Azure Log Monitor. Service components can be further broken down into Extractor, Classifier and Storage components.

OLAF is designed so that developers can quickly assemble a pipeline of services to process artifacts. A Pipeline object is a set of OLAF services that process an artifact in sequential fashion e.g. the Document pipeline which is used for DLP

Libraries and Services

OLAF makes use of several open-source libraries as well as SDKs for Azure machine learning services:

Extractors

Name	Description
Tesseract	Open-source OCR engine which can extract text from images. OLAF uses a wrapper called tesseractdotnet which also includes a .NET wrapper for the leptonica image processing library.
Apache Tika	Open-source document processing library that can extract full text and metadata from many common document file types like `.PDF`, `.DOC` and `.XLS`

Classifiers

Name	Description
Accord.NET	Open-source .NET machine learning library which contains implementations of many algorithms like the Viola-Jones object detection framework which can detect if an image contains human faces.
Azure Computer Vision	Azure's ML-powered image analysis service which provides powerful classification and recognition capabilities. For images that users download or copy, Azure Computer Vision can classify images using categories, captions, and tags, and can also determine if an image contains racy or sexual content.
Azure Text Analytics	Azure's ML-powered text- analytics service. Recognizes and extract entities like people or place names from free text.

In the near future I also want to try out Google's Cloud DLP service which has both a free tier and official .NET client for text analysis and classification

Storage

Name	Description
Azure Blob Storage	Azure's bulk storage service which can store both structured and unstructured data. OLAF's `AzureBlobStorage` component can store both text and image artifacts for preservation.
Azure Log Analytics	I used Tobias Zimmergen's Log Analytics wrapper from his terrfic article on using the Log Analytics Data Collector API to create custom tables and event types. OLAF creates a log entry for every document and image artifact it analyzes.

Challenges I ran into

When OLAF detects a storage volume is added to the system it spawns a DirectoryMonitor to monitor files written to the volume. However if you attempt to remove the volume using the Windows tray icon program then Windows will complain that the drive is still in use...because OLAF has a monitor running on it :o I've started the code to register and process the Win32 messages but it isn't all hooked up yet so for now to remove a USB drive when OLAF is running you'll have to just to yank it out the old-fashioned way.

Accomplishments that I'm proud of

I haven't been able to find an open-source DLP solution except for MyDLP which seems to have been discontinued. OLAF DLP is an end-to-end open-source DLP that is both free and capable when combined with Azure Sentinel and I hope to continue development on it.

What I learned

Computer security is always an opportunity to learn many new things and developing OLAP I learnt everything from Windows API calls to working with a SIEM.