(Best Digital Forensics Hack)


When trying to locate data through logical or physical approaches, it can be difficult to decides which key words and phrases to use when parsing through data sets. Some tools provide similar ideas but require large amounts of time to prepare the do background tasks due to their overall robust nature. This introduces the need to parse a device in the logical or physical manner with less amount of extra computing or additional tasks/preprocessing involved.

What it does

The purpose of this program is to use the Natural Language Toolkit to analyze a set of text. The NLTK provides sanitized and specifcally grouped text segments. These Text segments allow for the program to search accross raw data for phrases as key words. The Program can parse files and physical disks in binary formart block by block in search for the data. When the data is found the program outputs the block number index and what key word was found in a text file. It is essentially a raw preview and parse program of logical and physical data This becomes exponentially more useful when using words found images against hundreds of Gigabytes of data or even a couple Terabytes of data. The program was tested on one large text sample broken in to several text files.

How I built it

I used Python as my base programming language. I then utilized the Gooey library which is a UI wrapper for Python's ArgParse library and provides visual verbose functionality.

I then broke down the sub requirements into:

installation of NLTK and corresponding libraries

generate and obtain test data

build algorithm/functionality of Natural Language Processing NLTK to read from a text file and parse accordingly

Build algorithm and functionality to check and read physical disks and logical files in binary format.

Create robust and simple UI to allow the user to complete their task

Combine NLTK generated data set and binary parsing of data to sequentially iterate block by block through data

Create a simple logging/reporting functionality for use to look back into when task was complete so user can make on the fly triaging decisions.

Challenges I ran into

Physical access to disks and binary parsing of files and physical disks

IO issues with opening closing writing to and from files

Parsing arguments given by user and handling exceptions

Accomplishments that I'm proud of

Accessing and Parsing Raw Physical disks in Python with minimal overhead

Using NLP and NLTK to generate clean keys words and phrases

What I learned

Using Natural Language Processing NLTK to read and tokenize and sanitize text( stop words)

combining IO of Raw physical disks

reading binary block by block

decoding binary to readable text.

improving error handling from binary IO

What's next for NLTKParser

Improved performance through multithreading

Improved NLP and NLTK implementation

More Robust Reporting features

Share this project: