Inspiration

The increasing use of encrypted messaging apps by organized crime.

  • Real cases where criminals were caught using writing style analysis (e.g., the Unabomber).
  • The challenge of profiling when traditional metadata is absent or anonymized.
  • The opportunity to combine linguistic style, emotional tone, and behavioral cues for deeper analysis. ## What it does

How we built it

Python for processing and ML

  • spaCy, NLTK for NLP preprocessing
  • Empath, Text2Emotion for psycholinguistic feature extraction
  • scikit-learn, XGBoost, BERT for classification
  • Streamlit for the interactive user interface
  • Neo4j / NetworkX (optional) for visualizing criminal networks

🧱 Workflow

  1. Data Collection: Acquired and generated encrypted-style chat datasets (synthetic and anonymized).
  2. Preprocessing: Cleaned and normalized texts, interpreted emojis, handled coded slang.
  3. Feature Extraction:
    • Stylometric: word length, POS tags, punctuation, etc.
    • Psycholinguistic: emotion scores, power/dominance cues, social language.
  4. Role Classification: Trained ML models to predict criminal roles using labeled feature sets.
  5. Visualization: Developed an easy-to-use dashboard for real-time analysis and insights.

Challenges we ran into

🚫 Lack of Real Labeled Data Legal and ethical barriers made it hard to access actual encrypted criminal chat datasets.

🧠 Obfuscated & Coded Language Use of slang, emojis, and indirect phrasing made language interpretation difficult.

🔄 Feature Fusion Complexity Merging stylometric, psycholinguistic, and contextual features into one model was non-trivial.

📉 Model Interpretability Explaining why a user was classified as a smuggler or supplier was crucial—but hard without transparency tools.

🌐 Domain Adaptation Generic NLP models struggled to adapt to criminal lingo without fine-tuning on domain-specific data.

Accomplishments that we're proud of

🧠 Role Prediction Using Language Only Accurately classified roles like supplier, smuggler, or middleman based purely on chat patterns.

✍️ Combined Stylometry & Psycholinguistics Successfully fused writing style and psychological cues into a unified profiling system.

🔍 Decrypted Coded Communication Patterns Handled slang, emojis, and metaphorical language to extract real behavioral signals.

📊 Built a Real-Time Profiling Dashboard Created an interactive interface for investigators to visualize roles, risk levels, and linguistic fingerprints.

🧪 Created a Domain-Specific NLP Dataset Generated a synthetic but realistic criminal chat dataset tailored for stylometric and behavioral analysis.

What we learned

  1. Stylometry matters: Writing style can help distinguish roles and even individuals.
  2. Psycholinguistic signals are subtle: Emotional cues and cognitive markers help expose a user’s intent.
  3. Coded language is prevalent: Emojis, slang, and metaphors are used to hide meaning—yet patterns still emerge.
  4. Multimodal features improve accuracy: Merging text structure with emotional and semantic insights yields better predictions.

What's next for NeoNarcoNLP

Multilingual Support – Extend to regional and darkweb languages/slang.

Real-World Dataset Integration – Collaborate with law enforcement (where ethical/legal) for real encrypted data.

Role & Risk Scoring – Add features for threat level and behavioral intent classification.

Chatbot Integration – Enable real-time suspect profiling via chat interfaces.

Deployment as API – Offer as a secure tool for digital forensic teams and investigators.

Built With

  • and
  • empath**
  • huggingface-transformers**
  • ml;
  • natural-language-processing
  • nltk**
  • scikit-learn**
  • streamlit**
  • text2emotion**
  • the
  • ui;
  • we-used-**python**-with-**spacy**
Share this project:

Updates