Inspiration

Living in a world where 3.6 million people in Turkey rely on Turkish Sign Language (TID) as their primary means of communication, I was struck by a simple question: why can't hearing people just speak to a TID translator instead of typing?

I analyzed 596 user reviews across four major TID mobile applications (Sesim Elim, TID3B Avatar, TID Sozluk, Isaret Dili Hareketli) and found that zero of them supported real-time speech input. Beyond that, 52% of all complaints were about technical instability — apps that simply wouldn't open or crashed on launch. The gap between what users needed and what existed was clear. EKTA was born from that gap.

What it does

EKTA (Erişilebilir Konuşma Tercüme Asistanı) is a real-time, emotion-aware speech-to-Turkish Sign Language translation system.

A hearing user speaks naturally in Turkish → EKTA transcribes the speech, detects the emotional tone, and displays the corresponding TID sign GIF sequences — all in under 3 seconds.

What makes EKTA unique is its emotion layer. The phrase "Gel buraya" (Come here) means something very different when spoken warmly versus angrily. Existing systems ignore this entirely. EKTA doesn't.

How we built it

EKTA uses a four-module architecture:

1. Speech Recognition
OpenAI Whisper Small (~460 MB, runs fully offline) transcribes Turkish audio captured at 16 kHz in 8-second windows, achieving 94.7% word accuracy.

2. Three-Layer Multimodal Emotion Analysis
The core contribution — a weighted fusion of three modalities:

$$E_{final} = \alpha \cdot E_{audio} + \beta \cdot E_{text} + \gamma \cdot E_{rule}$$

$$\alpha = 0.20, \quad \beta = 0.50, \quad \gamma = 0.30$$

  • Layer 1 (Audio Prosody, α=0.20): librosa extracts RMS energy, zero-crossing rate, and mel-spectrograms to detect vocal intensity patterns
  • Layer 2 (BERT Turkish Sentiment, β=0.50): savasy/bert-base-turkish-sentiment-cased captures contextual semantic emotion from transcribed text
  • Layer 3 (Rule-Based Lexicon, γ=0.30): High-precision keyword matching across 40+ Turkish emotion terms (precision: 0.92)

3. TID Translation
A 2,000+ sign dictionary with Turkish character normalization, suffix stripping, and fuzzy matching (difflib, threshold=0.60).

4. Web Interface
Flask + Socket.IO for real-time bidirectional communication, with live emotion probability bar charts and synchronized GIF playback.

Challenges we ran into

  • Multimodal weight calibration: Finding optimal fusion weights (α, β, γ) required grid search over a manually labeled 100-sample validation set. Audio prosody alone performed poorly (45% accuracy) due to inter-speaker variability.

  • Turkish NLP specifics: Turkish is an agglutinative language — "gidiyorum," "gideceğim," and "gittim" all stem from "git." Suffix removal for dictionary matching required custom normalization beyond standard stemming tools.

  • Offline reliability vs. performance: Choosing Whisper Small over larger variants was a deliberate tradeoff — user feedback showed that existing apps failed due to network dependency. Local inference was non-negotiable.

  • Emotion ambiguity: Surprised and fearful emotions share similar prosodic profiles (high ZCR, variable energy), resulting in the lowest per-emotion F1-scores (0.70–0.72).

Accomplishments that we're proud of

  • 78% emotion recognition accuracy (F1: 0.76) — a 33% improvement over audio-only baselines and 8% over text-only
  • Sub-3-second end-to-end latency on consumer hardware
  • First Turkish real-time speech-to-TID system with integrated emotion analysis, as confirmed by systematic literature review
  • Empirical user research foundation: 596 reviews, sentiment analysis, keyword frequency mapping — not just a technical demo

What we learned

  • User feedback is a goldmine for system design. The 25% "calismiyor" (not working) mention rate in TID3B Avatar reviews directly shaped our offline-first architecture decision.
  • Multimodal fusion genuinely outperforms any single modality — but only when weights reflect each modality's actual reliability, not equal distribution.
  • Sign language carries emotion through non-manual markers (facial expressions, body posture) that text-only systems completely discard. Bridging this gap is both a technical and a linguistic challenge.

What's next for EKTA

  • Expanded vocabulary: Partnership with TID linguists to grow beyond the current 2,000-sign dictionary (~10–15% of estimated TID vocabulary)
  • Emotion-driven sign parameters: Modifying signing velocity and intensity based on detected emotion, grounded in TID linguistic research
  • Bidirectional translation: Sign language recognition (camera input → spoken Turkish) to enable full two-way conversation
  • Real-world user testing: Evaluation with Turkish deaf community members — the most important validation step not yet completed
  • Submission to SIU 2026 (34. IEEE Signal Processing and Communications Applications Conference, Piri Reis University, Istanbul)

Built With

Share this project:

Updates