Tribal

Inspiration

Elephant researchers collect thousands of field recordings, but much of it is unusable. Calls are buried under airplane rumble, vehicle traffic, and generator hum, often overlapping in the same frequency range of 10–1000 Hz.

The core challenge is that elephant calls and mechanical noise look nearly identical in frequency. Traditional denoising fails because it treats sound as static. The real signal is in how sound changes over time. Elephant calls are dynamic and burst-like, while engines are steady and continuous.

What it does

Tribal is a real-time AI system that runs entirely on a Raspberry Pi for localized field deployment, allowing researchers to process recordings directly at the source without relying on cloud infrastructure.

It removes 4–5 dB of background noise on average, reconstructs missing harmonics, detects and separates overlapping callers, and clusters calls into groups that reveal communication patterns.

It works fully offline, requires no training data, and processes a 10-second call in about 1.5–2 seconds using less than 200 MB of memory.

The result is clean, structured data that turns previously unusable recordings into meaningful insight about elephant communication and behavior.

How we built it

Tribal is built as a lightweight, multi-agent pipeline optimized for edge devices. Each stage runs as an independent process, passing messages through queues. This avoids shared memory, prevents system-wide failure, and keeps performance stable on low-power hardware.

The pipeline

Preprocessing
Audio is downsampled from 44.1 kHz to 4 kHz and bandpass filtered between 10–1000 Hz. This removes irrelevant frequencies and reduces compute load. Noise reference segments are also extracted.

Noise fingerprinting
We compute power spectral density to classify noise sources like generators, airplanes, and vehicles. This informs how aggressively different frequencies should be suppressed.

Core separation (NMF)
We convert audio into a spectrogram and apply Non-negative Matrix Factorization to decompose it into components.

Each component is scored using:

Harmonic structure (fundamental frequency and integer multiples)
Temporal variation (coefficient of variation)

Elephant calls show high temporal variation (around 1.5 or higher), while engines remain low (around 0.1–0.2). This becomes the key discriminator.

We select the top components and apply a soft Wiener mask to separate signal from noise without introducing artifacts.

Reconstruction
Lost harmonics are recovered using an exponential decay model, preserving the natural structure of the call.

Overlap detection
We detect multiple callers by identifying separate fundamental frequencies at least 5 Hz apart with sufficient harmonic support. If present, we split them into separate tracks using competitive masking.

Quality scoring
We compute tonal SNR, harmonic completeness, and a validity flag. If quality drops, we revert to the original audio to avoid degradation.

Clustering and analysis
We embed cleaned calls into a low-dimensional space and cluster them into groups called tribes. These clusters are not just grouping similar sounds, they begin forming a relational structure between calls. The system builds a knowledge base that behaves like a connected network of acoustic relationships, where proximity in the graph reflects behavioral and contextual similarity, allowing us to move closer to inference through relationships rather than isolated signals.

This clustered structure becomes a reference layer for future reasoning and pattern discovery, not just storage.

Why it works

Temporal variation separates elephant calls from mechanical noise even when frequencies overlap
Harmonic structure ensures we preserve biologically meaningful signals
Soft probabilistic masking avoids artifacts from hard filtering
Physics-based reconstruction restores lost signal components
Unsupervised learning adapts to any environment without training data
Relational clustering turns isolated detections into structured behavioral context

Challenges we ran into

Threshold-based selection collapsed and selected all components
Spectral subtraction introduced musical noise artifacts
Generator harmonics overlapped with elephant harmonics
Masking removed real signal along with noise
Dual-caller detection caused false positives
Standard SNR metrics failed for tonal signals
Memory and CPU limits on Raspberry Pi constrained model size
Thermal throttling reduced performance over time

We solved these with top-K selection, soft masking, harmonic reconstruction, stricter dual-caller rules, tonal SNR metrics, and hardware cooling.

Accomplishments that we're proud of

Fully unsupervised system with no training data required
4–5 dB average noise reduction without degrading signal quality
Real-time performance on a Raspberry Pi deployed in-field
1.5–2 second processing time per 10-second audio segment
<200 MB memory footprint on edge hardware
Reliable separation of overlapping elephant calls
Physics-based reconstruction of degraded harmonics
Robust multi-process pipeline with graceful failure handling
Emergent relational knowledge base from acoustic clustering
Ability to recover usable data from previously unusable recordings

What we learned

Time-based patterns are critical for separating natural signals from mechanical noise
Domain knowledge can outperform generic machine learning approaches
Soft probabilistic methods preserve quality better than hard thresholds
System architecture is as important as model choice on constrained hardware
Unsupervised quality evaluation requires domain-specific metrics
Structured relationships between signals unlock deeper inference than isolated classification

What's next

Deploy multiple Raspberry Pi units across field sites for wider coverage and passive long-term monitoring
Add optional camera modules for multimodal correlation between audio and visual behavior
Enable on-device parameter tuning for researchers in the field
Extend the system to other species with similar acoustic patterns
Improve clustering into richer behavioral and individual identity graphs