Inspiration

Every language carries a unique way of seeing the world: place names, ecological knowledge, family history, oral tradition, humor, and memory. But I noticed that endangered-language resources are scattered across PDFs, dictionary sites, academic papers, videos, audio collections, and small community archives. Even when the information exists, it's hard to find, compare, verify, and teach from.

I was inspired by that gap between preservation and actual revitalization. I wanted to build something that does more than collect words. My goal was to help communities turn fragile fragments into structured knowledge, then into learning material that speakers, elders, teachers, and heritage learners can actually use.

What It Does

LangSafe is a computational linguistics platform for endangered-language preservation and revitalization. It can:

  • Discover endangered-language resources across the web and archive-like sources.
  • Extract vocabulary, definitions, grammar patterns, source provenance, audio metadata, and cultural notes.
  • Cross-reference related entries so duplicate or conflicting information can be compared.
  • Browse endangered languages globally, including Jejueo as the main LingHacks demo language.
  • Search a preservation archive by English, native-script, romanized, or semantic concepts.
  • Show language health, source coverage, and a map-based language overview.
  • Provide a Community Review Queue where speakers, teachers, and linguists can verify entries, request elder notes, or flag sensitive content.
  • Generate classroom, family, or fieldwork lesson packs from verified archive entries.
  • Use Featherless.ai through an OpenAI-compatible API route to create live lesson-pack drafts, while keeping a local fallback so the demo still works reliably.

How I Built It

I built LangSafe with a Next.js 16 and React 19 frontend, TypeScript, Tailwind CSS, shadcn/ui components, and Framer Motion for a polished product experience. The interface has a blue, clean visual system with a dashboard, language browser, judge brief, archive views, map views, and the Revitalization Studio.

On the backend side, I used Next.js API routes, Node.js services, Socket.io for live agent updates, and a demo-safe Jejueo dataset so the app works even without every production credential. The broader architecture supports Elasticsearch for structured vocabulary and grammar retrieval, Jina AI embeddings and reranking for semantic search, Perplexity and BrightData for discovery, Browserbase/Stagehand for crawling dynamic sources, and Claude-based extraction/cross-reference agents.

For the LingHacks sponsor integration, I added a server-side Featherless.ai route at /api/featherless/lesson. The Studio sends selected vocabulary, grammar focus, audience, and lesson options to Featherless using its OpenAI-compatible chat completions API. The response is normalized into a lesson pack with a title, summary, activities, oral-history prompt, and quick check.

Challenges I Ran Into

One challenge was designing around uncertainty. Endangered-language data can be incomplete, contradictory, or culturally sensitive, so the product could not simply "auto-generate truth." I built provenance, confidence, source counts, and human review into the workflow.

Another challenge was demo reliability. A hackathon demo has to work even if search services, databases, or model APIs are slow. I built realistic Jejueo fallback data for search, grammar, graph, sources, stats, language overview, and lesson generation, then added live Featherless.ai generation on top.

I also had to make a complex system understandable quickly. The judges need to see creativity, impact, technology, and UX in a few minutes, so I created a focused Judge Brief page and a Studio flow that shows the product's real community value.

Accomplishments I'm Proud Of

I'm proud that LangSafe connects the whole preservation loop: discovery, archive, verification, and teaching. A lot of language-tech tools stop at retrieval or translation, but LangSafe treats community review and revitalization as first-class parts of the product.

I'm also proud of the Revitalization Studio. It makes the project feel less like a backend pipeline and more like something a teacher, speaker, or linguist could actually sit down and use.

Finally, I'm proud of the Featherless.ai integration. It gives the project a live AI generation layer for lesson packs while keeping the API key server-side and preserving a fallback path for feasibility.

What I Learned

I learned that language preservation is not just a data problem. It is a trust, consent, design, and access problem. AI can help find patterns and speed up organization, but communities still need control over what is verified, taught, restricted, or revised.

I also learned how important product framing is. The same technical pipeline feels much more impactful when the UI clearly shows who it helps and what action they can take next.

On the technical side, I learned how to combine retrieval, structured data, fallback datasets, agent pipelines, and open-weight model inference into a demo that is both ambitious and stable.

What's Next for LangSafe

Next, I'm looking to expand LangSafe in several directions:

  • Add community accounts and role-based permissions for elders, teachers, linguists, and learners.
  • Add consent and cultural-sensitivity controls for restricted words, recordings, and stories.
  • Support exports to printable lesson plans, Anki decks, CSVs, and community archive formats.
  • Improve audio workflows with pronunciation review, transcription alignment, and speaker-approved clips.
  • Add more endangered-language demo packs beyond Jejueo.
  • Build evaluation tools that compare model-generated lesson packs against verified community guidelines.
  • Deploy the full pipeline so communities can preserve and teach from their own language resources.

Built With

Share this project:

Updates