Apertutus: Multilingual Safety Testing System

Inspiration

The rapid rise of sovereign AI efforts worldwide highlights a critical issue: multilingual safety. While LLMs are tested extensively in English, they remain vulnerable to jailbreaks in other languages. Manual multilingual evaluation is costly and unscalable, motivating us to build Apertutus.

How we built it

Translated 387 multi-turn jailbreak prompts into 16 languages.
Sent these prompts to client models and collected outputs.
Evaluated responses with StrongReject and aggregated safety scores.
Generated reports highlighting vulnerabilities and improvements.
Optionally fine-tuned models using provided training settings.

Challenges we ran into

Managing API cost, latency, and rate limits.

Accomplishments that we're proud of

Built a fully automated multilingual safety testing pipeline for LLMs, scaling to 16 languages — nearly double the 9 “canary languages” used in FineWeb2 experiments.
Designed and implemented the StrongReject metric, quantifying refusal, specificity, and convincingness into a single score.
Processed 387 multi-turn jailbreak prompts per language, generating large-scale evaluations across models.
Produced comprehensive safety reports that highlight vulnerabilities and actionable improvements.