GemmaLaw: Scalable Domain-Aware AI on Cloud Run

Inspiration

Legal benchmarks punish generic LLMs: they miss carve-outs, mis-handle negation, and hallucinate citations. I wanted a compact model that actually knows contract law, respects binary label formats, and still fits in a serverless footprint. Google’s new Gemma‑3 family plus Cloud Run GPUs looked like the perfect combo: open weights, lightweight LoRA updates, and on-demand GPU inference!!

What it does

GemmaLaw ingests clauses, policies, and question prompts, then returns tightly formatted answers: Relevant/Irrelevant, Yes/No, Entailment/Contradiction, or MCQ letters exactly as LegalBench expects. It exposes a single FastAPI endpoint (/v1/legal/answer) guarded by an API key and streams responses so latency stays low. Every request is logged with request_id, tokens, and latency so judges can see what happened in Cloud Logging.

How we built it

Dataset curation. Split the raw privacy/contract corpus into specific shards, flipped mislabeled “permissions” items, annotated evidence spans, balanced the classes, and merged everything into legal_sft_combined.jsonl (~24 K rows).

Training. Ran a 12 K-step LoRA SFT of google/gemma-3-4b-it on a Vast.ai 2× RTX 5090 box inside tmux—bf16, gradient checkpointing, AdamW, explicit token-type handling for Gemma 3. Checkpoints live at /workspace/AMD/runs/gemma3-legal-lora.

Evaluation (in progress). Preparing the compare_base_vs_lora flow to score LegalBench smoke sets and MMLU Professional Law for before/after deltas.

Inference stack. FastAPI + Uvicorn, transformers 5.x, PEFT. Loads Gemma base plus LoRA on startup, enforces X-API-Key, exposes /healthz, emits JSON logs.

Deployment. Built Docker image locally, pushed to Artifact Registry (europe-west4), deployed on Cloud Run with an NVIDIA L4 (4 CPU / 16 GiB, concurrency 1). Base weights optionally pulled from GCS at boot to keep the image lean.

Challenges we ran into

Getting Gemma‑3 production-ready meant wrestling bleeding-edge transformers/accelerate builds just to keep use_cache and token_type_ids from crashing, scrubbing the privacy-relevance dataset (roughly 40 % noisy labels) with heuristics and manual flips, and tiptoeing around Vast’s auto‑tmux SSH quirks, hf_transfer, CUDA 13 changes, plus the bitsandbytes ooms when training LoRA on mixed ROCm/CUDA hosts, all while tuning the Cloud Run GPU image so cold-starts stayed under SLA even as we debated baking weights into the image versus pulling them from GCS.

Accomplishments that we're proud of

Brought Gemma‑3‑4B LoRA loss down to ≈0.13 on a mixed LegalBench-style dataset, keeping adapters tiny enough for easy deployment. Automated deployment: one gcloud run deploy spins up the GPU service, and logs/metrics are visible immediately. Built the groundwork for a before/after demo which was great and evaluated base vs LoRA accuracy gaps on real LegalBench tasks you can show live.

What we learned

Instruction formatting consistency is everything so we want the model to emit exactly relevant stuff. Gemma‑3’s new tokenizer and config fields require fresh tooling; pinned nightly transformer builds are a must. Fine tuning on 40GB of pile-of-law dataset is a mistake and leads to copy pasting outputs instead of reasoning and leads to catastrophic forgetting!

What's next for GemmaLaw: Scalable Domain-Aware AI on Cloud Run

Complete evaluation coverage. Finish the base-vs-LoRA suite across all LegalBench shards and add MMLU Professional Law scores so the README shows both domain lift and general retention. We can also do Task-specialized adapters. Broaden training data with confidentiality QA and LegalBench summaries, then experiment with task-specific LoRA adapters or a lightweight MoE router to keep strengths without spreading regressions....also perhaps prototype Google ADK agents—a retrieval agent fetching snippets plus an answer agent running GemmaLaw...to qualify for the AI Agents bonus track.

Built With

artifact-registry
docker
face
fastapi
google-cloud-run
hugging
logging
lora
nvidia-l4-gpu
python
pytorch
transformers
vast.ai

Updates

Arnav Salkade started this project — Nov 10, 2025 07:57 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.