EcoPrompt

Basic Page before Prompt Revision

Inspiration

As AI tools become more widely used and shipped, the environmental cost behind every prompt and computation multiplies. According to EESI, large data centers can consume up to 5 million gallons per day, equivalent to the water use of a town populated by 10,000 to 50,000 people. Most users don't realize that inefficient prompts can contribute to this problem by leading to unnecessary computation, increasing consumption.

What it does

A pipeline that takes messy user prompts and rewrites them into tight, well-framed prompts — using two small local models and a retrieval layer for style hints.

Stack:

Extractor: qwen2.5:3b — pulls a 6-field skeleton (INTENT, TASK, SUBJECT, OUTPUT, CONSTRAINTS, PROMPT)
Reviser: gemma3:4b — rewrites using the skeleton + retrieved exemplars
Retrieval: HumanDelta vector DB over fka/awesome-chatgpt-prompts + a custom corpus
Scorer: heuristic rubric (concision, preservation, no-leak, no-headers, compression) over 12 labeled test prompts

How It Works

User gives a raw prompt (could be 3 words or 3 paragraphs)
Extractor distills it into a structured skeleton
Gating logic decides whether to retrieve (length, intent, similarity threshold)
Reviser rewrites — borrowing structure only from retrieved examples, never nouns
Scorer compares "with retrieval" vs "without retrieval" across categories

What We Learned

Small models need scaffolding, not trust. A 3B model won't follow a schema just because you hand it one. It needs concrete examples, canonical order enforcement, and a regex safety net.
RAG is portable leverage. Adding new style patterns = dropping text into a corpus. No fine-tuning, no retraining.
Retrieval isn't always better. For trivial inputs ("tie a tie"), retrieved "act as a stylist" templates actively hurt. Gating matters more than retrieval quality.
Structure-only borrowing beats content borrowing. Pulling "Act as a " from an exemplar is safe. Pulling nouns, verbs, or domain words leaks contamination (Ethereum prompts drifting into messenger-app territory).
Deterministic fallbacks save LLM weaknesses. Regex constraint sweeping caught "3-4 hours a week" that the extractor kept dropping.

Challenges

1. The extractor kept copying its own examples

Give qwen an in-context example with write a python function... and ask it to process "tie a tie" — and the skeleton comes back with PROMPT: write a python function. Fixed by: two diverse examples, explicit "NEVER copy values verbatim," and a short-input rule ("if ≤5 words, copy to PROMPT verbatim").

2. The reviser over-prompted trivial tasks

"Act as a professional stylist to tie a tie" — nobody wants that. Added a QUICK TEST to Rule 1: role prefix only for tasks needing specialized professional expertise. Everyday tasks get step-by-step framing instead.

3. Synonym leaks

Extractor would turn "tie" into "necktie" in the SUBJECT field, and the reviser would propagate it. Hard-coded "NEVER substitute synonyms" at extraction and "if original says X, don't write Y" at revision.

4. Dropped load-bearing constraints

"I can work out 3-4 hours a week" → skeleton CONSTRAINTS: none → revised prompt drops the budget. Fixed in layers:

Added CONSTRAINTS field to skeleton
Added SCAN-FOR-CONSTRAINTS list to extractor prompt
Added _sweep_constraints() regex safety net that injects missed constraints before revision

The Loop That Shipped It

Every failure mode followed the same rhythm:

Run test prompt
Spot a new failure (schema echo, noun leak, dropped constraint)
Tighten the rule
Re-run the full test set — does it regress anything else?
Add a safety net in code if the rule can't be trusted

The scorer made step 4 cheap. Without it, I'd have been guessing whether each fix actually helped.

Takeaway

Small local models + RAG can produce strong results, but only with defensive engineering. The LLM is one component in a pipeline — schema cleaners, regex sweepers, retrieval gates, and scoring loops are the other 70%.