Inspiration
Bias in the news isn't abstract, it's something people feel every time they open a browser or scroll their feed. I noticed that the same event gets reported completely differently depending on which outlet you read. A rate hike becomes either "the Fed's tough stance on inflation" or "a threat to working families," depending on the framing. I wanted to build a tool that lets users see how different outlets are covering the same story, spot the framing differences, and decide for themselves what's really happening. The Elasticsearch Agent Builder gave me the perfect foundation: hybrid search for finding relevant articles, ES|QL for aggregating coverage stats, and an LLM orchestrator to reason over the data.
What it does
The News Bias Analyzer is a two-part system:
Ingestion Pipeline (
ingest.py): Fetches articles from NewsAPI across 13+ major outlets (Reuters, BBC, Fox News, Al Jazeera, etc.), enriches each article with NLP (sentiment analysis, named entity recognition, semantic embeddings), and indexes everything into Elasticsearch.Agent Builder Interface (
agent_config.yaml): An LLM-powered agent that answers user queries by:- Planning: decomposing a question into search terms
- Retrieving: running hybrid (BM25 + vector) searches across the index
- Aggregating: using ES|QL to compare coverage by outlet and sentiment
- Reasoning: cross-referencing articles to spot contradictions
- Reporting: surfacing primary sources and structured bias analysis
Users ask questions like "How are different outlets reporting on the Federal Reserve's latest interest rate decision?" and get back a side-by-side comparison of headlines, framing, sentiment scores, and contradictions, all with links to the original articles.
How I built it
I started with a simple Python script to pull articles and index them (day one), then realized I needed to make it configurable and robust ( smaller models to save disk space, pagination for more coverage). By day two I had the full pipeline working end-to-end: fetch → enrich → index → search → agent.
Key decisions:
- Used Hugging Face
transformersfor lightweight NLP (DistilBERT for sentiment,dslim/bert-base-NERfor entities,sentence-transformersfor embeddings). - Implemented hybrid search by normalizing BM25 and cosine-similarity scores:
hybrid = 0.5 * bm25 + 0.5 * vector_sim. - Made everything configurable via
.envso the agent can swap models or run offline without code changes. - Added Docker support so the pipeline can be deployed anywhere.
- Wrote unit tests with mocked HTTP calls so I could test without hitting rate limits.
Challenges I ran into
- Model size: The default NER model was 1.3 GB; we switched to a 400 MB version and added an
OFFLINEflag so people could pre-cache models. - Disk space errors: Halfway through a demo run, Hugging Face tried to download a model and hit the disk limit. I learned to set
TRANSFORMERS_OFFLINE=1after pre-warming the cache. - Rate limits: NewsAPI only returns 100 results per page and has strict throttling. I added pagination and deduplication, but more sophisticated retry logic remains TODO.
Accomplishments that I am proud of
- End-to-end working demo: From article fetch to agent reasoning in ~48 hours, with tests and Docker support.
- Hybrid search: Combining keyword and semantic search proved effective at finding relevant articles even when outlets use different terminology.
- Practical output: The agent returns structured, actionable insights (sentiment deltas, contradictions, primary sources) rather than raw search results.
What I learned
- How
transformers.pipeline()lazily downloads models on first use and how to control that behavior. - Hybrid search (combining BM25 and vector) is surprisingly effective for news; neither alone would have been as good.
- Elasticsearch's ES|QL is powerful for time-series and aggregate queries; a single query can replace pages of Python code.
- Orchestrating an LLM with multiple tools is less about the LLM and more about clear tool definitions and structured I/O.
What's next for News Bias Agent
- Asynchronous fetching: Replace serial NewsAPI calls with
asynciofor 10x faster ingestion. - Alternative data sources: Add RSS feeds, GDELT, or a web scraper to bypass NewsAPI's limits and reach paywalled outlets.
- Real-time ingestion: Hook into Elastic Workflows to keep the index fresh 24/7 instead of running ad-hoc.
- Fact-checking integration: Add a tool that queries ClaimBuster or Google Fact Check to flag misleading claims.
- Dashboard / UI: Build a Streamlit or React frontend so non-technical users can explore bias patterns interactively.
- Bias scoring: Compute metrics (e.g., sentiment delta, entity salience per outlet) and expose them as structured fields for deeper analysis.
- Multi-language support: Extend ingestion to non-English outlets and add translation pipelines.
Log in or sign up for Devpost to join the conversation.