Allocaid

Inspiration

Every year, billions of dollars flow into humanitarian crises through the United Nation’s (UN) Country-Based Pooled Funds, but not all crises receive funding proportional to their need. Some emergencies dominate headlines and attract massive support, while others with severe if not more severe conditions quietly go underfunded.

The UN's founding charter calls for international cooperation in solving humanitarian problems. Our team took that mission and thought about how we could directly contribute. This led us to build a machine learning model that predicts what each crisis should receive based on humanitarian indicators, then measures where reality falls short. The result is a tool that increases visibility to the general public as well as helps decision makers within the UN understand the complex data behind the thousands of programs they are involved in.

What it does

Allocaid is an interactive geo-analytics tool that surfaces potential blind spots in humanitarian pooled fund allocations. It learns funding patterns from historical crisis data using a time-aware XGBoost model trained on UN-aligned risk and vulnerability indicators (e.g., INFORM Risk, vulnerability, conflict probability, food security, population proxies, and socioeconomic context).

The model estimates expected CBPF allocations for each country-year and compares them to actual funding, producing a normalized "funding gap" score across 25 countries from 2020 to 2025 that highlights contexts that may be systematically underfunded relative to comparable crises.

Allocaid also includes a peer benchmarking engine that identifies structurally similar crises and contrasts their funding outcomes, and surfaces efficiency signals based on beneficiary reach where available. To supplement the quantitative analysis, we built a RAG-powered chatbot that draws from IATI project data indexed through Actian VectorAI to generate additional insights on how funding gaps might be addressed. These insights are presented through an interactive Streamlit application with maps and dashboards to support exploratory analysis and decision support.

How we built it

Allocaid is an interactive geo-analytics tool that surfaces blind spots in humanitarian pooled fund allocations. It uses a time-aware XGBoost model trained on UN-aligned indicators like INFORM Risk, vulnerability, conflict probability, food security, and population to estimate what CBPF funding each country should receive. By comparing predictions to actual allocations across 25 countries from 2020 to 2025, it produces a normalized "funding gap" score that highlights crises that may be systematically underfunded. Allocaid also includes a peer benchmarking engine that identifies structurally similar crises and contrasts their funding outcomes, and surfaces efficiency signals based on beneficiary reach. To supplement the quantitative analysis, we built a RAG-powered chatbot that draws from IATI project data indexed through Actian VectorAI to generate additional insights on how funding gaps might be addressed. All insights are presented through an interactive Streamlit application with maps and dashboards.

Challenges we ran into

Working with Databricks Free Edition came with unexpected constraints. Local filesystem access is restricted on serverless compute, so we had to find alternative approaches for exporting data. We also ran into numpy version conflicts between the pre-installed runtime packages and newer versions pulled in by SHAP and XGBoost, which required careful dependency pinning. On the data side, building a reliable Need Proxy for countries missing People in Need figures required chaining multiple fallback signals, and getting SHAP to work with our one-hot encoded pipeline needed specific handling to produce valid explanations.

Fetching IATI API data introduced problems around scale and data inconsistency. The API limits results to 1,000 rows per request, requiring controlled pagination with rate-limit handling. Many fields were stored as nested lists or stringified JSON, requiring careful parsing before analysis. On the vector search side, generating embeddings for thousands of projects required batching to avoid memory spikes, and building the RAG chatbot with Groq required careful prompt structuring to ground responses in retrieved results rather than allowing the LLM to hallucinate. Aligning API ingestion, vector indexing, and LLM reasoning into one reliable pipeline was the overarching challenge.

Accomplishments that we're proud of

We are proud of building a complete end-to-end ML pipeline that goes from raw UN data all the way to a deployed web application. The model produces genuinely useful insights: for example, in 2025 our model flags Yemen, Burkina Faso, and Pakistan as receiving significantly less funding than their humanitarian indicators would suggest. We are also proud of the intentional design choice to exclude prior-year funding as a model feature. It would have been easy to include it and boost our accuracy metrics, but doing so would have let the model learn "countries that got funded before get funded again," which just reinforces the exact bias we are trying to surface. Our Databricks pipeline follows real production patterns with Delta Lake tables, MLflow experiment tracking, and Unity Catalog governance, not just notebooks running scripts.

We are also proud of the aid intelligence layer we built alongside the core model. We took messy, nested IATI project records and transformed them into structured financial indicators and semantic embeddings indexed in Actian VectorAI, enabling a Groq-powered RAG chatbot that generates grounded, explainable funding insights. This allows real-time benchmarking of beneficiary-to-budget ratios, sector allocations, and donor patterns across crisis countries like Sudan and Bangladesh.

What we learned

This project taught us how much of humanitarian funding allocation is driven by factors beyond raw need. When you build a model based purely on humanitarian indicators, the gaps between predicted and actual funding are striking and consistent. We also learned the practical side of building on Databricks, from Medallion Architecture design to managing MLflow experiments to navigating the constraints of Free Edition serverless compute. On the data science side, the importance of walk-forward validation for time-series problems became very clear, since naive cross-validation would have leaked future information into our training data and given us misleadingly strong results.

What's next for Allocaid

Going forward, we want to expand Allocaid to visualize and analyze other aspects of international aid, such as the classification of partners by nationality and size. We would also like to add short-horizon forecasting to go from highlighting current underfunding to predicting imminent funding dips in a way that helps ensure the safety of everyone involved. On the platform side, connecting the Streamlit app to live Databricks tables through a scheduled pipeline would allow Allocaid to update automatically as new CBPF data is released. Ultimately, we envision this as a monitoring tool that could alert UN program officers when a crisis begins to slip below its expected funding trajectory, before it becomes a full-blown gap.