Inspiration
The idea originally came from one of the team members in which we realized that companies upload messy documents — invoices, contracts, CSVs, or PDFs — into AI pipelines, the results are slow, expensive, and inconsistent. Teams end up spending huge amounts of money on token usage, retraining, and manual cleanup. So we thought of solutions to fix this problem and make it cost effective.
What it does
CleanData is a scalable system that transforms messy files, such as .csv, .docx, and PDFs into clean, structured JSON or CSV outputs that are ready for analysis or model training. When a new file is uploaded, an LLM processes the unstructured content to generate structured data, while repeated or similar files are handled instantly from cache to reduce costs and latency. Along the way, CleanData normalizes formats like addresses and dates, fixes duplicates, validates important fields such as tax IDs or contract clauses, and flags errors for user confirmation. In terms of scalability we've implemented AWS Kinesis with sharding, the system is in the works of being able to handle many clients in parallel. Our multiple agents are able produce results that are cleaner and more consistent in some cases relative to Chatgpt. The result is faster, greener, and more cost-effective data pipelines that minimize token usage while ensuring reliable outputs.
How we built it
We started by designing a simple drag-and-drop interface so users can upload .csv, .docx, or PDF files. On the backend, we built a pipeline using AWS Kinesis to route incoming files to prepare for future scalability across multiple clients. Each file is assigned by an agent manager: if it’s new, it’s sent to an LLM (Cerebreas) that structures the unformatted content into clean JSON or CSV; if it’s similar to a previous upload, the result is pulled directly from cache for speed and efficiency. We integrated validation logic so the system can normalize formats, fix duplicates, and flag anomalies with a judge agent for user confirmation. Processed outputs are stored in AWS S3, and we optimized prompts to minimize token usage and reduce energy costs. Together, these components form a system that is both functional now and architected for enterprise-level scalability.
Individual Contributions
Our team split the work to move fast and cover every part of the build. Two teammates focused on the frontend, creating the drag-and-drop upload interface, the dashboard to display raw versus cleaned data, and the user confirmation screen for flagged errors. The other two teammates built the backend pipeline, wiring together AWS Kinesis and S3, integrating Cerebreas to transform unstructured files into structured JSON or CSV, and adding caching, validation, and judge agents for error detection. To support this, one teammate generated realistic test datasets like messy CSVs, contracts, and invoices, while another dug into scalability, implementing sharding and agent orchestration so the system can handle many client requests in parallel. Together, these contributions gave us a working prototype with a clean UI, reliable backend, and a path to scale.
Challenges we ran into
One of the biggest challenges we ran into was LLM hallucination when cleaning noisy data. For example, when uploading a messy DOCX contract with missing fields, the model sometimes “filled in the blanks” with information that wasn’t in the original document, like inventing a ZIP code or tax ID. That was dangerous because we needed the system to stay truthful to the input, not generate false data. So basically the LLM hallucinates when the input tokens go beyond the context window and to solve this we added a verifier agent that sends data into smaller batches into the LLM which reduces the chances of errors by a great margin. Additionally, we addressed this by adding a judge agent that validates outputs, flags suspicious or fabricated fields, and asks the user for confirmation before finalizing.
Accomplishments that we're proud of
Honestly, in the beginning of the competition we didn't know if we were able to submit a solid idea as we ran into so many issues however over the course of the night we were steadfast and worked through all the struggles and finally were able to produce a product that is capable of so many things. We are most proud of the team effort that took to finishing this product.
What we learned
Going through the different ideas, we've learned from idea generation to spending 24 hours together not only were we able to develop team camaraderie but also we learned new ways of implementation and new ways to brainstorm. For example once a mentor came over and suggested we use cache and keep storing data that way we save energy and cost.
What's next for our project
Something that we are really looking forward to is expanding on our efficiency within the product in terms of response times along with becoming much more eco friendly.
Built With
- amazon-web-services
- cerebreas
- claude
- kinesis
- react
- redis
- s3
- typescript
Log in or sign up for Devpost to join the conversation.