Inspiration
Sodacan was born from our shared frustration with modern data engineering. Simple tasks felt overly complex, requiring repetitive scripts, complex UIs, and constant context-switching between tools. We saw non-technical teammates completely blocked, unable to access or combine data without an engineer's help.
Our inspiration came from a simple question: "What if we could just tell the data what to do?"
We realized the massive leap in LLM capabilities would finally let us build the tool we always wanted: a smart, CLI-first platform that understands plain English and acts as an AI-powered data engineer right in our terminals.
What it does
Sodacan is an AI-powered, CLI-first data engineering platform. It's designed to act as an AI-powered data engineer that lives in your terminal.
It allows anyone, from engineers to non-technical analysts, to ingest data from multiple sources (like local CSV/PDF files, AWS S3, or databases like Snowflake and MySQL), transform that data using plain English commands (e.g., "drop the 'test_user' column and rename 'id' to 'customer_id'"), and then load the clean data into any destination (like Google Sheets, a Snowflake table, or a new file).
It uses a two-stage AI pipeline to reliably understand your request and convert it into executable code, allowing for a conversational, multi-step data workflow without you having to write a single line of Python or SQL.
How we built it
We built Sodacan as a Python-native CLI using Typer. The "brain" is a two-stage AI pipeline using the Google Gemini 2.5 Flash model, which we found was far more reliable than a single prompt:
Analyzer: This first AI agent classifies the user's intent (e.g., pandas_transform) and generates a structured JSON instruction.
Executor: A second AI agent receives this JSON and focuses on one task: writing clean, executable Pandas code.
We use Pandas as our universal data format. All sources (files, S3, databases) are loaded into a DataFrame, transformed, and then written to any sink. The interactive build mode features session-based memory (remembering the schema and conversation) and an undo/redo stack. Connectors were built using standard libraries like boto3 (AWS), SQLAlchemy (databases), and gspread (Google Sheets).
Challenges we ran into
Taming the LLM: Our biggest challenge was prompt engineering. Getting the AI to consistently return only a JSON object or only Python code, without any conversational fluff ("Sure, here's the code!"), took dozens of iterations.
Unstructured Data (PDFs): Extracting tables from PDFs is hard. Our hybrid solution uses pdfplumber to get all the text, then feeds that text to Gemini to re-format it as a CSV. It's a huge improvement but is still the most brittle part of our system.
Safety and "Hallucinations": We had to build safeguards, like running the AI-generated code in an isolated namespace, to prevent destructive "hallucinated" commands from breaking the session.
Scope Creep: The temptation to add "just one more" data source was immense. We had to be disciplined and focus on a core, end-to-end workflow (file -> build -> database) to deliver a polished product.
Accomplishments that we're proud of
The Two-Stage AI Pipeline: We built a reliable system where one AI ("Analyzer") figures out what you want, and a second AI ("Executor") writes the code. This separation makes it far more accurate than a single prompt.
True Conversational Workflow: The interactive build mode remembers your conversation and data's current state (its schema). This allows for complex, multi-step transformations and includes features like undo/redo.
Graceful Degradation: The tool is resilient. If database credentials are-missing, it generates a .sql file instead of crashing. If the AI is confused, it falls back to a simpler mode.
Solving PDF Extraction: We're proud of our hybrid AI solution that uses pdfplumber and Gemini to intelligently extract tables from PDFs, a notoriously difficult task.
Empowering Non-Technical Users: We successfully built a tool that lowers the barrier to data engineering, allowing non-technical teammates to access and work with data independently.
What we learned
AI Needs Guardrails: Our two-stage pipeline was our most important discovery. Constraining the AI at each step was the key to getting reliable, structured output.
Context is King: The "magic" of our interactive mode isn't just the AI; it's feeding the AI the current DataFrame schema and conversation history with every single prompt.
Graceful Degradation is Key: We built the system to fail gracefully. If the AI pipeline gets confused, it falls back to a simpler prompt. If database credentials are missing, it generates a .sql file instead of crashing.
Connectors are 90% of the Work: Building reliable connectors to handle authentication, rate limits, and data quirks for each source was the most significant traditional engineering effort.
What's next for Sodacan?
A Specialized Transformer Model: Our main goal is to train our own transformer on a large dataset of data-engineering commands. This will create a smaller, faster, and more accurate model specialized for this exact task.
More Connectors: We will expand our connector library to include tools like BigQuery, Kafka, Airtable, and Salesforce.
Enhanced AI Guardrails: We plan to build smarter safety features, like having the AI pre-validate its own code or warn users before running a destructive command.
Deeper Unstructured Data Handling: We will continue to improve our AI-hybrid approach for PDFs and expand it to other common unstructured formats like server logs or web pages.
Built With
- amazon-web-services
- google-cloud
- google-gemini-api
- google-sheets-api
- mysql
- pandas
- pdfplumber
- postgresql
- python
- pyyaml
- rich
- snowflake
- sqlalchemy
- sqlite
- typer



Log in or sign up for Devpost to join the conversation.