sodacan

CLI ui warrior
avengers assemble
group debugging session
autobots rollout!

Inspiration

Sodacan was born from our shared frustration with modern data engineering. Simple tasks felt overly complex, requiring repetitive scripts, complex UIs, and constant context-switching between tools. We saw non-technical teammates completely blocked, unable to access or combine data without an engineer's help.

Our inspiration came from a simple question: "What if we could just tell the data what to do?"

We realized the massive leap in LLM capabilities would finally let us build the tool we always wanted: a smart, CLI-first platform that understands plain English and acts as an AI-powered data engineer right in our terminals.

What it does

Sodacan is an AI-powered, CLI-first data engineering platform. It's designed to act as an AI-powered data engineer that lives in your terminal.

It allows anyone, from engineers to non-technical analysts, to ingest data from multiple sources (like local CSV/PDF files, AWS S3, or databases like Snowflake and MySQL), transform that data using plain English commands (e.g., "drop the 'test_user' column and rename 'id' to 'customer_id'"), and then load the clean data into any destination (like Google Sheets, a Snowflake table, or a new file).

It uses a two-stage AI pipeline to reliably understand your request and convert it into executable code, allowing for a conversational, multi-step data workflow without you having to write a single line of Python or SQL.

How we built it

We built Sodacan as a Python-native CLI using Typer. The "brain" is a two-stage AI pipeline using the Google Gemini 2.5 Flash model, which we found was far more reliable than a single prompt:

Analyzer: This first AI agent classifies the user's intent (e.g., pandas_transform) and generates a structured JSON instruction.

Executor: A second AI agent receives this JSON and focuses on one task: writing clean, executable Pandas code.

We use Pandas as our universal data format. All sources (files, S3, databases) are loaded into a DataFrame, transformed, and then written to any sink. The interactive build mode features session-based memory (remembering the schema and conversation) and an undo/redo stack. Connectors were built using standard libraries like boto3 (AWS), SQLAlchemy (databases), and gspread (Google Sheets).

Challenges we ran into

Taming the LLM: Our biggest challenge was prompt engineering. Getting the AI to consistently return only a JSON object or only Python code, without any conversational fluff ("Sure, here's the code!"), took dozens of iterations.
Unstructured Data (PDFs): Extracting tables from PDFs is hard. Our hybrid solution uses pdfplumber to get all the text, then feeds that text to Gemini to re-format it as a CSV. It's a huge improvement but is still the most brittle part of our system.
Safety and "Hallucinations": We had to build safeguards, like running the AI-generated code in an isolated namespace, to prevent destructive "hallucinated" commands from breaking the session.
Scope Creep: The temptation to add "just one more" data source was immense. We had to be disciplined and focus on a core, end-to-end workflow (file -> build -> database) to deliver a polished product.

Accomplishments that we're proud of

The Two-Stage AI Pipeline: We built a reliable system where one AI ("Analyzer") figures out what you want, and a second AI ("Executor") writes the code. This separation makes it far more accurate than a single prompt.
True Conversational Workflow: The interactive build mode remembers your conversation and data's current state (its schema). This allows for complex, multi-step transformations and includes features like undo/redo.
Graceful Degradation: The tool is resilient. If database credentials are-missing, it generates a .sql file instead of crashing. If the AI is confused, it falls back to a simpler mode.
Solving PDF Extraction: We're proud of our hybrid AI solution that uses pdfplumber and Gemini to intelligently extract tables from PDFs, a notoriously difficult task.
Empowering Non-Technical Users: We successfully built a tool that lowers the barrier to data engineering, allowing non-technical teammates to access and work with data independently.

What we learned

AI Needs Guardrails: Our two-stage pipeline was our most important discovery. Constraining the AI at each step was the key to getting reliable, structured output.
Context is King: The "magic" of our interactive mode isn't just the AI; it's feeding the AI the current DataFrame schema and conversation history with every single prompt.
Graceful Degradation is Key: We built the system to fail gracefully. If the AI pipeline gets confused, it falls back to a simpler prompt. If database credentials are missing, it generates a .sql file instead of crashing.
Connectors are 90% of the Work: Building reliable connectors to handle authentication, rate limits, and data quirks for each source was the most significant traditional engineering effort.

What's next for Sodacan?

A Specialized Transformer Model: Our main goal is to train our own transformer on a large dataset of data-engineering commands. This will create a smaller, faster, and more accurate model specialized for this exact task.
More Connectors: We will expand our connector library to include tools like BigQuery, Kafka, Airtable, and Salesforce.
Enhanced AI Guardrails: We plan to build smarter safety features, like having the AI pre-validate its own code or warn users before running a destructive command.
Deeper Unstructured Data Handling: We will continue to improve our AI-hybrid approach for PDFs and expand it to other common unstructured formats like server logs or web pages.