The problem
Questions like these are easy to ask and hard to answer from a pile of raw product feedback and events:
- What was the percentage increase in payment failures this month compared to last month?
- What percentage of users who unsubscribed raised a pricing concern but never opened a Support ticket before they left?
Review and product data are unstructured and always changing: new fields, new event types, new tools. A common response is to put everything in a vector database, because you can append new documents without touching a schema. That helps search and similarity, but it does not reliably support aggregate reasoning—percentages, cohorts, joins across entities, and “never did X” style logic. For trustworthy answers you usually want typed rows in tables and SQL.
The catch: if the shape of the data keeps moving, the relational model has to keep moving too. Hand-maintaining DDL and migrations for every shift is expensive and error-prone, so teams feel stuck between flexible-but-weak-on-aggregates (vectors) and accurate-but-high-maintenance (SQL).
Introducing DataBoss
DataBoss is an autonomous unstructured-to-structured pipeline: new blobs land in storage, an agent proposes a schema and migration story, you review it in git, and deterministic SQL (not another LLM pass) can back the answers to questions like the ones above—once the model matches the data.
How it works
Ingestion and event-driven trigger
Clients upload unstructured data to Google Cloud Storage. GCS emits notifications (e.g. via Pub/Sub); the service receives them and knows new data is ready to process.Staging
The server loads incoming payloads into a staging area in the database. A threshold (e.g. number of pending rows) avoids running a full cycle on every tiny file. When the buffer is “full enough,” the pipeline starts.Inspector (schema and plans)
An Inspector agent compares what’s in staging to existing production-oriented structures (in the codebase this shows up asPROD_*vs proposedDEV_*tables on the same Postgres). It produces an execution-oriented artifact set: DDL for new/changed dev tables, dbt staging models where appropriate, a migration script for existing data, an injection script from staging into the new model, and human-readable summary/metadata (accepted vs rejected rows, design notes). It validates and tightens this loop (including dbt runs against the dev side) so the plan is workable before you rely on it.Early injection (fast feedback)
Even before production changes are final, an Injection step can apply the approved plans to the dev side so analysts get real rows in the new shape quickly—useful for demos, dashboards, and SQL checks—while the pull request is still open.Pull request and review
The Inspector opens a PR containing the execution plan (schema, migration, injection, dbt where relevant) plus documentation: what changed, why, notes on rejected staging rows, and space for ERD / diff-style views if you wire those in. Merge can be manual (stakeholders comment, ask for fixes, re-run) or automated, depending on policy. If the PR is rejected, you roll dev back to match prod conceptually (how you implement “reset” is up to your ops; the intent is no half-applied model left lying around). You can also edit the branch yourself—clone, commit, merge—so the next cycle starts from your model.Deterministic production migration
After merge, CI (e.g. GitHub Actions) should run the same migration and load scripts that were already tested in dev—execution, not a fresh round of model inference. That keeps prod behavior predictable and auditable.Reset for the next cycle
When prod is updated and stable, dev is brought back in sync with prod, ephemeral demo/injection state is cleared, staging is cleaned up for successfully processed rows, storage hygiene follows your rules, and the system returns to listen mode for the next event.
What required a team of engineers can now be managed autonomously by DataBoss.
Built With
- langchain
- python
Log in or sign up for Devpost to join the conversation.