ByteGenie

Inspiration

A number of use-cases shared by Shinnhan bank revolve around building custom data applications, including SB5: AI for internal reporting and BI automation; SL1: Data analytics and BI platform; SF4: AI based customer behaviour prediction; SF1: Intelligent document processing; SB9: AI driven SME credit scoring;

So our motivation is to build a platform that allows enterprises to build custom data models and tools that can infer business logic from operational data, enabling enterprises to automate data heavy business operations, without a risk of hallucinations, or extensive manual reviews. Unlike traditional vibe-coding, where vibe-coding is used to generate the final output, which suffers from hallucination risk and needs human evaluation, our approach uses coding agents to infer symbolic formulas that are deterministic and human understandable, and can be executed without any risk of hallucinations, and without any need for human review, in a production environment.

In the demo, we show an example of building an AI based customer behaviour prediction model (SF4), and an example of AI powered dashboarding and BI (SB5). Our approach can be applied to other data-related use-cases, e.g. SME credit scoring, or intelligent document processing.

What it does

Our data modelling agent can take a real-world dataset, consisting of any number of files, as input, and learn a symbolic model for various use-cases such as building ETL pipelines, generating forecasts, or performing deterministic calculations based on some business logic.

In ByteGenie, you can ingest data via manual uploads. You can create a unique dataset id, and upload all the relevant files for that dataset. For example, you could upload files related to your portfolio, and create a portfolio dataset. You could upload files related to customer information, and create a customer dataset.

In addition to manual uploads, you can also ingest data from your existing databases and enterprises using data connectors. For example, you can ingest data from Snowflake using warehouse connections, or you can also ingest data from industry-specific enterprise systems such as Salesforce financial services cloud.

Once you have a suitable dataset for a use-case, you can perform any ad-hoc analysis on this dataset through Data Chat functionality. This functionality allows you to select a dataset, and ask any questions on it in natural language. For example, I can select a portfolio dataset, view the contents of the files in this dataset, to make sure I am using the right data. Now I can ask it to perform any analysis on this data, for example to calculate total portfolio value by industry, highlighting the top 3 industries.

For more repeatable workflows, where you may want to derive certain outcomes from your dataset on an ongoing basis, you can learn symbolic models that can produce a desired output from a set of inputs. Existing functionality supports learning symbolic models for ETL pipelines, forecasting, and business logic discovery. These models can be translated into a programming language (such as python), as well mathematics, and natural language, in a deterministic manner. This ensures that the model: Is executable; Can be analyzed mathematically; Can be interpreted by humans.

The live app can be tried on https://app.byte-genie.com using the following credentials (follow demo video for a user guide): Username: demo-genie Password: $d6YA*arSvgNceQp

How we built it

We built it using qwen coder and our proprietary symbolic language, which can simultaneously be translated into programming languages (to ensure it is executable), as well as mathematics (to facilitate mathematical reasoning), and natural language (to facilitate human review).

We use Qwen coder in an agentic loop, where the qwen coder proposes suitable symbolic program schema, which is then converted into a correct executable program, with the help of symbolic language tools. These symbolic programs are evaluated to move data from one state to the next. After each program execution, the difference between current state of the data and the desired output is calculated, and based on the difference, qwen coder produces the next program. The loop continues until the inputs reach the desired output stage.

Challenges we ran into

The main challenge we ran into was how to get Qwen coder to correctly understand and utilise a symbolic language. Given that the language has a different syntax and rules compared to existing programming languages and mathematics, and the model has not been trained on it, the model had considerable difficulty in constructing even simple program correctly.

We finally resolved that problem by creating a CLI tool, that allows qwen coder to get more details about the symbolic language, as and when needed, and could use the tool to correct its guesses. This allowed the base language model to create correct symbolic programs, with a reasonably correct json schema, which significantly improved its ability to generate correct and meaningful programs.

Accomplishments that we're proud of

That we were able to learn complex models for realistic use-cases including a complex ETL pipeline, a forecasting model, and a budget planning model, without providing the agent any custom instructions.