Inspiration
The process of building, training, and deploying a machine learning model is often complex, requiring significant expertise in coding, MLOps, and framework-specific details. We were inspired to drastically simplify this workflow. Our goal was to create a tool that empowers anyone to turn a simple idea, expressed in plain English, into a fully trained and operational ML model without writing a single line of ML code. We wanted to automate the entire pipeline from prompt to prediction.
What it does
ML Agent is an AI-powered platform that automatically builds and deploys machine learning models from natural language prompts. A user simply provides a high-level goal (e.g., "create a sentiment classifier for imdb dataset") and the system handles the rest.
The platform:
- Analyzes Data: It first fetches and programmatically analyzes the specified Hugging Face dataset to understand its structure, features, and splits.
- Generates Code: It feeds this dataset summary, along with the user's prompt, into a large language model (GPT-5) to generate a complete, high-quality PyTorch training script.
- Trains in the Cloud: The generated code is immediately executed in a secure, sandboxed environment on a GPU-powered cloud instance (NVIDIA H200).
- Monitors & Deploys: It provides a real-time dashboard where users can monitor training progress, view loss/accuracy charts, inspect the generated code, and, once training is complete, instantly use the model via a generated API endpoint.
How we built it
Our stack is designed for rapid development, scalability, and security, leveraging modern AI and cloud tools.
- Backend and Infrastructure: We used Modal as the backbone of our application. It handles all our serverless compute needs, including running the
trainfunction, hostingFastAPIendpoints for the frontend to call, managing shared file storage for models withmodal.Volume, and securely executing untrusted code withmodal.Sandbox. - AI Code Generation: The "brain" of our agent is the OpenAI API. We use a powerful model (like GPT-5) with carefully engineered prompts. These prompts provide the LLM with a clear role (expert ML scientist), strict constraints, and crucial context—most importantly, the JSON summary of the target dataset—to ensure the generated code is accurate and tailored to the data.
- Machine Learning Framework: The generated code is standardized on PyTorch and the Hugging Face ecosystem (
transformers,datasets,tqdm), which provides robust tools for building and training state-of-the-art models. - Interactive Frontend: We built the user-facing dashboard with Streamlit. It communicates with the Modal backend to start jobs, poll for status updates, and run inference. It dynamically displays the generated code, plots metrics using Pandas, and provides a simple interface for testing the final model.
Challenges we ran into
- Ensuring Code Reliability: Getting an LLM to generate Python code that runs perfectly without any modifications is difficult. We overcame this through iterative prompt engineering, where we explicitly defined the required libraries, file paths (
MODELS_DIR), output formats (likelosses.csvandmeta.json), and computational constraints (e.g., A100 GPU). - Executing Generated Code Securely: Running code written by an LLM poses a security risk. We leveraged Modal's
Sandboxfeature, which executes the training script in a completely isolated environment. This prevents the code from accessing anything outside its designated directory in the shared volume, ensuring the safety of the system. - Real-time State Management: We needed a way for our Streamlit frontend to get live updates from the asynchronous training job. We solved this by using
modal.Volumeas the single source of truth. The training sandbox writes artifacts likestatus.jsonandlosses.csvdirectly to the volume, and the frontend polls simple reader functions to fetch and display this data in real-time. - Handling Diverse Datasets: Hugging Face datasets vary in their structure. Our
dataset_summarizer.pyscript had to be robust enough to parse different config formats, feature types, and split definitions to create a standardized JSON summary that the LLM could reliably understand.
Accomplishments that we're proud of
- True End-to-End Automation: We successfully built a system that goes from a simple text prompt to a deployed, usable ML model with zero manual intervention. This full automation is our biggest accomplishment.
- Context-Aware Code Generation: Our agent doesn't just generate generic code. By first summarizing the dataset and including that summary in the prompt, it produces code that is specifically tailored to the data's columns and structure, making it far more effective.
- Seamless and Intuitive UX: The Streamlit dashboard provides a clean, interactive window into the entire process. Users can watch their model come to life, from the initial code generation to the live-updating performance charts, creating a "magical" user experience.
- Efficient Use of Modern Tooling: We are proud of how we effectively combined Modal, OpenAI, and Streamlit. Modal, in particular, allowed us to build a complex, distributed system with features like sandboxing and shared storage without getting bogged down in traditional infrastructure management.
What we learned
- Prompt Engineering is Everything: The success of an LLM-based agent hinges on the quality of its prompts. Providing the LLM with a clear role, constraints, and structured context (like our dataset summary) is non-negotiable for achieving reliable and accurate results.
- Declarative Infrastructure Accelerates Development: Using a framework like Modal, where infrastructure is defined as code alongside the application logic, was a game-changer. It saved us countless hours we would have spent on Docker, Kubernetes, and cloud provider configurations.
- Shared State is Crucial for Asynchronous Systems: For a decoupled architecture with a web client and an async backend worker, a simple and reliable shared state mechanism is essential.
modal.Volumeandmodal.Dictserved this purpose perfectly, acting as the bridge between our components.
What's next for ML Agent
- Support for More ML Tasks: We plan to expand beyond classification and support more complex tasks like image segmentation, text generation, summarization, and translation, which will require more advanced prompt strategies.
- Model Evaluation and Comparison: We want to add a feature allowing users to run several training jobs for the same prompt (e.g., with different base models) and then compare their performance metrics and inference speeds side-by-side in the dashboard.
- Interactive Code Refinement: Give users the option to review the LLM-generated code and make minor edits within the UI before kicking off the training run. This would provide a "human-in-the-loop" capability for advanced users.
- Cost and Performance Optimization: Implement logic to have the agent reason about the user's prompt and dataset size to suggest or automatically choose the most cost-effective model architecture and training configuration.
What's next for ML Agent: Abusing Parallelism 🚀
The current architecture is perfect for leveraging massive parallelism to unlock more advanced capabilities. Instead of running one job at a time, we can "abuse" the serverless nature of Modal to run hundreds of jobs concurrently.
Automated Hyperparameter Sweeps
A user could provide a prompt and a search space, like "find the best learning rate between $10^{-5}$ and $10^{-3}$ for this text classifier." The agent would then:
- Generate a base training script.
- Use Modal's
.map()functionality to spawn dozens or hundreds of parallel training runs, each with a different learning rate. - A final function would automatically collect the metrics from all runs, identify the best-performing model, and present it to the user.
Ensemble Modeling on Demand
Why settle for one model when you can have many? The agent could automatically train a diverse set of models for the same task.
- It could train different architectures (e.g., DistilBERT, a GRU-based model, and a simple TF-IDF classifier) in parallel.
- Once all models are trained, another function would automatically create a final "ensemble" model that combines their predictions, often leading to significantly higher accuracy and robustness. .
Generative Neural Architecture Search (NAS)
We can take code generation to the next level. For a given prompt, the agent could ask the LLM to propose not one, but 50 different small variations of a neural network architecture. It would then train all 50 in parallel and report back on which architecture performed best on the validation set. This turns our tool into an automated architecture discovery engine.
Dashboard for Parallel Jobs
The Streamlit dashboard would be updated to manage these "meta-jobs." It would provide a top-level view of the entire parallel run, showing a ranked list of all child jobs by their final validation accuracy, allowing the user to easily identify the champion model from the crowd.
Log in or sign up for Devpost to join the conversation.