Inspiration
The need for precise, high-quality datasets is critical in AI development. We saw an opportunity to leverage Microsoft Fabric’s powerful ecosystem to create a solution that automates and customizes real world multimodal dataset generation, helping developers save time and increase the quality of their AI models.
What it does
InfraGen is a cloud-native data pipeline that automatically generates diverse, user-specific image datasets. It combines structured and unstructured data, AI-enhanced labeling, and multimodal integration to produce high-quality datasets ready for model training, testing, and validation.
How we built it
Using Microsoft Fabric, we integrated a data lake to store tables and datasets, while leveraging Azure OpenAI GPT-4 for advanced data processing. By utilizing Microsoft Fabric’s data science features—such as notebooks, machine learning models, and multimodal AI tools like CLIP—we implemented precise classification and detection capabilities for accurate image categorization. With integrated Power BI and AI Copilot, we can seamlessly expand multimodal insights and perform real-time visual analysis on generated datasets. The pipeline is designed to be modular, scalable, and highly adaptable to diverse data needs.
Challenges we ran into
We faced challenges in integrating different AI models smoothly into a single pipeline when each model requires different environment to run. It is also challenge to ensuring real-time data retrieval and processing for efficient performance. Making the pipeline modular and adaptable across diverse AI models was another complex task we had to tackle.
Accomplishments that we're proud of
We’re proud of using machine learning models and Fabric platform to creating a robust, end-to-end pipeline that simplifies dataset generation for AI developers. Successfully incorporating multimodal AI, retrieval-augmented generation, and scalable architecture into Microsoft Fabric has made InfraGen a valuable tool for handling diverse and specialized data needs.
What we learned
We gained a deeper understanding of Microsoft Fabric’s ecosystem and the flexibility it offers for integrating AI and data science tools. We also learned the importance of modular design and adaptability in building solutions that can meet a wide range of data requirements.
What's next for InfraGen
We plan to expand InfraGen’s capabilities by incorporating more data types and adding features like automated data labeling. Future development includes optimizing the pipeline for real-time data updates and improving interoperability with other AI tools like Azure Machine Learning Platform, making InfraGen even more versatile for various AI applications.
Log in or sign up for Devpost to join the conversation.