Inspiration
The inspiration for Le Tigre stemmed from the desire to build a compact yet powerful system similar to GPT-4-V, leveraging the strengths of Mistral's models. We aimed to create a versatile model capable of real-time reasoning across audio, vision, and text, making Mistral more accessible and efficient. We named it Le Tigre as an upgrade to Le Chat.
What it does
Le Tigre integrates speech recognition, vision, and text-to-speech capabilities to offer a comprehensive multimodal AI solution. It can process and interpret audio inputs, detect and analyze visual elements, and generate descriptive and contextual text outputs, all in real time.
How we built it
- Audio: Utilized a Speech-to-Text model native to Apple devices, with inference performed on the frontend using a SwiftUI iOS app.
- Vision:
- YOLO (Object Detection): Implemented a YOLO v7 model fine-tuned on OpenImages v7, capable of recognizing 600 classes.
- OCR (Text Extraction): Leveraged the OCR API from Google Vision for text extraction, especially for mathematical content.
- Combined the outputs of these models to feed into LLM_1.
- LLM_1: Used Mistral 7B Instruct v0.3, fine-tuned to generate advanced image descriptions from JSON inputs.
- LLM_2: Second LLM, used Mistral 7B Instruct v0.3, fine-tuned to generate a more realistic conversation. The idea is still to separate those two models because they are both specialized in a specific task.
- Fine-tuning: Developed synthetic data for specific use cases through a process called ft_chain_1.

Challenges we ran into
- Finding suitable open-weight OCR models for our specific needs.
- Fine-tuning the LLM to accurately reconstruct image scenes from coordinates.
- Generating synthetic data for our very specific use case to ensure the model performed well in real-world applications.
- Balance between model performance, size and hallucinations
Accomplishments that we're proud of
- Successfully integrating multiple modalities (audio, vision, text) into a cohesive and efficient system. Not bad for an MVP.
- Fine-tuning a compact model to achieve high accuracy in generating contextual image descriptions.
- Building a functional and aesthetic IOS application with Xcode.
- Creating a versatile AI model that performs well in real-time applications while being resource-efficient.
What we learned
- The importance of tailored fine-tuning to improve model performance for specific tasks.
- Effective strategies for combining outputs from different models to enhance overall system capabilities.
- Techniques for generating synthetic data to support the fine-tuning process and improve model accuracy.
What's next for Le Tigre
- Further refinement of the model to reduce hallucinations and improve stability.
- Expanding the dataset for fine-tuning to cover more diverse and complex scenarios.
- Enhancing the user interface and experience of the SwiftUI iOS app to make it more intuitive and user-friendly.
- Exploring additional applications and industries where Le Tigre can be effectively deployed.
Log in or sign up for Devpost to join the conversation.