Inspiration

Current Vision Language Models (VLMs), despite their ability to learn "common sense" from internet data and handle long-tail cases, fall short in autonomous driving applications due to two critical limitations: they can't process high frame-rate video inputs and can't meet real-time latency requirements. This gap presents significant implications for real-world applications:

  • Autonomous vehicle development needs better analysis of edge cases
  • Driving schools require objective assessment tools
  • Insurance companies seek efficient claim verification methods
  • Law enforcement needs automated traffic violation analysis

What it does

TeslaQA automatically analyzes traffic videos and answers questions about driver behavior, traffic rules, and road safety. By first converting video content into descriptive text using BLIP, and then leveraging ChatGPT for question-answering, our system provides detailed explanations of traffic scenarios.

How we built it

  • Used BLIP to convert video frames into detailed text descriptions
  • Employed ChatGPT API to analyze these descriptions and answer specific questions
  • Created a pipeline that processes 5-second traffic videos
  • Developed custom prompt engineering to ensure accurate and relevant responses

Challenges we ran into

  • Initially struggled with TimeSformer model and BLIP model for video understanding
  • Faced issues with different video frame rates and lengths
  • Had to optimize the text descriptions to be both concise and informative
  • Needed to carefully design prompts to get consistent, accurate answers

Accomplishments that we're proud of

  • Successfully combined computer vision and language models
  • Created an interpretable system that can explain its reasoning
  • Achieved accurate answers for complex traffic scenarios
  • Built a scalable solution that can handle various traffic situations

What we learned

  • The importance of model selection for specific tasks
  • How to effectively combine multiple AI models
  • The value of converting visual data to text for better interpretability
  • Techniques for prompt engineering with GPT models

What's next for TeslaQA

  1. Fine-tune BLIP model for specialized traffic understanding:

    • Generate more precise descriptions of vehicle movements, signals, and road conditions
    • Better capture critical driving behaviors and safety moments
    • Develop traffic-specific visual attention mechanisms
  2. Enhance text summary quality through:

    • Adding detailed spatial relationships between vehicles
    • Incorporating multi-frame temporal understanding
    • Building specialized traffic vocabulary and context
  3. Technical Improvements:

    • Integrate with existing traffic monitoring systems
    • Develop APIs for easy integration
    • Create efficient frame selection algorithms
    • Optimize for real-time processing
  4. Model Enhancement:

    • Train on larger, more diverse traffic datasets
    • Implement more sophisticated prompting strategies
    • Add multi-camera support
    • Develop hierarchical summarization capabilities

Our goal is to make roads safer by providing better tools for understanding and analyzing traffic scenarios, ultimately contributing to the development of more reliable autonomous driving systems and better driver education.

The kaggle link is listed below. And our team name is 'passionfruit.'

Built With

  • api
  • blip
  • jupyternotebook
  • openai
  • python
  • tesla
  • vlm
Share this project:

Updates