Inspiration

I have been scaling a logistics business over the past year and working closely with fleets, warehouses, and real world operations. Being on the ground made me realize how much valuable data exists in everyday physical activity but is never captured or structured in a usable way. When AI for the internet was built, it had the web as a clean, organized data layer. Physical AI has no such foundation. That gap became the inspiration. This led to capturing real world sensor data and using models like Gemini to make it searchable and understandable, with the goal of building the internet for Physical AI.

What it does

The platform works on multimodal data from real world environments, including synchronized camera, LiDAR, telemetry, and in cabin audio streams. This data is aligned on a single timeline and structured into searchable intelligence.

Using models like Gemini, engineers can describe a situation in natural language and instantly retrieve the exact moments across hours of footage. It enables analysis of why a vehicle reacted a certain way, where edge cases occur, and what patterns led to specific decisions.

It also supports behavior understanding, intent extraction from speech, and frame by frame scene analysis, making large volumes of multimodal data usable for teams building Physical AI systems.

How we built it

To begin with I wrote the entire code with a combination of Google AI studio’s vibe coding abilities, and Google Antigravity.

For each product, I gave gemini the most “useful” data that I could: 1) For the audio I used a diffusion model to denoise it before handing it over to gemini for translation and transcription.

2) For the video, I started by bounding all objects using YOLO, then algorithmically tracking their velocity vectors. I then passed this as pairs to gemini so that it could take inferences by understanding it as time series data.

This allows gemini to perceive rapid changes in velocity from motorbikes as them cutting in, or as dangerous driving. This combined with the audio allows to get near perfect semantic retrieval.

3) Delby Cortex was the most innovative goal- I tried to condition a predictive model with gemini’s output as to how an object would behave. I tried using a U-Net with a transformer at the bend of the U to achieve this. While this was too computationally expensive for now, I got the best results with this while annotating our video.

Challenges we ran into

The ultimate goal is to pass annotated data to gemini, and use the outputs to condition subsequent models.

In order to accomplish this, labelling the data before passing it to gemini is the easy part, but using gemini’s output to condition the later models is very tricky.

An approach that worked very well was to downsample the data as much as possible through convolutional layers, and then use a self-attention model before upsampling back.

However, the inference of this model proved to be too slow to be practical. I am currently experimented with a “wake word” based approach, where a smaller quantized model identifies the frames of interest based on gemini’s output (gemini provides a confidence score along with its answer), and we intermittently wake up the UNet.

This was a very challenging technical problem that was a delight to experiment with.

Accomplishments that we're proud of

I am extremely proud of the fact that I could get gemini’s scene understanding to be perfect by passing in annotated data. This allows me to make gemini truly multimodal in such applications.

While the other end of using it in future models is not useful yet, I am extremely happy with it getting 7-8% better accuracy at predicting the subsequent frame’s velocity vectors. This is a massive level shift from the state of the art.

What we learned

We use the internet so much that we take the speed for granted, but building this allowed me to truly appreciate how much engineering went into this.

AI is a useful tool, but the problem solving behind building such a high scale application made me learn: 1) How to use layer embeddings to accelerate models. 2) I became very proficient with tensorflow 3) I can diagnose the training time behaviour of a transformer. 4) I can use advanced data structures to store and retrieve multimodal data (I even have some ideas on how to adapt spanning trees for this purpose).

What's next for DELBY INTELLIGENCE

The next step is scaling the data layer and building more advanced applications on top of it. Starting with semantic search across multimodal data, the platform will expand into behavior prediction, intent understanding, and decision support tools for teams building Physical AI systems.

As more real world datasets come online across fleets, roads, factories, and logistics networks, the goal is to create a foundational intelligence layer that helps machines better understand the physical world. Over time, this can contribute to safer roads, more reliable mobility, smarter manufacturing, and more efficient supply chains, ultimately supporting systems that improve safety and productivity, reduce accidents, and enhance quality of life at scale.

Built With

  • aistudio
  • antigravity
  • chromadb
  • gemini
  • python
  • ros2
  • streamlit
  • tensorflow
Share this project:

Updates