Gemini-Powered Autonomous Disaster Response

Gemini-Powered Edge-to-Cloud Hierarchy for Autonomous Disaster Response

When disaster strikes, every second counts, yet most rescue drones still rely on manual human monitoring. My project transforms disaster response by integrating Google Gemini with edge-based vision-language models to create an end-to-end autonomous monitoring system. The drone operates on a dual-layer AI architecture. At the edge, it runs a fine-tuned YOLOE model on a Raspberry Pi for real-time object detection. The breakthrough occurs at the server level, where we leverage Gemini’s multimodal capabilities. While the drone identifies targets like "collapsed bridges" or "medical kits" via natural language prompts, Gemini acts as the central reasoning engine. It analyzes high-resolution aerial feeds to identify complex hazards, provides automated scene descriptions, and dynamically updates the drone’s mission parameters based on visual cues it "sees" in the disaster zone. By using Gemini’s advanced reasoning, the system moves beyond simple detection to true situational awareness—mapping flood zones and suggesting rescue paths in real-time. This project demonstrates how Gemini can bridge the gap between raw aerial data and actionable life-saving intelligence, providing a scalable, low-cost solution for search and rescue teams navigating the world's most dangerous environments.

Inspiration

Every second counts after a natural disaster. In 2024 alone, natural hazards affected 167 million people. While drones are often used for aerial surveys, they typically rely on human eyes to spot trouble or rigid AI models that only recognize a few things. We were inspired to build a "thinking" drone system with gemini as the brain—one that doesn't just see pixels, but understands natural language, allowing rescue teams to find specific, unpredictable items like "a red medical kit in the mud", "a person waving from a roof", or potential hazard such as “gas tank”, “flooding areas” without needing to retrain the AI.

What it does

The project is a dual-layer search and rescue system: The Edge (YOLOE): A Raspberry Pi on the drone runs a custom-trained, open-vocabulary model. It filters live video for "trigger objects" (e.g., medical kits, life vests, structural damage) to save bandwidth.

The Brain (Google Gemini): When a trigger is hit, the high-res frame is sent to our server. Gemini performs multimodal analysis to:

Reason about Hazards: "I see a person on a roof; the water level is rising near the power lines. This is a high-priority rescue.". Gemini generates comprehensive natural language descriptions of the scene, specifically identifying and evaluating potential hazards to ensure the safety of the rescue team.

Generate Dynamic Prompts: Gemini tells the drone to hold in position, or "Zoom in on the north-west corner of the building" to check for structural cracks. Gemini goes beyond simple detection by reasoning about contextual clues; for instance, if it identifies a tent or a backpack in a disaster zone, it autonomously prioritizes command of the drone to search the immediate vicinity for survivors or vehicles.

Automated Mapping: Gemini outputs structured JSON data containing GPS coordinates and hazard levels to instantly populate a google Map, mark all the objects in a map which is very easy for the

How we built it

Hardware Stack: Custom quadcopter powered by a Pixhawk Flight Controller (ArduPilot) and a Raspberry Pi 5 for edge processing.

Vision-Language Model: We fine-tuned YOLOE on the AIDER dataset, enabling the drone to recognize disaster-specific objects from a 90-degree aerial perspective.

Gemini Integration: We used the Google GenAI SDK to connect our server to Gemini API. We leveraged Gemini’s 1-million-token context window to keep a "mission log"—Gemini remembers every image the drone has seen during the flight, allowing it to spot trends (like rising water levels over time).

Gemini: The Server Brain

The "magic" happens on the server. While the drone sees a "box" around a person, Gemini understands the context. We use Gemini for two critical tasks:

Multimodal Reasoning: We send the image with a system instruction: "You are an emergency response coordinator. Analyze this aerial drone footage. Identify survivors, assess environmental threats (fire, water, electrical), and provide a priority score (1-10)."

** Command & Control:** Gemini acts as the pilot’s assistant. If it sees a "collapsed bridge," it automatically triggers a "Monitoring flood", “looking for down power line” mission by generating new natural language prompts for the YOLOE edge model.

Challenges we ran into

Latency vs. Intelligence: We initially tried larger models, but the latency was too high for a drone moving at 10m/s. Switching to Gemini 1.5 Flash gave us the sub-second reasoning we needed.

Aerial Perspective Bias: Standard models often fail to recognize objects from the top down. We overcame this by prompting Gemini specifically to "look for silhouettes and shadows" typical of aerial photography.

Accomplishments that we're proud of

Zero-Shot Versatility: Our drone found a "yellow backpack" in the snow despite the model never being explicitly trained on backpacks, thanks to the open-vocabulary nature of the vision-language system.

The "Gemini Insight": In one test, Gemini correctly identified that a "fallen tree" wasn't just debris, but a blockage preventing an ambulance from reaching a marked hospital entrance.

What we learned

We learned that Context is King. A drone that can see is helpful, but a drone that can think via Gemini is a force multiplier. Using Gemini's System Instructions, we were able to turn raw pixels into actionable intelligence that could save lives.