Inspiration
The 2021 Perseverance landing proved autonomous rover navigation works at planetary scale, but current systems still require hand-coded waypoints for every target. We wanted to flip that: give the rover a natural language goal like "find the skull-shaped rock" and let vision-language AI handle the rest. Mars has 144 million square kilometers of surface. You cannot map it waypoint by waypoint.
What it does
ARES Mars Scout is an autonomous rover agent that navigates a hand-built Mars terrain scene using natural language goals and Gemini Vision AI.
You type a target description -- for example, "dark olivine boulder" or "polygon crack network" -- and the agent drives the rover to it using a finite state machine (SEARCHING, APPROACHING, VERIFYING, COMPLETED). Gemini Flash evaluates camera frames at each tick and returns a confidence score plus a bounding box. The Pure Pursuit controller converts that into a 3D waypoint and drives.
Every mission writes live telemetry into three backends at once:
- MongoDB Atlas stores the full spatial catalog with 2dsphere geospatial indexes and a 384-dimension vector search index on terrain feature embeddings. A TTL-indexed alerts collection fires when consecutive low-confidence detections signal a problem.
- Elasticsearch indexes every observation with dense vector fields for hybrid keyword + semantic search. Full Kibana Ground Control dashboard refreshes every 5 seconds: FSM state distribution, VLM confidence over time, inference latency, false positive rates.
- Arize Phoenix receives every VLM inference as an OpenTelemetry span using OpenInference semantic conventions, then runs post-mission hallucination and relevance evaluations automatically.
How we built it
Mars terrain: We built the entire scene from scratch in Isaac Sim. HiRISE elevation GeoTIFFs drive the displacement mesh. Rocks are procedurally generated with vesicular basalt geometry, 3-channel color models (dust cap / base / rust shadow), and regolith skirts at the base. A custom sky shader reproduces the Martian atmospheric scattering.
Agent core: Python FSM wrapping ROS2 control nodes. Gemini 1.5 Flash handles visual grounding. Pure Pursuit converts 2D detections to 3D waypoints via ray-depth intersection.
Telemetry: MongoDB Atlas with compound indexes, vector search, and Change Streams for real-time alerting. Elasticsearch with geo_point and dense_vector mappings. Arize Phoenix via OTLP export with a custom Gemini span attribute mapper (no first-party instrumentation package exists yet).
Batch seeder: A kinematic differential drive simulator generates 50 complete synthetic missions (~6,000 observations) so the dashboards have real data volume to visualize even before a live GPU session.
Challenges we ran into
Getting three independent telemetry backends to receive data from the same tick without any one silently swallowing errors was harder than expected. The design rule was: no silent fallbacks. If MongoDB is down, the mission crashes loudly. That constraint forced proper connection validation at startup rather than scattered try/except blocks.
Atlas M0 free tier does not support programmatic Vector Search index creation via the Data API. We detect that error at runtime and print exact manual steps rather than pretending the index exists.
Writing OpenInference-compliant span attributes for Gemini required a custom mapper since there is no official Gemini instrumentation package.
Accomplishments we're proud of
The adaptive confidence threshold surprised us most. After enough missions, the system notices that certain query types consistently produce high-confidence detections that turn out to be false positives when the FSM walks back from VERIFYING to APPROACHING. It raises the minimum confidence threshold for those query types automatically on future missions. That is calibration from production data feeding back into agent behavior, no human in the loop.
The terrain memory catalog compounds across missions: features discovered on mission 1 are available as prior knowledge for mission 50. The planner queries nearby known features before departing, reducing redundant exploration.
What we learned
Observability is not optional for autonomous agents. Without Phoenix traces we had no idea the VLM was systematically overconfident on texture-ambiguous terrain. Without Elasticsearch we could not query "all VERIFYING states where confidence > 0.7 but target was not confirmed." Those two queries together identified exactly which terrain types needed tuned thresholds.
MongoDB Change Streams for real-time alerting on consecutive low-confidence observations turned out to be one of the cleanest architectural decisions. Writing to the alerts collection and having the fleet monitor watch it is simpler and more reliable than any polling loop.
What's next
Live HiRISE imagery integration: NASA's HiRISE camera produces 25 cm/pixel Mars surface images. The terrain builder already loads real GeoTIFF tiles as Isaac Sim displacement meshes, so the rover would navigate terrain matching an actual latitude and longitude on Mars.
After that: multi-rover coordination. The fleet monitor already tracks multiple rover states in MongoDB. Routing rovers to non-overlapping terrain sectors is mostly a scheduling problem at that point, with the spatial infrastructure already in place.
Built With
- arize-phoenix
- aws-ec2
- docker
- elasticsearch
- gemini-1.5-flash
- isaac-sim
- kibana
- mongodb-atlas
- numpy
- opencv
- openinference
- opentelemetry
- python
- ros2
- sentence-transformers
Log in or sign up for Devpost to join the conversation.