Infrastructure & Deployment Stabilization
The initial Cloud Run deployment surfaced several critical issues that required immediate attention:
- Celery Worker Memory Crashes: Workers were OOM-killed on Cloud Run's default 512 MB limit. All services (API, workers, beat scheduler) were upgraded to 1 Gi memory, and Gunicorn workers were reduced to prevent memory contention in containerized environments.
- Celery Task Discovery: Workers failed to find tasks due to an incorrect app name configuration in
celery.py. Fixed the autodiscover path to match the Django project structure. - Cloud Run Job Argument Parsing: The
--argsflag in Cloud Run uses commas as delimiters, which conflicted with Celery queue names containing commas. Switched to semicolon delimiters with entrypoint parsing to preserve argument integrity. - Dependency Upgrades: Upgraded
django-celery-beat(2.1.0 → 2.8.1) anddjango-timezone-field(4.2.3 → 7.2.1) for Python 3.12 compatibility. Removed all AWS legacy references and adapted the full stack to GCP-native services. - SSL & CORS Configuration: Resolved staging redirect loops caused by Cloudflare's Flexible SSL mode conflicting with Django's
SECURE_SSL_REDIRECT. Configured Full (Strict) SSL mode and alignedALLOWED_HOSTS,CORS_ALLOWED_ORIGINS, andCSRF_TRUSTED_ORIGINSacross environments. - Cloud Run Jobs Pipeline: Built a reusable
cloud_run_job.shrunner that creates/updates Cloud Run Jobs on-the-fly, enabling one-command scenario seeding to staging and production.
Staging Environment Protection
- Cloudflare Access: Configured Cloudflare Access Application with email-based policies to restrict staging access to authorized developers only.
- DNS Proxying: Ensured all staging subdomains are proxied through Cloudflare for DDoS protection and access control.
Frontend Real-Time Improvements
Events & Live Mode Overhaul
The real-time simulation experience received significant fixes to make the live telemetry streaming actually usable:
- Event Reactivity Fix: The Events Timeline component used
BehaviorSubject.getValue()inside Angularcomputed()signals, which cannot be reactively tracked. Migrated totoSignal()from@angular/core/rxjs-interopsohasActiveRunandisContinuousRunnow update reactively. - Click-to-Seek: Events in the timeline are now clickable — selecting an event seeks the viewer to the exact simulation frame where the event occurred.
- Reload Protection: Added proper cleanup and re-initialization when navigating between scenarios, preventing stale telemetry from previous sessions from leaking into new views.
- Polling Guards: Guarded event polling and telemetry fetches behind authentication checks to prevent unnecessary API calls (and associated costs) for unauthenticated viewers.
Authentication Guards
Several interactive features were exposed to unauthenticated users, causing redirect errors when the API returned 401/403:
- Feedback Form: Wrapped in auth check — unauthenticated users see a snackbar with a "Login" action instead of a broken form.
- Resume Continuous Mode: The "Go Live" button now checks auth before attempting to resume real-time simulation.
- Project Future Button: The trajectory projection feature (
PROJECT FUTURE +15min) now requires authentication — unauthenticated users receive a descriptive snackbar prompt. - Pattern: All guards use the same consistent pattern:
auth.isAuthenticated$.pipe(take(1))→ snackbar withLoginaction →router.navigate(['/login']).
API Contract Alignment
- Events Interface Fix: Renamed
min_separation_lttomin_separation_km_ltacross the full stack (backend serializer, frontend API service, and component queries) to match the actual backend filter parameter. - Graceful 503 Fallback: Frontend now handles
503 Service Unavailableresponses gracefully (e.g., when Celery workers are temporarily down) instead of showing raw error screens.
Simulation Engine Fixes
False Collision Events
The proximity detection system was generating false COLLISION_IMPACT and SURFACE_IMPACT events in several scenarios:
- Ring Systems: Saturn's rings, classified as
RING_SYSTEM, were triggering collision events with moons passing through them. Fixed by addingRING_SYSTEMto theis_visual_only()category filter inProximityService. - NRHO Gateway Station: The Lunar Gateway in the cislunar scenario generated false impacts. Adjusted proximity thresholds for NRHO (Near Rectilinear Halo Orbit) entities.
- Barycenter Collisions: In the Three-Body Choreography scenario, bodies passing through the coordinate-frame barycenter triggered false collisions. Fixed by introducing a
skip_proximity_checkflag in entitylogical_propertiesand extendingProximityServiceto honor it. Also added container category (GALAXY,STAR_SYSTEM,UNIVERSE) exclusion from proximity checks. - External ID Validation: Fixed
scenario_09_planetary_defensewhere an incorrectexternal_idcaused entity lookup failures during catalog hydration.
Numerical Stability
- Three-Body Choreography (Scenario 16): The original 1-day time step was far too coarse for the figure-8 choreographic solution, causing RK4 integrator divergence after ~4.3 simulated years. Reduced step size and tuned duration for stable propagation across multiple orbital periods.
Scenario Step Timing Audit
Performed a comprehensive audit of all 16 scenario templates to ensure smooth visualization. Scenarios with low step counts (< 500 frames) produce "jumpy" animations.
Scenario Consolidation & Expansion
Consolidation: 29 → 16 Templates
The original codebase contained 29 scenario files, many of which were duplicates, incomplete prototypes, or absorbed into other scenarios. A full audit consolidated these into 16 production-ready templates, each with validated physics configurations and proper documentation.
Current Scenario Catalog
| # | Scenario | Duration | Physics |
|---|---|---|---|
| 01 | Real-Time Simulation | Continuous | Kepler |
| 02 | LEO Operations (TLEs) | 6 hours | J2 + Atmo |
| 03 | Earth-Moon Cislunar | 30 days | Cowell N-Body |
| 04 | Inner Solar System | 2 years | Kepler |
| 05 | JWST at Lagrange L2 | 1 year | Cowell N-Body |
| 06 | Outer Solar System | 165 years | Kepler |
| 07 | Voyager — Furthest Spacecraft | 50 years | Kepler |
| 08 | Solar Dynamics | 25 years | Kepler |
| 09 | Planetary Defense | 1 year | J2 |
| 10 | Space Hazards (Debris) | 6 hours | J2 + Atmo + Mag |
| 11 | TRAPPIST-1 Exoplanet System | 20 days | Cowell N-Body |
| 12 | Alpha Centauri System | 80 years | Cowell N-Body |
| 13 | Galactic Center (Sgr A*) | 20 years | Cowell N-Body |
| 14 | Local Group (Galaxy Collisions) | 1.5 Byr | Cowell N-Body |
| 15 | Stellar Evolution (Sirius AB) | 50 years | Cowell N-Body + Mag |
| 16 | Three-Body Choreography | 6 years | Cowell N-Body |
Backend Architecture Improvements
- Django Admin Registration: Registered all simulator and core models in Django admin for operational visibility and manual data management.
- Timezone Configuration: Set
TIME_ZONE = 'UTC'globally and suppressedErfaWarningfor dubious year calculations in Astropy (common in deep-time scenarios like Local Group). - Email Configuration: Migrated from console email backend to a production-grade SMTP provider for staging/production transactional emails.
- Internationalization: Generated Django locale translations (pending review) for future multi-language support.
Testing & Coverage
- Full backend test suite maintained across all changes (unit + integration + e2e)
- Dedicated proximity/collision test coverage validating the event detection pipeline: visual-only entity filtering, container exclusion,
skip_proximity_checkflag, threshold calculations, and event classification
Log in or sign up for Devpost to join the conversation.