CloudSafe Product Document

Document Status: Updated to match repository state
Version: 2.0
Authors: Nathan Chiu, Jeet Lad
Project: CloudSafe
Last Updated: March 28, 2026

1. Product Summary

CloudSafe is a Python-based cross-cloud failover demonstration that monitors a primary AWS endpoint, fails over to a standby GCP endpoint after repeated health-check failures, and presents system state through a local web dashboard.

The current repository implements a working demo platform, not a production traffic management system. The codebase is optimized for hackathon demonstration value and clarity of behavior rather than infrastructure completeness, security hardening, or operational durability.

2. Current Product Scope

CloudSafe currently provides:

  • Continuous polling of an AWS primary endpoint over HTTP on port 8080
  • Health validation based on HTTP 200 plus an expected response signature
  • Failover to a GCP standby endpoint after a configurable consecutive failure threshold
  • Post-failover standby verification against the GCP endpoint
  • Automatic failback to AWS when the primary becomes healthy again
  • A local Flask dashboard with:
    • start and stop controls
    • active target and failure counters
    • server-sent event log streaming
    • visual status display for AWS and GCP
  • A CLI monitoring loop that can inject simulated primary failure for demo use
  • Simple HTTP server stubs for deployment on AWS and GCP VMs

CloudSafe does not currently provide:

  • DNS failover
  • load balancer or reverse proxy switching
  • cloud control plane automation
  • instance startup or shutdown orchestration
  • persistent state or event history
  • authentication or multi-user access control
  • production-grade alerting, retries, backoff, or circuit breaking

3. Product Positioning

CloudSafe should be positioned as a lightweight failover orchestration demo that proves monitoring, decisioning, visibility, and state transitions across two cloud providers. It demonstrates the control logic required for cross-cloud resilience without claiming to be a production-ready availability platform.

4. Target Users

  • Hackathon judges evaluating technical feasibility and clarity of execution
  • Student teams demonstrating cloud resilience concepts
  • Engineers who want a compact reference implementation for active/standby monitoring behavior

5. Technology Stack

The repository currently uses:

  • Python for all application logic
  • Flask for the local dashboard server
  • requests for health checks
  • standard-library threading, queue, and event primitives for monitor execution and log fanout
  • server-sent events for browser log streaming
  • static HTML, CSS, and vanilla JavaScript for the dashboard UI
  • simple Python http.server responders on cloud VMs to return service signatures

This is no longer a Tkinter desktop application. Any documentation that refers to Tkinter is outdated and should be treated as incorrect.

6. System Behavior

6.1 Health Check Model

The monitoring loop checks the primary AWS IP at a fixed interval and treats the service as healthy only when:

  • the HTTP request succeeds within the configured timeout
  • the response status is 200
  • the response body contains one of the accepted signatures:
    • OK-AWS
    • OK-GCP
    • OK-AZURE (legacy compatibility only)

6.2 Failover Model

Failover occurs when the AWS primary fails the configured number of consecutive checks. On failover:

  • the active target changes from AWS to GCP
  • the failure counter is reset
  • the system logs the failover event
  • the GCP endpoint is immediately verified

6.3 Failback Model

After failover, the monitoring loop continues running. It checks whether AWS has recovered. If AWS becomes healthy again:

  • routing state switches back to AWS
  • failure counters are reset
  • the system logs failback completion

This automatic failback behavior is implemented in the codebase today and must be reflected in product documentation.

6.4 Demo Failure Injection

The CLI supports a --simulate mode that locally injects AWS failure into the monitoring loop. The current implementation intentionally allows one healthy poll before simulated failures begin, which makes the dashboard and logs easier to interpret during a live demo.

This simulation mode:

  • is local to the orchestrator
  • does not require cloud-side failure to occur
  • represents a hard health-check failure path

It does not simulate:

  • latency
  • partial packet loss
  • slow degradation
  • split-brain behavior
  • provider API outages

7. Current Configuration

The repository currently defines these defaults in config.py:

  • primary cloud: AWS
  • secondary cloud: GCP
  • AWS primary IP: 52.8.170.166
  • GCP failover IP: 35.236.12.27
  • failure threshold: 2
  • health check interval: 2.5 seconds
  • health check timeout: 3 seconds

Important note: the UI currently displays a target RTO strip that still says Poll: 5s, but the code uses 2.5 seconds. Documentation should follow the code, and the UI should be corrected separately.

8. Dashboard Requirements

The implemented dashboard is served from Flask and exposes these routes:

  • / for the main dashboard
  • /api/status for current state
  • /api/start to begin monitoring
  • /api/stop to request monitoring stop
  • /api/stream for server-sent event log streaming

Functional expectations for the dashboard:

  • Users can start a monitoring session from the browser
  • Users can stop a monitoring session from the browser
  • Users can see the currently active cloud target
  • Users can see failure count and threshold
  • Users can see whether the system is failed over
  • Users can watch live event logs without reloading the page

9. CLI Requirements

The CLI monitoring entrypoint supports:

  • normal monitoring mode
  • --simulate to inject AWS failure
  • --iterations to cap loop execution for testing or demo control

The CLI remains part of the product surface and should be treated as a first-class demo interface, not just an internal helper.

10. Cloud Endpoint Requirements

The demo assumes:

  • one publicly reachable AWS endpoint on port 8080
  • one publicly reachable GCP endpoint on port 8080
  • each endpoint returns a unique text signature identifying the active platform
  • the standby endpoint is already running when failover occurs

The sample server implementations under server_app_aws.py and server_app_gcp.py satisfy the current health-check contract.

11. Functional Requirements

FR-01 Monitoring

  • The system must poll the AWS primary endpoint at a fixed interval.
  • The system must treat failed HTTP requests, non-200 responses, and missing signatures as unhealthy.
  • The system must reset the failure counter after a successful AWS health check.

FR-02 Failover

  • The system must trigger failover after the configured consecutive failure threshold is reached.
  • The system must log failover activation clearly.
  • The system must switch the active target from AWS to GCP in shared state.
  • The system must verify the GCP target immediately after failover.

FR-03 Failback

  • The system must continue monitoring after failover.
  • The system must fail back to AWS when the primary recovers.
  • The system must log failback clearly.

FR-04 Observability

  • The system must expose current state over /api/status.
  • The dashboard must stream log events in near real time over SSE.
  • The UI must show whether monitoring is idle, active, or failed over.

FR-05 Demo Readiness

  • The system must support local failure injection through the CLI.
  • The system must be understandable from logs alone during a live demo.
  • The product documentation must describe this as an orchestration demo, not real network rerouting.

12. Non-Goals

The current version is not intended to implement:

  • production routing control
  • managed DNS updates
  • BGP or anycast behavior
  • secure secret management
  • autoscaling policies
  • persistent audit trails
  • SLA enforcement
  • multi-region or multi-standby scheduling
  • reconciliation against cloud provider APIs

13. Architecture Summary

Current repository structure:

14. Known Gaps and Codebase Drift

The repository review surfaced several accuracy and maintenance issues that the PRD must now acknowledge:

  • The previous PRD described a Tkinter application. The implementation is Flask-based.
  • The previous PRD described one-way failover. The implementation performs automatic failback.
  • The previous PRD and parts of the code still carry Azure-era compatibility naming even though the actual secondary provider is GCP.
  • The dashboard hardcodes Poll: 5s in the UI, while the configuration uses 2.5 seconds.
  • The current unit test suite is partially stale:
    • tests/test_app.py still targets a deleted Tkinter CloudSafeApp interface
    • tests/test_monitor.py expects HEALTH_CHECK_INTERVAL == 5, which no longer matches config.py

These are documentation and maintenance problems, not just cosmetic issues. They directly affect product accuracy and demo credibility.

15. Testing Status

As of March 28, 2026, repository tests were reviewed via unittest.

Observed state:

  • most failover engine and monitor tests pass
  • tests/test_app.py fails because it targets an obsolete Tkinter app structure
  • one configuration assertion fails because the expected health-check interval is outdated
  • pytest is not currently installed in the virtual environment

This means the core failover logic is reasonably covered, but the web dashboard path is under-documented and under-tested relative to the current architecture.

16. Risks

Risk Current Reality Impact
Orchestrator is a single point of failure Monitoring and decision logic run in one local process No failover action if that process stops
Hardcoded public IPs IPs are static values in source code Environment drift breaks the demo
No persistence State resets on restart No historical visibility
No authentication Local dashboard has no auth layer Unsafe beyond trusted local use
Test drift Part of the suite no longer matches implementation False confidence and slower iteration
UI and config mismatch Dashboard shows outdated poll timing Demo inconsistency

17. Recommended Next Steps

Priority order for the next development cycle:

  1. Align the test suite with the Flask dashboard and current configuration values.
  2. Remove or clearly isolate Azure legacy aliases unless backward compatibility is still required.
  3. Move IPs and timing values into environment-based configuration instead of hardcoding them in source.
  4. Update the dashboard to display live configuration values for poll interval and timeout.
  5. Add dashboard-focused tests covering /api/status, /api/start, /api/stop, and /api/stream.

18. Product Statement

CloudSafe is currently a local orchestration and observability demo for AWS-to-GCP active/standby failover. Its value is in showing clear monitoring, state transition, and recovery behavior across clouds with a lightweight Python implementation. The PRD, tests, and UI should all describe that same product consistently.

Share this project:

Updates