CloudSafe Product Document
Document Status: Updated to match repository state
Version: 2.0
Authors: Nathan Chiu, Jeet Lad
Project: CloudSafe
Last Updated: March 28, 2026
1. Product Summary
CloudSafe is a Python-based cross-cloud failover demonstration that monitors a primary AWS endpoint, fails over to a standby GCP endpoint after repeated health-check failures, and presents system state through a local web dashboard.
The current repository implements a working demo platform, not a production traffic management system. The codebase is optimized for hackathon demonstration value and clarity of behavior rather than infrastructure completeness, security hardening, or operational durability.
2. Current Product Scope
CloudSafe currently provides:
- Continuous polling of an AWS primary endpoint over HTTP on port
8080 - Health validation based on HTTP
200plus an expected response signature - Failover to a GCP standby endpoint after a configurable consecutive failure threshold
- Post-failover standby verification against the GCP endpoint
- Automatic failback to AWS when the primary becomes healthy again
- A local Flask dashboard with:
- start and stop controls
- active target and failure counters
- server-sent event log streaming
- visual status display for AWS and GCP
- A CLI monitoring loop that can inject simulated primary failure for demo use
- Simple HTTP server stubs for deployment on AWS and GCP VMs
CloudSafe does not currently provide:
- DNS failover
- load balancer or reverse proxy switching
- cloud control plane automation
- instance startup or shutdown orchestration
- persistent state or event history
- authentication or multi-user access control
- production-grade alerting, retries, backoff, or circuit breaking
3. Product Positioning
CloudSafe should be positioned as a lightweight failover orchestration demo that proves monitoring, decisioning, visibility, and state transitions across two cloud providers. It demonstrates the control logic required for cross-cloud resilience without claiming to be a production-ready availability platform.
4. Target Users
- Hackathon judges evaluating technical feasibility and clarity of execution
- Student teams demonstrating cloud resilience concepts
- Engineers who want a compact reference implementation for active/standby monitoring behavior
5. Technology Stack
The repository currently uses:
- Python for all application logic
- Flask for the local dashboard server
requestsfor health checks- standard-library threading, queue, and event primitives for monitor execution and log fanout
- server-sent events for browser log streaming
- static HTML, CSS, and vanilla JavaScript for the dashboard UI
- simple Python
http.serverresponders on cloud VMs to return service signatures
This is no longer a Tkinter desktop application. Any documentation that refers to Tkinter is outdated and should be treated as incorrect.
6. System Behavior
6.1 Health Check Model
The monitoring loop checks the primary AWS IP at a fixed interval and treats the service as healthy only when:
- the HTTP request succeeds within the configured timeout
- the response status is
200 - the response body contains one of the accepted signatures:
OK-AWSOK-GCPOK-AZURE(legacy compatibility only)
6.2 Failover Model
Failover occurs when the AWS primary fails the configured number of consecutive checks. On failover:
- the active target changes from AWS to GCP
- the failure counter is reset
- the system logs the failover event
- the GCP endpoint is immediately verified
6.3 Failback Model
After failover, the monitoring loop continues running. It checks whether AWS has recovered. If AWS becomes healthy again:
- routing state switches back to AWS
- failure counters are reset
- the system logs failback completion
This automatic failback behavior is implemented in the codebase today and must be reflected in product documentation.
6.4 Demo Failure Injection
The CLI supports a --simulate mode that locally injects AWS failure into the monitoring loop. The current implementation intentionally allows one healthy poll before simulated failures begin, which makes the dashboard and logs easier to interpret during a live demo.
This simulation mode:
- is local to the orchestrator
- does not require cloud-side failure to occur
- represents a hard health-check failure path
It does not simulate:
- latency
- partial packet loss
- slow degradation
- split-brain behavior
- provider API outages
7. Current Configuration
The repository currently defines these defaults in config.py:
- primary cloud: AWS
- secondary cloud: GCP
- AWS primary IP:
52.8.170.166 - GCP failover IP:
35.236.12.27 - failure threshold:
2 - health check interval:
2.5seconds - health check timeout:
3seconds
Important note: the UI currently displays a target RTO strip that still says Poll: 5s, but the code uses 2.5 seconds. Documentation should follow the code, and the UI should be corrected separately.
8. Dashboard Requirements
The implemented dashboard is served from Flask and exposes these routes:
/for the main dashboard/api/statusfor current state/api/startto begin monitoring/api/stopto request monitoring stop/api/streamfor server-sent event log streaming
Functional expectations for the dashboard:
- Users can start a monitoring session from the browser
- Users can stop a monitoring session from the browser
- Users can see the currently active cloud target
- Users can see failure count and threshold
- Users can see whether the system is failed over
- Users can watch live event logs without reloading the page
9. CLI Requirements
The CLI monitoring entrypoint supports:
- normal monitoring mode
--simulateto inject AWS failure--iterationsto cap loop execution for testing or demo control
The CLI remains part of the product surface and should be treated as a first-class demo interface, not just an internal helper.
10. Cloud Endpoint Requirements
The demo assumes:
- one publicly reachable AWS endpoint on port
8080 - one publicly reachable GCP endpoint on port
8080 - each endpoint returns a unique text signature identifying the active platform
- the standby endpoint is already running when failover occurs
The sample server implementations under server_app_aws.py and server_app_gcp.py satisfy the current health-check contract.
11. Functional Requirements
FR-01 Monitoring
- The system must poll the AWS primary endpoint at a fixed interval.
- The system must treat failed HTTP requests, non-
200responses, and missing signatures as unhealthy. - The system must reset the failure counter after a successful AWS health check.
FR-02 Failover
- The system must trigger failover after the configured consecutive failure threshold is reached.
- The system must log failover activation clearly.
- The system must switch the active target from AWS to GCP in shared state.
- The system must verify the GCP target immediately after failover.
FR-03 Failback
- The system must continue monitoring after failover.
- The system must fail back to AWS when the primary recovers.
- The system must log failback clearly.
FR-04 Observability
- The system must expose current state over
/api/status. - The dashboard must stream log events in near real time over SSE.
- The UI must show whether monitoring is idle, active, or failed over.
FR-05 Demo Readiness
- The system must support local failure injection through the CLI.
- The system must be understandable from logs alone during a live demo.
- The product documentation must describe this as an orchestration demo, not real network rerouting.
12. Non-Goals
The current version is not intended to implement:
- production routing control
- managed DNS updates
- BGP or anycast behavior
- secure secret management
- autoscaling policies
- persistent audit trails
- SLA enforcement
- multi-region or multi-standby scheduling
- reconciliation against cloud provider APIs
13. Architecture Summary
Current repository structure:
cloudsafe/app.py: top-level launcher for the Flask dashboardcloudsafe/web/server.py: Flask app, monitor thread lifecycle, SSE streaming, browser API routescloudsafe/failover.py: CLI orchestration loop and simulation supportcloudsafe/failover_engine.py: shared failover state and transition logiccloudsafe/monitor.py: HTTP health checks and signature validationcloudsafe/config.py: static configuration valuescloudsafe/server_apps/server_app_aws.py: AWS demo respondercloudsafe/server_apps/server_app_gcp.py: GCP demo responder
14. Known Gaps and Codebase Drift
The repository review surfaced several accuracy and maintenance issues that the PRD must now acknowledge:
- The previous PRD described a Tkinter application. The implementation is Flask-based.
- The previous PRD described one-way failover. The implementation performs automatic failback.
- The previous PRD and parts of the code still carry Azure-era compatibility naming even though the actual secondary provider is GCP.
- The dashboard hardcodes
Poll: 5sin the UI, while the configuration uses2.5seconds. - The current unit test suite is partially stale:
tests/test_app.pystill targets a deleted TkinterCloudSafeAppinterfacetests/test_monitor.pyexpectsHEALTH_CHECK_INTERVAL == 5, which no longer matchesconfig.py
These are documentation and maintenance problems, not just cosmetic issues. They directly affect product accuracy and demo credibility.
15. Testing Status
As of March 28, 2026, repository tests were reviewed via unittest.
Observed state:
- most failover engine and monitor tests pass
tests/test_app.pyfails because it targets an obsolete Tkinter app structure- one configuration assertion fails because the expected health-check interval is outdated
pytestis not currently installed in the virtual environment
This means the core failover logic is reasonably covered, but the web dashboard path is under-documented and under-tested relative to the current architecture.
16. Risks
| Risk | Current Reality | Impact |
|---|---|---|
| Orchestrator is a single point of failure | Monitoring and decision logic run in one local process | No failover action if that process stops |
| Hardcoded public IPs | IPs are static values in source code | Environment drift breaks the demo |
| No persistence | State resets on restart | No historical visibility |
| No authentication | Local dashboard has no auth layer | Unsafe beyond trusted local use |
| Test drift | Part of the suite no longer matches implementation | False confidence and slower iteration |
| UI and config mismatch | Dashboard shows outdated poll timing | Demo inconsistency |
17. Recommended Next Steps
Priority order for the next development cycle:
- Align the test suite with the Flask dashboard and current configuration values.
- Remove or clearly isolate Azure legacy aliases unless backward compatibility is still required.
- Move IPs and timing values into environment-based configuration instead of hardcoding them in source.
- Update the dashboard to display live configuration values for poll interval and timeout.
- Add dashboard-focused tests covering
/api/status,/api/start,/api/stop, and/api/stream.
18. Product Statement
CloudSafe is currently a local orchestration and observability demo for AWS-to-GCP active/standby failover. Its value is in showing clear monitoring, state transition, and recovery behavior across clouds with a lightweight Python implementation. The PRD, tests, and UI should all describe that same product consistently.
Log in or sign up for Devpost to join the conversation.