CloudSafe Product Document

Document Status: Updated to match repository state
Version: 2.0
Authors: Nathan Chiu, Jeet Lad
Project: CloudSafe
Last Updated: March 28, 2026

1. Product Summary

CloudSafe is a Python-based cross-cloud failover demonstration that monitors a primary AWS endpoint, fails over to a standby GCP endpoint after repeated health-check failures, and presents system state through a local web dashboard.

The current repository implements a working demo platform, not a production traffic management system. The codebase is optimized for hackathon demonstration value and clarity of behavior rather than infrastructure completeness, security hardening, or operational durability.

2. Current Product Scope

CloudSafe currently provides:

Continuous polling of an AWS primary endpoint over HTTP on port 8080
Health validation based on HTTP 200 plus an expected response signature
Failover to a GCP standby endpoint after a configurable consecutive failure threshold
Post-failover standby verification against the GCP endpoint
Automatic failback to AWS when the primary becomes healthy again
A local Flask dashboard with:
- start and stop controls
- active target and failure counters
- server-sent event log streaming
- visual status display for AWS and GCP
A CLI monitoring loop that can inject simulated primary failure for demo use
Simple HTTP server stubs for deployment on AWS and GCP VMs

CloudSafe does not currently provide:

DNS failover
load balancer or reverse proxy switching
cloud control plane automation
instance startup or shutdown orchestration
persistent state or event history
authentication or multi-user access control
production-grade alerting, retries, backoff, or circuit breaking

3. Product Positioning

CloudSafe should be positioned as a lightweight failover orchestration demo that proves monitoring, decisioning, visibility, and state transitions across two cloud providers. It demonstrates the control logic required for cross-cloud resilience without claiming to be a production-ready availability platform.

4. Target Users

Hackathon judges evaluating technical feasibility and clarity of execution
Student teams demonstrating cloud resilience concepts
Engineers who want a compact reference implementation for active/standby monitoring behavior

5. Technology Stack

The repository currently uses:

Python for all application logic
Flask for the local dashboard server
requests for health checks
standard-library threading, queue, and event primitives for monitor execution and log fanout
server-sent events for browser log streaming
static HTML, CSS, and vanilla JavaScript for the dashboard UI
simple Python http.server responders on cloud VMs to return service signatures

This is no longer a Tkinter desktop application. Any documentation that refers to Tkinter is outdated and should be treated as incorrect.

6. System Behavior

6.1 Health Check Model

The monitoring loop checks the primary AWS IP at a fixed interval and treats the service as healthy only when:

the HTTP request succeeds within the configured timeout
the response status is 200
the response body contains one of the accepted signatures:
- OK-AWS
- OK-GCP
- OK-AZURE (legacy compatibility only)

6.2 Failover Model

Failover occurs when the AWS primary fails the configured number of consecutive checks. On failover:

the active target changes from AWS to GCP
the failure counter is reset
the system logs the failover event
the GCP endpoint is immediately verified

6.3 Failback Model

After failover, the monitoring loop continues running. It checks whether AWS has recovered. If AWS becomes healthy again:

routing state switches back to AWS
failure counters are reset
the system logs failback completion

This automatic failback behavior is implemented in the codebase today and must be reflected in product documentation.

6.4 Demo Failure Injection

The CLI supports a --simulate mode that locally injects AWS failure into the monitoring loop. The current implementation intentionally allows one healthy poll before simulated failures begin, which makes the dashboard and logs easier to interpret during a live demo.

This simulation mode:

is local to the orchestrator
does not require cloud-side failure to occur
represents a hard health-check failure path

It does not simulate:

latency
partial packet loss
slow degradation
split-brain behavior
provider API outages

7. Current Configuration

The repository currently defines these defaults in config.py:

primary cloud: AWS
secondary cloud: GCP
AWS primary IP: 52.8.170.166
GCP failover IP: 35.236.12.27
failure threshold: 2
health check interval: 2.5 seconds
health check timeout: 3 seconds

Important note: the UI currently displays a target RTO strip that still says Poll: 5s, but the code uses 2.5 seconds. Documentation should follow the code, and the UI should be corrected separately.

8. Dashboard Requirements

The implemented dashboard is served from Flask and exposes these routes:

/ for the main dashboard
/api/status for current state
/api/start to begin monitoring
/api/stop to request monitoring stop
/api/stream for server-sent event log streaming

Functional expectations for the dashboard:

Users can start a monitoring session from the browser
Users can stop a monitoring session from the browser
Users can see the currently active cloud target
Users can see failure count and threshold
Users can see whether the system is failed over
Users can watch live event logs without reloading the page

9. CLI Requirements

The CLI monitoring entrypoint supports:

normal monitoring mode
--simulate to inject AWS failure
--iterations to cap loop execution for testing or demo control

The CLI remains part of the product surface and should be treated as a first-class demo interface, not just an internal helper.

10. Cloud Endpoint Requirements

The demo assumes:

one publicly reachable AWS endpoint on port 8080
one publicly reachable GCP endpoint on port 8080
each endpoint returns a unique text signature identifying the active platform
the standby endpoint is already running when failover occurs

The sample server implementations under server_app_aws.py and server_app_gcp.py satisfy the current health-check contract.

11. Functional Requirements

FR-01 Monitoring

The system must poll the AWS primary endpoint at a fixed interval.
The system must treat failed HTTP requests, non-200 responses, and missing signatures as unhealthy.
The system must reset the failure counter after a successful AWS health check.

FR-02 Failover

The system must trigger failover after the configured consecutive failure threshold is reached.
The system must log failover activation clearly.
The system must switch the active target from AWS to GCP in shared state.
The system must verify the GCP target immediately after failover.

FR-03 Failback

The system must continue monitoring after failover.
The system must fail back to AWS when the primary recovers.
The system must log failback clearly.

FR-04 Observability

The system must expose current state over /api/status.
The dashboard must stream log events in near real time over SSE.
The UI must show whether monitoring is idle, active, or failed over.

FR-05 Demo Readiness

The system must support local failure injection through the CLI.
The system must be understandable from logs alone during a live demo.
The product documentation must describe this as an orchestration demo, not real network rerouting.

12. Non-Goals

The current version is not intended to implement:

production routing control
managed DNS updates
BGP or anycast behavior
secure secret management
autoscaling policies
persistent audit trails
SLA enforcement
multi-region or multi-standby scheduling
reconciliation against cloud provider APIs

13. Architecture Summary

Current repository structure:

cloudsafe/app.py: top-level launcher for the Flask dashboard
cloudsafe/web/server.py: Flask app, monitor thread lifecycle, SSE streaming, browser API routes
cloudsafe/failover.py: CLI orchestration loop and simulation support
cloudsafe/failover_engine.py: shared failover state and transition logic
cloudsafe/monitor.py: HTTP health checks and signature validation
cloudsafe/config.py: static configuration values
cloudsafe/server_apps/server_app_aws.py: AWS demo responder
cloudsafe/server_apps/server_app_gcp.py: GCP demo responder

14. Known Gaps and Codebase Drift

The repository review surfaced several accuracy and maintenance issues that the PRD must now acknowledge:

The previous PRD described a Tkinter application. The implementation is Flask-based.
The previous PRD described one-way failover. The implementation performs automatic failback.
The previous PRD and parts of the code still carry Azure-era compatibility naming even though the actual secondary provider is GCP.
The dashboard hardcodes Poll: 5s in the UI, while the configuration uses 2.5 seconds.
The current unit test suite is partially stale:
- tests/test_app.py still targets a deleted Tkinter CloudSafeApp interface
- tests/test_monitor.py expects HEALTH_CHECK_INTERVAL == 5, which no longer matches config.py

These are documentation and maintenance problems, not just cosmetic issues. They directly affect product accuracy and demo credibility.

15. Testing Status

As of March 28, 2026, repository tests were reviewed via unittest.

Observed state:

most failover engine and monitor tests pass
tests/test_app.py fails because it targets an obsolete Tkinter app structure
one configuration assertion fails because the expected health-check interval is outdated
pytest is not currently installed in the virtual environment

This means the core failover logic is reasonably covered, but the web dashboard path is under-documented and under-tested relative to the current architecture.

16. Risks

Risk	Current Reality	Impact
Orchestrator is a single point of failure	Monitoring and decision logic run in one local process	No failover action if that process stops
Hardcoded public IPs	IPs are static values in source code	Environment drift breaks the demo
No persistence	State resets on restart	No historical visibility
No authentication	Local dashboard has no auth layer	Unsafe beyond trusted local use
Test drift	Part of the suite no longer matches implementation	False confidence and slower iteration
UI and config mismatch	Dashboard shows outdated poll timing	Demo inconsistency

17. Recommended Next Steps

Priority order for the next development cycle:

Align the test suite with the Flask dashboard and current configuration values.
Remove or clearly isolate Azure legacy aliases unless backward compatibility is still required.
Move IPs and timing values into environment-based configuration instead of hardcoding them in source.
Update the dashboard to display live configuration values for poll interval and timeout.
Add dashboard-focused tests covering /api/status, /api/start, /api/stop, and /api/stream.

18. Product Statement

CloudSafe is currently a local orchestration and observability demo for AWS-to-GCP active/standby failover. Its value is in showing clear monitoring, state transition, and recovery behavior across clouds with a lightweight Python implementation. The PRD, tests, and UI should all describe that same product consistently.

Built With

Updates

Nathan C started this project — Mar 28, 2026 09:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.