Hello, Sea! — Digital Tether: Semantic Communication for the Underwater Frontier

What inspired us

We photographed the far side of the Moon — but we still can't send a single photo through 100 meters of ocean. The sea is the last digital frontier, and the industry that runs on it pays dearly for the blackout: marine shipping loses an estimated $200B+ every year to a problem that sounds almost too simple — nobody can tell, in time, when a ship's hull needs cleaning.

Hull biofouling — the marine organisms that colonize a ship's underside — increases drag and wastes 10–20% of a vessel's fuel. The fix is straightforward: inspect, then clean. The catch is communication. A vessel sits at anchorage for a short window, an inspection is done, and the results take ~1 week to travel back to the operator — by which time the ship has already sailed for its next voyage, dirty, burning excess fuel the whole way. Across ~80,000 vessels, that's ~$2.6M per vessel per year.

The problem isn't cleaning. It's knowing when to clean — and getting that knowledge ashore before the ship leaves.

That's the gap our team set out to close.

Why this is hard: physics, not laziness

Underwater, our entire wireless playbook breaks:

  • Radio (5G, Wi-Fi, Bluetooth) is absorbed by seawater within meters. There is no underwater RF.
  • Acoustic (sonar) penetrates water, but it's brutally slow — on the order of hundreds of kbps near the surface, collapsing to ~1,200 effective bits/s on a practical long-range link, with high latency, multipath echoes, Doppler, and bit errors.

To put that in perspective: a one-minute HD inspection clip (~60 MB) takes ~1 minute over RF — and ~40 minutes over raw sonar. That kills real-time decision-making.

Our idea: transmit meaning, not data

Classical compression asks "how few bits reproduce these pixels?" Semantic communication asks a sharper question: "how few bits reproduce what this image means for the inspection?"

For a hull inspection, pixel-perfect fidelity is almost never the point. The operator needs to know what is on the hull (biofouling, corrosion, a damaged anode, a diver), where it is, and how severe it is. That payload is a tiny fraction of the photo that carries it. By sending the meaning and reconstructing the image on shore, we hit 50–100× compression — turning that 40-minute sonar transfer into roughly ~48 seconds.

This is the same paradigm shift that moved telecom from circuit-switching to packet-switching — applied underwater for the first time.

The technical core

Formally, we don't minimize pixel distortion $\mathbb{E}\,[\,d(x,\hat{x})\,]$. For a frame $x$ and a downstream inspection task $T$, we optimize a rate–task tradeoff under a hard channel budget:

$$ \min_{\theta}\; \underbrace{\mathbb{E}\big[\,d_{T}!\left(T(x),\,\hat{T}\right)\big]}{\text{task fidelity}} \quad\text{s.t.}\quad \underbrace{R(z)}{\text{bits on the wire}} \;\le\; C_{\text{acoustic}}, \qquad C_{\text{acoustic}} \approx 1.2\,\text{kbps}. $$

We approach it with a two-stream architecture:

  • Priority label stream — an edge segmentation + captioning model (SegFormer / Mask2Former + BLIP-2) extracts the semantic content (entities, masks, severity, a structured description) and sends it first, under strong forward error correction. This is the stream the mission actually depends on.
  • Generative bulk stream — a compact structural latent lets the shore side regenerate a faithful image (ControlNet-seg + Stable Diffusion), with an optional anchor patch of real pixels for human verification where it matters.

The longer-term codec is Joint Source–Channel Coding (JSCC): a neural encoder–decoder trained end-to-end with a physics-based acoustic channel model (Rician fading, multipath), with adaptive coding that trades compression rate against real-time SNR so semantic fidelity holds even on a noisy channel. On the classical side, CompressAI gives us an honest learned-codec baseline.

How we built it — and what's real today

We refused to demo something we couldn't actually run, so we built the link end to end:

  1. The acoustic link is real. We stood up a 2-node UnetStack 3.4.4 RealTime network — an edge/AUV node and a shore node (net_rt.groovy). Frames are fragmented to the acoustic MTU, transmitted node-to-node, and reassembled on the far side. Fragmentation, latency, and reassembly all behave like the real channel.
  2. Edge transmitter (tx_edge.py) compresses each inspection frame and pushes fragments across the link; prep_demo.py precomputes payloads with an honest manifest — real byte counts, real compression ratio, real PSNR.
  3. Shore station (shore_server.py) decodes each frame the instant its bytes complete and streams the result to a live web console over Server-Sent Events — showing incoming data, the reconstructed frame, and a hull map, then auto-advancing to a structured inspection report when the reel finishes.
  4. The whole stack runs inference-only on Apple M4 (MPS, no CUDA); the production edge target is an NVIDIA Jetson with an OFDM-modulated acoustic modem.

We were deliberate about honesty: the codec on stage today is a real JPEG baseline — real bytes, real ratio, real PSNR — and the reconstruction you see is literally the bytes that crossed the link, decoded. The neural semantic codec (JSCC + the two-stream generative path) is our active research direction, not a fake we dressed up.

What we learned

  • The constraint is the product. Once you internalize ~150 bytes/second, every design decision inverts — you stop thinking in megabytes and start asking which bits earn their place on the wire.
  • "Reconstruction" is not one thing. Splitting machine-task fidelity (labels, masks, severity) from human-verification fidelity (a few anchor pixels) is what makes the bit budget tractable.
  • Acoustic networking has real teeth. Multipath, latency, and MTU fragmentation aren't footnotes — they shape the protocol. Reassembling a frame cleanly across the link taught us more than any throughput number.
  • An honest baseline beats an impressive fake. Shipping real JPEG-over-acoustic kept us grounded and made the gap the neural codec must close measurable.

Challenges we faced

  • A brutal channel budget. Designing anything visual for ~1.2 kbps forced the two-stream split and the "send meaning, regenerate pixels" approach. There's no shortcut around the physics.
  • Toolchain archaeology. UnetStack 3.4.4 needs Java 8 (Java 21 crashes its bundled Groovy), so just booting the acoustic network reliably was a battle. We pinned the JVM and documented it for reproducibility.
  • No CUDA. Making the generative stack run inference-only on Apple M4 (MPS) meant choosing models and settings that fit the hardware.
  • Real-time without stutter. Streaming partial frames to the browser as fragments arrive — decoding only on completion and advancing the console's state machine cleanly — took careful coordination across receiver, SSE channel, and front-end.

Hello, Sea! — connecting the last digital frontier, one semantic frame at a time.

Built With

Share this project:

Updates