๐ฏ Sandokan
Train with reasoning, optimize with precision.
A CPU-only, header-only C++ neural network training engine built for on-device learning โ
no Python, no PyTorch, no GPU required.
๐ก Inspiration
Most neural network training assumes a GPU and a Python runtime. That assumption breaks in the places where learning matters most:
- A microcontroller updating a sensor model in the field
- A robot adapting its controller between episodes
- An embedded vision system that must improve on the device it runs on
The trend toward giant GPU-trained models obscures a different class of problem: systems that must keep learning after deployment, with local data, on hardware that has no network connection or power budget for a GPU. Sandokan was built for those environments.
๐ง What It Does
Drop a single header into any C++ project and get a complete training pipeline:
| Feature | Details |
|---|---|
| Architectures | Fully connected networks, residual blocks |
| Optimizers | SGD, Adam (with bias correction) |
| Schedulers | LinearLR |
| Loss functions | CrossEntropy, BCE, MSE |
| Datasets | IDX image data (EMNIST, MNIST), numeric CSVs |
| Persistence | Custom .sand binary format with normalization |
| Inference | Top-k predictions, ASCII art previews |
Networks are defined by composing typed submodules โ Submodule<T> auto-registers on construction, so you can't accidentally forget a register_module call:
struct LetterNet : Module {
Submodule<Linear> proj { *this, 784, 64 };
ReLU relu;
Submodule<Linear> head { *this, 64, 26 };
Eigen::MatrixXf forward(const Eigen::MatrixXf& x) override {
return head.forward(relu.forward(proj.forward(x)));
}
};
โ๏ธ How We Built It
PMAD Slab Allocator โ zero-fragmentation gradient memory
The standard allocator story for neural network training is painful: thousands of separate heap allocations per epoch, fragmentation over long runs, malloc unavailable on some embedded targets.
PMAD (Pre-allocated Memory Arena for Derivatives) solves this at the layer level. Before training begins, it walks the network topology, computes the exact size class for every gradient buffer, and satisfies them all from one contiguous slab. During training there are zero malloc/free calls.
LetterNet net;
init_pmad_for(net); // walks topology โ computes sizes โ allocates slab โ migrates pointers
Benefits:
- Cache-friendly: all gradient buffers for a pass are packed contiguously โ L2/L3 cache fits them together
- Deterministic latency: no allocator lock contention, no OS page-fault surprises mid-epoch
- Topology-aware: add a layer, re-call
init_pmad_for(), and the slab is rebuilt automatically
mmap-backed Datasets โ bounded memory regardless of dataset size
Image data is memory-mapped from disk. Pages are faulted on demand during batch assembly, so the working set stays bounded even over 100k+ samples. The OS page cache deduplicates reads across runs โ a second training run costs zero disk I/O.
ImageDataset train = load_emnist_letters("data/Emnist Letters", /*train=*/true);
train.compute_normalization();
test.apply_normalization_from(train); // never leaks test distribution
Apple AMX Acceleration
Batched GEMM is routed through Eigen + Apple Accelerate on Apple Silicon. Combined with PMAD's cache-friendly layout, this is the primary driver of the training speedups below.
๐ Performance
Architecture
784 โ 64 โ 64 โ 26| batch = 128 | Apple Silicon (M-series)
EMNIST Letters โ 124,800 training samples
| Backend | ms / epoch | samples / sec |
|---|---|---|
| Sandokan single-sample | 1,508 | 82,757 |
| Eigen single-sample | 1,851 | 67,408 |
| Sandokan batched + parallel | 77 | 1,615,666 |
| Eigen batched | 123 | 1,015,951 |
Sandokan's batched path is 19.5ร faster than single-sample and 1.5ร faster than plain Eigen.
Fashion MNIST โ 60,000 training samples
| Backend | ms / epoch | samples / sec |
|---|---|---|
| Sandokan batched + parallel | 34.4 | 1,742,000 |
| Eigen batched | 40.9 | 1,464,000 |
๐ฏ Accuracy
| Dataset | Architecture | Optimizer | Result |
|---|---|---|---|
| EMNIST Letters | 784 โ 64 โ ResBlock(64) โ 26 | Adam + LinearLR | 88.25% test accuracy |
| Fashion MNIST | 784 โ 64 โ 64 โ 10 | SGD | ~85% |
๐งฑ Tech Stack
- C++17 โ no runtime dependencies beyond the standard library
- Eigen 3 โ linear algebra backend
- Apple Accelerate / AMX โ hardware BLAS on Apple Silicon
- CMake โฅ 3.15 โ build system
๐ง Challenges
- Designing PMAD's size-class inference to work generically across arbitrary network topologies without requiring the user to annotate buffer sizes manually
- Folding the Softmax Jacobian into the CrossEntropy backward pass correctly โ Softmax's own
backwardbecomes a passthrough, which avoids materializing the full Jacobian matrix while still producing the right gradient - Keeping the API ergonomic (single-header, no registration boilerplate) while giving the allocator enough topology information at init time
๐ Accomplishments
- A complete training pipeline with zero external runtime dependencies beyond Eigen
- 1.5ร speedup over plain Eigen batched training on EMNIST Letters
- 88.25% test accuracy on a 26-class letter recognition task, trained entirely on CPU
- Deterministic, allocation-free gradient memory during training via PMAD
๐ What We Learned
- Slab allocation is surprisingly portable and powerful for neural network workloads โ the key insight is that gradient buffer sizes are statically determined by the network architecture, so they can be committed upfront
- mmap is underused for ML datasets; bounded RSS matters enormously on devices where RAM is the constraint, not disk
๐ฎ What's Next
- Convolutional layers for on-device vision
- ARM NEON / CMSIS-NN support for non-Apple embedded targets
- INT8 quantization for microcontroller deployment
- ONNX export for interoperability with inference runtimes
๐ Links
Built with C++17 ยท Eigen ยท Apple Accelerate ยท CMake
Log in or sign up for Devpost to join the conversation.