Perforated TravelPlanner — Dendritic Optimization + W&B Sweeps

Intro

We integrated Perforated AI's Dendritic Optimization into a PyTorch training workflow on the TravelPlanner benchmark and ran Weights & Biases sweeps to tune hyperparameters, improving PR-AUC / F1 with validation-based threshold selection.

Sweep ID: 46sgkuvo | Platform: Google Colab | Tracking: W&B

What We Built

A reproducible train + eval loop with W&B logging for imbalanced binary classification
W&B sweep agent (ID 46sgkuvo) runs for hyperparameter exploration over learning rate and weight decay
Best-threshold-on-validation selection: pick threshold on validation to maximize criterion, then report test metrics at that threshold
PR-AUC / Average Precision (AP) as primary metric (robust for rare positives)
Explicit tracking of threshold calibration effects (best_thr ranged from 0.00145 to 0.99975)

Dataset: TravelPlanner

Paper: TravelPlanner: A Benchmark for Real-World Planning with Language Agents (Xie et al., 2024)

Hugging Face: osunlp/TravelPlanner

We converted this into an imbalanced binary classification task:

Validation positive rate ≈ 0.111
Loss reweighting: pos_weight = 8.0
Evaluated using PR-AUC, F1@0.5, and F1 at validation-chosen threshold

Citation

@article{Xie2024TravelPlanner,
  author    = {Jian Xie and Kai Zhang and Jiangjie Chen and Tinghui Zhu and Renze Lou and Yuandong Tian and Yanghua Xiao and Yu Su},
  title     = {TravelPlanner: A Benchmark for Real-World Planning with Language Agents},
  journal   = {arXiv preprint arXiv:2402.01622},
  year      = {2024}
}

Method

Training + Evaluation Loop

Train: Standard PyTorch loop logging train/loss
Validate/Test: Compute PR-AUC, F1@0.5, and collect predictions for threshold optimization
Threshold Selection: Find best_thr on validation (to maximize validation criterion), then evaluate test at that threshold

Baseline vs Dendrites

We compared standard modules vs dendritic optimization enabled. Key observation: dendrites strongly affected score calibration, with optimal thresholds ranging from 0.00145 to 0.99975, making threshold-aware evaluation critical.

Results

Baseline vs Dendrites (Head-to-Head Comparison)

Setting	Test AP (PR-AUC)	Val-chosen Threshold	TP	FP	TN	FN
Baseline	0.187	0.869	4	25	39	4
Dendrites	0.168	0.479	1	8	56	7

Interpretation: Dendrites reduced false positives (25 → 8) but also reduced true positives (4 → 1), lowering AP at this operating point. This underscores the importance of calibrated threshold selection.

Additional Baseline Snapshot

Single baseline run (baseline_adamw):

Best validation PR-AUC: 0.15237 (epoch 8, best threshold: 0.87788)
Test PR-AUC: 0.24924 | Test F1@0.5: 0.20

Sweep Results (Headline Findings)

Best validation PR-AUC observed: 0.41344 (run tdif3p2t, best_thr=0.75876)
Highest test PR-AUC observed: 0.44663 (run 3ei24bcq) and 0.44634 (run hpmq62fe)
Largest calibration extremes: best_thr ranges from 0.00145 (gp84dgsz) to 0.99975 (y8krklgw)

Top-Performing Sweep Runs

Run ID	Learning Rate	Weight Decay	Best Val AP	Best Threshold	Test AP	F1@best_thr	Precision / Recall
hpmq62fe	0.0010658	0.0002224	0.40934	0.69435	0.44634	0.42857	0.50 / 0.375
3ei24bcq	0.0001149	0.00002155	0.29750	0.55035	0.44663	0.35294	0.333 / 0.375
tdif3p2t	0.0005784	0.0019845	0.41344	0.75876	0.25857	0.31579	0.273 / 0.375
jeg08kfr	0.0019701	0.000002321	0.29750	0.39074	0.39453	0.46154	0.60 / 0.375

Key Findings

Threshold calibration is critical: Dendrites shifted score distributions substantially, changing optimal thresholds from ~0.87 (baseline) to ~0.48-0.70 (dendrites). PR-AUC alone doesn't capture this behavior.
Sweep efficiency: W&B sweep (46sgkuvo) explored learning rate and weight decay systematically, discovering multiple runs with test AP > 0.44, compared to baseline AP ≈ 0.19-0.25.
Class imbalance handling: With pos_weight=8.0 and PR-AUC as primary metric, the model learned useful ranking even with ~11% positive rate. Thresholded F1 at best_thr reached up to 0.43.
Dendrites' practical value: In agent/planning systems with asymmetric FP/FN costs, dendrites' ability to shift decision boundaries can be leveraged to match business operating points (e.g., "precision ≥ 0.5").

Why It Matters (Business Need)

Travel planning and agent systems operate under tight constraints:

False positives trigger wasted downstream tool calls (search, constraint solvers, bookings)
False negatives miss valid itinerary options and reduce user trust
Latency & cost are critical; dendrites can reduce parameter count while maintaining/improving ranking quality

Our results demonstrate dendrites can meaningfully change calibration and decision behavior, offering value when tuned to a target operating point (precision-first vs recall-first trade-off).

Engineering Recap

What We Built

Colab-friendly training pipeline with explicit train/val/test splits
Consistent evaluation suite reporting PR-AUC, F1@0.5, and validation-chosen thresholds
W&B sweep configuration (46sgkuvo) exploring learning rate and weight decay

Major Debugging Milestones

Resolved regex patching issue in Colab (escape character handling)
Diagnosed PerforatedAI package structure in Colab:
- perforatedai imported successfully but exposed no top-level names
- Located internal modules (tracker_perforatedai.py, modules_perforatedai.py)
- Classes like PAINeuronModule and PAIDendriteModule were available for integration

Framework Exploration: NanoGPT (Attempted)

Explored dendritic integration into pure PyTorch NanoGPT for bonus points
Reached blockers: ImportError on custom entry points and unknown config keys (dend_activation)
Prioritized completing TravelPlanner experiments and sweep analysis

Limitations and Next Steps

Limitations

Threshold instability: Some runs picked extreme thresholds (near 0 or 1), inflating/deflating F1 even when AP stable
Small positive counts: Confusion matrix deltas swing with few examples; multiple seeds would improve confidence
Matched sweep needed: Baseline and dendrites should be swept under identical search spaces and seeds for fair attribution

Next Steps

Run paired sweeps (baseline vs dendrites) with identical ranges and 3–5 random seeds; report mean±std of AP
Choose a business operating point (e.g., "precision ≥ 0.5") and compare models at that constraint
Add efficiency metrics (params, wall-clock/step, memory) to quantify compute/size wins from dendrites

Tech Stack

PyTorch – training framework
PerforatedAI (Dendritic Optimization) – neural efficiency module
Weights & Biases (W&B) – experiment tracking, sweeps, dashboards
Google Colab – compute environment
scikit-learn – PR-AUC, threshold optimization utilities

References

Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y., Xiao, Y., & Su, Y. (2024). TravelPlanner: A Benchmark for Real-World Planning with Language Agents. arXiv preprint arXiv:2402.01622.

Built With

cloud
googlecolab
huggingface
matplotlib
numpy
perforatedai
plotly
pytorch
regex
tqdm
wandb

Updates

Stephen Nguyen started this project — Jan 02, 2026 07:52 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.