What we've tried, what we haven't, and how the model will work.
Last updated: 29 March 2026 — Phase 9B complete
📊 Current Status
0.686
Best Clean AUC
0.637
Walk-Forward AUC
0.636
SP Baseline AUC
120+
Experiments Run
0%
Races w/ Complete SP
61%
Runners w/ Any SP
Key finding: The conclusion "SP prices everything" was built on incomplete data. Only 61% of runners have SP. Zero races have complete SP coverage. The benchmark itself is broken. We cannot claim market efficiency when we only observed 61% of the market.
What's real: With strict walk-forward features, model AUC = 0.6366 vs SP AUC = 0.6361. Delta = +0.0005. On the observable subset, SP does nearly all the work.
Silver lining: Free Betfair BSP data available back to 2008. 100% runner coverage for BAGS races. This repairs the benchmark.
📅 What We Tried
Phase 6A-B: Feature Cache Build Invalidated
Built cached_features_v2.parquet — 130 features, 951k rows. Baseline AUC 0.7747.
Problem: Features computed using full dog careers, not walk-forward. Leaked future data. AUC 0.77 was fake.
Phase 6C: 37 vs 130 Feature Comparison Invalidated
130 features beat 37 by +0.015 AUC. But both built on leaked cache. ROI still deeply negative (-33%).
Wave 0: Selector Surgery No-Go
7 predeclared selector strategies on frozen OOF predictions. All failed.
Key finding: EV curve is inverted — higher EV thresholds = worse ROI. The model's strongest disagreements with SP are its worst bets.
Built from raw DB with clean walk-forward features. Conditional logit + LightGBM.
Cluster
Features
Holdout AUC
vs D0
Verdict
D0 SP Baseline
4-9
0.686
—
Baseline
D1 Evidence/Reliability
+11
0.6846
0.000
No-Go
D2 Clean Form
+8
0.6843
0.000
No-Go
D3 Race Comparison
+6
0.6843
0.000
No-Go
D4 Context Fit
+5
0.6843
0.000
No-Go
D5 Interference/Draw
+6
0.6843
0.000
No-Go
D6 LightGBM All
34
0.6148
-0.070
Harmful
Phase 9A: Leakage Audit Confirmed
Date-shuffle test: shuffled AUC 0.594 ≈ normal 0.598. Temporal structure doesn't matter → features leaked future info. P1's AUC 0.77 was fake. True ceiling ~0.69.
Overnight V2-V6: 120+ Experiments No Edge
Architecture variants, segment specialists, novice deep-dives, regularization sweeps, DART, LambdaRank, ablations. All ROI was FLB arbitrage, not model skill. V01 SP-only produced +31% ROI — same as complex models.
D6 Proper: Compound Interactions No Edge
9 predeclared interactions from Phase 8 plan. All deltas < 0.001 on full population. Ablations show no individual interaction matters. Only EXP6 novice showed +4.57pp but was holdout contamination.
Phase 9B: Novice Blind Test Small Signal
Clean walk-back test. Model AUC 0.6463 vs SP 0.6334 on novice subset. Delta = +1.3pp. Signal is real but modest. Original +4.57pp was holdout contamination.
Comprehensive Audit Bugs Found
12-point audit + independent GPT review found multiple bugs invalidating results. Walk-forward-only model AUC = SP AUC to 4 decimal places.
SP Data Gap Discovery New Direction
Zero races have complete SP. 735k trial runners polluting analysis. 551k graded runners finished races without SP (GBGB pipeline failure). Betfair BSP available free back to 2008.
🐛 Bugs Found in Audit
Subset Normalization BugBroad Impact
Race probabilities normalized only within filtered subset (e.g. novice runners), not full field. Inflated all specialist/pocket AUCs. Affected: phase8_d6_proper subsets, overnight v2-v6 subsets.
Phase 9B Reverse-Time SplitInvalid
Trained on 2024-2026, tested on 2020-2023. This is backwards — future data predicting the past. Both phase9b scripts affected.
Sorted a copy of the dataframe, computed features, then assigned .values back positionally to the original unsorted dataframe. Features attached to wrong dogs.
LambdaRank Group BugInvalid
Rows not sorted by race_id before passing group arrays to LightGBM ranker. Phase8 D6 catastrophic 0.6148 AUC was probably this bug.
D0 Baseline DriftComparison Invalid
D0 started as 9 features, rerun used 4. D1-D5 compared against 4-feature version. Not apples-to-apples.
Isotonic Calibration In-SampleSignal Loss
Calibration fitted in-sample, then race renormalization breaks it. SP-only model (0.6854) underperforms raw SP (0.6857) — the pipeline destroys even pure SP signal.
💾 Data Reality
Database: greyhound_full.db
951k
Total Races
3.74M
Total Runners
2.29M
Runners w/ SP
1.45M
Runners w/o SP
Where the Missing SP Actually Is
Trial Races (No SP Expected)
735k
T1/T2/T3/T4 schooling runs. No bookmaker market. No SP should exist. Use for form learning only.
Graded — SP Missing (Pipeline Failure)
551k
Actual starters who finished. GBGB API lost their SP data. Betfair BSP will fill ~250k of these.
Non-Runners / Voids
161k
finish_pos=0 (handslips, no-race) + NULL finish_pos (reserve dogs never ran). Exclude from everything.
Critical: "5 out of 6 runners have SP, 1 doesn't" is the dominant pattern across 353k races. Even market favourites (550k market_pos=1 runners) are missing SP. This is a systematic GBGB database pipeline failure.
Non-Runner Breakdown
Type
Count
Has SP?
Comments
finish_pos = 0
56,514
2,307 (4%)
Handslip (23k), NoRace (7k), NoTrial (2k), misc
finish_pos = NULL
147,138
123 (0.08%)
No comments. Spread across A1-D3 grades. Reserve dogs.
Total non-finishers
203,652
—
These should never be in betting analysis
🔍 What We Haven't Tried
Complete Market DataNever Had
Every experiment used partial SP (61% of runners). Race-level normalization divided by partial sums. We have never tested against a complete market benchmark.
Betfair BSP as BenchmarkAvailable Free
BSP is the closing exchange consensus — harder and better benchmark than bookmaker SP. 100% coverage for BAGS races. Free CSVs back to 2008.
Full-Field Race EvaluationNever Done
All evaluation was on runners-with-SP subset. Never evaluated model on the full field of actual starters in a race.
Model-Only vs Market-Only vs CombinedNot Properly Tested
Clean experiment: (1) fundamental model (no SP input), (2) market-only (BSP), (3) combined. Tests whether fundamentals add value beyond market. Requires complete pricing.
Trial Races as Form InputNot Done
735k trial runs contain real dog performance data (times, events, interference). Should feed into dog form calculations but not into betting evaluation. Currently ignored or polluting.
Proper Data SegmentationNot Done
No separation of: trials (form only), BAGS with complete data (model + bet), non-BAGS graded (form only), non-runners (exclude). All mixed together.
🏗️ How the Model Will Work
The proposed architecture for the next phase. Four layers, each testable independently.
Interference-adjusted form: Strip bumps/baulks from time calculations. A dog averaging 30.2s because of bumps has inflated bad-form. Speed residuals: Track/distance/grade-adjusted times. Walk-forward only. Experience proxy: Career runs, recency, track/distance familiarity. Breeding prior with decay: Sire/dam progeny stats for novices, weighted inversely to dog's own runs. High value for first-timers, fades with experience. Trial data feeds in here: Dog performance from trial runs (times, events) included in form history, even though trials aren't bet on.
▼
🏁 Layer 2: Race Comparison (relative gaps within field)
Relative ability: Gap between this dog's assessment score and field average/best. Field strength: How strong is the competition? Mean/spread of Layer 1 scores. Class dog flag: Is this the standout in the field? All computed within race at prediction time.
▼
📍 Layer 3: Context (track/trap/conditions)
Track affinity: Dog's history at this specific track. Trap/distance fit: Performance by trap position and race distance. Forward bump risk: Is this dog bump-prone at this trap/track combination? (Pattern flag, not standalone feature.) Crowding/draw risk: Multiple early-pace dogs in traps 3-6 = traffic risk.
▼
🔗 Layer 4: Compound Signal (interactions)
Thesis: SP prices individual factors well. The edge (if it exists) is in combinations that SP misprices — e.g. strong dog + right draw + weak field + low interference risk. Pre-declared interactions only. No data-mined combos. Only tested after Layers 1-3 prove value individually.
▼
📊 Evaluation Framework
Primary: Race log-loss, selected-bet ECE Secondary: AUC, top-1 hit rate Decisive: Flat-stake ROI after commission Benchmark: Model vs BSP vs combined Walk-forward only. Pre-registered segments.
💰 Betting Decision
Only on: BAGS races with complete BSP Gate: Model conviction threshold (TBD from eval) SP band: Constraint from Phase 7 pocket (likely SP 2-6) Stake: Flat stake first, Kelly only if flat-stake profitable FLB correction: Compare model ROI vs raw BSP ROI at same gate
Key principle: Predict winners first, then figure out betting strategy. Don't optimise betting rules on a model that hasn't been validated bottom-up.
Bump dual-use:Backward = strip interference from form calculations (Layer 1, data quality). Forward = flag bump-prone trap/track patterns (Layer 3, risk signal). Two separate uses of the same raw data.
100% BSP coverage for BAGS races. Free, daily CSVs. Match by track + date + trap + dog name. Coverage: 2008-present.
▼
greyhound_full.db — Segmented
Segment
Runners
Use
Bet?
BAGS + Complete BSP
~1.5-2M
Full model training + evaluation
✅ Yes
Trials (T1-T4)
735k
Dog form history (times, events)
❌ Never
Non-BAGS Graded
~500k
Dog form history only
❌ No pricing data
Non-Runners / Voids
204k
Exclude entirely
❌ Exclude
Trial race value: Trials are novice/schooling runs — exactly the population where breeding priors matter most and form data is thinnest. Including trial performances in Layer 1 dog assessment gives the model earlier data on young dogs, especially first-timers.
🎯 Next Phase: Data Repair + Decisive Test
Step 1: Betfair BSP BackfillImmediate
Download daily BSP CSVs (2015-2026, ~4,000 files). Match to DB by track + date + trap + dog name. Store in bsp_price column. Expected: ~250k additional runners get pricing data. All BAGS races become complete.
Step 2: Data SegmentationImmediate
Flag every race: is_trial, is_bags, has_complete_bsp, is_void. Only races with complete BSP enter the modelling/betting universe. Trials + non-BAGS feed dog form history only.
Step 3: Clean BaselineAfter Data Repair
On the complete-BSP universe: (a) BSP-only baseline (AUC, race log-loss, ROI), (b) Fundamental-only model (no market input), (c) Combined model. If fundamentals don't beat BSP on race log-loss → stop.
Step 4: Layer-by-Layer BuildIf Step 3 Shows Signal
Build Layers 1-4 one at a time. Each layer must improve race log-loss ≥0.5% AND not worsen selected-bet ECE AND not worsen model-BSP disagreement correlation. Kill rules enforced at each stage.
Step 5: Execution RealismFinal Gate
ROI after Betfair commission (5% on net winnings). Liquidity check — are these bets actually executable at the prices shown? Only flat-stake ROI after friction counts.
Stop conditions: If, after complete market data, the combined model does not beat BSP on race log-loss and calibration, OR ROI disappears after commission and realistic liquidity — stop cleanly. No more feature mining on partial data.