🐕 Greyhound Model — Project Map

What we've tried, what we haven't, and how the model will work.

Last updated: 29 March 2026 — Phase 9B complete

📊 Current Status

0.686
Best Clean AUC
0.637
Walk-Forward AUC
0.636
SP Baseline AUC
120+
Experiments Run
0%
Races w/ Complete SP
61%
Runners w/ Any SP
Key finding: The conclusion "SP prices everything" was built on incomplete data. Only 61% of runners have SP. Zero races have complete SP coverage. The benchmark itself is broken. We cannot claim market efficiency when we only observed 61% of the market.
What's real: With strict walk-forward features, model AUC = 0.6366 vs SP AUC = 0.6361. Delta = +0.0005. On the observable subset, SP does nearly all the work.
Silver lining: Free Betfair BSP data available back to 2008. 100% runner coverage for BAGS races. This repairs the benchmark.

📅 What We Tried

Phase 6A-B: Feature Cache Build Invalidated
Built cached_features_v2.parquet — 130 features, 951k rows. Baseline AUC 0.7747.
Problem: Features computed using full dog careers, not walk-forward. Leaked future data. AUC 0.77 was fake.
Phase 6C: 37 vs 130 Feature Comparison Invalidated
130 features beat 37 by +0.015 AUC. But both built on leaked cache. ROI still deeply negative (-33%).
Wave 0: Selector Surgery No-Go
7 predeclared selector strategies on frozen OOF predictions. All failed.
Key finding: EV curve is inverted — higher EV thresholds = worse ROI. The model's strongest disagreements with SP are its worst bets.
Wave 1a: Feature Pruning (C1 + C2) No-Go
C1: removed EV-correlated features → worsened. C2: removed time_rank_5 + dog_distance_speed_rank → worsened (-1.88pp ROI).
Key finding: Features carry needed discrimination. Removing them hurts even though they correlate with EV inflation.
Wave 1b: Three Feature Bundles All No-Go
Bundle 4 (Data Quality Gates): Quality filters hurt ROI. Edge lives in uncertain/sparse races.
Bundle 2 (Speed-in-Context): +0.39pp delta, needed +3pp. SP prices speed.
Bundle 1 (Pace Dynamics): 10 new features, ROI worsened -1.07pp. SP prices pace.
Phase 7: Blind Holdout on P1 Champion Mixed
44,797 races, 185k runners. AUC 0.7703 (model generalises). Only positive line: SP[2-6] gate=0.40 → 532 bets, ROI +2.4%.
Note: This was tested on the leaked P1 model. The +2.4% pocket is built on inflated AUC.
Phase 8: Clean Rebuild — All 7 Clusters SP Prices Everything
Built from raw DB with clean walk-forward features. Conditional logit + LightGBM.
ClusterFeaturesHoldout AUCvs D0Verdict
D0 SP Baseline4-90.686Baseline
D1 Evidence/Reliability+110.68460.000No-Go
D2 Clean Form+80.68430.000No-Go
D3 Race Comparison+60.68430.000No-Go
D4 Context Fit+50.68430.000No-Go
D5 Interference/Draw+60.68430.000No-Go
D6 LightGBM All340.6148-0.070Harmful
Phase 9A: Leakage Audit Confirmed
Date-shuffle test: shuffled AUC 0.594 ≈ normal 0.598. Temporal structure doesn't matter → features leaked future info. P1's AUC 0.77 was fake. True ceiling ~0.69.
Overnight V2-V6: 120+ Experiments No Edge
Architecture variants, segment specialists, novice deep-dives, regularization sweeps, DART, LambdaRank, ablations. All ROI was FLB arbitrage, not model skill. V01 SP-only produced +31% ROI — same as complex models.
D6 Proper: Compound Interactions No Edge
9 predeclared interactions from Phase 8 plan. All deltas < 0.001 on full population. Ablations show no individual interaction matters. Only EXP6 novice showed +4.57pp but was holdout contamination.
Phase 9B: Novice Blind Test Small Signal
Clean walk-back test. Model AUC 0.6463 vs SP 0.6334 on novice subset. Delta = +1.3pp. Signal is real but modest. Original +4.57pp was holdout contamination.
Comprehensive Audit Bugs Found
12-point audit + independent GPT review found multiple bugs invalidating results. Walk-forward-only model AUC = SP AUC to 4 decimal places.
SP Data Gap Discovery New Direction
Zero races have complete SP. 735k trial runners polluting analysis. 551k graded runners finished races without SP (GBGB pipeline failure). Betfair BSP available free back to 2008.

🐛 Bugs Found in Audit

Subset Normalization BugBroad Impact
Race probabilities normalized only within filtered subset (e.g. novice runners), not full field. Inflated all specialist/pocket AUCs. Affected: phase8_d6_proper subsets, overnight v2-v6 subsets.
Phase 9B Reverse-Time SplitInvalid
Trained on 2024-2026, tested on 2020-2023. This is backwards — future data predicting the past. Both phase9b scripts affected.
Overnight V3-V6 Feature MisalignmentData Corruption
Sorted a copy of the dataframe, computed features, then assigned .values back positionally to the original unsorted dataframe. Features attached to wrong dogs.
LambdaRank Group BugInvalid
Rows not sorted by race_id before passing group arrays to LightGBM ranker. Phase8 D6 catastrophic 0.6148 AUC was probably this bug.
D0 Baseline DriftComparison Invalid
D0 started as 9 features, rerun used 4. D1-D5 compared against 4-feature version. Not apples-to-apples.
Isotonic Calibration In-SampleSignal Loss
Calibration fitted in-sample, then race renormalization breaks it. SP-only model (0.6854) underperforms raw SP (0.6857) — the pipeline destroys even pure SP signal.

💾 Data Reality

Database: greyhound_full.db

951k
Total Races
3.74M
Total Runners
2.29M
Runners w/ SP
1.45M
Runners w/o SP

Where the Missing SP Actually Is

Trial Races (No SP Expected)
735k
T1/T2/T3/T4 schooling runs. No bookmaker market. No SP should exist. Use for form learning only.
Graded — SP Missing (Pipeline Failure)
551k
Actual starters who finished. GBGB API lost their SP data. Betfair BSP will fill ~250k of these.
Non-Runners / Voids
161k
finish_pos=0 (handslips, no-race) + NULL finish_pos (reserve dogs never ran). Exclude from everything.
Critical: "5 out of 6 runners have SP, 1 doesn't" is the dominant pattern across 353k races. Even market favourites (550k market_pos=1 runners) are missing SP. This is a systematic GBGB database pipeline failure.

Non-Runner Breakdown

TypeCountHas SP?Comments
finish_pos = 056,5142,307 (4%)Handslip (23k), NoRace (7k), NoTrial (2k), misc
finish_pos = NULL147,138123 (0.08%)No comments. Spread across A1-D3 grades. Reserve dogs.
Total non-finishers203,652These should never be in betting analysis

🔍 What We Haven't Tried

Complete Market DataNever Had
Every experiment used partial SP (61% of runners). Race-level normalization divided by partial sums. We have never tested against a complete market benchmark.
Betfair BSP as BenchmarkAvailable Free
BSP is the closing exchange consensus — harder and better benchmark than bookmaker SP. 100% coverage for BAGS races. Free CSVs back to 2008.
Full-Field Race EvaluationNever Done
All evaluation was on runners-with-SP subset. Never evaluated model on the full field of actual starters in a race.
Model-Only vs Market-Only vs CombinedNot Properly Tested
Clean experiment: (1) fundamental model (no SP input), (2) market-only (BSP), (3) combined. Tests whether fundamentals add value beyond market. Requires complete pricing.
Trial Races as Form InputNot Done
735k trial runs contain real dog performance data (times, events, interference). Should feed into dog form calculations but not into betting evaluation. Currently ignored or polluting.
Proper Data SegmentationNot Done
No separation of: trials (form only), BAGS with complete data (model + bet), non-BAGS graded (form only), non-runners (exclude). All mixed together.

🏗️ How the Model Will Work

The proposed architecture for the next phase. Four layers, each testable independently.

📥 Data Ingestion & Segmentation

GBGB API → greyhound_full.db. Betfair BSP CSV → bsp_price column.
Segment: Trial (form only) | BAGS + complete BSP (model + bet) | Non-BAGS graded (form only) | Non-runners (exclude)

🐕 Layer 1: Dog Assessment (race-independent)

Interference-adjusted form: Strip bumps/baulks from time calculations. A dog averaging 30.2s because of bumps has inflated bad-form.
Speed residuals: Track/distance/grade-adjusted times. Walk-forward only.
Experience proxy: Career runs, recency, track/distance familiarity.
Breeding prior with decay: Sire/dam progeny stats for novices, weighted inversely to dog's own runs. High value for first-timers, fades with experience.
Trial data feeds in here: Dog performance from trial runs (times, events) included in form history, even though trials aren't bet on.

🏁 Layer 2: Race Comparison (relative gaps within field)

Relative ability: Gap between this dog's assessment score and field average/best.
Field strength: How strong is the competition? Mean/spread of Layer 1 scores.
Class dog flag: Is this the standout in the field?
All computed within race at prediction time.

📍 Layer 3: Context (track/trap/conditions)

Track affinity: Dog's history at this specific track.
Trap/distance fit: Performance by trap position and race distance.
Forward bump risk: Is this dog bump-prone at this trap/track combination? (Pattern flag, not standalone feature.)
Crowding/draw risk: Multiple early-pace dogs in traps 3-6 = traffic risk.

🔗 Layer 4: Compound Signal (interactions)

Thesis: SP prices individual factors well. The edge (if it exists) is in combinations that SP misprices — e.g. strong dog + right draw + weak field + low interference risk.
Pre-declared interactions only. No data-mined combos.
Only tested after Layers 1-3 prove value individually.

📊 Evaluation Framework

Primary: Race log-loss, selected-bet ECE
Secondary: AUC, top-1 hit rate
Decisive: Flat-stake ROI after commission
Benchmark: Model vs BSP vs combined
Walk-forward only. Pre-registered segments.

💰 Betting Decision

Only on: BAGS races with complete BSP
Gate: Model conviction threshold (TBD from eval)
SP band: Constraint from Phase 7 pocket (likely SP 2-6)
Stake: Flat stake first, Kelly only if flat-stake profitable
FLB correction: Compare model ROI vs raw BSP ROI at same gate

Key principle: Predict winners first, then figure out betting strategy. Don't optimise betting rules on a model that hasn't been validated bottom-up.
Bump dual-use: Backward = strip interference from form calculations (Layer 1, data quality). Forward = flag bump-prone trap/track patterns (Layer 3, risk signal). Two separate uses of the same raw data.

🔧 Data Pipeline & Segmentation

GBGB API

Race results, dog info, times, SP (partial), events, comments. Source: api.gbgb.org.uk. Coverage: 2015-present.

Betfair BSP CSV (NEW)

100% BSP coverage for BAGS races. Free, daily CSVs. Match by track + date + trap + dog name. Coverage: 2008-present.

greyhound_full.db — Segmented

SegmentRunnersUseBet?
BAGS + Complete BSP~1.5-2MFull model training + evaluation✅ Yes
Trials (T1-T4)735kDog form history (times, events)❌ Never
Non-BAGS Graded~500kDog form history only❌ No pricing data
Non-Runners / Voids204kExclude entirely❌ Exclude
Trial race value: Trials are novice/schooling runs — exactly the population where breeding priors matter most and form data is thinnest. Including trial performances in Layer 1 dog assessment gives the model earlier data on young dogs, especially first-timers.

🎯 Next Phase: Data Repair + Decisive Test

Step 1: Betfair BSP BackfillImmediate
Download daily BSP CSVs (2015-2026, ~4,000 files). Match to DB by track + date + trap + dog name. Store in bsp_price column. Expected: ~250k additional runners get pricing data. All BAGS races become complete.
Step 2: Data SegmentationImmediate
Flag every race: is_trial, is_bags, has_complete_bsp, is_void. Only races with complete BSP enter the modelling/betting universe. Trials + non-BAGS feed dog form history only.
Step 3: Clean BaselineAfter Data Repair
On the complete-BSP universe: (a) BSP-only baseline (AUC, race log-loss, ROI), (b) Fundamental-only model (no market input), (c) Combined model. If fundamentals don't beat BSP on race log-loss → stop.
Step 4: Layer-by-Layer BuildIf Step 3 Shows Signal
Build Layers 1-4 one at a time. Each layer must improve race log-loss ≥0.5% AND not worsen selected-bet ECE AND not worsen model-BSP disagreement correlation. Kill rules enforced at each stage.
Step 5: Execution RealismFinal Gate
ROI after Betfair commission (5% on net winnings). Liquidity check — are these bets actually executable at the prices shown? Only flat-stake ROI after friction counts.
Stop conditions: If, after complete market data, the combined model does not beat BSP on race log-loss and calibration, OR ROI disappears after commission and realistic liquidity — stop cleanly. No more feature mining on partial data.