Pipeline Layers
4
Nicholas Harris — Portfolio Project
This project ingests historical + current-season data, engineers playoff-relevant regular-season features, trains survival and matchup models, simulates the bracket, and serves outputs in a polished analytics interface.
Pipeline Layers
4
Model Family
Survival + Matchup
Warehouse
DuckDB
Frontend
Streamlit + Portfolio
Ingestion scripts load game and player logs, normalize seasons, and validate completeness before modeling.
Regular-season feature builders create team-level predictors like net rating, efficiency, and top-team performance.
Survival model estimates title progression risk; matchup model predicts series outcomes for bracket simulation.
Analytics dashboard highlights title odds, play-in paths, and projected winners with explainability hooks.
Project Anatomy
`pipeline/ingestion/*` pulls raw logs and writes warehouse tables.
`pipeline/features/*` builds reusable feature tables with controlled season alignment.
`pipeline/models/*` fits survival + matchup artifacts and exports diagnostics.
Current-season bracket and play-in outcomes are Monte Carlo simulated.
`app/main.py` renders analytics views and an AI analyst chat layer.
Model Architecture
All features are regular-season only — no playoff data leaks into any input. Validated with Leave-One-Year-Out backtesting across 15 seasons (2010–2024).
Predicts playoff depth — treating rounds reached as a survival time and estimating each team's hazard of early elimination.
CoxPH treats each playoff round as a discrete survival step. A team's hazard ratio reflects how quickly the model expects it to be eliminated — high survival score = projected deeper run. This framing captures ordinal depth (R1 vs. R4 exit) that a binary classifier would collapse into a single label.
Binary classifier — given two teams in a series, who wins? Outputs calibrated win probabilities for Monte Carlo sampling.
Features are pairwise differentials of the same 5 RS metrics. Logistic regression keeps outputs well-calibrated, which matters when downstream Monte Carlo draws treat them as true probabilities. By-round: R1 76.7% · R2 65.0% · CF 70.0% · Finals 80.0%.
Propagates matchup win probabilities through 10,000 full bracket draws to produce title odds and round-by-round probabilities.
At each round, outcomes are Bernoulli draws from the matchup model's win probabilities. Title probability = fraction of runs where each team wins the Finals. Path-dependence is captured naturally — a tough projected R2 matchup suppresses title odds even for a team favoured in Round 1.
Given a series win probability, derives the full distribution P(series = 4, 5, 6, or 7 games) with no additional parameters.
Back-calculates the implied per-game win probability via numerical root-finding on the negative binomial CDF, then applies the PMF analytically. A 70% series edge implies a 4-game sweep ~14% of the time vs. a 7-game series ~30% — the app surfaces the full distribution, not just the modal outcome.
Projects final standings for all 30 teams by simulating every remaining regular-season game with logistic win probabilities.
Per-game win probability uses a calibrated logistic model on net-rating differentials with a home-court advantage parameter (HOME_ADV = 0.10 → equal-strength teams win at home 52.5%). 5,000 vectorised Bernoulli draws produce seed distributions, auto-bid probability, and play-in odds per team.
Live Artifacts
Run Locally
python -m pipeline.run_pipeline --skip-fetch --with-model
python -m pipeline.models.predict_current
python -m pipeline.models.simulation
streamlit run app/main.py
python -m http.server
# open /portfolio/index.html