Nicholas Harris — Portfolio Project

From NBA API to playoff forecast app, fully local and reproducible.

This project ingests historical + current-season data, engineers playoff-relevant regular-season features, trains survival and matchup models, simulates the bracket, and serves outputs in a polished analytics interface.

▶ Live App Project Code This Site's Repo

Pipeline Layers

Model Family

Survival + Matchup

Warehouse

DuckDB

Frontend

Streamlit + Portfolio

Data Engineering

Ingestion scripts load game and player logs, normalize seasons, and validate completeness before modeling.

Feature Layer

Regular-season feature builders create team-level predictors like net rating, efficiency, and top-team performance.

Modeling

Survival model estimates title progression risk; matchup model predicts series outcomes for bracket simulation.

Decision UI

Analytics dashboard highlights title odds, play-in paths, and projected winners with explainability hooks.

Project Anatomy

System flow and why this structure scales

Ingest

`pipeline/ingestion/*` pulls raw logs and writes warehouse tables.

Engineer

`pipeline/features/*` builds reusable feature tables with controlled season alignment.

Train

`pipeline/models/*` fits survival + matchup artifacts and exports diagnostics.

Simulate

Current-season bracket and play-in outcomes are Monte Carlo simulated.

Serve

`app/main.py` renders analytics views and an AI analyst chat layer.

Model Architecture

Five models, one end-to-end forecast

All features are regular-season only — no playoff data leaks into any input. Validated with Leave-One-Year-Out backtesting across 15 seasons (2010–2024).

Survival Analysis

Cox Proportional Hazards

Predicts playoff depth — treating rounds reached as a survival time and estimating each team's hazard of early elimination.

0.796 LOYO C-index (±0.074)

80% Champion in top 3 (12/15 seasons)

100% Champion in top 5 (15/15 seasons)

CoxPH treats each playoff round as a discrete survival step. A team's hazard ratio reflects how quickly the model expects it to be eliminated — high survival score = projected deeper run. This framing captures ordinal depth (R1 vs. R4 exit) that a binary classifier would collapse into a single label.

Net Ratingvs-Top Win%Close Game Win%eFG%FTA/g

Logistic Regression

Series Matchup Model

Binary classifier — given two teams in a series, who wins? Outputs calibrated win probabilities for Monte Carlo sampling.

72.9% LOYO series accuracy

0.802 ROC-AUC out-of-sample

225 Series validated

Features are pairwise differentials of the same 5 RS metrics. Logistic regression keeps outputs well-calibrated, which matters when downstream Monte Carlo draws treat them as true probabilities. By-round: R1 76.7% · R2 65.0% · CF 70.0% · Finals 80.0%.

Δ Net RatingΔ vs-Top Win%Δ Close Game Win%Δ eFG%Δ FTA/g

Monte Carlo

Bracket Simulation

Propagates matchup win probabilities through 10,000 full bracket draws to produce title odds and round-by-round probabilities.

10,000 Full bracket simulations

15 Series per bracket

4 Rounds modelled

At each round, outcomes are Bernoulli draws from the matchup model's win probabilities. Title probability = fraction of runs where each team wins the Finals. Path-dependence is captured naturally — a tough projected R2 matchup suppresses title odds even for a team favoured in Round 1.

Matchup win probsBracket seedingPlay-in winners

Negative Binomial

Series Length Model

Given a series win probability, derives the full distribution P(series = 4, 5, 6, or 7 games) with no additional parameters.

4 Game-count outcomes

Brent's Root-finding for p_game

0 Extra free parameters

Back-calculates the implied per-game win probability via numerical root-finding on the negative binomial CDF, then applies the PMF analytically. A 70% series edge implies a 4-game sweep ~14% of the time vs. a 7-game series ~30% — the app surfaces the full distribution, not just the modal outcome.

p_series → p_gameP(4g) P(5g) P(6g) P(7g)Expected games

Monte Carlo

Remaining Schedule Simulation

Projects final standings for all 30 teams by simulating every remaining regular-season game with logistic win probabilities.

5,000 Simulations per run

30 Teams projected

P10–P90 Win range output

Per-game win probability uses a calibrated logistic model on net-rating differentials with a home-court advantage parameter (HOME_ADV = 0.10 → equal-strength teams win at home 52.5%). 5,000 vectorised Bernoulli draws produce seed distributions, auto-bid probability, and play-in odds per team.

Net rating differentialHome court (+0.10)Seed distribution

Live Artifacts

Snapshot from latest local model run

Top Championship Odds

Load data--

Run Metadata

Season: --
Simulations: --
Updated: --

Run Locally

Commands to launch the full experience

Backend refresh

python -m pipeline.run_pipeline --skip-fetch --with-model
python -m pipeline.models.predict_current
python -m pipeline.models.simulation

Frontend launch

streamlit run app/main.py
python -m http.server
# open /portfolio/index.html