Executive Overview

A quick orientation for non-technical readers. The dataset follows 303 patients evaluated for heart disease at Cleveland Clinic and uses their vitals to flag likely disease.

Important: This is a historical modelling exercise meant for analytics practice. It is not medical advice or a diagnostic tool.

Snapshot

We cleaned the Cleveland cohort, converted the target to a simple yes/no heart disease indicator, and stacked two baseline models. The scaled logistic regression pipeline produced the strongest balance of sensitivity and specificity.

Accuracy

0.869

Share of correct predictions on the 61-person test set.

Split Size

242 / 61

Training vs. testing rows after an 80/20 stratified split.

Precision

0.93 / 0.81

When the model says "no disease" or "disease", this is how often it is right.

Recall

0.82 / 0.93

How many real negatives/positives the model successfully captured.

Precision and recall are listed as no disease / disease. Logistic regression kept false negatives (missed disease) to just 2 cases, an improvement over the decision tree’s 5.

Think of the confusion matrix this way: 27 true negatives, 26 true positives, with 6 false alarms and only 2 missed cases. Those totals come straight from the evaluation file below.

Correlation heatmap for the Cleveland heart disease features
Heatmap from the cleaned training set shows chest pain type (cp), thallium test result (thal), and ST depression (oldpeak) as leading indicators.

Feature Guide

Each row is a person who received a heart disease work-up. Use this glossary to decode the variables and interpret the heatmap.

age
Age in years at the time of the visit.
sex
Biological sex (1 = male, 0 = female).
cp
Chest pain type (1 typical angina → 4 asymptomatic).
trestbps
Resting blood pressure (mm Hg).
chol
Serum cholesterol (mg/dL).
fbs
Fasting blood sugar > 120 mg/dL (1 = true).
restecg
Resting electrocardiogram (0 = normal, 1 = ST-T abnormality, 2 = probable LV hypertrophy).
thalach
Maximum heart rate achieved during exercise.
exang
Exercise-induced angina (1 = yes).
oldpeak
ST depression induced by exercise relative to rest.
slope
Slope of the peak exercise ST segment (1 upsloping, 2 flat, 3 downsloping).
ca
Count of major vessels (0–3) shown via fluoroscopy.
thal
Thallium stress-test result (3 normal, 6 fixed defect, 7 reversible defect).
target
Thingbert label: 1 = heart disease present, 0 = no heart disease.

1. Workflow Summary

Use these talking points when you need to walk stakeholders through the approach.

2. Technical Appendix

For analysts, here are the holdout metrics for each model. Precision and recall values are shown as no disease / disease.

Model Accuracy Precision Recall Confusion matrix Notes
Logistic regression 0.869 0.93 / 0.81 0.82 / 0.93 [[27, 6], [2, 26]] Standardised features, balanced class weights (liblinear solver).
Decision tree 0.770 0.83 / 0.72 0.73 / 0.82 [[24, 9], [5, 23]] Depth 4, min leaf 5 with class balancing; higher variance, more missed cases.

Recreate the run by loading heart_disease/heart_disease_model.pkl with joblib.load and calling model.predict on the cleaned dataset.

3. Heart Risk Estimator (Beta)

Plug in the same inputs used in the model to see the logistic regression probability. This demo runs entirely in your browser; nothing is stored or transmitted.

Educational use only. Consult a clinician for medical decisions.

4. Download the Analysis Files

Same artifacts we used internally—ideal for replicating the workflow or reviewing the numbers in detail.

5. Implementation Roadmap

To graduate the in-browser demo into a production experience, follow this path.

  1. Expose an API (FastAPI or Flask) that loads the serialized logistic regression pipeline and validates user inputs.
  2. Design a lightweight intake form (age, sex, chest pain type, vitals, exercise response) that mirrors the model features.
  3. Return risk probabilities with guardrails: flag inputs outside training ranges, display calibration notes, and surface the confusion matrix for context.
  4. Add analytics tracking to gauge engagement and iterate on the question flow before promoting broadly.

The trained pipeline lives in the working repo at heart_disease/heart_disease_model.pkl. Wire it into a microservice when you are ready to test the interaction.