Exploring Predictive Indicators of Diabetes in Women patients

Project Goal

Analyze a dataset from the National Institute of Diabetes and Digestive and Kidney Diseases and explore how how certain diagnostic factors affect the diabetes outcome of women patients.

Process

Data Exploration in Python (EDA)

Checked for missing or zero values in critical features (like insulin and skin thickness).
Plotted distributions and correlations for features like BMI, glucose, and age.
Analyzed relationships between features and the Outcome variable (diabetes diagnosis: 0 = No, 1 = Yes).
Visualization in Tableau

Created a dashboard to surface the most insightful relationships:
BMI vs. Diabetes Outcome: higher BMI is strongly associated with positive diagnoses
Blood Pressure, Glucose, Insulin by Age: used bubble scatter plots to explore variation across lifespan
Diabetes Pedigree Function by Age: older age + higher pedigree function often linked to diagnoses
Boxplot of BMI by Diagnosis: clearly shows higher median BMI in diagnosed individuals

Part 1 -

Data Exploration in Python

Data Inspection and Cleaning

import pandas as pd
import numpy as np

df = pd.read_csv('diabetes.csv')
df.info()
df.describe()
df.head()

cols_with_zero_issues = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
(df[cols_with_zero_issues] == 0).sum()

df_cleaned = df.copy()
df_cleaned[cols_with_zero_issues] = df_cleaned[cols_with_zero_issues].replace(0, np.nan)
df_cleaned.dropna(inplace=True)

Distribution by Outcome

import seaborn as sns
import matplotlib.pyplot as plt

for col in ['Glucose', 'BMI', 'Age', 'DiabetesPedigreeFunction']:
    plt.figure(figsize=(6, 4))
    sns.histplot(data=df_cleaned, x=col, hue='Outcome', kde=True, element="step")
    plt.title(f'Distribution of {col} by Outcome')
    plt.show()

Glucose levels were significantly higher in diagnosed individuals, with a clear shift rightward in the distribution.
BMI also showed a noticeable right-skew among diagnosed patients, though less pronounced than glucose.
Diabetes Pedigree Function showed wider spread in diagnosed patients, with several outliers above 1.0, suggesting stronger family history linkage.
Age distribution showed that non-diagnosed participants skew younger, while diagnosed participants were more uniformly distributed across age brackets.

Distribution by Outcome

numeric_df = df_cleaned.select_dtypes(include='number')

plt.figure(figsize=(10, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Matrix")
plt.show()

Glucose had the strongest positive correlation with diabetes diagnosis.
BMI and DiabetesPedigreeFunction followed as moderately correlated.
Pregnancies showed moderate correlation overall and was more predictive when paired with higher age groups.

Part 2 -

Visualization in Tableau

Blood Pressure, Glucose, Insulin, Skin Thickness Measures by Age

Glucose increases more visibly with age in diagnosed patients
Insulin values show extreme variance and inconsistency (possible data quality issue)

Diabetes Diagnosis vs BMI

Clear visual split: median BMI is higher in diagnosed patients
Diagnosed group contains more outliers with BMI > 40

Diabetes Pedigree Function by Age

Age 50–70 tends to have higher pedigree scores among diagnosed individuals
Pedigree function peaks in the 50–60 age group → suggests increased genetic risk in older patients

Summary

Glucose, BMI, and pedigree score are the top indicators for diabetes in this group.
There’s a need for better handling of missing biometric values like insulin.
Age trends show the impact of family history and metabolic changes over time.

Ideas from data visualization

Early screening tools using just glucose, BMI, and family history

Data cleanup methods or clinician training to reduce gaps in insulin/skin thickness tracking

Predictive modeling apps for providers to assess diabetes risk without relying on all 9 variables

Exploring Predictive Indicators of Diabetes in Women patients

Project Goal

Data Exploration in Python (EDA)

Visualization in Tableau

Part 1 -

Data Exploration in Python

Data Inspection and Cleaning

Distribution by Outcome

Distribution by Outcome

Part 2 -

Visualization in Tableau

Blood Pressure, Glucose, Insulin, Skin Thickness Measures by Age

Diabetes Diagnosis vs BMI

Diabetes Pedigree Function by Age

Summary

Ideas from data visualization

Xi’s digital space