Exploring Predictive Indicators of Diabetes in Women patients
Project Goal
Analyze a dataset from the National Institute of Diabetes and Digestive and Kidney Diseases and explore how how certain diagnostic factors affect the diabetes outcome of women patients.
Process
-
Data Exploration in Python (EDA)
Checked for missing or zero values in critical features (like insulin and skin thickness).
Plotted distributions and correlations for features like BMI, glucose, and age.
Analyzed relationships between features and the Outcome variable (diabetes diagnosis: 0 = No, 1 = Yes).
-
Visualization in Tableau
Created a dashboard to surface the most insightful relationships:
BMI vs. Diabetes Outcome: higher BMI is strongly associated with positive diagnoses
Blood Pressure, Glucose, Insulin by Age: used bubble scatter plots to explore variation across lifespan
Diabetes Pedigree Function by Age: older age + higher pedigree function often linked to diagnoses
Boxplot of BMI by Diagnosis: clearly shows higher median BMI in diagnosed individuals
Part 1 -
Data Exploration in Python
Data Inspection and Cleaning
import pandas as pd import numpy as np df = pd.read_csv('diabetes.csv') df.info() df.describe() df.head() cols_with_zero_issues = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'] (df[cols_with_zero_issues] == 0).sum() df_cleaned = df.copy() df_cleaned[cols_with_zero_issues] = df_cleaned[cols_with_zero_issues].replace(0, np.nan) df_cleaned.dropna(inplace=True)
Distribution by Outcome
import seaborn as sns import matplotlib.pyplot as plt for col in ['Glucose', 'BMI', 'Age', 'DiabetesPedigreeFunction']: plt.figure(figsize=(6, 4)) sns.histplot(data=df_cleaned, x=col, hue='Outcome', kde=True, element="step") plt.title(f'Distribution of {col} by Outcome') plt.show()
Glucose levels were significantly higher in diagnosed individuals, with a clear shift rightward in the distribution.
BMI also showed a noticeable right-skew among diagnosed patients, though less pronounced than glucose.
Diabetes Pedigree Function showed wider spread in diagnosed patients, with several outliers above 1.0, suggesting stronger family history linkage.
Age distribution showed that non-diagnosed participants skew younger, while diagnosed participants were more uniformly distributed across age brackets.
Distribution by Outcome
numeric_df = df_cleaned.select_dtypes(include='number') plt.figure(figsize=(10, 8)) sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f") plt.title("Feature Correlation Matrix") plt.show()
Glucose had the strongest positive correlation with diabetes diagnosis.
BMI and DiabetesPedigreeFunction followed as moderately correlated.
Pregnancies showed moderate correlation overall and was more predictive when paired with higher age groups.
Part 2 -
Visualization in Tableau
Blood Pressure, Glucose, Insulin, Skin Thickness Measures by Age
Glucose increases more visibly with age in diagnosed patients
Insulin values show extreme variance and inconsistency (possible data quality issue)
Diabetes Diagnosis vs BMI
Clear visual split: median BMI is higher in diagnosed patients
Diagnosed group contains more outliers with BMI > 40
Diabetes Pedigree Function by Age
Age 50–70 tends to have higher pedigree scores among diagnosed individuals
Pedigree function peaks in the 50–60 age group → suggests increased genetic risk in older patients
Summary
Glucose, BMI, and pedigree score are the top indicators for diabetes in this group.
There’s a need for better handling of missing biometric values like insulin.
Age trends show the impact of family history and metabolic changes over time.
Ideas from data visualization
Early screening tools using just glucose, BMI, and family history
Data cleanup methods or clinician training to reduce gaps in insulin/skin thickness tracking
Predictive modeling apps for providers to assess diabetes risk without relying on all 9 variables