Predicting Diabetes Using Machine Learning Myriam Zilles.

Executive Summary


  • Which subset of predictors are most indicative of diabetes and the top two risk factors?
  • Is high BP and/or high cholesterol indicative?
  • Do factors like BMI, smoking, eating fruits and vegetables matter?
  • Does income and educational levels have an impact?

Target Audience/Use Case

  • Individuals to be aware of the risks and proactively manage risk factors in addition to preventative interventions and/or lifestyle changes. This will help individuals improve their overall health by leading an active lifestyle. These behavior indicators are easy for individuals to do self-care and assessment.
  • Public Health authorities, State Health Departments to create policies, procedures and determine areas of research
  • Doctors, Diabetes Care Specialists, Community Based Organizations, Employers can use this information for managing and educating communities on healthy behaviors to prevent diabetes by providing diabetes self-management education and support.
  • Insurers and Health Care Management organizations can use this predictive model to help high-risk individuals with preventative or maintenance care and overall improve initial management practices.
Graph 1

Data Analysis / Important Predictors & Relationship with Response Variable

Graph 2

In graph 3 below we can see the relationship between BMI and diabetes. Those with higher BMI tend to have a higher rate of diabetes.

Graph 3

Graph 4 indicates those with diabetes have higher cholesterol, BP and difficulty walking. Higher the age, higher the chance of diabetes.

Graph 4

Graph 5 indicates lower income and lower education level individuals have higher rates of diabetes.

Graph 5

Data Preparing and Transformation

  • All the 21 features will be used.
  • Response variable Diabetes_binary is categorical with levels 0/1, and converted to a factor variable.
  • BMI, MentHlth days and PhysHlth days numeric features which will not be scaled. Caret function parameters will be used for scaling and centering.
  • Remaining 18 categorical features with levels 2 to 13 will be converted to factor variables.
  • Age is pre-binned with 13 levels.
Graph 6

Model Development

  • Random Forest: Tuning parameter as number of variables available for splitting at each node as provided as 2, 3, 4, 5, 6, 7, 8 and 22 based on the number of features used.
  • Support Vector Machine (SVM): Tuning parameter cost was defined as .001, .01, .1, 1, 5, 10, 100.
Graph 7

The predictor variable combinations were tested on both the learning methods to predict response variable diabetes for all the 10 model formulas by cross-validating the fit 10 times on each subset of data. The best model chosen during the model fitting process was Support Vector Machine on all 21 features. We can see this in the graph 7 and 8 with the highlighted red vertical line indicating an lowest error rate of 25% as compared to the rest of the models and accuracy of 75%.

Graph 8

In graph 9, we see that tuning hyper-parameter with a cost of 0.01 is the best value.

Graph 9

Model Assessment

The 2nd best model was also Support Vector Machine, which is less complex in terms of the number of predictor variables (Diabetes_binary ~ HighBP + HighChol + BMI + GenHlth + Age + DiffWalk). This model was not selected though as I think that other features like income, education, smoking, etc will be indicative of diabetes, especially when this model is deployed at scale for various ethnic groups. The added perk of this method is to evaluate the behavioural traits in routine health exams and screenings. Early risk factors will expedite medical care and improve the quality of life for individuals long term in helping them better prevent and manage chronic conditions.

The following graph illustrates the error rate for all the 50 models. We can see that overall SVM on the selected model has a lower error rate than Random Forest method. The dotted blue lines separate the 10 groups of model formulas for SVM and Random Forest for each fold with the 5 least CV error rates ranging between 24% to 30% and accuracy ranging from 70% to 76%.

Graph 9

Final Model Selection

Graph 10

We are more concerned with recall rate as false negatives are important to manage in a timely fashion.This additional visual shows the proportion of true positives, true negatives and false positives/negatives. We could optimize the model to lower false negatives so individuals can receive timely health advice.

There is not much risk of a healthy person being flagged as diabetic. However, this could lead to patient dissatisfaction if they have to order additional medical tests and cause a financial burden to the individual and reduce the effectiveness of the ecosystem.

Graph 11

Model Interpretation

Next Steps

  • Make the model less complex by eliminating low indicative predictors, so the selected set of behavioral traits can be more easily managed by individuals and health care providers.
  • Try KNN, Logistic Regression and ANN with tuning parameters.
  • Experiment with tuning parameters on Random Forest and SVM on a combination of features.
  • Gather additional data by race, sex and age.
  • Gather additional data on individuals under 18 years with and without diabetes. Prediction Models could be different for children vs adults.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Poonam Rao

Exec Director StratEx - I bring to the table blend of data science, finance and strategy management skills with 20+ years of experience in insurance & fintech.