Predicting Diabetes Using Machine Learning
Today diabetes is a prevalent issue in many countries with 537 million adults living with the chronic condition of diabetes. In the US, approximately 34M have diabetes in addition to 88M with pre-diabetes. This leading chronic disease spans 1/3rd of the American population with either diabetes or high risk of type 2 diabetes. Not only does this disease have financial impacts on families, it also leads to various health complications and ailments. CDC forecasts that the prevalence of this disease is rising. IDF (International Diabetes Foundation) research indicates that 3 in 4 with diabetes live in low to middle income countries and forecasts that 643 million by 2030 and 783 million by 2045. There is great interest in predicting diabetes.
The primary purpose is to predict if the person has diabetes or not based on behavioral and physical features with high accuracy.
- Which subset of predictors are most indicative of diabetes and the top two risk factors?
- Is high BP and/or high cholesterol indicative?
- Do factors like BMI, smoking, eating fruits and vegetables matter?
- Does income and educational levels have an impact?
Target Audience/Use Case
Leveraging the power of science and action-based insights by predicting diabetes in individuals can help manage the far-reaching consequences of this slow-killing disease. It can also put less pressure on the health care systems and overall improve the life quality of individuals. The results of this analysis will be helpful to the following audiences:
- Individuals to be aware of the risks and proactively manage risk factors in addition to preventative interventions and/or lifestyle changes. This will help individuals improve their overall health by leading an active lifestyle. These behavior indicators are easy for individuals to do self-care and assessment.
- Public Health authorities, State Health Departments to create policies, procedures and determine areas of research
- Doctors, Diabetes Care Specialists, Community Based Organizations, Employers can use this information for managing and educating communities on healthy behaviors to prevent diabetes by providing diabetes self-management education and support.
- Insurers and Health Care Management organizations can use this predictive model to help high-risk individuals with preventative or maintenance care and overall improve initial management practices.
Data Analysis / Important Predictors & Relationship with Response Variable
The following two graphs 1 and 2 show the correlation between features in our dataset. The shades of red eclipses signify a positive correlation (for example high cholesterol and high BP are correlated) while blue shades signify negative correlation between predictor variables (for example having higher fruits and veggies results in lower chances of diabetes condition). We can see that BMI, Age, Difficulty Walking, high BP and high Cholesterol are highly correlated with diabetes.The two most important predictors are BMI and high BP, followed by high cholesterol and Age.
In graph 3 below we can see the relationship between BMI and diabetes. Those with higher BMI tend to have a higher rate of diabetes.
Graph 4 indicates those with diabetes have higher cholesterol, BP and difficulty walking. Higher the age, higher the chance of diabetes.
Graph 5 indicates lower income and lower education level individuals have higher rates of diabetes.
Data Preparing and Transformation
The missingness map graph 6 indicates no missing data. Additional details below on data preparation:
- All the 21 features will be used.
- Response variable Diabetes_binary is categorical with levels 0/1, and converted to a factor variable.
- BMI, MentHlth days and PhysHlth days numeric features which will not be scaled. Caret function parameters will be used for scaling and centering.
- Remaining 18 categorical features with levels 2 to 13 will be converted to factor variables.
- Age is pre-binned with 13 levels.
The following two machine learning classification algorithms were used for this analysis.
- Random Forest: Tuning parameter as number of variables available for splitting at each node as provided as 2, 3, 4, 5, 6, 7, 8 and 22 based on the number of features used.
- Support Vector Machine (SVM): Tuning parameter cost was defined as .001, .01, .1, 1, 5, 10, 100.
The predictor variable combinations were tested on both the learning methods to predict response variable diabetes for all the 10 model formulas by cross-validating the fit 10 times on each subset of data. The best model chosen during the model fitting process was Support Vector Machine on all 21 features. We can see this in the graph 7 and 8 with the highlighted red vertical line indicating an lowest error rate of 25% as compared to the rest of the models and accuracy of 75%.
In graph 9, we see that tuning hyper-parameter with a cost of 0.01 is the best value.
During the model assessment phase, the models were run on different training and test data subsets. Consistently Support Vector Machine was selected as the machine learning method 5/5 times the model on all 21 predictors was selected. It also has the overall lowest error rate.
The 2nd best model was also Support Vector Machine, which is less complex in terms of the number of predictor variables (Diabetes_binary ~ HighBP + HighChol + BMI + GenHlth + Age + DiffWalk). This model was not selected though as I think that other features like income, education, smoking, etc will be indicative of diabetes, especially when this model is deployed at scale for various ethnic groups. The added perk of this method is to evaluate the behavioural traits in routine health exams and screenings. Early risk factors will expedite medical care and improve the quality of life for individuals long term in helping them better prevent and manage chronic conditions.
The following graph illustrates the error rate for all the 50 models. We can see that overall SVM on the selected model has a lower error rate than Random Forest method. The dotted blue lines separate the 10 groups of model formulas for SVM and Random Forest for each fold with the 5 least CV error rates ranging between 24% to 30% and accuracy ranging from 70% to 76%.
Final Model Selection
The final model was validated on a dataset of 6K observations and also fit on a full dataset of 72 observations. Support Vector Machine model with all the 21 predictor variables being used and tuning hyper parameters of cost 0.01. This is the same model that was selected during model fitting and model assessment phase. After fitting on the complete dataset of 72K observations, it has an error rate of 25% and accuracy of 75%, very close to the validation dataset results of 24.5% error rate and 75.5% accuracy. The confusion matrix graphs 10 shows a visual, with precision of 73% (how many labeled as diabetic are truly diabetic?) and recall (sensitivity) of 79% (how many diabetic predictions were truly diabetic). F1 score that balances both precision and recall is pretty decent at 76%.
We are more concerned with recall rate as false negatives are important to manage in a timely fashion.This additional visual shows the proportion of true positives, true negatives and false positives/negatives. We could optimize the model to lower false negatives so individuals can receive timely health advice.
There is not much risk of a healthy person being flagged as diabetic. However, this could lead to patient dissatisfaction if they have to order additional medical tests and cause a financial burden to the individual and reduce the effectiveness of the ecosystem.
This disease that was once named the “disease of the rich” and “slow killer” is now prevalent across all income and age levels. Narayan et. al. elaborates on this research as well. This could be explained by possible access to junk food at cheaper rates and higher consumption of food causing diabetes. With improving socio-economic conditions, dietary patterns and habits could change for families. More income could sometimes mean less physical activity and exercise, household chores outsourced that would have been typically done by the individual. This could have a ripple effect with higher rates of obesity, stress related smoking and further adopting unhealthy lifestyles all of which could impact the prevalence of diabetes. Diabetes also increases the risk of high BP by damaging the blood vessels (CanopyHealth.com, Dr. Hatipoglu). Many times, those with high BP have not yet been diagnosed with diabetes. With this it makes sense that BMI and high BP are the two most important predictors, with income, education, physical activity, etc are predictors of diabetes.
For this model to be robust and used worldwide, it would be important that it is trained on datasets gathered from individuals with multiracial and ethnic backgrounds so that it performs well in those scenarios and does not have a higher error rate. Different dietary and genetic factors could play into different body compositions and genetic predispositions. It would also be worth evaluating how these factors impact men and women. A few things I would try in the future include:
- Make the model less complex by eliminating low indicative predictors, so the selected set of behavioral traits can be more easily managed by individuals and health care providers.
- Try KNN, Logistic Regression and ANN with tuning parameters.
- Experiment with tuning parameters on Random Forest and SVM on a combination of features.
- Gather additional data by race, sex and age.
- Gather additional data on individuals under 18 years with and without diabetes. Prediction Models could be different for children vs adults.