Gone are the days when athlete impression and judge psychology determined athlete selection! With the rapid evolution of the sports science industry there is focus on scientifically selecting athletes and improving their performance. One of the areas of innovation gaining momentum is predictive analytics. Could we predict a sport group for an athlete based on their body and/or blood composition with machine learning methods?
Purpose & Goals
The purpose of predictive data analytics project includes:
- Understand how the model’s predictive capabilities may benefit individuals.
- Identify the most influential predictors for the sports category.
- Make reasonably accurate predictions.
- Compare accuracy / error rate between the models.
Target Audience & Potential Uses of Model
This data analysis and predictive model could benefit athletes as well as professionals such as Sports Medical Doctors, Sports Haematologists, Physical Education Teachers, Physiotherapist/Exercise Physiologist, Dieticians or Nutritionists, Performance Analysts, Sports Coaches, Sports Therapists, Fitness Center Managers, Sports Administrator, Strength and Conditioning Specialists, Sports Food Manufacturers and Retail Managers to leverage this predictive model output in ways that are apt to their scale and grow their business by improving their product or service development.
- For training high performance athletes in areas of strength and conditioning to prepare athletes for competitions and achieve the body and blood compositions that of high performing athletes.
- Sport coaches, physical fitness trainers could use this information to train athletes to achieve ideal body compositions based on the sport the athlete has chosen.
- Talent identification scouts could use to find trainable athletes.
- Sports Nutritionists could use this information to help manage dietary needs and caloric recommendations for athletes. This can also be helpful to Physical Therapists to help injured athletes recover to their pre-injury levels.
- This analysis can be leveraged by Sports Science and Sports Medicine departments, practitioners and coaches to conduct what-if analysis.
- Athletes can use these predictions to manage their individual health metrics based on the sports they are playing. This can be helpful not only during the active sports years but also after athletes retire to manage long-term health goals.
About the Dataset & Key Stats
Physical measurements and blood measurements from high performance athletes at the AIS (Australian Institute of Sports), with 202 observations (102 male and 100 female athletes) and 13 variables was collected in 1991. There are no missing values in the dataset. The original dataset has many sports categories. For this analysis, some categories have been combined.
- “ball” combines “tennis”, “netball”, and “basket ball”
- “track” combines “field”, “t_400m”, and “t_sprnt”
- “water/gym” combines “gym”, “row”, “swim”, and “w_polo”.
Sex (0 = male or 1 = female)
Ht height (cm)
Wt weight (kg)
LBM lean body mass
RCC red cell count
WCC white cell count
Ferr plasma ferritin concentration
BMI body mass index, weight/(height)**2
SSF sum of skin folds
Bfat Percent body fat
Sport_group 3 categories as ball, track and water/gym
The following figure shows mean, median for each variable for the sports categories.
The following figure shows mean, median for SSF, Bfat, BMI, Hg and LBM for the sports categories grouped by gender (0 as male and 1 as female).
The following graph shows sport_group density by sex.
The following graph shows the correlation between variables in our dataset. The red eclipses signify a positive correlation (for example if body fat is higher then skinfolds will be higher) while blue signifies negative correlation (for example if body fat is lower than red cell count will be higher) between predictor variables. We can see that skinfold is highly correlated with body fat.
The following graph shows correlation between variables based on the sports group of ball, track and water/gym. We can distinctly see that skinfold and body fat have a positive relationship and the same between LBM and weight and LBM and height.
In the graph below we can see the relationship between skin folds and sports group, and how they compare between males and females. From the data we observe that water/gym makes the most observations (80), followed by track and then ball.
As no data is missing we do not require to impute any data. Machine learning needs data in numerical format so sex variable which is an integer is converted to a factor variable with two levels of 0 and 1. Sport_group is a categorical character variable and has been converted to a factor variable with 3 levels to aid the machine learning method to understand our predictor and response variables.
Important Predictors and Relationship with Response Variable
From the graph below, we see that skin fold (SSF) and Hemoglobin (Hg) are the two most important predictors of a sports group for an athlete. Based on my limited domain knowledge and additionally reading published materials, the relationship between skin folds and hemoglobin as the most important predictors for sport groups. It makes sense that based on the type of training an athlete undergoes, it makes a difference to the blood and body composition. This is ultimately important for high performing athletes not only to train on the sport but also optimize the body/blood composition factors that could increase their chances of success.
However, these predictors do raise a red flag. It would be worth evaluating how skin folds, body fat, height, weight, BMI and other blood features compare for ethnic sports men and women. Are these uniform for diverse sportspersons based on observations seen in Australian athletes? Do genetics, dietary or other factors play into blood and body compositions? What are the outliers?
For this model to be robust and used for athletes on a larger scale, in other countries or globally, it would be important that it is trained on datasets gathered from athletes from multiracial and ethnic backgrounds so that it performs well in those scenarios and does not have a higher error rate. Davidson et. al. mention about how skin fold test and body fat has not been evaluated in racial and ethnic groups.
The other aspect with skin fold evaluation is that the accuracy of this test relies heavily upon the tester’s experience as there isn’t any equipment involved. For skin fold to be used as a predictor it is important that the sample population has a consistent method of evaluation.
Model Development, Assessment & Interpretation
The following two machine learning classification algorithms were used for this analysis.
- KNN (K Nearest Neighbors): Tuning parameter k was defined as 1 to 20.
- Random Forest: Tuning parameter as number of variables available for splitting at each node as provided as 2, 3, 4, 5, 6, 7, 8, 9, 12.
The predictor variable combinations were tested on both the learning methods to predict response variable sport groups for all the 22 model formulas by cross-validating the fit 10 times on each subset of data. The best model chosen during the model fitting process was Random Forest. We can see this in the graph below with the highlighted red vertical line indicating a lowest error rate of 28% as compared to the rest of the models and accuracy of 72%. We can see the confusion matrix below.
During the model assessment phase, the models were run on different training and test data subsets. Consistently Random Forest was selected as the machine learning method 4/5 times the model on all 12 predictors was selected and 1/5 time on 5 predictor variables as (Bfat + SSF + Wt + Ht + Hg), which is less complex in terms of the number of predictor variables. The following graph illustrates the error rate for all the 110 models. We can see that overall Random Forest on the selected model has a lower error rate than KNN method. The dotted blue lines separate the 22 model formulas for KNN and Random Forest from each fold with the 5 least CV error rates ranging between 26% to 35% and accuracy ranging from 65% to 74%.
Final Model Selection
The final model selected is Random Forest with all the 12 predictor variables being used and tuning parameter of 4. This is the same model that was selected during model fitting and model assessment phase. After fitting on the complete dataset, it has an error rate of 29% and accuracy of 71%. The graph levels out indicating that 500 tress is adequate. From the graph below we can see the error rate for each of the sport group classes and out of bag error rate. The prediction for track is the most accurate, followed by water/gym. The error rate in the ball category is significantly higher and shows that the model performs poorly for this category. More data could be collected for athletes in this category to make a robust model that does well on new data.
A few things I would try in the future include:
- Use BMI since it is derived from Height and Weight predictors.
- Make the model less complex by removing duplicate predictors.
- Try Gradient Boosted Trees or Naive Bayes.
- Removing the tuning parameter of 9 and 12 from Random Forest and use fewer variables.
- Gather more data on athletes who play ball as the error rate is 41% on this class.
Dataset. Telford, R. D. and Cunningham, R. B. (1991) Sex, sport, and body-size dependency of hematology in highly trained athletes. Medicine and Science in Sports and Exercise, 23(7):788–794