Predictive Analytics for Managing Demand of Cardiology Medical Examinations

ARIMA Model, Holt-Winters Double Exponential Smoothing, Bayesian Linear Regression

Photo Credits: Hush Naidoo. Source: https://unsplash.com/photos/yo01Z-9HQAw

Preface

1 Problem Statement

  • Health Center (HC) staffing is not optimal to meet the overall demand. HCs do not have the capacity to meet the deadline of <30 days for examinations due to lack of examining physicians and human resource constraints.
  • Significant amount of revenue is lost due to inefficiencies in processing of disability examinations.
  • A small portion of the lost revenue is due to Fargo Health Centers (HCs) inability to process the request in < 30 days resulting in a $200 fine paid to Regional Office of Health Oversight.
  • A majority of lost revenue accounts in situations when HC sends the request back to Local Office (LO) because it knows in advance that due to staffing concerns it will not be able to make the 30 day deadline.
  • If an LO is not able to meet the demand then they outsource to an out of network Outpatient Clinic (OC) with a price tag of $1,250 per request.
  • In addition to the high expense there is no commitment on a turnaround time for completing the exam as they do not follow the 30 day deadline, further leading to a reputational and patient attrition risk.

1.1 Initiative Value Proposition
The resulting model based on a data-driven approach serves as a decision-making tool for the Abbeville Health Center (HC) to help with managing appropriately staff levels for the expected volume of heart related medical examinations.

1.2 Key Activities
Key project activities included:

  1. Analyzing raw dataset provided
  2. Cleansing, imputing and analyzing the final dataset
  3. Creating two forecasting models to forecast demand for 2014
  4. Selecting the best model
  5. Sharing results with project stakeholders and recommending a forecasting model

2 Key Benefits
Key benefits that could be realized by Fargo Health Center are categorized as under:

2.1 Financial benefits:

  1. Generating maximum revenue from the incoming demand.
  2. Reducing outsourcing costs by reducing the number of costly examinations performed out of network and adequately staffing based on anticipated demand.
  3. Preventing patient attrition eventually resulting in lost revenue over a period of time.
  4. Reducing operational costs.

2.2 Reputational & Patient-Satisfaction benefits:

  1. Reducing the uncertainty aspect of whether an out of network examination will be performed in a timely basis thereby resulting in reputational risks.
  2. Preventing patient satisfaction levels from going down due to long wait times.
  3. Ensure staff has the emotional capacity to handle their work.
  4. Patient-centered care.

2.3 Operational benefits:

  1. Proactively planning outsourcing based on anticipated situations when staffing plan inadequacy is known and/or accounting for any variances with model predictions.
  2. Operating at optimal capacity.
  3. Facilitating better matching of resources with clinical volume.
  4. Plan mitigation for increased demand with walk-ins.

2.4 Employee benefits:

  1. Aiding in effective scheduling. Staff vacations and time-off can be better planned without impacting business based on anticipated patient volume. Determine contingent staffing needs. Maintain optimal patient-to-staff ratio.
  2. A byproduct benefit is optimal staffing is satisfied physicians who are not overworked.

3 Data Analytics Approach

The initial Proof of Concept (POC) focusses on a small dataset to predict demand of heart examinations for Abbeville HC. The analysis of Fargo Health dataset ranges from Jan 2006 to Dec 2013 and is sufficient to do a time series model to identify potential trends and irregularity aspects that need to be considered for forecasting. The initial data available in Excel format will be inspected for identifying any quality issues. Data will be imputed using R. The cleaned output will be verified iteratively to ensure it is fit to input into the two time series model — ARIMA and Holt’s Winters. The results of the two models will be statistically compared using MAD (Mean Absolute Deviation), MAPE (Mean Absolute Percent Error) and MSE (Mean Squared Error) to select the best model. The following sections lay out the approach in detail.

3.1 Available Data
Data was received in an Excel spreadsheet tabbed format — 1 tab for Abbeville and 4 different HCs for Jan 2006 to Dec 2013. In addition there is a tab listing heart related conditions. Explanation of the dataset was used as a primary data dictionary that explains the contents of the various tabs. Key observations include:

  • It is unknown how the original data was captured and imported into Excel.
  • Abbeville data contains fields as incoming examinations, year and month.
  • The tabs for Violet, New Orleans, Lafayette and Baton Rouge 2007 tabs are in a different format than that of Abbeville tab with fields as original hospital location, type of examination, date and request id.
  • Requests include different examination types.

3.2 Raw Data Analysis
Excel built-in features such as filtering, sorting for Abbeville location and changing dates to a readable data format where needed was used to inspect data. Preliminary analysis identified data issues such as:

  • outliers
  • duplicates
  • placeholders with no meaning
  • missing/blank data
  • text instead of numeric values
  • incorrect formats
  • data types not consistent
  • date formats not consistent
  • inconsistencies with naming convention for examinations (capitalization, acronyms)
  • spelling names of examination
  • special characters

3.2.1 Additional observations

  • Of the 96 rows in Abbeville, 11 have missing values denoted as special characters, 999999 or a text comment. Sorting Abbeville tab by year and then by month shows duplicate months records.
  • Oct 2008 Abbeville data is abnormally high due to New Orleans LO being closed.
  • Violet LO May 2007 data includes 2007 info only with original hospital location as Abbeville, Baton Rouge, Lafayette and New Orleans.
  • New Orleans May 2007, Lafayette, Baton Rouge tabs include 2013 examinations as well for Abbeville.
  • Dec 2013 data has 10481 requests in all some of which are of Abbeville based on SYSID and heart conditions.
  • Data will need to be imputed for 2006, 2008, part of 2009, part of 2010, and 2011.
  • The data is complete for 2007, 2012, and 2013.

3.3 Data Cleansing, Imputation — Methodology/Strategy
Imputing one value for a missing datum cannot be correct in general, because we don’t know what value to impute with certainty (if we did, it wouldn’t be missing).” by Donald B. Rubin. Imputation is an art as well as science. The following steps were implemented to cleanse and impute data.

  • Excel data was imported into R. The first step was to partially clean Abbeville data by removing non-numeric and placeholder information.
  • Each location’s data was read in R, filtered for Abbeville and heart related conditions. Data was cleaned and imputed on a year by year basis. This method allowed for abnormalities and anomalies within the year to be handled appropriately and not affect other years data.
  • Examinations not in 2007 were imputed where necessary to complete the dataset. Majority of the missing data was imputed using MICE package in R using Bayesian linear regression (Norm method), running 5 linear regressions and imputed with median value.
  • October 2008 data is incredibly high at 3110. This was due to a spike as a result of New Orleans location impacted by hurricane. Volume was imputed based on a MICE random sample. Judgement call made to impute value based on trends for past 2 months. This is pretty close to the 3 period moving average.
  • December 2008 data was imputed using MICE random sampling functions to impute a value that seemed plausible based on the volume seen for the year and for previous years. 875 chosen for December.
  • December 2013 data was used in conjunction with Heart-Related Condition Codes to determine which examinations are heart related.
  • For imputing missing monthly examination data for Dec 2009 — Feb 2010 wherein the monthly volume wasn’t known but the total volume for the 3 month time period was available as 5129 examinations. Standard deviation and mean monthly percentages was computed.

3.4 Data Related Assumptions for Model
A total of 19 examination types were selected as heart related examinations based on the 2007 dataset for Violet, New Orleans, Lafayette and Baton Rouge:

  1. Angina
  2. Aortic Valve Stenosis
  3. Arrhythmia
  4. CAD
  5. Cardiac
  6. Cardiovascular
  7. Coronary Artery Disease (CAD)
  8. Cur Pulmonale
  9. Cor Pulmonale
  10. Endocarditis
  11. Heart
  12. Heart Palpitations
  13. Ischemic Heart Disease
  14. Myocardial Infraction
  15. Myocardial Ischemia
  16. Myocarditis
  17. Premature Ventricular Contraction
  18. Ventricular Septal Defect (VSD)
  19. VSD

“Stress Test” and “Chest Pain” examination categories were not considered as heart related as other conditions could cause chest pain and stress test could be done for various reasons, including routine tests for athletes or to advise patients of the best level of physical activity.

3.5 Partially Cleansed Data
Following chart shows Abbeville data partially cleaned after removing non-numeric values and placeholders.

3.6 Final Cleansed & Imputed Data
This file has the same format as the original datafile including column headers. The following plot is based on cleansed and imputed data for 2006–2013. This removes the spikes seen for Oct 2008.

4 Forecasting Model & Results Analysis

Holt-Winters Double Exponential Smoothing:

Accounts for increasing trend, reliable for long-term forecasting and uses a regression approach. The chart depicts 80% (dark gray) and 95% (light gray) confidence intervals (CI).

ARIMA (Auto Regressive Integrated Moving Average)

Model was auto fit to predict trend and analyzed using ACF and PACF, shows 80% (dark gray) and 95% (light gray) confidence intervals (CI). ARIMA (1,1,1). auto.arima() function in R was used so the parameter choosing process is not time-consuming. With auto arima, the tasks to make the series stationary and to determine p and q values is eliminated.

4.1 Model Assumptions

Following are the assumptions for the model:

  • Demand for examinations is not affected by a downtrend on Fargo Health Center reputation.
  • Assume pricing of the tests remains the same.
  • No change in any other economic factors.
  • Consistent health profile of patients similar to the current profile.
  • No pandemics or epidemics that could increase the volume of health examinations in a certain region where HCs are located.
  • Marketing spend by Fargo Health remains the same.
  • Overall heart disability rate in the area stays at the same trajectory.
  • The model does not account for turning points (i.e. sudden periods of slow or rapid demand).

4.2 Model Selection

4.3 Model Analysis
The graph generated with cleaned and imputed dataset shows an increasing trend in heart examinations over the years. A comparison of 2013 Abbeville demand with 2014 forecasted demand shows a 16% increase and expected income growth if the demand can be sustained. 2015 shows 11% growth over 2014. This information will help Fargo Health staff appropriately.

5 Ethical Analysis & Implications

Context: What was the original purpose of the collection? How close is the new use to its original purpose?

The original purpose of data collection was to gather patient examination details for each Health Center (HC). It is assumed that the HC was doing some kind of manual forecasting and using it for scheduling staff. Fargo Health Group had various challenges with demand over the time span of 8 years and would have had at least a rudimentary form of data analysis. With the introduction of a formal data analytics process, the prior purpose of data remains unchanged. Ultimately, either manual or data-driven analytical approach the goal is to optimize operations and scale revenue.

Consent: Was informed consent necessary from affected patients before data collection? If so, did they provide informed consent prior to data collection? Did they have an opportunity to decline? Are patients aware of their data being used?

While patient data is not identifiable directly from the dataset it can be backward tracked based on request id. It is important that patients have signed a HIPAA waiver and understand how their data will be used. In the given case study, it wasn’t mentioned if patients signed the HIPAA. It is assumed that they did, however, an ongoing process needs to be established to ensure that patients who did not sign the HIPAA waiver do not have their data transferred where it can be traced back to the patient. This data should be scrubbed. This is crucial especially since data is being passed onto a third party for data analytics. If the analysis had been done in-house then a HIPAA waiver would not have been required as it would have been allowed for use in ways already permitted under HIPAA.

Many times patients sign without reading the waiver. It is the moral responsibility of the HC to ensure and explain that the patient understands what they are signing for. Some patients don’t sign the HIPAA waiver as it is not mandatory. In this case their data can only be used to the extent HIPAA allows.

It is critical that key considerations within the aspect of predictive analytics — respect, privacy, autonomy and doing no harm are accepted as key principles within ethics and that moral hazards are identified.

Reasonability: Is the depth and breadth of the dataset reasonable for the forecast?

The breadth and depth of data is definitely apt for the time series analysis for the scope laid out in the case study. However, if the scope of data analytics project is expanded to include additional slicing and dicing by other key measures then additional data would be needed. In that case a large diverse dataset across HCs would be needed. It would also be interesting to compare trends for each HC and determine tangible and intangible factors that lead to high/low trends in demand.

Fairness: Will the results be equitable for all parties (patients, Fargo Health, public health agencies, Fargo Health employees, etc.) when your forecasting model is deployed?

The results of the forecast can be utilized by many parties in ways specific to their business needs. The initial POC is to leverage the results to identify trends for demand of heart examinations at Abbeville HC. These results can be used by the HR team for staffing, it could be used by the Head of Heart Specialty for scheduling based on the monthly predicted values, Fargo Health employees can benefit from the information by planning front desk and needed logistics during peak times/months, public health agencies could use the model to determine overall trend in a community and further rolled-up at a national level to determine overall trend of heart disease across demographics, Logistics team could use this information for upgrading equipment used for heart examinations, other hospitals could use it for managing bed utilization and prevent bed shortages, etc.

With the current scope no disadvantages are identified. However, if the scope changes to predict potential exams for a patient, at an individual level or further analyze household medical examination needs, this could be used for potential destructive purposes like discriminating based on pre-existing conditions, charging health insurance or denying coverage amongst many other undesired consequences as a result of insights into individual risk profile.

As the healthcare industry expands, it is imperative that all concerned parties are aware of the risks and agree upon standards. Clear regulations are important as predictive analytics becomes more and more pervasive. Self-regulation is key in the absence of clear regulatory practices on healthcare predictive analytics.

Ownership: Who owns the dataset, analysis, and insights gleaned from data analysis? Is there a moral obligation for Fargo Health to act based on the forecasting model?

Fargo Health Group owns the dataset as well as the results of the dataset. The contracting company does not have the right to share the data to other third parties unless specifically expressed in the NDA contract. Individually identifiable patient records are owned by patients. The contracting team will need to handle the data responsibility based on the NDA contract. Clauses for data destruction by the contracting company need to be laid out once the project is complete and the project closed/transitioned to Fargo Health Group. All parties handling data need to ensure due-diligence as they handle data. Necessary training should be imparted to employees as well as contracting company employees. Maintaining data privacy is as much as cultural aspect as much as technical. Teams should be trained on how any printed versions of the data including clauses of work from home for ensuring data privacy and confidentiality. The insights gained from data and the knowledge is owned by Fargo Health group though contracting companies in general gain business knowledge as a result of the analysis, they cannot reveal the health clinic organization.

Accountability: Who is accountable for mistakes and unintended consequences in data collection and analysis? Can the affected parties check the results that affect them? Security measures for data safeguarding.

The contracting company will need to sign an NDA (Non-Disclosure Agreement) and have clauses for breach of data. Fargo Health Group is ultimately responsible for breach of their patient data. Security measures need to be considered on how data is collected, transferred between LOs, transferred between HCs and out-of-network LOs. Additional steps include:

  • The data transmission process needs to be determined between HC and the contracting company and a secure pipeline created.
  • Implement administrative safeguards
  • Ensure contractors and unauthorized personnel cannot copy data on BYOD and portable devices. Implement BYOD policies and limit unauthorized data movement to external storage devices.

Fargo Health group needs to provide training to ensure data quality at the point of data entry and as it is stored and processed in the data warehouse. Based on the concept of GIGO, bad data will yield erroneous results thereby resulting in bad/expensive business decisions. The data analytics project team is responsible for cleaning and imputing data and being transparent to Fargo Health authorities on the imputation methods and any underlying assumptions as well as modeling assumptions.

While the model predicts demand it should not be solely relied upon as macro and micro factors play a key role in the demand of examinations. Human intervention in decision-making is key so machine learning algorithm biases are not over-relied upon.

It is critical that the dataset that is used for training the model is accurate and complete.

6 Suggested Next Steps after POC Approval

  • Refine the model based on how accurately it is predicting demand.
  • The following would be required for any kind of data analysis. This would also ensure that any future data science modeling projects has high quality data to work with.
    - Establish data quality standards and practices such as naming conventions, grouping and relating examination types to conditions, date formats.
    - Establish data entry time frame and process on how data will be collected.
    - Train staff on established standards. Eg. request id field length ranges from 9 to 12. It should be standardized to have a consistent length.
    - Establish automated data quality checks and validate if they are consistent across Health Centers.
    - Establish standards and process for validating data.
    - Establish data audit process.
    - Hold data compliance reviews.
  • Ensure HIPAA compliance.
  • Evaluate moral hazards and ethical issues as scope of predictive analytics expands.
  • Ensure transparency of underlying model assumptions.
  • Ensure analysis is done in a fair and socially acceptable way.
  • Establish a process so burden of proof can be established to prevent data misuse and aid in digital forensic investigation.

7 Future Applications

  • If the proof of concept is deemed fit, expand the above practices for other types of examinations and leverage existing code for newer models.
  • Fargo Health Center could compare the internally developed model with the overall industry benchmark on heart examination predictions.
  • Fargo Health Center could outsource the model to other medical centers.
  • Public Health agencies could use the results to analyze past and future trends across the population.
  • The model could be further enhanced to analyze daily/weekly demand in addition to analyzing the day of the weeks or times of the day where patient visits are higher.

8 Conclusion

9 References

11 Glossary

ARIMA Forecasting model for time series

HC (Health Clinic) Fargo Health Group clinic

Holt-Winters forecasting Forecasting model for time series

LO (Local Office)Fargo Health Group local offices. 4 in all from where we have data. In all there are 34 offices.

MAE Average of all absolute values of error

MAPE Percentage magnitude of error

MICE, package in R (Multivariate Imputations by Chained Equations) The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.

Source: R documentation

Appendix

Exec Director StratEx - I bring to the table blend of data science, finance and strategy management skills with 20+ years of experience in insurance & fintech.