Predictive Analytics and Machine Learning for Healthcare - Diabetes

DATA SCIENCE IN
HEALTHCARE & LIFE
SCIENCES
APPLIED CLINICAL ANALYTICS

DATA ANALYTICS IN HEALTHCARE & LIFE SCIENCES
1. VITAL BUSINESS PROBLEMS:
So many different problems exist and they are of varying degree of complexity:
- What impacts favorable clinical outcomes
- Drivers of adverse events
- Factors impacting cost of care
- Earlier diagnosis of cancers and chronic diseases
Understanding these different business problems is critical for generating
possible solutions
2. POTENTIAL DATA SOURCES:
Huge amounts of data is getting generated nowadays from different sources that
are capable of capturing information :
- Electronic Health Records
- Healthcare claims from Insurance companies
- Pharmacies – claims and medication reviews
- Lab tests and Imaging results
- Population health data – Social Determinants of Health
- Genomics (and later Proteomics and Metabolomics)
- Wearable and other devices
- Other sources (Surveys, Patient Reported Outcomes)
The volume, velocity, variety, and veracity that is getting generated is staggering
– typical Big Data problem.
3. DATA PROCESSING, MANAGEMENT AND ANALYSIS:
Making sense of these varied sources of data and processing them so that they are useful for analysis is a data engineering challenge.
Structured data needs to be cleaned and curated; data from different sources need to be matched to get a complete 360 degree view of the customer.
Semi-structured and unstructured data sources (Physician notes, imaging data) pose challenges to curate and store the information so that it can be retrieved and
analyzed at scale and speed.
Various Big Data technologies have been developed to tackle this problem of storing(HADOOP ecosystem, SPARK) and analyzing semi-structured and unstructured data
(Text mining, NLP, Deep Learning for Image and Video Analytics).
4. SOLUTIONS TO THE PROBLEMS:
At the end of the day, all the analysis should be able to generate actionable insights. Interpretation of the results and their implementation to solve the problem are key.

HOW ML/DL CAN AUGMENT THE DECISION MAKING
PROCESS FOR CLINICIANS
PROGNOSIS
•A machine-learning
model can learn the
patterns of health
trajectories of vast
numbers of patients.
This facility can help
physicians to
anticipate future
events at an expert
level, drawing from
information well
beyond the
individual physician’s
practice experience.
For example, how
likely is it that a
patient will be able
to return to work, or
how quickly will the
disease progress?
DIAGNOSIS
•A diagnostic error
will occur in the
care of nearly every
patient in his or her
lifetime, and
receiving the right
diagnosis is critical
to receiving
appropriate care.
This problem is not
limited to rare
conditions. Cardiac
chest pain, TB,
dysentery, and
complications of
childbirth are
commonly not
detected even in
developing
countries
TREATMENT
•In a large health
care system with
tens of thousands of
physicians treating
tens of millions of
patients, there is
variation in when
and why patients
present for care and
how patients with
similar conditions
are treated. Can a
model sort through
these natural
variations to help
physicians identify
when the collective
experience points to
a preferred
treatment pathway?
CLINICALWORKFLOW
•The same machine-
learning techniques
that are used in
many consumer
products can be
used to make
clinicians more
efficient. Machine
learning that drives
search engines can
help expose reqd.
.information in a
patient’s chart for a
clinician without
multiple clicks.
Data entry of forms
and text fields can
be improved with
the use of machine-
learning
techniques.
REMOTEAREAS
•There is no way for
physicians to
individually interact
with all the patients
who may need care.
Can machine learning
extend the reach of
clinicians to provide
expert-level medical
assessment without
involvement? For
example, patients
with new rashes may
be able to obtain a
diagnosis by sending
a picture that they
take on their
smartphones,
thereby averting
unnecessary urgent-
care visits.
REFERENCE: https://www.nejm.org/doi/full/10.1056/NEJMra1814259

COMPONENTS OF ELECTRONIC HEALTH RECORDS
EMR
DEMOG &
HISTORY
DRUGS
ALLERGIES
VISITS
ADMISSIONS
DIAGNOSES
LAB
RESULTS
PROCEDURE
ADDITIONAL DATA FACTORS (normally not present)
 GENOMICS
 SOCIAL DETERMINANTS OF HEALTH
 IMAGING DATA – X-RAY/USG/CT/MRI
 PATIENT REPORTED OUTCOMES - PRO
STANDARD EMR/EHR DATA COMPONENTS
 DEMOGRAHICS – Age, Gender, Race, Language, Religion, Insurance, Location
 CLINICAL HISTORY – Habits, Past Dx and Observations
 MEDICATIONS – Drug NDC, Quantity, Refills, Route, Rx dates
 FOOD AND DRUG ALLERGIES – Allergen, Reaction Desc., Severity, Dates
 VISITS TO ER AND OPD – Date/Time, Encounter Type, Provider Info
 INPATIENT ADMISSIONS – Date/Time, Source, Discharge Code
 PRIMARY DIAGNOSES AND COMORBIDITIES – ICD9/10, SNOMED
 PROCEDURES AND SURGERIES – Procedure codes and ICD codes
 LABORATORY RESULTS – LOINC, Date/Time, Reference Range, Value, UOM
Standard dictionaries: ICD9/10, SNOMED-CT, NDC, LOINC, NPI
GENOMICS IMAGING SDoH OUTCOMES

DIABETES – THE MAGNITUDE OF THE PROBLEM
Diabetes is the world's
eighth biggest killer,
accounting for some 1.5
million deaths each year. A
major new World Health
Organization report has
now revealed that the
number of cases around the
world has nearly
quadrupled to 422 million
in 2014 from 108 million in
1980. The Eastern-
Mediterranean region had
the biggest increase in cases
during that time frame.
Diabetes now affects one in
11 adults with high blood
sugar levels linked to 3.8
million deaths every year.
REFERENCE:
https://www.statista.com/chart/4617/the-
unrelenting-global-march-of-diabetes/

WHAT HAPPENS IN DIABETES MELLITUS
• https://youtu.be/qn2dhw0NJxo
Type 1 diabetes (T2DM)
In people with type 1 diabetes, the
body does not make insulin. The
immune system attacks and destroys
the cells in the pancreas that make
insulin. Type 1 diabetes is usually
diagnosed in children and young
adults, although it can appear at any
age. People with type 1 diabetes need
to take insulin every day to stay alive.
Type 2 diabetes (T1DM)
In people having type 2 diabetes, the
body does not make or use insulin
well. It can develop diabetes at any
age, even during childhood. However,
this type of diabetes occurs most often
in middle-aged and older people. Type
2 is the most common type of
diabetes.
COURTESY: NIDDK
https://www.niddk.nih.gov/health-
information/diabetes/overview/what-is-diabetes
IMAGE COURTESY: KHAN ACADEMY

HOW MACHINE LEARNING CAN HELP IN DIABETES
Predicting risk of heart failure for
diabetes patients with help from
machine learning
Identification of Type 2 Diabetes
Risk Factors Using Phenotypes
Consisting of Anthropometry and
Triglycerides based on Machine
Learning
Use of a Machine Learning
Algorithm Improves Prediction of
Progression to Diabetes
Predicting Future Glucose
Fluctuations Using Machine
Learning and Wearable Sensor Data
Predicting Diabetes Mellitus With
Machine Learning Techniques
Machine-learning to stratify
diabetic patients using novel
cardiac biomarkers and integrative
genomics
Predicting diabetic retinopathy and
identifying interpretable biomedical
features using machine learning
algorithms
Impact of HbA1c Measurement on
Hospital Readmission Rates:
Analysis of 70,000 Clinical Database
Patient Records
Data-Driven Blood Glucose Pattern
Classification and Anomalies
Detection: Machine-Learning
Applications in Type 1 Diabetes

APPROACH FOR DM READMISSION PREDICTIVE MODEL
• DMT2 risk prediction using clinical data and statistical and machine learning
algorithms/models
8
Predictor Variables (total 44 variables)
 Demographic
 Age
 Gender
 Ethnicity
 Diagnosis
 Type of Condition(DM T1/T2) diagnosis
 # of comorbidities
 Position (primary, secondary, etc.) of
diagnosis
 Encounter
 IP, OP, AE visits
 Medications
 Dosage, frequency, route
 Lab results
 Test names, dates, UOM, value
 Normal/abnormal result
 Admission
 Length of stay
 Admission method (elective, non-
elective)
 Discharge destination
 Procedure
 Count of procedures
 Cost of procedures
Response Variable
 Readmission within 30 days
INPUT MODEL OUTPUT
4 years 1 year
Observation
window
Performance
window
Validation
window
Data split into time windows1
2 Models built using following algorithms (data from
observation and performance windows)
 Logistic regression model (LOG)
 Decision tree model (DT)
 Random forest model (RF)
 Model Ensembles
3 In-time validation (within performance window)
48.6%
74.3%
34.9%
29.4%
37.3%
68.7%
38.5%
28.2%
53.5%
76.7%
39.8%
33.7%
GINI AUC KS WORST
DECILE
CAPTURELOG DT RF
4 Out-of-time validation (in validation window)
All three models provided accuracy of
~80% in out-of-time validation scenario
RF model with ~76% AUC indicates reasonably good fit
Significant variables (major
drivers of readmission)
 SEVERITY OF DM
 # of DM spells in past 1 year
 ED LOS in past 1 year
 # of procedures undergone
 # of OPD visits in past 1 year
 # of ED visits in past 1 year
 # of IP visits in past 1 year
 # of comorbidities
 Distance from hospital
 DM LOS in past 1 year
 Time since last ED visit
 Total ED cost in past 1 year
 Age of patient
Patient category based on
risk score
HighLow
5
6

9
RISK PREDICTION MODEL: DESIGN, EVALUATION
• Mean/Median
• Regression
• KNN
Missing
imputation
• Feature Imp
• RFE
• WoE and IV
Feature
Selection
• Tree based
(DT, RF, GBT)
• Others (SVM,
NN, NB)
Model
Build
• K-fold cross
validation
• ROC curve
Model
Evaluation
Patient cohorts are created based on ICD 9/10 codes for defined chronic disease (e.g. DMT2) and also on the time of
diagnosis to separate already diagnosed patients from those who will potentially develop the disease.
Prospective
Cohort -
Scoring
Dataset
Feature selection
mechanisms help to
focus on the most
important variables
which the outcome
variable – methods
mentioned above
have been used.
EMR data has many
dimensions and this
also means lot of
values are missing –
imputation methods
help keep most of
the features usable.
The basic task is
classification which
is done by
computing the
probability of
outcome at each
patient level and
then applying
thresholds.
Multiple models
were created and
then validated for
accuracy metrics to
select the best
model. Cross
validation and area
under ROC curve
utilized.
Scoring was done
on the prospective
cohort to group
patients into high
risk, medium risk
and low risk. High
risk group was to be
targeted for
interventions.

PRACTICAL USE CASE AND CODE DEMO
USE CASE
DATASET
• Risk Prediction for Diabetes
• Impact of HbA1c Measurement on Hospital Readmission Rates:
Analysis of Clinical Database Patient Records
UCI MACHINE LEARNING REPOSITORY - Description
100000 T2DM patients from 30 hospitals; CERNER HEALTH FACTS
OUTCOME
• How likely is a patient to be diagnosed with DM in near future?
• How likely is a T2DM patient to come back to the hospital, before
30 days post discharge and after 30 days discharge?
METHODS
Multiple ML models generated and compared
Individual Classifiers: DT, LOGREG, SVC
Ensemble Classifiers: RF, GBC
GitHub Link

Predictive Analytics and Machine Learning for Healthcare - Diabetes

More Related Content

What's hot

What's hot (20)

Similar to Predictive Analytics and Machine Learning for Healthcare - Diabetes

Similar to Predictive Analytics and Machine Learning for Healthcare - Diabetes (20)

Recently uploaded

Recently uploaded (20)

Predictive Analytics and Machine Learning for Healthcare - Diabetes