Statistical Modeling Final Project
Data350 Final Modeling Project
brian avery
1 August 2018
library(car)
library(broom)
library(mosaic)
Statistical Modeling Final Project
Introduction
Start with a brief introduction that explains the data set you’ve chosen, giving some background information about the provenance and structure of the data. Do a little exploratory data analysis (graphs, charts, tables, etc.) to learn about the data.
This data set is a subset from a single site (Cleveland Clinic) of a multisite a cardiology study to develop a new probabilistic algorithm to diagnose coronary artery disease (Detrano et al., International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease, Am J Cardiol 1989;64:304-310.).
The data was collected in the early 1980s and at the time there was little data and few models to predict the probability of coronary artery disease from diagnostic data.
The authors collected data from 303 patients that had no previous history of major heart disease (heart attack, valve problems, etc.)
codebook
info from - https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names The 14 variables below (included in the dataset used) are a subset of variables from the larger, complete dataset (the original variable numbers are shown at the end after #). Most f this info is from the link above, but I have added some info from the original paper and the google.
1. age - age in years (#3)
2. sex - sex (1 = male; 0 = female) (#4)
3. cp - chest pain type - Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic (#9)
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) (#10)
5. chol - serum cholestoral in mg/dl (#12)
6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false) (#16)
7. restecg - resting electrocardiographic results - Value 0: normal, Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria (#19), both 1 and 2 are diagnostic of heart attack/problems.
8. thalach - maximum heart rate achieved (#32), during the stress test (generally higher is more healthy, but it is also age related, the quick calc for mx HR is 220-age).
9. exang - exercise induced angina (1 = yes; 0 = no) (#38), this is chest pain that is a symptom of coronary artery disease.
10. oldpeak - ST depression induced by exercise relative to rest (#40) and higher is an indicator of heart attack (and a few other problems).
11. slope - the slope of the peak exercise ST segment - Value 1: upsloping, Value 2: flat, Value 3: downsloping (#41), flat or downsloping indicates low coronary artery function (and results in lack of oxygen in the heart muscle and damage).
12. ca - number of major vessels (0-3) colored by flouroscopy (#44) indicating that they contain calcium deposits which is indicator of hardening plaques.
13. thal - 3 = normal; 6 = fixed defect; 7 = reversible defect (#51) in an exercise thallium scintigraphic stress test. fixed defects are permanent heart damage, where as reversible are from ischemia and less severe.
14. num - diagnosis of heart disease (angiographic disease status) - Value 0: < 50% diameter narrowing, Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels) - 0 is no disease, other values are disease (#58), I’ll be honest, this variable is a bit suspect as it appears to have been a clear 0=no disease, 1=disease variable at some point, but now has poorly explained other values.
The context of the original data collection of this data appears to be what is called a “nuclear thallium stress test” for each patient that is used to assess damage to the coronary arteries after heart attack or to judge the success of treatment for blocked coronary arteries. So it seems reasonable to try to use some of these “diagnostic” variables to predict if a person has a heart disease
info links/references:
http://ptca.org/imaging/thallium_stress_test.html
https://www.hopkinsmedicine.org/healthlibrary/test_procedures/cardiovascular/myocardial_perfusion_scan_stress_92,P07979
https://www.webmd.com/heart-disease/coronary-calcium-scan
https://en.wikipedia.org/wiki/ST_segment
https://www.nhlbi.nih.gov/health-topics/angina
Exploratory Data Analysis
get and prep data
my data source was this file converted to csv (in LibreOffice):
https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
First, we need to bring in the data:
cleveland <- read.csv("processed.cleveland.csv", na.strings="?")
str(cleveland)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ cp : int 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: int 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ restecg : int 2 2 2 0 2 0 2 0 2 2 ...
## $ thalach : int 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ thal : int 6 3 7 3 3 3 3 3 7 7 ...
## $ num : int 0 2 1 0 0 0 3 0 2 1 ...
I want to convert a few of the variables to factors in R based on the codebook descriptions.
cleveland$sex <- factor(cleveland$sex, levels=c(1,0), labels=c('M','F'))
cleveland$cp <- factor(cleveland$cp, levels=c(4,1,2,3), labels=c('none','t_ang','a_ang','non-ang'))
cleveland$fbs <- factor(cleveland$fbs, levels=c(0,1), labels=c('low','high'))
cleveland$restecg <- factor(cleveland$restecg, levels=c(0,1,2), labels=c('normal','wave_abn','lv_hyp'))
cleveland$exang <- factor(cleveland$exang, levels=c(0,1), labels=c('N', 'Y'))
cleveland$slope <- factor(cleveland$slope, levels=c(1,2,3), labels=c('pos', 'flat', 'neg'))
cleveland$thal <- factor(cleveland$thal, levels=c(3,6,7), labels=c('normal','fixed','reversible'))
# num could be a factor, but in next step, I make a new factor out of it
# cleveland$num <- factor(cleveland$num)
str(cleveland)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "M","F": 1 1 1 1 2 1 2 2 1 1 ...
## $ cp : Factor w/ 4 levels "none","t_ang",..: 2 1 1 4 3 3 1 1 1 1 ...
## $ trestbps: int 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "low","high": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "normal","wave_abn",..: 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : int 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "N","Y": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "pos","flat","neg": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ thal : Factor w/ 3 levels "normal","fixed",..: 2 1 3 1 1 1 1 1 3 3 ...
## $ num : int 0 2 1 0 0 0 3 0 2 1 ...
Now I want a categorical variable that tells us whether the patient was diagnosed with a disease or not, I’m going to make a categorical variable here that does that. I’m basing my decision on the description of the num
variable, which says “0 is no disease, other values are disease”.
cleveland$disease <- factor(ifelse(cleveland$num!=0, "disease", "OK"))
I’m still somewhat suspicious of this variable since there are 2 similar variables in the research article.
Also, as we see below, oldpeak
has a lot of zeros and from the description and digging I did, zero is normal and all numerical values are abnormal, so to play with later, I’m also making a factor called STdep
that separates the zeros and values, but later I found evidence (https://www.ncbi.nlm.nih.gov/pubmed/16015437) that values greater than 2 were considered an indication of damage, so I made a factor called twommstd
that has any oldpeak
values larger than 2 coded as greater
and all others coded as less
.
cleveland$STdep <- factor(ifelse(cleveland$oldpeak>0, "depr", "none"))
cleveland$twommstd <- factor(ifelse(cleveland$oldpeak>2, "greater", "less"))
str(cleveland)
## 'data.frame': 303 obs. of 17 variables:
## $ age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "M","F": 1 1 1 1 2 1 2 2 1 1 ...
## $ cp : Factor w/ 4 levels "none","t_ang",..: 2 1 1 4 3 3 1 1 1 1 ...
## $ trestbps: int 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "low","high": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "normal","wave_abn",..: 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : int 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "N","Y": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "pos","flat","neg": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ thal : Factor w/ 3 levels "normal","fixed",..: 2 1 3 1 1 1 1 1 3 3 ...
## $ num : int 0 2 1 0 0 0 3 0 2 1 ...
## $ disease : Factor w/ 2 levels "disease","OK": 2 1 1 2 2 2 1 2 1 1 ...
## $ STdep : Factor w/ 2 levels "depr","none": 1 1 1 1 1 1 1 1 1 1 ...
## $ twommstd: Factor w/ 2 levels "greater","less": 1 2 1 1 2 2 1 2 2 1 ...
Just to be sure I made the disease
variable right, I want to compare the number of zeros in the num
variable with the number of OK
values in the disease
variable.
tally(cleveland$num)
## X
## 0 1 2 3 4
## 164 55 36 35 13
Matches up.
tally(cleveland$disease)
## X
## disease OK
## 139 164
They are the same, and 55+36+35+13 = 139 which is the same as the disease
values, so seems legit.
tally(cleveland$disease, format="perc")
## X
## disease OK
## 45.875 54.125
and some good news, when I look at table 1 of the original research article (Detrano, Am J Cardid 1989;64:304-310) it lists 46% of the Cleveland sample as having “Disease”, so that lines up with my new variable!
legit EDA (finally)
Here is a summary of the data by variable:
summary(cleveland)
## age sex cp trestbps chol fbs restecg
## Min. :29.0 M:206 none :144 Min. : 94 Min. :126 low :258 normal :151
## 1st Qu.:48.0 F: 97 t_ang : 23 1st Qu.:120 1st Qu.:211 high: 45 wave_abn: 4
## Median :56.0 a_ang : 50 Median :130 Median :241 lv_hyp :148
## Mean :54.4 non-ang: 86 Mean :132 Mean :247
## 3rd Qu.:61.0 3rd Qu.:140 3rd Qu.:275
## Max. :77.0 Max. :200 Max. :564
##
## thalach exang oldpeak slope ca thal num
## Min. : 71 N:204 Min. :0.00 pos :142 Min. :0.000 normal :166 Min. :0.000
## 1st Qu.:134 Y: 99 1st Qu.:0.00 flat:140 1st Qu.:0.000 fixed : 18 1st Qu.:0.000
## Median :153 Median :0.80 neg : 21 Median :0.000 reversible:117 Median :0.000
## Mean :150 Mean :1.04 Mean :0.672 NA's : 2 Mean :0.937
## 3rd Qu.:166 3rd Qu.:1.60 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :202 Max. :6.20 Max. :3.000 Max. :4.000
## NA's :4
## disease STdep twommstd
## disease:139 depr:204 greater: 50
## OK :164 none: 99 less :253
##
##
##
##
##
Now let’s look at some variables by plotting:
First, what is the distribution of ages in the sample?
ggplot(cleveland, aes(age)) +
geom_histogram(binwidth=5) + theme_minimal() +
labs(title="age distribution", x="age in years", y="count")
age
has a bit of left skew to it, but it’s not too bad.
What do the ages look like split by sex?
ggplot(cleveland, aes(sex,age)) + geom_boxplot() + theme_minimal() +
labs(title="age distribution by sex", y="age in years")
what is the distribution of values of blood pressure?
ggplot(cleveland, aes(x = trestbps)) +
geom_histogram(binwidth=10) + theme_minimal() +
labs(title="distribution of patient resting BP", x="resting blood pressure (in mm Hg)", y="count")
What is the distribution of serum cholesterol values in the sample?
ggplot(cleveland, aes(x = chol)) +
geom_histogram(binwidth=20) + theme_minimal() +
labs(title="distribution of patient serum cholesterol", x="serum cholesterol (mg/dL)", y="count")
What is the distribution of mx heart rates during the stress test?
ggplot(cleveland, aes(thalach)) +
geom_histogram(binwidth=10) + theme_minimal() +
labs(title="maximum heart rate during stress test", x="max HR (bpm)")
thalach
isn’t quite normal either, it has some left skew to it.
oldpeak
has a lot of zeros which are “normal” and all other values are “abnormal” - I made a factor variable to reflect this status. It does remove some info from the non-zero values, however. Then I made a second factor variable (twommstd
) to better reflect that values greater than 2 had higher risk.
ggplot(cleveland, aes(oldpeak))+
geom_histogram(binwidth=.20) + theme_minimal() +
labs(title="ST depression induced by exercise", x="ST depression (mm)", y="count")
Now we should look more at thalach
since it is a likely response variable. What is the distribution by sex?
ggplot(cleveland, aes(x=sex, y=thalach)) + geom_boxplot() + theme_minimal() +
labs(title="maximum stress test heart rate by sex", y="max HR (bpm)")
We should also look to see if any of these variable distributions are different based on disease status.
first, cholesterol:
ggplot(cleveland, aes(x=disease, y=chol)) + geom_boxplot() + theme_minimal() +
labs(title="cholesterol level by disease status", y="cholesterol (mg/dL)")
Does resting blood pressure differe by disease status?
ggplot(cleveland, aes(x=disease, y=trestbps)) + geom_boxplot() + theme_minimal() +
labs(title="resting blood pressure by disease status", y="resting BP (mmHg)")
What about maximum stress test heart rate (thalach
) - does its distrubtion vary by disease status?
ggplot(cleveland, aes(x=disease, y=thalach)) + geom_boxplot() + theme_minimal() +
labs(title="maximum stress test heart rate by disease status", y="max HR (bpm)")
Yes, it appears that it does. It looks like individuals with coronary artery disease might have lower max HR in the stress test.
What about ST slope? Flat or negative ST slope is supposed to be trouble (normal is slight positive slope):
ggplot(cleveland, aes(x=slope, y=thalach)) + geom_boxplot() + theme_minimal() +
labs(title="maximum stress test heart rate by ST slope", y="max HR (bpm)", x="ST slope")
Might be a difference there.
What about chest pain type? Having chest pain is often bad…
ggplot(cleveland, aes(x=cp, y=thalach)) + geom_boxplot() + theme_minimal() +
labs(title="stress test max heart rate by chest pain type", y="max HR (bpm)", x="chest pain type")
Since max heart rate in the stress test (thalach
) is going to be my response variable, I’m going to make some scatterplots with possible predictor variables to look for linear patterns.
ggplot(cleveland, aes(x=trestbps, y=thalach)) + geom_point() + theme_minimal() +
labs(title="maximum stress test heart rate by resting blood pressure",
x="resting BP (mmHg)", y="max HR (bpm)")
ggplot(cleveland, aes(x=oldpeak, y=thalach)) + geom_point() + theme_minimal() +
labs(title="maximum stress test heart rate by ST depression",
x="ST depression during exercise (mm)", y="max HR (bpm)")
oldpeak
has all those zeroes that makes it a bit hard to work with.
ggplot(cleveland, aes(x=age, y=thalach)) + geom_point() + theme_minimal() +
labs(title="maximum stress test heart rate by age",
x="age (years)", y="max HR (bpm)")
age
has a negative linear trend.
ggplot(cleveland, aes(x=chol, y=thalach)) + geom_point() + theme_minimal() +
labs(title="maximum stress test heart rate by cholesterol",
x="serum cholesterol (mg/dL)", y="max HR (bpm)")
age
has a negative linear trend. That makes sense from my background research, as you age you have a lower maximum heart rate during exercise. None of the other numerical variables show anything obvious.
Mechanics
In the next part of your project, you will go through some “mechanical” tasks that demonstrate your ability to run basic types of analysis in R. These tasks will not necessarily be “good” models and you don’t need to worry about the conclusions expressing anything meaningful. Here are the tasks you need to complete:
Multiple regression
Use a numerical response variable, one numerical explanatory variable, and one categorical explanatory variable. (Allow R to use the default dummy coding with the categorical variable.)
Check scatterplots and residual plots. But even if the scatterplots and residual plots show that a multiple regression analysis is inappropriate, go ahead and run the analysis anyway.
Test for interaction both graphically and statistically.
Interpret the coefficients of the model, including the intercept (even when such coefficients might not make any sense for a variety of reasons).
additive model
My response variable will be thalach
(maximum HR in the stress test) and the explanatory variables I’ll use are age
(numerical) and sex
(categorical).
HR_as <- lm(thalach ~ age + sex, data=cleveland)
summary(HR_as)
##
## Call:
## lm(formula = thalach ~ age + sex, data = cleveland)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.44 -11.82 3.32 15.29 46.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203.667 7.379 27.60 < 2e-16
## age -1.018 0.134 -7.58 4.3e-13
## sexF 4.303 2.598 1.66 0.099
##
## Residual standard error: 21 on 300 degrees of freedom
## Multiple R-squared: 0.163, Adjusted R-squared: 0.157
## F-statistic: 29.2 on 2 and 300 DF, p-value: 2.69e-12
scatter plot:
ggplot(cleveland, aes(age,thalach)) + geom_point() + geom_smooth(method="lm") +
labs(title="stress test max heart rate by age", x="age (yrs)", y="max HR (bpm)") +
theme_minimal()
check the residuals:
HR_as_aug <- augment(HR_as)
ggplot(HR_as_aug, aes(sample=.std.resid)) + geom_qq() + theme_minimal()
ggplot(HR_as_aug, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept=0) +
theme_minimal()
The residuals for age
look reasonable - pretty normal and no obvious patterns (like heteroskedasticity) although there are more large negative values than positive.
model with interaction
HR_as_int <- lm(thalach ~ age * sex, data=cleveland)
summary(HR_as_int)
##
## Call:
## lm(formula = thalach ~ age * sex, data = cleveland)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.32 -13.19 3.47 15.63 46.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 208.245 9.058 22.99 < 2e-16
## age -1.103 0.166 -6.64 1.4e-10
## sexF -9.273 15.785 -0.59 0.56
## age:sexF 0.247 0.283 0.87 0.38
##
## Residual standard error: 21 on 299 degrees of freedom
## Multiple R-squared: 0.165, Adjusted R-squared: 0.156
## F-statistic: 19.7 on 3 and 299 DF, p-value: 1.15e-11
Anova(HR_as_int)
doesn’t appear to be an interaction.
ggplot(cleveland, aes(age,thalach, group=sex)) + geom_point(aes(color=sex)) +
geom_smooth(method="lm", se=F, aes(color=sex)) +
labs(title="stress test max heart rate by age", x="age (yrs)", y="max HR (bpm)") +
theme_minimal()
Based on both the statistical model and the scatterplot, there does not appear to be any interaction between age
and sex
.
interpretation
The model:
\[\widehat{maxHR}=203.7 - 1.018 \cdot age + 4.303 \cdot sexF\]
The model predicts that a patient of age 0 would have a maximum stress test heart rate of 203.7 bpm, and that is reduced by -1.018 bpm for every year of life, and 4.303 bpm are added if the patient is female (since male is the reference category). All other variables held constant of course.
The overall model is significant (P < 0.001 in ANOVA) and we have sufficient evidence to reject the null hypothesis that \(\beta_{age}=0\) (P < 0.001) but insufficient evidence to reject the null hypothesis that \(\beta_{sex}=0\) (P > 0.05).
The \(R^2\) for the additive model was 0.163, which means that we are able to explain 16.3% of the variation in stress test maximum heart rate using age
and sex
.
There did not appear to be any interaction between age
and sex
.
ANOVA
Use a numerical response variable, and two categorical explanatory variables. Use deviance coding (with “contr.sum”).
Test for the presence of interaction between the two explanatory variables both graphically and statistically.
Interpret the coefficients of the model, including the intercept (even when such coefficients might not make any sense for a variety of reasons).
additive model
I’m still using thalach
as my response variable, but will use cp
(chest pain) and rcg
(resting ECG abnormalities, derived from restecg
, see below).
Set the contrasts:
chest pain (cp
)…
contrasts(cleveland$cp) <- "contr.sum"
contrasts(cleveland$cp)
## [,1] [,2] [,3]
## none 1 0 0
## t_ang 0 1 0
## a_ang 0 0 1
## non-ang -1 -1 -1
I want to use the resting ECG data (restecg
) but one category has very few observations:
tally(cleveland$restecg)
## X
## normal wave_abn lv_hyp
## 151 4 148
so I’m going to collapse wave_abn
and lv_hyp
into a new category called abnormal
in a new variable called rcg
:
cleveland$rcg <- factor(ifelse(cleveland$restecg=="normal", "normal", "abnormal"),
levels=c("normal", "abnormal"))
now set the contrasts for rcg
:
contrasts(cleveland$rcg) <- "contr.sum"
contrasts(cleveland$rcg)
## [,1]
## normal 1
## abnormal -1
make the model:
HR_cr <- lm(thalach ~ cp + rcg, data=cleveland)
summary(HR_cr)
##
## Call:
## lm(formula = thalach ~ cp + rcg, data = cleveland)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.83 -12.01 1.75 15.72 46.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.61 1.52 101.07 <2e-16
## cp1 -12.79 1.98 -6.47 4e-10
## cp2 2.57 3.49 0.74 0.4626
## cp3 8.57 2.63 3.26 0.0012
## rcg1 1.01 1.24 0.82 0.4132
##
## Residual standard error: 21.2 on 298 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.139
## F-statistic: 13.2 on 4 and 298 DF, p-value: 6.38e-10
Anova(HR_cr)
interaction model
Look for interaction graphically:
# group means
gr_means <- cleveland %>%
group_by(cp, rcg) %>%
summarize(mean_thalach = mean(thalach))
gr_means
# interaction plot
ggplot() +
geom_point(data=cleveland, aes(cp, thalach, color=rcg),
position=position_jitterdodge(jitter.width=0.1, dodge.width=0.45), alpha=0.4) +
geom_point(data=gr_means, aes(cp, mean_thalach, color=rcg), shape=18, size=4) +
geom_line(data=gr_means, aes(cp, mean_thalach, group=rcg, color=rcg), size=1.2) +
scale_colour_hue(l=40) + theme_minimal() +
labs(title="interaction between resting ECG and chest pain", x="chest pain type",
y="max HR (bpm)") + guides(color=guide_legend(title="resting ECG"))
doesn’t look like any interaction from the plot.
build model with interaction:
HR_cr_int <- lm(thalach ~ cp * rcg, data=cleveland)
summary(HR_cr_int)
##
## Call:
## lm(formula = thalach ~ cp * rcg, data = cleveland)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.97 -11.98 1.59 15.59 46.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.6333 1.5619 98.36 < 2e-16
## cp1 -12.7958 2.0125 -6.36 7.7e-10
## cp2 2.5096 3.5805 0.70 0.4839
## cp3 8.6749 2.6957 3.22 0.0014
## rcg1 0.9042 1.5619 0.58 0.5631
## cp1:rcg1 0.2260 2.0125 0.11 0.9107
## cp2:rcg1 -0.0471 3.5805 -0.01 0.9895
## cp3:rcg1 -0.4382 2.6957 -0.16 0.8710
##
## Residual standard error: 21.3 on 295 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.131
## F-statistic: 7.48 on 7 and 295 DF, p-value: 0.0000000271
Anova((HR_cr_int))
doesn’t look like any interaction from the statistical model either.
interpretation
The model:
\[\widehat{maxHR}=153.61 -12.79 \cdot cp1 + 2.57 \cdot cp2 + 8.57 \cdot cp3 + 1.01 \cdot rcg1 \]
The model predicts that a patient with no chest pain and a normal resting ECG would have a stress test max HR of \(153.61 - 12.79 + 1.01 =\) 141.83., which is very close to the group mean for that group. All other variables held constant of course.
The results of the ANOVA table for this model suggest that there is sufficient evidence to support thinking the mean thalach
values for the different cp
groups are not equal, while there is insufficient evidence to support a difference in thalach
means in the different rcg
groups.
The \(R^2\) value of this model indicates that it can explain 51.1% of the variation in the max heart rate of a patient in a stress test (thalach
). That’s not the best model ever.
Finally, to look at pairwise comparisons:
TukeyHSD(HR_cr)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = x)
##
## $cp
## diff lwr upr p adj
## t_ang-none 15.27597 2.9637 27.5882 0.00810
## a_ang-none 21.73944 12.7391 30.7398 0.00000
## non-ang-none 14.72642 7.2541 22.1988 0.00000
## a_ang-t_ang 6.46348 -7.3511 20.2780 0.62174
## non-ang-t_ang -0.54954 -13.4209 12.3218 0.99952
## non-ang-a_ang -7.01302 -16.7643 2.7382 0.24852
##
## $rcg
## diff lwr upr p adj
## abnormal-normal -1.9701 -6.7689 2.8286 0.41977
each of t_ang
, a_ang
, and non-ang
are significnly different from none
, which is what the plot looks like as well.
Model and Hypothesis
For the final part of the project, go through the process of specifying a contextually meaningful model that tests a reasonable hypothesis. In addition to the steps you have done above, be sure to transform any variables that need transforming, do a thorough search for outliers and test them for influence, and run any other model diagnostics that are appropriate for your specific data. Use your model to come to a meaningful conclusion and learn something from your analysis.
linear model of max heart rate
Since the first seven variables are routinely collected for patients with possible heart trouble and an exercise stress test is a relatively common diagnostic test, I thought I would try to model the max heart rate achieved in the stress test (thalach
) using all of the variables collected before the test (age
+ sex
+ cp
+ trestbps
+ chol
+ fbs
+ rcg
) but my preliminary analysis (mostly above in the EDA and other models) indicates that many of the variables are crap for this, so I decided to make a model called the old guy with chest pain model using age
and cp
as explanatory variables.
hypotheses:
\(H_{0}: \beta_{age} = \beta_{cptang} = \beta_{cpaang} = \beta_{cpnonang}\)
\(H_{a}: one \beta \neq others\)
first, switch back to dummy regressors for cp
:
contrasts(cleveland$cp) <- "contr.treatment"
contrasts(cleveland$cp)
## t_ang a_ang non-ang
## none 0 0 0
## t_ang 1 0 0
## a_ang 0 1 0
## non-ang 0 0 1
make the model:
oldguy_cp <- lm(thalach ~ age + cp, data=cleveland)
summary(oldguy_cp)
##
## Call:
## lm(formula = thalach ~ age + cp, data = cleveland)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.85 -11.43 2.65 13.00 44.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.258 7.316 25.87 < 2e-16
## age -0.872 0.128 -6.82 5.2e-11
## cpt_ang 15.404 4.438 3.47 0.00059
## cpa_ang 17.937 3.292 5.45 1.1e-07
## cpnon-ang 12.961 2.706 4.79 2.6e-06
##
## Residual standard error: 19.8 on 298 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.254
## F-statistic: 26.7 on 4 and 298 DF, p-value: <2e-16
ANOVA table for the model:
Anova(oldguy_cp)
Graph to look for possible interaction between age
and `cp:
ggplot(cleveland, aes(age,thalach, group=cp)) + geom_point(aes(color=cp)) +
geom_smooth(method="lm", se=F, aes(color=cp)) +
labs(title="stress test max heart rate by age", x="age (yrs)", y="max HR (bpm)") +
guides(color=guide_legend(title="chest pain")) + theme_minimal()
The three versions of answering “yes” to chest pain cluster tightly together, while none is somewhat different. If I had more time, I’d make yet another variable of my own combining the different kinds of pain into a “yes” category… However, there doesn’t look like there is an interaction (and running an interaction model added nothing , data not shown).
check residuals
oldguy_cp_aug <- augment(oldguy_cp)
ggplot(oldguy_cp_aug, aes(sample=.std.resid)) + geom_qq() + theme_minimal()
ggplot(oldguy_cp_aug, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept=0) +
theme_minimal()
the residuals are generally well behaved (normal and constant variance) - there is a bit of funny business at the ends where there are few data points and there are larger negative residuals, but nothing too alarming.
interpretation
The model:
\[\widehat{maxHR}=189.3 - 0.872 \cdot age + 15.4 \cdot cp_{tang} + 17.93 \cdot cp_{aang} + 12.96 \cdot cp_{nonang}\] The model predicts that a patient that is zero years old would have a stress test maximum heart rate of 189.3 bpm and it would decrease by 0.872 bpm for each year they age. That’s all she wrote if there is no chest pain, but if they have t_ang chest pain, we would add 15.4 bpm to their max heart rate, if they have a_ang chest pain then we would add 17.93 bpm and if they have non_ang chest pain, we would add 12.96 bpm. All other variables held constant of course.
Each of the variables included in the model, age
and chest pain (cp
), were statistically significant (P <0.001) in ANOVA, and each of the \(\beta\)s was also significant (P<0.001) so we have sufficient evidence that the \(\beta\)s are not equal to zero. These variables add information to our model.
However, the \(R^2\) value is 0.264, so the model only accounts for 26.4% of the variation in stress test maximum heart rate.
bonus: logistic regression models
Since the research article from whence the data came used logistic regression to try to classify patients based on disease probability, I just wanted to fiddle with these models a bit. I split the entire 13 variable model into 2 simpler ones as well.
# rcg back to treatment contrasts
contrasts(cleveland$rcg) <- "contr.treatment"
This model has the routine, easy to gather medical data only.
routine <- glm(disease ~ age + sex + cp + trestbps + chol + fbs + rcg,
family = binomial(link = "logit"), data=cleveland)
summary(routine)
##
## Call:
## glm(formula = disease ~ age + sex + cp + trestbps + chol + fbs +
## rcg, family = binomial(link = "logit"), data = cleveland)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.300 -0.712 0.237 0.731 2.410
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.81940 1.44631 3.33 0.00086
## age -0.05360 0.01851 -2.90 0.00377
## sexF 1.93639 0.36941 5.24 1.6e-07
## cpt_ang 2.51576 0.56975 4.42 1.0e-05
## cpa_ang 2.51342 0.45728 5.50 3.9e-08
## cpnon-ang 2.37788 0.36488 6.52 7.2e-11
## trestbps -0.01731 0.00874 -1.98 0.04762
## chol -0.00396 0.00295 -1.35 0.17862
## fbshigh -0.02734 0.41285 -0.07 0.94721
## rcgabnormal -0.45673 0.30148 -1.51 0.12979
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 283.84 on 293 degrees of freedom
## AIC: 303.8
##
## Number of Fisher Scoring iterations: 5
This model is just the data gathered from the stress test:
stress <- glm(disease ~ thalach + exang + twommstd + slope + ca + thal,
family = binomial(link = "logit"), data=cleveland)
summary(stress)
##
## Call:
## glm(formula = disease ~ thalach + exang + twommstd + slope +
## ca + thal, family = binomial(link = "logit"), data = cleveland)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.486 -0.479 0.280 0.558 2.722
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.62617 1.56421 -1.04 0.2985
## thalach 0.01738 0.00875 1.99 0.0470
## exangY -1.18548 0.38078 -3.11 0.0019
## twommstdless 1.73252 0.64660 2.68 0.0074
## slopeflat -0.84839 0.38563 -2.20 0.0278
## slopeneg 0.17360 0.80875 0.21 0.8300
## ca -1.15844 0.22165 -5.23 0.000000173
## thalfixed -0.64221 0.67426 -0.95 0.3409
## thalreversible -2.00087 0.36287 -5.51 0.000000035
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 221.47 on 288 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 239.5
##
## Number of Fisher Scoring iterations: 5
This is the full model described in the research article:
full <- glm(disease ~ age + sex + cp + trestbps + chol + fbs + rcg + thalach + exang +
twommstd + slope + ca + thal, family = binomial(link = "logit"), data=cleveland)
summary(full)
##
## Call:
## glm(formula = disease ~ age + sex + cp + trestbps + chol + fbs +
## rcg + thalach + exang + twommstd + slope + ca + thal, family = binomial(link = "logit"),
## data = cleveland)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.733 -0.341 0.129 0.505 2.878
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.56012 2.68729 0.21 0.83489
## age 0.01379 0.02504 0.55 0.58176
## sexF 1.54697 0.53495 2.89 0.00383
## cpt_ang 1.98645 0.67641 2.94 0.00332
## cpa_ang 0.82735 0.55927 1.48 0.13905
## cpnon-ang 1.83220 0.50586 3.62 0.00029
## trestbps -0.02299 0.01137 -2.02 0.04311
## chol -0.00569 0.00403 -1.41 0.15818
## fbshigh 0.67319 0.60650 1.11 0.26701
## rcgabnormal -0.40613 0.38589 -1.05 0.29259
## thalach 0.01925 0.01104 1.74 0.08123
## exangY -0.75711 0.44091 -1.72 0.08595
## twommstdless 1.61199 0.72150 2.23 0.02547
## slopeflat -1.23950 0.45672 -2.71 0.00665
## slopeneg -0.29834 0.95146 -0.31 0.75386
## ca -1.33425 0.27758 -4.81 0.0000015
## thalfixed 0.23863 0.79473 0.30 0.76397
## thalreversible -1.43079 0.43346 -3.30 0.00096
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 188.57 on 279 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 224.6
##
## Number of Fisher Scoring iterations: 6
Anova(full)
I would love to have more time to mess with these logistic regression models and do some calculations or look at the other test data that came from the paper (I think the data from other sites is in the repo too), but sadly, I’m outta time.