Statistical Modeling Final Project
Data350 Final Modeling Project
brian avery
1 August 2018
library(car)
library(broom)
library(mosaic)Statistical Modeling Final Project
Introduction
Start with a brief introduction that explains the data set you’ve chosen, giving some background information about the provenance and structure of the data. Do a little exploratory data analysis (graphs, charts, tables, etc.) to learn about the data.
This data set is a subset from a single site (Cleveland Clinic) of a multisite a cardiology study to develop a new probabilistic algorithm to diagnose coronary artery disease (Detrano et al., International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease, Am J Cardiol 1989;64:304-310.).
 The data was collected in the early 1980s and at the time there was little data and few models to predict the probability of coronary artery disease from diagnostic data.
 The authors collected data from 303 patients that had no previous history of major heart disease (heart attack, valve problems, etc.)
codebook
info from - https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names The 14 variables below (included in the dataset used) are a subset of variables from the larger, complete dataset (the original variable numbers are shown at the end after #). Most f this info is from the link above, but I have added some info from the original paper and the google.
 1. age - age in years (#3)
 2. sex - sex (1 = male; 0 = female) (#4)
 3. cp - chest pain type - Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic (#9)
 4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) (#10)
 5. chol - serum cholestoral in mg/dl (#12)
 6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false) (#16)
 7. restecg - resting electrocardiographic results - Value 0: normal, Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria (#19), both 1 and 2 are diagnostic of heart attack/problems.
 8. thalach - maximum heart rate achieved (#32), during the stress test (generally higher is more healthy, but it is also age related, the quick calc for mx HR is 220-age).
 9. exang - exercise induced angina (1 = yes; 0 = no) (#38), this is chest pain that is a symptom of coronary artery disease.
 10. oldpeak - ST depression induced by exercise relative to rest (#40) and higher is an indicator of heart attack (and a few other problems).
 11. slope - the slope of the peak exercise ST segment - Value 1: upsloping, Value 2: flat, Value 3: downsloping (#41), flat or downsloping indicates low coronary artery function (and results in lack of oxygen in the heart muscle and damage).
 12. ca - number of major vessels (0-3) colored by flouroscopy (#44) indicating that they contain calcium deposits which is indicator of hardening plaques.
 13. thal - 3 = normal; 6 = fixed defect; 7 = reversible defect (#51) in an exercise thallium scintigraphic stress test. fixed defects are permanent heart damage, where as reversible are from ischemia and less severe.
 14. num - diagnosis of heart disease (angiographic disease status) - Value 0: < 50% diameter narrowing, Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels) - 0 is no disease, other values are disease (#58), I’ll be honest, this variable is a bit suspect as it appears to have been a clear 0=no disease, 1=disease variable at some point, but now has poorly explained other values.
The context of the original data collection of this data appears to be what is called a “nuclear thallium stress test” for each patient that is used to assess damage to the coronary arteries after heart attack or to judge the success of treatment for blocked coronary arteries. So it seems reasonable to try to use some of these “diagnostic” variables to predict if a person has a heart disease
info links/references:
 http://ptca.org/imaging/thallium_stress_test.html
 https://www.hopkinsmedicine.org/healthlibrary/test_procedures/cardiovascular/myocardial_perfusion_scan_stress_92,P07979
 https://www.webmd.com/heart-disease/coronary-calcium-scan
 https://en.wikipedia.org/wiki/ST_segment
 https://www.nhlbi.nih.gov/health-topics/angina
Exploratory Data Analysis
get and prep data
my data source was this file converted to csv (in LibreOffice):
 https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
First, we need to bring in the data:
cleveland <- read.csv("processed.cleveland.csv", na.strings="?")
str(cleveland)## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : int  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : int  6 3 7 3 3 3 3 3 7 7 ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...I want to convert a few of the variables to factors in R based on the codebook descriptions.
cleveland$sex <- factor(cleveland$sex, levels=c(1,0), labels=c('M','F'))
cleveland$cp <- factor(cleveland$cp, levels=c(4,1,2,3), labels=c('none','t_ang','a_ang','non-ang'))
cleveland$fbs <- factor(cleveland$fbs, levels=c(0,1), labels=c('low','high'))
cleveland$restecg <- factor(cleveland$restecg, levels=c(0,1,2), labels=c('normal','wave_abn','lv_hyp'))
cleveland$exang <- factor(cleveland$exang, levels=c(0,1), labels=c('N', 'Y'))
cleveland$slope <- factor(cleveland$slope, levels=c(1,2,3), labels=c('pos', 'flat', 'neg'))
cleveland$thal <- factor(cleveland$thal, levels=c(3,6,7), labels=c('normal','fixed','reversible'))
# num could be a factor, but in next step, I make a new factor out of it
# cleveland$num <- factor(cleveland$num)
str(cleveland)## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "M","F": 1 1 1 1 2 1 2 2 1 1 ...
##  $ cp      : Factor w/ 4 levels "none","t_ang",..: 2 1 1 4 3 3 1 1 1 1 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "low","high": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "normal","wave_abn",..: 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "N","Y": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "pos","flat","neg": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : Factor w/ 3 levels "normal","fixed",..: 2 1 3 1 1 1 1 1 3 3 ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...Now I want a categorical variable that tells us whether the patient was diagnosed with a disease or not, I’m going to make a categorical variable here that does that. I’m basing my decision on the description of the num variable, which says “0 is no disease, other values are disease”.
cleveland$disease <- factor(ifelse(cleveland$num!=0, "disease", "OK"))I’m still somewhat suspicious of this variable since there are 2 similar variables in the research article.
Also, as we see below, oldpeak has a lot of zeros and from the description and digging I did, zero is normal and all numerical values are abnormal, so to play with later, I’m also making a factor called STdep that separates the zeros and values, but later I found evidence (https://www.ncbi.nlm.nih.gov/pubmed/16015437) that values greater than 2 were considered an indication of damage, so I made a factor called twommstd that has any oldpeak values larger than 2 coded as greater and all others coded as less.
cleveland$STdep <- factor(ifelse(cleveland$oldpeak>0, "depr", "none"))
cleveland$twommstd <- factor(ifelse(cleveland$oldpeak>2, "greater", "less"))
str(cleveland)## 'data.frame':    303 obs. of  17 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "M","F": 1 1 1 1 2 1 2 2 1 1 ...
##  $ cp      : Factor w/ 4 levels "none","t_ang",..: 2 1 1 4 3 3 1 1 1 1 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "low","high": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "normal","wave_abn",..: 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "N","Y": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "pos","flat","neg": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : Factor w/ 3 levels "normal","fixed",..: 2 1 3 1 1 1 1 1 3 3 ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...
##  $ disease : Factor w/ 2 levels "disease","OK": 2 1 1 2 2 2 1 2 1 1 ...
##  $ STdep   : Factor w/ 2 levels "depr","none": 1 1 1 1 1 1 1 1 1 1 ...
##  $ twommstd: Factor w/ 2 levels "greater","less": 1 2 1 1 2 2 1 2 2 1 ...Just to be sure I made the disease variable right, I want to compare the number of zeros in the num variable with the number of OK values in the disease variable.
tally(cleveland$num)## X
##   0   1   2   3   4 
## 164  55  36  35  13Matches up.
tally(cleveland$disease)## X
## disease      OK 
##     139     164They are the same, and 55+36+35+13 = 139 which is the same as the disease values, so seems legit.
tally(cleveland$disease, format="perc")## X
## disease      OK 
##  45.875  54.125and some good news, when I look at table 1 of the original research article (Detrano, Am J Cardid 1989;64:304-310) it lists 46% of the Cleveland sample as having “Disease”, so that lines up with my new variable!
legit EDA (finally)
Here is a summary of the data by variable:
summary(cleveland)##       age       sex           cp         trestbps        chol       fbs          restecg   
##  Min.   :29.0   M:206   none   :144   Min.   : 94   Min.   :126   low :258   normal  :151  
##  1st Qu.:48.0   F: 97   t_ang  : 23   1st Qu.:120   1st Qu.:211   high: 45   wave_abn:  4  
##  Median :56.0           a_ang  : 50   Median :130   Median :241              lv_hyp  :148  
##  Mean   :54.4           non-ang: 86   Mean   :132   Mean   :247                            
##  3rd Qu.:61.0                         3rd Qu.:140   3rd Qu.:275                            
##  Max.   :77.0                         Max.   :200   Max.   :564                            
##                                                                                            
##     thalach    exang      oldpeak      slope           ca                thal          num       
##  Min.   : 71   N:204   Min.   :0.00   pos :142   Min.   :0.000   normal    :166   Min.   :0.000  
##  1st Qu.:134   Y: 99   1st Qu.:0.00   flat:140   1st Qu.:0.000   fixed     : 18   1st Qu.:0.000  
##  Median :153           Median :0.80   neg : 21   Median :0.000   reversible:117   Median :0.000  
##  Mean   :150           Mean   :1.04              Mean   :0.672   NA's      :  2   Mean   :0.937  
##  3rd Qu.:166           3rd Qu.:1.60              3rd Qu.:1.000                    3rd Qu.:2.000  
##  Max.   :202           Max.   :6.20              Max.   :3.000                    Max.   :4.000  
##                                                  NA's   :4                                       
##     disease     STdep        twommstd  
##  disease:139   depr:204   greater: 50  
##  OK     :164   none: 99   less   :253  
##                                        
##                                        
##                                        
##                                        
## Now let’s look at some variables by plotting:
 First, what is the distribution of ages in the sample?
ggplot(cleveland, aes(age)) + 
  geom_histogram(binwidth=5) + theme_minimal() + 
  labs(title="age distribution", x="age in years", y="count")age has a bit of left skew to it, but it’s not too bad.
What do the ages look like split by sex?
ggplot(cleveland, aes(sex,age)) + geom_boxplot() + theme_minimal() + 
  labs(title="age distribution by sex", y="age in years")what is the distribution of values of blood pressure?
ggplot(cleveland, aes(x = trestbps)) +
  geom_histogram(binwidth=10) + theme_minimal() + 
  labs(title="distribution of patient resting BP", x="resting blood pressure (in mm Hg)", y="count")What is the distribution of serum cholesterol values in the sample?
ggplot(cleveland, aes(x = chol)) + 
  geom_histogram(binwidth=20) + theme_minimal() + 
  labs(title="distribution of patient serum cholesterol", x="serum cholesterol (mg/dL)", y="count")What is the distribution of mx heart rates during the stress test?
ggplot(cleveland, aes(thalach)) + 
  geom_histogram(binwidth=10) + theme_minimal() + 
  labs(title="maximum heart rate during stress test", x="max HR (bpm)")thalach isn’t quite normal either, it has some left skew to it.
oldpeak has a lot of zeros which are “normal” and all other values are “abnormal” - I made a factor variable to reflect this status. It does remove some info from the non-zero values, however. Then I made a second factor variable (twommstd) to better reflect that values greater than 2 had higher risk.
ggplot(cleveland, aes(oldpeak))+
  geom_histogram(binwidth=.20) + theme_minimal() + 
  labs(title="ST depression induced by exercise", x="ST depression (mm)", y="count")Now we should look more at thalach since it is a likely response variable. What is the distribution by sex?
ggplot(cleveland, aes(x=sex, y=thalach)) + geom_boxplot() + theme_minimal() + 
  labs(title="maximum stress test heart rate by sex", y="max HR (bpm)")We should also look to see if any of these variable distributions are different based on disease status.
first, cholesterol:
ggplot(cleveland, aes(x=disease, y=chol)) + geom_boxplot() + theme_minimal() + 
  labs(title="cholesterol level by disease status", y="cholesterol (mg/dL)")Does resting blood pressure differe by disease status?
ggplot(cleveland, aes(x=disease, y=trestbps)) + geom_boxplot() + theme_minimal() + 
  labs(title="resting blood pressure by disease status", y="resting BP (mmHg)")What about maximum stress test heart rate (thalach) - does its distrubtion vary by disease status?
ggplot(cleveland, aes(x=disease, y=thalach)) + geom_boxplot() + theme_minimal() + 
  labs(title="maximum stress test heart rate by disease status", y="max HR (bpm)")Yes, it appears that it does. It looks like individuals with coronary artery disease might have lower max HR in the stress test.
What about ST slope? Flat or negative ST slope is supposed to be trouble (normal is slight positive slope):
ggplot(cleveland, aes(x=slope, y=thalach)) + geom_boxplot() + theme_minimal() + 
  labs(title="maximum stress test heart rate by ST slope", y="max HR (bpm)", x="ST slope")Might be a difference there.
What about chest pain type? Having chest pain is often bad…
ggplot(cleveland, aes(x=cp, y=thalach)) + geom_boxplot() + theme_minimal() + 
  labs(title="stress test max heart rate by chest pain type", y="max HR (bpm)", x="chest pain type")Since max heart rate in the stress test (thalach) is going to be my response variable, I’m going to make some scatterplots with possible predictor variables to look for linear patterns.
ggplot(cleveland, aes(x=trestbps, y=thalach)) + geom_point() + theme_minimal() + 
  labs(title="maximum stress test heart rate by resting blood pressure", 
       x="resting BP (mmHg)", y="max HR (bpm)")ggplot(cleveland, aes(x=oldpeak, y=thalach)) + geom_point() + theme_minimal() + 
  labs(title="maximum stress test heart rate by ST depression", 
       x="ST depression during exercise (mm)", y="max HR (bpm)")oldpeak has all those zeroes that makes it a bit hard to work with.
ggplot(cleveland, aes(x=age, y=thalach)) + geom_point() + theme_minimal() + 
  labs(title="maximum stress test heart rate by age", 
       x="age (years)", y="max HR (bpm)")age has a negative linear trend.
ggplot(cleveland, aes(x=chol, y=thalach)) + geom_point() + theme_minimal() + 
  labs(title="maximum stress test heart rate by cholesterol", 
       x="serum cholesterol (mg/dL)", y="max HR (bpm)")age has a negative linear trend. That makes sense from my background research, as you age you have a lower maximum heart rate during exercise. None of the other numerical variables show anything obvious.
Mechanics
In the next part of your project, you will go through some “mechanical” tasks that demonstrate your ability to run basic types of analysis in R. These tasks will not necessarily be “good” models and you don’t need to worry about the conclusions expressing anything meaningful. Here are the tasks you need to complete:
Multiple regression
Use a numerical response variable, one numerical explanatory variable, and one categorical explanatory variable. (Allow R to use the default dummy coding with the categorical variable.)
 Check scatterplots and residual plots. But even if the scatterplots and residual plots show that a multiple regression analysis is inappropriate, go ahead and run the analysis anyway.
 Test for interaction both graphically and statistically.
 Interpret the coefficients of the model, including the intercept (even when such coefficients might not make any sense for a variety of reasons).
additive model
My response variable will be thalach (maximum HR in the stress test) and the explanatory variables I’ll use are age (numerical) and sex (categorical).
HR_as <- lm(thalach ~ age + sex, data=cleveland)
summary(HR_as)## 
## Call:
## lm(formula = thalach ~ age + sex, data = cleveland)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64.44 -11.82   3.32  15.29  46.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  203.667      7.379   27.60  < 2e-16
## age           -1.018      0.134   -7.58  4.3e-13
## sexF           4.303      2.598    1.66    0.099
## 
## Residual standard error: 21 on 300 degrees of freedom
## Multiple R-squared:  0.163,  Adjusted R-squared:  0.157 
## F-statistic: 29.2 on 2 and 300 DF,  p-value: 2.69e-12scatter plot:
ggplot(cleveland, aes(age,thalach)) + geom_point() + geom_smooth(method="lm") + 
  labs(title="stress test max heart rate by age", x="age (yrs)", y="max HR (bpm)") + 
  theme_minimal()check the residuals:
HR_as_aug <- augment(HR_as)
ggplot(HR_as_aug, aes(sample=.std.resid)) + geom_qq() + theme_minimal()
ggplot(HR_as_aug, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept=0) + 
  theme_minimal()The residuals for age look reasonable - pretty normal and no obvious patterns (like heteroskedasticity) although there are more large negative values than positive.
model with interaction
HR_as_int <- lm(thalach ~ age * sex, data=cleveland)
summary(HR_as_int)## 
## Call:
## lm(formula = thalach ~ age * sex, data = cleveland)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -63.32 -13.19   3.47  15.63  46.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  208.245      9.058   22.99  < 2e-16
## age           -1.103      0.166   -6.64  1.4e-10
## sexF          -9.273     15.785   -0.59     0.56
## age:sexF       0.247      0.283    0.87     0.38
## 
## Residual standard error: 21 on 299 degrees of freedom
## Multiple R-squared:  0.165,  Adjusted R-squared:  0.156 
## F-statistic: 19.7 on 3 and 299 DF,  p-value: 1.15e-11Anova(HR_as_int)doesn’t appear to be an interaction.
ggplot(cleveland, aes(age,thalach, group=sex)) + geom_point(aes(color=sex)) + 
  geom_smooth(method="lm", se=F, aes(color=sex)) + 
  labs(title="stress test max heart rate by age", x="age (yrs)", y="max HR (bpm)") + 
  theme_minimal()Based on both the statistical model and the scatterplot, there does not appear to be any interaction between age and sex.
interpretation
The model:
 \[\widehat{maxHR}=203.7 - 1.018 \cdot age + 4.303 \cdot sexF\]
The model predicts that a patient of age 0 would have a maximum stress test heart rate of 203.7 bpm, and that is reduced by -1.018 bpm for every year of life, and 4.303 bpm are added if the patient is female (since male is the reference category). All other variables held constant of course.
The overall model is significant (P < 0.001 in ANOVA) and we have sufficient evidence to reject the null hypothesis that \(\beta_{age}=0\) (P < 0.001) but insufficient evidence to reject the null hypothesis that \(\beta_{sex}=0\) (P > 0.05).
The \(R^2\) for the additive model was 0.163, which means that we are able to explain 16.3% of the variation in stress test maximum heart rate using age and sex.
There did not appear to be any interaction between age and sex.
ANOVA
Use a numerical response variable, and two categorical explanatory variables. Use deviance coding (with “contr.sum”).
 Test for the presence of interaction between the two explanatory variables both graphically and statistically.
 Interpret the coefficients of the model, including the intercept (even when such coefficients might not make any sense for a variety of reasons).
additive model
I’m still using thalach as my response variable, but will use cp (chest pain) and rcg (resting ECG abnormalities, derived from restecg, see below).
Set the contrasts:
 chest pain (cp)…
contrasts(cleveland$cp) <- "contr.sum"
contrasts(cleveland$cp)##         [,1] [,2] [,3]
## none       1    0    0
## t_ang      0    1    0
## a_ang      0    0    1
## non-ang   -1   -1   -1I want to use the resting ECG data (restecg) but one category has very few observations:
tally(cleveland$restecg)## X
##   normal wave_abn   lv_hyp 
##      151        4      148so I’m going to collapse wave_abn and lv_hyp into a new category called abnormal in a new variable called rcg:
cleveland$rcg <- factor(ifelse(cleveland$restecg=="normal", "normal", "abnormal"), 
                        levels=c("normal", "abnormal"))now set the contrasts for rcg:
contrasts(cleveland$rcg) <- "contr.sum"
contrasts(cleveland$rcg)##          [,1]
## normal      1
## abnormal   -1make the model:
HR_cr <- lm(thalach ~ cp + rcg, data=cleveland)
summary(HR_cr)## 
## Call:
## lm(formula = thalach ~ cp + rcg, data = cleveland)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70.83 -12.01   1.75  15.72  46.19 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   153.61       1.52  101.07   <2e-16
## cp1           -12.79       1.98   -6.47    4e-10
## cp2             2.57       3.49    0.74   0.4626
## cp3             8.57       2.63    3.26   0.0012
## rcg1            1.01       1.24    0.82   0.4132
## 
## Residual standard error: 21.2 on 298 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.139 
## F-statistic: 13.2 on 4 and 298 DF,  p-value: 6.38e-10Anova(HR_cr)interaction model
Look for interaction graphically:
# group means
gr_means <- cleveland %>%
    group_by(cp, rcg) %>%
    summarize(mean_thalach = mean(thalach))
gr_means# interaction plot
ggplot() + 
  geom_point(data=cleveland, aes(cp, thalach, color=rcg), 
             position=position_jitterdodge(jitter.width=0.1, dodge.width=0.45), alpha=0.4) + 
  geom_point(data=gr_means, aes(cp, mean_thalach, color=rcg), shape=18, size=4) +
  geom_line(data=gr_means, aes(cp, mean_thalach, group=rcg, color=rcg), size=1.2) + 
  scale_colour_hue(l=40) + theme_minimal() + 
  labs(title="interaction between resting ECG and chest pain", x="chest pain type", 
       y="max HR (bpm)") + guides(color=guide_legend(title="resting ECG"))doesn’t look like any interaction from the plot.
build model with interaction:
HR_cr_int <- lm(thalach ~ cp * rcg, data=cleveland)
summary(HR_cr_int)## 
## Call:
## lm(formula = thalach ~ cp * rcg, data = cleveland)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70.97 -11.98   1.59  15.59  46.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.6333     1.5619   98.36  < 2e-16
## cp1         -12.7958     2.0125   -6.36  7.7e-10
## cp2           2.5096     3.5805    0.70   0.4839
## cp3           8.6749     2.6957    3.22   0.0014
## rcg1          0.9042     1.5619    0.58   0.5631
## cp1:rcg1      0.2260     2.0125    0.11   0.9107
## cp2:rcg1     -0.0471     3.5805   -0.01   0.9895
## cp3:rcg1     -0.4382     2.6957   -0.16   0.8710
## 
## Residual standard error: 21.3 on 295 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.131 
## F-statistic: 7.48 on 7 and 295 DF,  p-value: 0.0000000271Anova((HR_cr_int))doesn’t look like any interaction from the statistical model either.
interpretation
The model:
 \[\widehat{maxHR}=153.61 -12.79 \cdot cp1 + 2.57 \cdot cp2 + 8.57 \cdot cp3 + 1.01 \cdot rcg1 \]
The model predicts that a patient with no chest pain and a normal resting ECG would have a stress test max HR of \(153.61 - 12.79 + 1.01 =\) 141.83., which is very close to the group mean for that group. All other variables held constant of course.
The results of the ANOVA table for this model suggest that there is sufficient evidence to support thinking the mean thalach values for the different cp groups are not equal, while there is insufficient evidence to support a difference in thalach means in the different rcg groups.
The \(R^2\) value of this model indicates that it can explain 51.1% of the variation in the max heart rate of a patient in a stress test (thalach). That’s not the best model ever.
Finally, to look at pairwise comparisons:
TukeyHSD(HR_cr)##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = x)
## 
## $cp
##                   diff      lwr     upr   p adj
## t_ang-none    15.27597   2.9637 27.5882 0.00810
## a_ang-none    21.73944  12.7391 30.7398 0.00000
## non-ang-none  14.72642   7.2541 22.1988 0.00000
## a_ang-t_ang    6.46348  -7.3511 20.2780 0.62174
## non-ang-t_ang -0.54954 -13.4209 12.3218 0.99952
## non-ang-a_ang -7.01302 -16.7643  2.7382 0.24852
## 
## $rcg
##                    diff     lwr    upr   p adj
## abnormal-normal -1.9701 -6.7689 2.8286 0.41977each of t_ang, a_ang, and non-ang are significnly different from none, which is what the plot looks like as well.
Model and Hypothesis
For the final part of the project, go through the process of specifying a contextually meaningful model that tests a reasonable hypothesis. In addition to the steps you have done above, be sure to transform any variables that need transforming, do a thorough search for outliers and test them for influence, and run any other model diagnostics that are appropriate for your specific data. Use your model to come to a meaningful conclusion and learn something from your analysis.
linear model of max heart rate
Since the first seven variables are routinely collected for patients with possible heart trouble and an exercise stress test is a relatively common diagnostic test, I thought I would try to model the max heart rate achieved in the stress test (thalach) using all of the variables collected before the test (age + sex + cp + trestbps + chol + fbs + rcg) but my preliminary analysis (mostly above in the EDA and other models) indicates that many of the variables are crap for this, so I decided to make a model called the old guy with chest pain model using age and cp as explanatory variables.
hypotheses:
 \(H_{0}: \beta_{age} = \beta_{cptang} = \beta_{cpaang} = \beta_{cpnonang}\)
 \(H_{a}: one \beta \neq others\)
first, switch back to dummy regressors for cp:
contrasts(cleveland$cp) <- "contr.treatment"
contrasts(cleveland$cp)##         t_ang a_ang non-ang
## none        0     0       0
## t_ang       1     0       0
## a_ang       0     1       0
## non-ang     0     0       1make the model:
oldguy_cp <- lm(thalach ~ age + cp, data=cleveland)
summary(oldguy_cp)## 
## Call:
## lm(formula = thalach ~ age + cp, data = cleveland)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.85 -11.43   2.65  13.00  44.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  189.258      7.316   25.87  < 2e-16
## age           -0.872      0.128   -6.82  5.2e-11
## cpt_ang       15.404      4.438    3.47  0.00059
## cpa_ang       17.937      3.292    5.45  1.1e-07
## cpnon-ang     12.961      2.706    4.79  2.6e-06
## 
## Residual standard error: 19.8 on 298 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.254 
## F-statistic: 26.7 on 4 and 298 DF,  p-value: <2e-16ANOVA table for the model:
Anova(oldguy_cp)Graph to look for possible interaction between age and `cp:
ggplot(cleveland, aes(age,thalach, group=cp)) + geom_point(aes(color=cp)) + 
  geom_smooth(method="lm", se=F, aes(color=cp)) + 
  labs(title="stress test max heart rate by age", x="age (yrs)", y="max HR (bpm)") + 
  guides(color=guide_legend(title="chest pain")) + theme_minimal()The three versions of answering “yes” to chest pain cluster tightly together, while none is somewhat different. If I had more time, I’d make yet another variable of my own combining the different kinds of pain into a “yes” category… However, there doesn’t look like there is an interaction (and running an interaction model added nothing , data not shown).
check residuals
oldguy_cp_aug <- augment(oldguy_cp)
ggplot(oldguy_cp_aug, aes(sample=.std.resid)) + geom_qq() + theme_minimal()
ggplot(oldguy_cp_aug, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept=0) + 
  theme_minimal()the residuals are generally well behaved (normal and constant variance) - there is a bit of funny business at the ends where there are few data points and there are larger negative residuals, but nothing too alarming.
interpretation
The model:
 \[\widehat{maxHR}=189.3 - 0.872 \cdot age + 15.4 \cdot cp_{tang} + 17.93 \cdot cp_{aang} + 12.96 \cdot cp_{nonang}\] The model predicts that a patient that is zero years old would have a stress test maximum heart rate of 189.3 bpm and it would decrease by 0.872 bpm for each year they age. That’s all she wrote if there is no chest pain, but if they have t_ang chest pain, we would add 15.4 bpm to their max heart rate, if they have a_ang chest pain then we would add 17.93 bpm and if they have non_ang chest pain, we would add 12.96 bpm. All other variables held constant of course.
Each of the variables included in the model, age and chest pain (cp), were statistically significant (P <0.001) in ANOVA, and each of the \(\beta\)s was also significant (P<0.001) so we have sufficient evidence that the \(\beta\)s are not equal to zero. These variables add information to our model.
However, the \(R^2\) value is 0.264, so the model only accounts for 26.4% of the variation in stress test maximum heart rate.
bonus: logistic regression models
Since the research article from whence the data came used logistic regression to try to classify patients based on disease probability, I just wanted to fiddle with these models a bit. I split the entire 13 variable model into 2 simpler ones as well.
# rcg back to treatment contrasts  
contrasts(cleveland$rcg) <- "contr.treatment"This model has the routine, easy to gather medical data only.
routine <- glm(disease ~ age + sex + cp + trestbps + chol + fbs + rcg, 
               family = binomial(link = "logit"), data=cleveland)
summary(routine)## 
## Call:
## glm(formula = disease ~ age + sex + cp + trestbps + chol + fbs + 
##     rcg, family = binomial(link = "logit"), data = cleveland)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.300  -0.712   0.237   0.731   2.410  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  4.81940    1.44631    3.33  0.00086
## age         -0.05360    0.01851   -2.90  0.00377
## sexF         1.93639    0.36941    5.24  1.6e-07
## cpt_ang      2.51576    0.56975    4.42  1.0e-05
## cpa_ang      2.51342    0.45728    5.50  3.9e-08
## cpnon-ang    2.37788    0.36488    6.52  7.2e-11
## trestbps    -0.01731    0.00874   -1.98  0.04762
## chol        -0.00396    0.00295   -1.35  0.17862
## fbshigh     -0.02734    0.41285   -0.07  0.94721
## rcgabnormal -0.45673    0.30148   -1.51  0.12979
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 283.84  on 293  degrees of freedom
## AIC: 303.8
## 
## Number of Fisher Scoring iterations: 5This model is just the data gathered from the stress test:
stress <- glm(disease ~ thalach + exang + twommstd + slope + ca + thal, 
              family = binomial(link = "logit"), data=cleveland)
summary(stress)## 
## Call:
## glm(formula = disease ~ thalach + exang + twommstd + slope + 
##     ca + thal, family = binomial(link = "logit"), data = cleveland)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.486  -0.479   0.280   0.558   2.722  
## 
## Coefficients:
##                Estimate Std. Error z value    Pr(>|z|)
## (Intercept)    -1.62617    1.56421   -1.04      0.2985
## thalach         0.01738    0.00875    1.99      0.0470
## exangY         -1.18548    0.38078   -3.11      0.0019
## twommstdless    1.73252    0.64660    2.68      0.0074
## slopeflat      -0.84839    0.38563   -2.20      0.0278
## slopeneg        0.17360    0.80875    0.21      0.8300
## ca             -1.15844    0.22165   -5.23 0.000000173
## thalfixed      -0.64221    0.67426   -0.95      0.3409
## thalreversible -2.00087    0.36287   -5.51 0.000000035
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 221.47  on 288  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 239.5
## 
## Number of Fisher Scoring iterations: 5This is the full model described in the research article:
full <- glm(disease ~ age + sex + cp + trestbps + chol + fbs + rcg + thalach + exang + 
              twommstd + slope + ca + thal, family = binomial(link = "logit"), data=cleveland)
summary(full)## 
## Call:
## glm(formula = disease ~ age + sex + cp + trestbps + chol + fbs + 
##     rcg + thalach + exang + twommstd + slope + ca + thal, family = binomial(link = "logit"), 
##     data = cleveland)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.733  -0.341   0.129   0.505   2.878  
## 
## Coefficients:
##                Estimate Std. Error z value  Pr(>|z|)
## (Intercept)     0.56012    2.68729    0.21   0.83489
## age             0.01379    0.02504    0.55   0.58176
## sexF            1.54697    0.53495    2.89   0.00383
## cpt_ang         1.98645    0.67641    2.94   0.00332
## cpa_ang         0.82735    0.55927    1.48   0.13905
## cpnon-ang       1.83220    0.50586    3.62   0.00029
## trestbps       -0.02299    0.01137   -2.02   0.04311
## chol           -0.00569    0.00403   -1.41   0.15818
## fbshigh         0.67319    0.60650    1.11   0.26701
## rcgabnormal    -0.40613    0.38589   -1.05   0.29259
## thalach         0.01925    0.01104    1.74   0.08123
## exangY         -0.75711    0.44091   -1.72   0.08595
## twommstdless    1.61199    0.72150    2.23   0.02547
## slopeflat      -1.23950    0.45672   -2.71   0.00665
## slopeneg       -0.29834    0.95146   -0.31   0.75386
## ca             -1.33425    0.27758   -4.81 0.0000015
## thalfixed       0.23863    0.79473    0.30   0.76397
## thalreversible -1.43079    0.43346   -3.30   0.00096
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 188.57  on 279  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 224.6
## 
## Number of Fisher Scoring iterations: 6Anova(full)I would love to have more time to mess with these logistic regression models and do some calculations or look at the other test data that came from the paper (I think the data from other sites is in the repo too), but sadly, I’m outta time.
