Chapter3 Survival Analysis
3.1 Notation
In survival analysis, there is an important concept called “censoring”. Censoring is a kind of missing data problem where an event is not completely observed, for reasons like termination of study or loss of communication with the participant. And censoring will cause the observed time data inaccurate. For instance, for left-censored data, we don’t know the exact value of the data point, but only an upper bound for it; And for right-censored data, we only knows a lower bound for it.
In our dataset rotterdam, for the time variable dtime and rtime that we are curious about, death and recur are the corresponding variable that records if censoring occured. Since if someone is dead in this study, this means that the survival time we observed is exact; But for someone is still alive in the end of the study, the survival time we observed is just a lower bound. And similarly, if someone suffers from recurrent breast cancer during the study, our rtime should be accurate; But for someone did not encounter recurrence of breast cancer in the study, we are not sure if she will encounter in the future, so the data point is also a lower point.
For all the survival analysis below, we will include death and recur to examine the relationship between dtime and rtime vs. the diagnostic information and treatment received.
3.2 Kaplan-Miere estimator of the entire dataset
Generally speaking, Kaplan-Miere curve is a non-parametric estimator of the survival function which takes censoring into account. Its y-axis measures \(P(X \leq k)\) for various values of \(k\).
3.2.0.1 Death Time
KM_d <- survfit(Surv(dtime, death) ~ 1, data = rotterdam)
plot(KM_d, conf.int = TRUE, xlab="Days", ylab="Survival")
Kaplan-Meier curve also helps us finding the median and mean estimate. For median, we can directly find it from the survfit object KM_d. For mean, we need to calculate the area under the Kaplan-Meier curve.
## Call: survfit(formula = Surv(dtime, death) ~ 1, data = rotterdam)
##
## n events median 0.95LCL 0.95UCL
## 2982 1272 4033 3888 4309
# mean
# AUCKM stands for "Area Under Curve Kaplan Meier":
AUCKM = function(survobj,duration)
{
base=c(0,summary(survobj)$time,max(duration))
heights=c(1,summary(survobj)$surv)
new=c()
for(i in 1:length(heights)) { new=c(new,(base[i+1]-base[i])*heights[i]) }
c(sum(new))
}
AUCKM(KM_d,rotterdam$dtime)## [1] 4099.795
The overall mean survival time till death for breast cancer is 4099.795 days, and the overall median survival time till death for breast cancer is 4033 days. Both of them are approximately 7 years.
We can find that the mean and median we calculated above is much longer than the mean and median of the variable dtime itself, because a lot of data point are right-censored:
## [1] 2605.34
## [1] 2638.5
3.2.0.2 Recurrence Time
KM_r <- survfit(Surv(rtime, recur) ~ 1, data = rotterdam)
plot(KM_r, conf.int = TRUE, xlab="Days", ylab="Survival")
We can find the mean and median in a similar way:
## Call: survfit(formula = Surv(rtime, recur) ~ 1, data = rotterdam)
##
## n events median 0.95LCL 0.95UCL
## 2982 1518 2983 2719 3193
## [1] 3588.535
The overall mean survival time till recurrence for breast cancer is 3588.535 days, which is approximately 10 years. The overall median survival time till recurrence for breast cancer is 2983 days, which is approximately 8 years. Notice here our mean is much greater than the median, and this indicates that our rtime should be right-skewed.

3.2.0.3 Survival time after recurrence
KM_dr <- survfit(Surv(drecurtime, death) ~ 1, data = rotterdam_recur)
plot(KM_dr, conf.int = TRUE, xlab="Days", ylab="Survival")
## Call: survfit(formula = Surv(drecurtime, death) ~ 1, data = rotterdam_recur)
##
## n events median 0.95LCL 0.95UCL
## 1518 1077 894 815 949
## [1] 1407.135
The overall mean survival time after rucurrence till death for breast cancer is 1407.135 days, which is approximately a little less than 4 years. The overall median survival time after rucurrence till death for breast cancer is 894 days, which is approximately 2 years and a half. Here, the mean is also greater than the median, also indicating right-skewness:

Since a Kaplan-Miere estimator is unbiased, we could view the mean and median as being very close to the true value of survival time.
3.3 Kaplan-Miere estimator on different variables in rotterdam
There are 16 variables in the rotterdam dataset. Of course we could have fit each variable with a KM estimator, but it would be meaningless to do them all. We will stick to the Diagnostics and Treatment we mentioned in Chapter 2 and fit size, Nodes_level(we are not using nodes because a Kaplan-Miere estimator does not work well with quantitative variables), grade and Treatment each with KM estimators with respect to dtime, rtime, and drecurtime to grasp the survival time within each categories of the variables.
3.3.1 size vs. Survival Times
size vs. dtime
KM_None_Death <- survfit(Surv(dtime, death) ~ size, data = rotterdam)
plot(KM_None_Death, conf.type = "plain", col = c("black","red","blue"), xlab="Days", ylab="Survival")
legend(6000, 1, legend=c("<=20", "20-50", ">50"),
col=c("black", "red", "blue"), lty=1, cex=0.8,
title="Tumor Size", text.font=6)
size vs. rtime
KM_None_Recur <- survfit(Surv(rtime, recur) ~ size, data = rotterdam)
plot(KM_None_Recur, conf.type = "plain", col = c("black","red","blue"), xlab="Days", ylab="Survival")
legend(6000, 1, legend=c("<=20", "20-50", ">50"),
col=c("black", "red", "blue"), lty=1, cex=0.8,
title="Tumor Size", text.font=6)
In general, patients with smaller tumor at diagnosis enjoys longer survival for both death and recurrence.
KM_None_drecur <- survfit(Surv(drecurtime, death) ~ size, data = rotterdam_recur)
plot(KM_None_drecur, conf.type = "plain", col = c("black","red","blue"), xlab="Days", ylab="Survival")
legend(4500, 1, legend=c("<=20", "20-50", ">50"),
col=c("black", "red", "blue"), lty=1, cex=0.8,
title="Tumor Size", text.font=6)
The trend is still the same as patients with smaller tumor size enjoy longer survival of death after recurrence, but the survival time now decreases much faster for all groups.
3.3.2 Nodes_level vs. Survival Times
Nodes_level vs. dtime
KM_None_Death <- survfit(Surv(dtime, death) ~ Nodes_level, data = rotterdam)
plot(KM_None_Death, conf.type = "plain", col = c("black","red","blue","orange"), xlab="Days", ylab="Survival")
legend(6000, 1, legend=c("N0", "N1","N2", "N3"),
col=c("black", "red", "blue", "orange"), lty=1, cex=0.8,
title="Nodes_level", text.font=6)
Nodes_level vs. rtime
KM_None_Recur <- survfit(Surv(rtime, recur) ~ Nodes_level, data = rotterdam)
plot(KM_None_Recur, conf.type = "plain", col = c("black","red","blue","orange"), xlab="Days", ylab="Survival")
legend(6000, 1, legend=c("N0", "N1","N2", "N3"),
col=c("black", "red", "blue", "orange"), lty=1, cex=0.8,
title="Nodes_level", text.font=6)
In general, patients with less nodes tested positive will enjoy longer survival for both death and recurrence.
Nodes_level vs. drecurtime
KM_None_drecur <- survfit(Surv(drecurtime, death) ~ Nodes_level, data = rotterdam_recur)
plot(KM_None_drecur, conf.type = "plain", col = c("black","red","blue","orange"), xlab="Days", ylab="Survival")
legend(4500, 1, legend=c("N0", "N1","N2", "N3"),
col=c("black", "red", "blue", "orange"), lty=1, cex=0.8,
title="Nodes_level", text.font=6)
Similarly, the trend is still the same as patients with fewer nodes tested positive enjoy longer survival of death after recurrence, but the survival time now decreases much faster for all groups, and the difference is small in groups N1, N2 and N3.
3.3.3 grade vs. Survival Times
grade vs. dtime
KM_None_Death <- survfit(Surv(dtime, death) ~ grade, data = rotterdam)
plot(KM_None_Death, conf.type = "plain", col = c("black","red"), xlab="Days", ylab="Survival")
legend(6000, 1, legend=c("2", "3"),
col=c("black", "red"), lty=1, cex=0.8,
title="grade", text.font=6)
grade vs. rtime
KM_None_Recur <- survfit(Surv(rtime, recur) ~ grade, data = rotterdam)
plot(KM_None_Recur, conf.type = "plain", col = c("black","red"), xlab="Days", ylab="Survival")
legend(6000, 1, legend=c("2", "3"),
col=c("black", "red"), lty=1, cex=0.8,
title="Nodes_level", text.font=6)
grade vs. drecurtime
KM_None_drecur <- survfit(Surv(drecurtime, death) ~ grade, data = rotterdam_recur)
plot(KM_None_drecur, conf.type = "plain", col = c("black","red"), xlab="Days", ylab="Survival")
legend(4500, 1, legend=c("2", "3"),
col=c("black", "red"), lty=1, cex=0.8,
title="Nodes_level", text.font=6)
Similarly, we find that patients with grade 2 breast cancer show clear distinction in both the survival of death and recurrence. However, such distinct becomes less obvious for the survival of death after recurrence has occurred.
3.3.4 Treatment vs. Survival Times
Treatment vs. dtime
KM_Treatment_Death <- survfit(Surv(dtime, death) ~ Treatment, data = rotterdam)
plot(KM_Treatment_Death, conf.int = FALSE, col = c("black", "red", "blue", "orange"), xlab="Days", ylab="Survival")
legend(1, 0.4, legend=c("Both", "Chemo","Hormon", "NaN/Other Treatment"),
col=c("black", "red", "blue", "orange"), lty=1, cex=0.8,
title="Treatment Group", text.font=6)
Treatment vs. rtime
KM_Treatment_Recur <- survfit(Surv(rtime, recur) ~ Treatment, data = rotterdam)
plot(KM_Treatment_Recur, conf.int = FALSE, col = c("black", "red", "blue", "orange"), xlab="Days", ylab="Survival")
legend(1, 0.4, legend=c("Both", "Chemo","Hormon", "NaN/Other Treatment"),
col=c("black", "red", "blue", "orange"), lty=1, cex=0.8,
title="Treatment Group", text.font=6)
As we have discussed in Chapter 2, we know that generally chemotherapy is used on patients with age lower than 50 years old and hormontherapy is used on patients with age higher than 50 years old. Based on the difference of treatment, we could see that chemotherapy has a better effect than hormontherapy with respect to death time and a smaller yet still better effect regarding the recurrence time.
Treatment vs. drecurtime
KM_Treatment_drecur <- survfit(Surv(drecurtime, death) ~ Treatment, data = rotterdam_recur)
plot(KM_Treatment_drecur, conf.int = FALSE, col = c("black", "red", "blue", "orange"), xlab="Days", ylab="Survival")
legend(3700, 1, legend=c("Both", "Chemo","Hormon", "NaN/Other Treatment"),
col=c("black", "red", "blue", "orange"), lty=1, cex=0.8,
title="Treatment Group", text.font=6)
However, sadly enough, from the above plot we could see that no matter what treatment a patient use, it does not make a difference for the death survival time after cancer cells have recurred. This matches our conclusion from the visualization in Chapter 2.
3.4 Parametric Models
Another point that we are going to explore is if we would be able to fit our data to a parametric model. This matters since if we could fit any parametric model, then we should have a model good enough to generate predictions of breast cancer patients’ survival and would have nice and interpretable coefficients to work with.
To do so, we will begin by checking if any of Exponential, Weibull, or Log-normal distribution would be adequate parametric assumption to cast on our data. We will verify the adequacy by checking the Cox-Snell residual plot. We will be fitting models using variables: Treatment, size, nodes, grade, age (we have shown in Chapter 2 that age is a confounder for categories in Treatment).
3.4.1 User-defined Cox-Snell function
If we fit Exponential model on the survival time, we have the function \(S(t) = e^{-\lambda t}\), and we can transform the survival function into an expression for the “complementary log-log”:
\[\begin{align*} S(t) &= e^{-\lambda t} \\ \log[S(t)] &= -\lambda t \\ -\log[S(t)] &= \lambda t \\ \log\big(-\log[S(t)] \big) &= \log\lambda + \log t \end{align*}\]
Then, if we let \(y=\log\big(-\log[S(t)] \big)\), \(m=1\), \(b=\log(\lambda)\), and \(x=\log(t)\), then we can view the equation in the form \(y=mx+b = 1*x+b\).
This means that if Exponential model is adequate, its complementary log-log plot should be a linear line with slope=1 and intercept=0.
And for any other parametric model, we can define a Cox-Snell residual as \(CS_i = -log(\hat{S}_p(t_i|x_i))\) for \(i=1,\dots, n\), and \(CS_i\)s should behave like data drawn from an Exponential distribution (\(\lambda=1\)).
Thus, we can check the adequacy of any model by checking the Exponential-ness of its Cox-Snell residuals. We can graph \(CS_i\) vs \(log(CS_i)\), and the graph should also look linear with slope =1 and intercept=0, if the parametric model is adequate for the data.
# The Cox-Snell function takes as inputs
# 1. A vector of Cox-Snell residuals created by the user based on the model being evaluated,
# 2. A status vector
# 3. Optional x- and y- limits for the resulting plot
CoxSnell = function(cs,status,xlim=NULL,ylim=NULL)
{
kmcs = survfit(Surv(jitter(cs,amount=(max(cs)-min(cs))/1000),status) ~ 1)$surv
plot(log(-log(kmcs)) ~ sort(log(cs)) ,
xlab="log(Cox-Snell)", ylab="log(-log(S(Cox-Snell)))", xlim=xlim, ylim=ylim )
abline(0,1,col='red')
}3.4.2 Exponential and Weibull
We first tried to examine the adquacy of Exponential and Weibull model, but none of them are adquate for our dataset:
dtime, death ~ Treatment + size + nodes + grade + ageunder Exponential:
Dexp <- survreg(Surv(dtime, death) ~ Treatment + size + nodes + grade + age, dist='exponential', data=rotterdam)
Dexp## Call:
## survreg(formula = Surv(dtime, death) ~ Treatment + size + nodes +
## grade + age, data = rotterdam, dist = "exponential")
##
## Coefficients:
## (Intercept) TreatmentChemo
## 10.72043480 -0.51333347
## TreatmentHormon TreatmentNaN/Other Treatment
## -0.41382108 -0.45674929
## size20-50 size>50
## -0.44229181 -0.76874022
## nodes grade3
## -0.06922296 -0.32925464
## age
## -0.01424763
##
## Scale fixed at 1
##
## Loglik(model)= -12125.2 Loglik(intercept only)= -12360.4
## Chisq= 470.37 on 8 degrees of freedom, p= <2e-16
## n= 2982

rtime, recur ~ Treatment + size + nodes + grade + ageunder Exponential:
Rexp <- survreg(Surv(rtime, recur) ~ Treatment + size + nodes + grade + age, dist='exponential', data=rotterdam)
Rexp## Call:
## survreg(formula = Surv(rtime, recur) ~ Treatment + size + nodes +
## grade + age, data = rotterdam, dist = "exponential")
##
## Coefficients:
## (Intercept) TreatmentChemo
## 9.081159515 -0.373811470
## TreatmentHormon TreatmentNaN/Other Treatment
## -0.507332960 -0.485387357
## size20-50 size>50
## -0.372059978 -0.654240460
## nodes grade3
## -0.080821343 -0.393920415
## age
## 0.008095379
##
## Scale fixed at 1
##
## Loglik(model)= -13897.4 Loglik(intercept only)= -14153.7
## Chisq= 512.6 on 8 degrees of freedom, p= <2e-16
## n= 2982

dtime, death ~ Treatment + size + nodes + grade + ageunder Weibull:
Dweibull <- survreg(Surv(dtime, death) ~ Treatment + size + nodes + grade + age, dist='weibull', data=rotterdam)
Dweibull## Call:
## survreg(formula = Surv(dtime, death) ~ Treatment + size + nodes +
## grade + age, data = rotterdam, dist = "weibull")
##
## Coefficients:
## (Intercept) TreatmentChemo
## 10.10735484 -0.39504003
## TreatmentHormon TreatmentNaN/Other Treatment
## -0.37941684 -0.35423500
## size20-50 size>50
## -0.33487166 -0.61672126
## nodes grade3
## -0.05500626 -0.26093659
## age
## -0.01130402
##
## Scale= 0.7370449
##
## Loglik(model)= -12055.3 Loglik(intercept only)= -12322.7
## Chisq= 534.68 on 8 degrees of freedom, p= <2e-16
## n= 2982

rtime, recur ~ Treatment + size + nodes + grade + ageunder Weibull:
Rweibull <- survreg(Surv(rtime, recur) ~ Treatment + size + nodes + grade + age, dist='weibull', data=rotterdam)
Rweibull## Call:
## survreg(formula = Surv(rtime, recur) ~ Treatment + size + nodes +
## grade + age, data = rotterdam, dist = "weibull")
##
## Coefficients:
## (Intercept) TreatmentChemo
## 9.092880859 -0.376425745
## TreatmentHormon TreatmentNaN/Other Treatment
## -0.508855839 -0.489527042
## size20-50 size>50
## -0.376473401 -0.659532060
## nodes grade3
## -0.081625087 -0.398251643
## age
## 0.008216064
##
## Scale= 1.014145
##
## Loglik(model)= -13897.2 Loglik(intercept only)= -14145.7
## Chisq= 497.1 on 8 degrees of freedom, p= <2e-16
## n= 2982

3.4.3 Log-normal models
Finally, we will examine the adequacy of Log-normal model.
Dlnorm <- survreg(Surv(dtime, death) ~ Treatment + size + nodes + grade + age , dist='lognormal', data=rotterdam)
summary(Dlnorm)##
## Call:
## survreg(formula = Surv(dtime, death) ~ Treatment + size + nodes +
## grade + age, data = rotterdam, dist = "lognormal")
## Value Std. Error z p
## (Intercept) 9.91513 0.27975 35.44 < 2e-16
## TreatmentChemo -0.42503 0.26292 -1.62 0.106
## TreatmentHormon -0.31448 0.26896 -1.17 0.242
## TreatmentNaN/Other Treatment -0.41617 0.26011 -1.60 0.110
## size20-50 -0.34787 0.05108 -6.81 9.8e-12
## size>50 -0.60916 0.08067 -7.55 4.3e-14
## nodes -0.07773 0.00544 -14.29 < 2e-16
## grade3 -0.30866 0.05490 -5.62 1.9e-08
## age -0.01005 0.00197 -5.10 3.3e-07
## Log(scale) 0.06670 0.02162 3.08 0.002
##
## Scale= 1.07
##
## Log Normal distribution
## Loglik(model)= -12018.2 Loglik(intercept only)= -12286.5
## Chisq= 536.66 on 8 degrees of freedom, p= 9.5e-111
## Number of Newton-Raphson Iterations: 4
## n= 2982
CS_LnormD <- -log(1 - plnorm(rotterdam$dtime, 9.91512622-0.42502769*(rotterdam$Treatment=="Chemo")
-0.31448004*(rotterdam$Treatment=="Hormon")
-0.41616843*(rotterdam$Treatment=="NaN/Other Treatment")
-0.34787441*(rotterdam$size=="20-50")
-0.60916461*(rotterdam$size==">50")
-0.07772719*rotterdam$nodes
-0.30866307*(rotterdam$grade=="3")
-0.01004650*rotterdam$age,
1.06897))
# Make appropriate graph using CoxSnell function
CoxSnell(CS_LnormD, rotterdam$death)
Rlnorm <- survreg(Surv(rtime, recur) ~ Treatment + size + nodes + grade + age, dist='lognormal', data=rotterdam)
summary(Rlnorm)##
## Call:
## survreg(formula = Surv(rtime, recur) ~ Treatment + size + nodes +
## grade + age, data = rotterdam, dist = "lognormal")
## Value Std. Error z p
## (Intercept) 8.78912 0.30877 28.46 < 2e-16
## TreatmentChemo -0.36139 0.29098 -1.24 0.21425
## TreatmentHormon -0.41809 0.29943 -1.40 0.16263
## TreatmentNaN/Other Treatment -0.58078 0.28741 -2.02 0.04331
## size20-50 -0.41866 0.05992 -6.99 2.8e-12
## size>50 -0.67179 0.09845 -6.82 8.9e-12
## nodes -0.10590 0.00662 -16.00 < 2e-16
## grade3 -0.43987 0.06470 -6.80 1.1e-11
## age 0.00892 0.00238 3.75 0.00018
## Log(scale) 0.28360 0.01980 14.33 < 2e-16
##
## Scale= 1.33
##
## Log Normal distribution
## Loglik(model)= -13780.4 Loglik(intercept only)= -14045.8
## Chisq= 530.65 on 8 degrees of freedom, p= 1.9e-109
## Number of Newton-Raphson Iterations: 4
## n= 2982
CS_LnormR <- -log(1 - plnorm(rotterdam$rtime, 8.78912009-0.36138607*(rotterdam$Treatment=="Chemo")
-0.41809135*(rotterdam$Treatment=="Hormon")
-0.58077716*(rotterdam$Treatment=="NaN/Other Treatment")
-0.41865655*(rotterdam$size=="20-50")
-0.67178824*(rotterdam$size==">50")
-0.10590430*rotterdam$nodes
-0.43987388*(rotterdam$grade=="3")
+0.00892277*rotterdam$age,
1.327904))
# Make appropriate graph using CoxSnell function
CoxSnell(CS_LnormR, rotterdam$recur)
We could see that the Log-normal parametric model is an adequate model for both the dtime and rtime vs. Treatment + size + nodes + grade + age.
3.4.3.1 Interpretation of the coefficients
Lognormal is an example of a class of models called Accelerated Failure Time (AFT) models, in which every 1-unit increase in \(x_i\) is associated with a scaling of time by \(e^{c_i}\), where \(c\) is the coefficient of \(x\) in the regression model. The quantity \(e^{c_i}\) is referred to as a time ratio (TR), and thus each \(c_i\) represents a log(TR).
Thus, the results of our lognormal model can be interpreted in the following way:
3.4.3.1.1 For dtime:
## [1] 0.6537517
## [1] 0.7301685
## [1] 0.6595692
## [1] 0.7061876
## [1] 0.543805
## [1] 0.9252168
## [1] 0.7344282
## [1] 0.9900038
Having chemotherapy, hormon-therapy and NaN/Other Treatment is associated with a scaling of mean survival time by 0.6537517, 0.7301685 and 0.6595692 respectively, compared to having both chemo and hormon therapies. However, the relationships are all insignificant.
Compared to tumor size <=20mm, having tumor size 20-50mm and >50mm will make the expected survival time scaled by 0.7061876 and 0.543805.
For node and age, each extra positive lymph nodes and every 1-year increase in age is associated with a scaling of mean survival time by 0.9252168 and 0.9900038.
And compared to grade II, grade III of cancer cell makes the expected survival time multiplied by 0.7344282.
3.4.3.1.2 For rtime:
## [1] 0.69671
## [1] 0.6583021
## [1] 0.5594634
## [1] 0.6579301
## [1] 0.5107943
## [1] 0.8995107
## [1] 0.6441177
## [1] 1.008963
Having chemotherapy, hormon-therapy and NaN/Other Treatment is associated with a scaling of mean time till recurrence by 0.69671, 0.6583021 and 0.5594634 respectively, compared to having both chemo and hormon therapies. And the relations here are also insignificant.
Compared to tumor size <=20mm, having tumor size 20-50mm and >50mm will make the expected survival time scaled by 0.6579301 and 0.5107943.
For node and age, each extra positive lymph nodes and every 1-year increase in age is associated with a scaling of mean survival time by 0.8995107 and 1.008963.
And compared to grade II, grade III of cancer cell makes the expected survival time multiplied by 0.6441177.
3.4.3.2 Why log-normal
As to why the Log-normal model would be suitable for the data, we are not sure. One thing to consider is that the non-monotonicity of the hazard function, which is one main characteristic of log-normal model compared to Exponential and Weibull.
The plot below graphs the general hazard functions of our dtime and rtime in a non-parametric way, and we can find that both of them shows non-monotonicity. Though we cannot graph the exact hazard functions conditioning on the diagnostic and treatment information, the non-parametric hazard can still partially explain the adquacy of log-normal.


3.4.3.3 Positive coefficient of age
In the rtime ~ Treatment + size + nodes + age models, it is surprised to see that the coefficient of age is positive, though the value is small. One possible explanation could be the idea of “competing events” and “competing risk”.
In this scenario, having recurred breast cancer and being dead could be somewhat “competing events”. Though they are not completely “cannot happen on one person at the same time”, it is still reasonable to think that for older patients, it is more likely to die from breast cancer or other complications than being cancer-free for years and then having breast cancer recurred; whereas for younger people, the risk of having recurrent breast cancer could be higher than being dead from the first breast cancer.
rotterdam_new <- rotterdam %>%
mutate(state = ifelse(death == 1 & recur == 0, "death",
ifelse(death == 0 & recur == 1, "recur",
ifelse(death == 1 & recur == 1, "both", "neither"))))
ggplot(rotterdam_new, aes(x=age, color=state, fill=state)) +
geom_density(alpha=0.3)
The plot above also helps verifying the idea. We can find that the group who has been dead during the study but has never had breast cancer recurred (the green one) tends to be older than others.
The positive coefficient of age in the log-normal model may seem indicating a protective effect of being old against having breast cancer recurred, but this should not be the true case. It is very likely that this is caused by the competing risk between being dead and having breast cancer recurred for people in different age group. The elderly patients are less likely to suffer from recurrent breast cancer because they are more likely to die from breast cancer or other complications during the treatment after the first diagnosis.
3.5 Cox-PH model:
Another very important part of a survival analysis is looking for the hazard ratio by building and interpreting the coxph model.
Cox’s PH model assumes that the hazard function for any subject can be written as: \(h(t) = h_0(t)e^{b_1 x_1 + b_2 x_2 + \dots + b_k x_k}\), where \(h_0(t)\) is called a baseline hazard function, and the \(x_i\)’s are covariates. The Cox PH model assumes that, for any 2 values of a covariate, the hazard ratio (HR) is constant over time.
With this model, once we have proof of the validation of the PH assumption of the model, we could have nice coefficients to interpret as the logarithm of HR(hazard ratio) among different groups. We will be following the way that we built our parametric models, by using Survival times vs. age + size + nodes + grade + Treatment.
3.5.1 Cox-PH model for dtime
m_death_withage = coxph(Surv(dtime, death) ~ Treatment + size + nodes + grade + age, data=rotterdam)
m_death_withage## Call:
## coxph(formula = Surv(dtime, death) ~ Treatment + size + nodes +
## grade + age, data = rotterdam)
##
## coef exp(coef) se(coef) z p
## TreatmentChemo 0.544967 1.724551 0.360119 1.513 0.130
## TreatmentHormon 0.488511 1.629887 0.366267 1.334 0.182
## TreatmentNaN/Other Treatment 0.491504 1.634773 0.356997 1.377 0.169
## size20-50 0.450933 1.569776 0.065260 6.910 4.85e-12
## size>50 0.818215 2.266451 0.091189 8.973 < 2e-16
## nodes 0.073329 1.076085 0.004865 15.072 < 2e-16
## grade3 0.350976 1.420453 0.070141 5.004 5.62e-07
## age 0.015067 1.015182 0.002552 5.903 3.56e-09
##
## Likelihood ratio test=513.7 on 8 df, p=< 2.2e-16
## n= 2982, number of events= 1272
First we should check if the PH assumption holds. We will be using a formal test.
## chisq df p
## Treatment 4.32 3 0.22855
## size 4.71 2 0.09487
## nodes 3.87 1 0.04915
## grade 2.61 1 0.10637
## age 14.48 1 0.00014
## GLOBAL 26.91 8 0.00073
There are two variables, age and nodes, that have have p-value smaller than 0.05, which indicates the violation of the PH assumption.
One way we came up to solve this problem is to put stratification on age and nodes by putting a strata() on them when fitting the model. What this does is to recognize the correlation between age and dtime and nodes and dtime, but not actually including them in our model results.
m_death_strataage = coxph(Surv(dtime, death) ~ Treatment + size + strata(nodes) + grade + strata(age), data=rotterdam)
m_death_strataage## Call:
## coxph(formula = Surv(dtime, death) ~ Treatment + size + strata(nodes) +
## grade + strata(age), data = rotterdam)
##
## coef exp(coef) se(coef) z p
## TreatmentChemo 0.45621 1.57808 0.47456 0.961 0.336
## TreatmentHormon 0.44920 1.56706 0.49904 0.900 0.368
## TreatmentNaN/Other Treatment 0.76097 2.14035 0.48430 1.571 0.116
## size20-50 0.43163 1.53977 0.08021 5.381 7.40e-08
## size>50 0.63412 1.88536 0.14221 4.459 8.23e-06
## grade3 0.38089 1.46358 0.09141 4.167 3.09e-05
##
## Likelihood ratio test=67.3 on 6 df, p=1.459e-12
## n= 2982, number of events= 1272
## chisq df p
## Treatment 3.198 3 0.36
## size 1.819 2 0.40
## grade 0.635 1 0.43
## GLOBAL 5.791 6 0.45
Now we can see that all the variables’ p-value are greater than 0.05 which indicates the PH assumption holds in our model.
For treatments, we can see that Chemotherapy, Hormontherapy and Other Treatment all have higher risks than patients receiving both therapies, with hazard ratio 1.57808, 1.56706 and 2.14035 respectively. However, note that all three p-values are greater than 0.05, which means that we do not have significant evidence for such relationships.
And for size, it is clear that patients with larger tumor size enjoy higher risks, and as the size grows, the risk is also increasing, with hazard ratio 1.53977 and 1.88536 for 20-50mm vs. <=20mm and >50mm vs. <=20mm. Such relationship is significant as we can see from the p-values since they are all less than 0.05.
For grade, we could see that the risk is going to be multiplied by 1.46358, if it goes from grade II to grade III, and the relation is staitstically significant.
3.5.2 Cox-PH model for rtime
m_recur = coxph(Surv(rtime, recur) ~ Treatment + strata(size) + strata(nodes) + strata(grade) + age, data=rotterdam)
m_recur## Call:
## coxph(formula = Surv(rtime, recur) ~ Treatment + strata(size) +
## strata(nodes) + strata(grade) + age, data = rotterdam)
##
## coef exp(coef) se(coef) z p
## TreatmentChemo 0.280433 1.323703 0.290152 0.967 0.33379
## TreatmentHormon 0.396060 1.485958 0.300374 1.319 0.18732
## TreatmentNaN/Other Treatment 0.817338 2.264464 0.291876 2.800 0.00511
## age -0.015377 0.984741 0.002553 -6.023 1.71e-09
##
## Likelihood ratio test=63.27 on 4 df, p=5.96e-13
## n= 2982, number of events= 1518
For rtime, we found that all the diagnostic information size, nodes and grade obeys the PH assumption, and we can only investigate Treatment and age.
We can find that patients taking both chemo and hormon therapies still have lowest hazard of breast cancer recurrence, and having chemo only, hormon only or NaN/Other Treatment will multiply the hazard by 1.323703, 1.485958 and 2.264464 respectively. However, the statistical significance is not that strong.
And for age, it’s still surprised to see that its coefficient is negative, indicating that every 1-year increase in age will make the hazard of breast cancer recurrence by 0.984741. The reason behind this should be the same as the idea of competing risk we mentioned above.
3.6 Summary
Based on all the survival analysis above, we found that different treatment ways actually does not differ significantly in extending patients’ survival time or time till recurrence, and the main factors that affect patients’ life are the diagnostic information: tumor size, number of positive lymph nodes and tumor grade.
This reminds us the importance of taking screening every three years. Current medical level still fails to cure breast cancer, but with early diagnosis, patients’ lives can be extended by a lot.