Chapter3 Survival Analysis

3.1 Notation

In survival analysis, there is an important concept called “censoring”. Censoring is a kind of missing data problem where an event is not completely observed, for reasons like termination of study or loss of communication with the participant. And censoring will cause the observed time data inaccurate. For instance, for left-censored data, we don’t know the exact value of the data point, but only an upper bound for it; And for right-censored data, we only knows a lower bound for it.

In our dataset rotterdam, for the time variable dtime and rtime that we are curious about, death and recur are the corresponding variable that records if censoring occured. Since if someone is dead in this study, this means that the survival time we observed is exact; But for someone is still alive in the end of the study, the survival time we observed is just a lower bound. And similarly, if someone suffers from recurrent breast cancer during the study, our rtime should be accurate; But for someone did not encounter recurrence of breast cancer in the study, we are not sure if she will encounter in the future, so the data point is also a lower point.

For all the survival analysis below, we will include death and recur to examine the relationship between dtime and rtime vs. the diagnostic information and treatment received.

3.2 Kaplan-Miere estimator of the entire dataset

Generally speaking, Kaplan-Miere curve is a non-parametric estimator of the survival function which takes censoring into account. Its y-axis measures \(P(X \leq k)\) for various values of \(k\).

3.2.0.1 Death Time

Kaplan-Meier curve also helps us finding the median and mean estimate. For median, we can directly find it from the survfit object KM_d. For mean, we need to calculate the area under the Kaplan-Meier curve.

## Call: survfit(formula = Surv(dtime, death) ~ 1, data = rotterdam)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    2982    1272    4033    3888    4309
## [1] 4099.795

The overall mean survival time till death for breast cancer is 4099.795 days, and the overall median survival time till death for breast cancer is 4033 days. Both of them are approximately 7 years.

We can find that the mean and median we calculated above is much longer than the mean and median of the variable dtime itself, because a lot of data point are right-censored:

## [1] 2605.34
## [1] 2638.5

3.2.0.2 Recurrence Time

We can find the mean and median in a similar way:

## Call: survfit(formula = Surv(rtime, recur) ~ 1, data = rotterdam)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    2982    1518    2983    2719    3193
## [1] 3588.535

The overall mean survival time till recurrence for breast cancer is 3588.535 days, which is approximately 10 years. The overall median survival time till recurrence for breast cancer is 2983 days, which is approximately 8 years. Notice here our mean is much greater than the median, and this indicates that our rtime should be right-skewed.

3.2.0.3 Survival time after recurrence

## Call: survfit(formula = Surv(drecurtime, death) ~ 1, data = rotterdam_recur)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    1518    1077     894     815     949
## [1] 1407.135

The overall mean survival time after rucurrence till death for breast cancer is 1407.135 days, which is approximately a little less than 4 years. The overall median survival time after rucurrence till death for breast cancer is 894 days, which is approximately 2 years and a half. Here, the mean is also greater than the median, also indicating right-skewness:

Since a Kaplan-Miere estimator is unbiased, we could view the mean and median as being very close to the true value of survival time.

3.3 Kaplan-Miere estimator on different variables in rotterdam

There are 16 variables in the rotterdam dataset. Of course we could have fit each variable with a KM estimator, but it would be meaningless to do them all. We will stick to the Diagnostics and Treatment we mentioned in Chapter 2 and fit size, Nodes_level(we are not using nodes because a Kaplan-Miere estimator does not work well with quantitative variables), grade and Treatment each with KM estimators with respect to dtime, rtime, and drecurtime to grasp the survival time within each categories of the variables.

3.3.4 Treatment vs. Survival Times

Treatment vs. dtime

Treatment vs. rtime

As we have discussed in Chapter 2, we know that generally chemotherapy is used on patients with age lower than 50 years old and hormontherapy is used on patients with age higher than 50 years old. Based on the difference of treatment, we could see that chemotherapy has a better effect than hormontherapy with respect to death time and a smaller yet still better effect regarding the recurrence time.

Treatment vs. drecurtime

However, sadly enough, from the above plot we could see that no matter what treatment a patient use, it does not make a difference for the death survival time after cancer cells have recurred. This matches our conclusion from the visualization in Chapter 2.

3.4 Parametric Models

Another point that we are going to explore is if we would be able to fit our data to a parametric model. This matters since if we could fit any parametric model, then we should have a model good enough to generate predictions of breast cancer patients’ survival and would have nice and interpretable coefficients to work with.

To do so, we will begin by checking if any of Exponential, Weibull, or Log-normal distribution would be adequate parametric assumption to cast on our data. We will verify the adequacy by checking the Cox-Snell residual plot. We will be fitting models using variables: Treatment, size, nodes, grade, age (we have shown in Chapter 2 that age is a confounder for categories in Treatment).

3.4.1 User-defined Cox-Snell function

If we fit Exponential model on the survival time, we have the function \(S(t) = e^{-\lambda t}\), and we can transform the survival function into an expression for the “complementary log-log”:

\[\begin{align*} S(t) &= e^{-\lambda t} \\ \log[S(t)] &= -\lambda t \\ -\log[S(t)] &= \lambda t \\ \log\big(-\log[S(t)] \big) &= \log\lambda + \log t \end{align*}\]

Then, if we let \(y=\log\big(-\log[S(t)] \big)\), \(m=1\), \(b=\log(\lambda)\), and \(x=\log(t)\), then we can view the equation in the form \(y=mx+b = 1*x+b\).

This means that if Exponential model is adequate, its complementary log-log plot should be a linear line with slope=1 and intercept=0.

And for any other parametric model, we can define a Cox-Snell residual as \(CS_i = -log(\hat{S}_p(t_i|x_i))\) for \(i=1,\dots, n\), and \(CS_i\)s should behave like data drawn from an Exponential distribution (\(\lambda=1\)).

Thus, we can check the adequacy of any model by checking the Exponential-ness of its Cox-Snell residuals. We can graph \(CS_i\) vs \(log(CS_i)\), and the graph should also look linear with slope =1 and intercept=0, if the parametric model is adequate for the data.

3.4.2 Exponential and Weibull

We first tried to examine the adquacy of Exponential and Weibull model, but none of them are adquate for our dataset:

  1. dtime, death ~ Treatment + size + nodes + grade + age under Exponential:
## Call:
## survreg(formula = Surv(dtime, death) ~ Treatment + size + nodes + 
##     grade + age, data = rotterdam, dist = "exponential")
## 
## Coefficients:
##                  (Intercept)               TreatmentChemo 
##                  10.72043480                  -0.51333347 
##              TreatmentHormon TreatmentNaN/Other Treatment 
##                  -0.41382108                  -0.45674929 
##                    size20-50                      size>50 
##                  -0.44229181                  -0.76874022 
##                        nodes                       grade3 
##                  -0.06922296                  -0.32925464 
##                          age 
##                  -0.01424763 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -12125.2   Loglik(intercept only)= -12360.4
##  Chisq= 470.37 on 8 degrees of freedom, p= <2e-16 
## n= 2982

  1. rtime, recur ~ Treatment + size + nodes + grade + age under Exponential:
## Call:
## survreg(formula = Surv(rtime, recur) ~ Treatment + size + nodes + 
##     grade + age, data = rotterdam, dist = "exponential")
## 
## Coefficients:
##                  (Intercept)               TreatmentChemo 
##                  9.081159515                 -0.373811470 
##              TreatmentHormon TreatmentNaN/Other Treatment 
##                 -0.507332960                 -0.485387357 
##                    size20-50                      size>50 
##                 -0.372059978                 -0.654240460 
##                        nodes                       grade3 
##                 -0.080821343                 -0.393920415 
##                          age 
##                  0.008095379 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -13897.4   Loglik(intercept only)= -14153.7
##  Chisq= 512.6 on 8 degrees of freedom, p= <2e-16 
## n= 2982

  1. dtime, death ~ Treatment + size + nodes + grade + age under Weibull:
## Call:
## survreg(formula = Surv(dtime, death) ~ Treatment + size + nodes + 
##     grade + age, data = rotterdam, dist = "weibull")
## 
## Coefficients:
##                  (Intercept)               TreatmentChemo 
##                  10.10735484                  -0.39504003 
##              TreatmentHormon TreatmentNaN/Other Treatment 
##                  -0.37941684                  -0.35423500 
##                    size20-50                      size>50 
##                  -0.33487166                  -0.61672126 
##                        nodes                       grade3 
##                  -0.05500626                  -0.26093659 
##                          age 
##                  -0.01130402 
## 
## Scale= 0.7370449 
## 
## Loglik(model)= -12055.3   Loglik(intercept only)= -12322.7
##  Chisq= 534.68 on 8 degrees of freedom, p= <2e-16 
## n= 2982

  1. rtime, recur ~ Treatment + size + nodes + grade + age under Weibull:
## Call:
## survreg(formula = Surv(rtime, recur) ~ Treatment + size + nodes + 
##     grade + age, data = rotterdam, dist = "weibull")
## 
## Coefficients:
##                  (Intercept)               TreatmentChemo 
##                  9.092880859                 -0.376425745 
##              TreatmentHormon TreatmentNaN/Other Treatment 
##                 -0.508855839                 -0.489527042 
##                    size20-50                      size>50 
##                 -0.376473401                 -0.659532060 
##                        nodes                       grade3 
##                 -0.081625087                 -0.398251643 
##                          age 
##                  0.008216064 
## 
## Scale= 1.014145 
## 
## Loglik(model)= -13897.2   Loglik(intercept only)= -14145.7
##  Chisq= 497.1 on 8 degrees of freedom, p= <2e-16 
## n= 2982

3.4.3 Log-normal models

Finally, we will examine the adequacy of Log-normal model.

## 
## Call:
## survreg(formula = Surv(dtime, death) ~ Treatment + size + nodes + 
##     grade + age, data = rotterdam, dist = "lognormal")
##                                 Value Std. Error      z       p
## (Intercept)                   9.91513    0.27975  35.44 < 2e-16
## TreatmentChemo               -0.42503    0.26292  -1.62   0.106
## TreatmentHormon              -0.31448    0.26896  -1.17   0.242
## TreatmentNaN/Other Treatment -0.41617    0.26011  -1.60   0.110
## size20-50                    -0.34787    0.05108  -6.81 9.8e-12
## size>50                      -0.60916    0.08067  -7.55 4.3e-14
## nodes                        -0.07773    0.00544 -14.29 < 2e-16
## grade3                       -0.30866    0.05490  -5.62 1.9e-08
## age                          -0.01005    0.00197  -5.10 3.3e-07
## Log(scale)                    0.06670    0.02162   3.08   0.002
## 
## Scale= 1.07 
## 
## Log Normal distribution
## Loglik(model)= -12018.2   Loglik(intercept only)= -12286.5
##  Chisq= 536.66 on 8 degrees of freedom, p= 9.5e-111 
## Number of Newton-Raphson Iterations: 4 
## n= 2982

## 
## Call:
## survreg(formula = Surv(rtime, recur) ~ Treatment + size + nodes + 
##     grade + age, data = rotterdam, dist = "lognormal")
##                                 Value Std. Error      z       p
## (Intercept)                   8.78912    0.30877  28.46 < 2e-16
## TreatmentChemo               -0.36139    0.29098  -1.24 0.21425
## TreatmentHormon              -0.41809    0.29943  -1.40 0.16263
## TreatmentNaN/Other Treatment -0.58078    0.28741  -2.02 0.04331
## size20-50                    -0.41866    0.05992  -6.99 2.8e-12
## size>50                      -0.67179    0.09845  -6.82 8.9e-12
## nodes                        -0.10590    0.00662 -16.00 < 2e-16
## grade3                       -0.43987    0.06470  -6.80 1.1e-11
## age                           0.00892    0.00238   3.75 0.00018
## Log(scale)                    0.28360    0.01980  14.33 < 2e-16
## 
## Scale= 1.33 
## 
## Log Normal distribution
## Loglik(model)= -13780.4   Loglik(intercept only)= -14045.8
##  Chisq= 530.65 on 8 degrees of freedom, p= 1.9e-109 
## Number of Newton-Raphson Iterations: 4 
## n= 2982

We could see that the Log-normal parametric model is an adequate model for both the dtime and rtime vs. Treatment + size + nodes + grade + age.

3.4.3.1 Interpretation of the coefficients

Lognormal is an example of a class of models called Accelerated Failure Time (AFT) models, in which every 1-unit increase in \(x_i\) is associated with a scaling of time by \(e^{c_i}\), where \(c\) is the coefficient of \(x\) in the regression model. The quantity \(e^{c_i}\) is referred to as a time ratio (TR), and thus each \(c_i\) represents a log(TR).

Thus, the results of our lognormal model can be interpreted in the following way:

3.4.3.1.1 For dtime:
## [1] 0.6537517
## [1] 0.7301685
## [1] 0.6595692
## [1] 0.7061876
## [1] 0.543805
## [1] 0.9252168
## [1] 0.7344282
## [1] 0.9900038

Having chemotherapy, hormon-therapy and NaN/Other Treatment is associated with a scaling of mean survival time by 0.6537517, 0.7301685 and 0.6595692 respectively, compared to having both chemo and hormon therapies. However, the relationships are all insignificant.

Compared to tumor size <=20mm, having tumor size 20-50mm and >50mm will make the expected survival time scaled by 0.7061876 and 0.543805.

For node and age, each extra positive lymph nodes and every 1-year increase in age is associated with a scaling of mean survival time by 0.9252168 and 0.9900038.

And compared to grade II, grade III of cancer cell makes the expected survival time multiplied by 0.7344282.

3.4.3.1.2 For rtime:
## [1] 0.69671
## [1] 0.6583021
## [1] 0.5594634
## [1] 0.6579301
## [1] 0.5107943
## [1] 0.8995107
## [1] 0.6441177
## [1] 1.008963

Having chemotherapy, hormon-therapy and NaN/Other Treatment is associated with a scaling of mean time till recurrence by 0.69671, 0.6583021 and 0.5594634 respectively, compared to having both chemo and hormon therapies. And the relations here are also insignificant.

Compared to tumor size <=20mm, having tumor size 20-50mm and >50mm will make the expected survival time scaled by 0.6579301 and 0.5107943.

For node and age, each extra positive lymph nodes and every 1-year increase in age is associated with a scaling of mean survival time by 0.8995107 and 1.008963.

And compared to grade II, grade III of cancer cell makes the expected survival time multiplied by 0.6441177.

3.4.3.2 Why log-normal

As to why the Log-normal model would be suitable for the data, we are not sure. One thing to consider is that the non-monotonicity of the hazard function, which is one main characteristic of log-normal model compared to Exponential and Weibull.

The plot below graphs the general hazard functions of our dtime and rtime in a non-parametric way, and we can find that both of them shows non-monotonicity. Though we cannot graph the exact hazard functions conditioning on the diagnostic and treatment information, the non-parametric hazard can still partially explain the adquacy of log-normal.

3.4.3.3 Positive coefficient of age

In the rtime ~ Treatment + size + nodes + age models, it is surprised to see that the coefficient of age is positive, though the value is small. One possible explanation could be the idea of “competing events” and “competing risk”.

In this scenario, having recurred breast cancer and being dead could be somewhat “competing events”. Though they are not completely “cannot happen on one person at the same time”, it is still reasonable to think that for older patients, it is more likely to die from breast cancer or other complications than being cancer-free for years and then having breast cancer recurred; whereas for younger people, the risk of having recurrent breast cancer could be higher than being dead from the first breast cancer.

The plot above also helps verifying the idea. We can find that the group who has been dead during the study but has never had breast cancer recurred (the green one) tends to be older than others.

The positive coefficient of age in the log-normal model may seem indicating a protective effect of being old against having breast cancer recurred, but this should not be the true case. It is very likely that this is caused by the competing risk between being dead and having breast cancer recurred for people in different age group. The elderly patients are less likely to suffer from recurrent breast cancer because they are more likely to die from breast cancer or other complications during the treatment after the first diagnosis.

3.5 Cox-PH model:

Another very important part of a survival analysis is looking for the hazard ratio by building and interpreting the coxph model.

Cox’s PH model assumes that the hazard function for any subject can be written as: \(h(t) = h_0(t)e^{b_1 x_1 + b_2 x_2 + \dots + b_k x_k}\), where \(h_0(t)\) is called a baseline hazard function, and the \(x_i\)’s are covariates. The Cox PH model assumes that, for any 2 values of a covariate, the hazard ratio (HR) is constant over time.

With this model, once we have proof of the validation of the PH assumption of the model, we could have nice coefficients to interpret as the logarithm of HR(hazard ratio) among different groups. We will be following the way that we built our parametric models, by using Survival times vs. age + size + nodes + grade + Treatment.

3.5.1 Cox-PH model for dtime

## Call:
## coxph(formula = Surv(dtime, death) ~ Treatment + size + nodes + 
##     grade + age, data = rotterdam)
## 
##                                  coef exp(coef) se(coef)      z        p
## TreatmentChemo               0.544967  1.724551 0.360119  1.513    0.130
## TreatmentHormon              0.488511  1.629887 0.366267  1.334    0.182
## TreatmentNaN/Other Treatment 0.491504  1.634773 0.356997  1.377    0.169
## size20-50                    0.450933  1.569776 0.065260  6.910 4.85e-12
## size>50                      0.818215  2.266451 0.091189  8.973  < 2e-16
## nodes                        0.073329  1.076085 0.004865 15.072  < 2e-16
## grade3                       0.350976  1.420453 0.070141  5.004 5.62e-07
## age                          0.015067  1.015182 0.002552  5.903 3.56e-09
## 
## Likelihood ratio test=513.7  on 8 df, p=< 2.2e-16
## n= 2982, number of events= 1272

First we should check if the PH assumption holds. We will be using a formal test.

##           chisq df       p
## Treatment  4.32  3 0.22855
## size       4.71  2 0.09487
## nodes      3.87  1 0.04915
## grade      2.61  1 0.10637
## age       14.48  1 0.00014
## GLOBAL    26.91  8 0.00073

There are two variables, age and nodes, that have have p-value smaller than 0.05, which indicates the violation of the PH assumption.

One way we came up to solve this problem is to put stratification on age and nodes by putting a strata() on them when fitting the model. What this does is to recognize the correlation between age and dtime and nodes and dtime, but not actually including them in our model results.

## Call:
## coxph(formula = Surv(dtime, death) ~ Treatment + size + strata(nodes) + 
##     grade + strata(age), data = rotterdam)
## 
##                                 coef exp(coef) se(coef)     z        p
## TreatmentChemo               0.45621   1.57808  0.47456 0.961    0.336
## TreatmentHormon              0.44920   1.56706  0.49904 0.900    0.368
## TreatmentNaN/Other Treatment 0.76097   2.14035  0.48430 1.571    0.116
## size20-50                    0.43163   1.53977  0.08021 5.381 7.40e-08
## size>50                      0.63412   1.88536  0.14221 4.459 8.23e-06
## grade3                       0.38089   1.46358  0.09141 4.167 3.09e-05
## 
## Likelihood ratio test=67.3  on 6 df, p=1.459e-12
## n= 2982, number of events= 1272
##           chisq df    p
## Treatment 3.198  3 0.36
## size      1.819  2 0.40
## grade     0.635  1 0.43
## GLOBAL    5.791  6 0.45

Now we can see that all the variables’ p-value are greater than 0.05 which indicates the PH assumption holds in our model.

For treatments, we can see that Chemotherapy, Hormontherapy and Other Treatment all have higher risks than patients receiving both therapies, with hazard ratio 1.57808, 1.56706 and 2.14035 respectively. However, note that all three p-values are greater than 0.05, which means that we do not have significant evidence for such relationships.

And for size, it is clear that patients with larger tumor size enjoy higher risks, and as the size grows, the risk is also increasing, with hazard ratio 1.53977 and 1.88536 for 20-50mm vs. <=20mm and >50mm vs. <=20mm. Such relationship is significant as we can see from the p-values since they are all less than 0.05.

For grade, we could see that the risk is going to be multiplied by 1.46358, if it goes from grade II to grade III, and the relation is staitstically significant.

3.5.2 Cox-PH model for rtime

## Call:
## coxph(formula = Surv(rtime, recur) ~ Treatment + strata(size) + 
##     strata(nodes) + strata(grade) + age, data = rotterdam)
## 
##                                   coef exp(coef)  se(coef)      z        p
## TreatmentChemo                0.280433  1.323703  0.290152  0.967  0.33379
## TreatmentHormon               0.396060  1.485958  0.300374  1.319  0.18732
## TreatmentNaN/Other Treatment  0.817338  2.264464  0.291876  2.800  0.00511
## age                          -0.015377  0.984741  0.002553 -6.023 1.71e-09
## 
## Likelihood ratio test=63.27  on 4 df, p=5.96e-13
## n= 2982, number of events= 1518

For rtime, we found that all the diagnostic information size, nodes and grade obeys the PH assumption, and we can only investigate Treatment and age.

We can find that patients taking both chemo and hormon therapies still have lowest hazard of breast cancer recurrence, and having chemo only, hormon only or NaN/Other Treatment will multiply the hazard by 1.323703, 1.485958 and 2.264464 respectively. However, the statistical significance is not that strong.

And for age, it’s still surprised to see that its coefficient is negative, indicating that every 1-year increase in age will make the hazard of breast cancer recurrence by 0.984741. The reason behind this should be the same as the idea of competing risk we mentioned above.

3.6 Summary

Based on all the survival analysis above, we found that different treatment ways actually does not differ significantly in extending patients’ survival time or time till recurrence, and the main factors that affect patients’ life are the diagnostic information: tumor size, number of positive lymph nodes and tumor grade.

This reminds us the importance of taking screening every three years. Current medical level still fails to cure breast cancer, but with early diagnosis, patients’ lives can be extended by a lot.