SurvCART: Constructing Survival Tree in R

[Author: Madan G. Kundu]

In practice, survival times may be influenced by a combination of baseline variables. Survival trees (i.e., trees with survival data) offer a relatively flexible approach to understanding the effects of covariates, including their interaction, on survival times when the functional form of the association is unknown, and also have the advantage of easier interpretation. This blog presents the construction of a survival tree following the SurvCART algorithm (Kundu and Ghosh 2021).  Usually, a survival tree is constructed based on the heterogeneity in time-to-event distribution. However, in some applications with marker dependent censoring (e.g., less education or better prognosis might lead to early censoring) ignorance of censoring distribution might lead to inconsistent survival tree (Cui, Zhou and Kosorok, 2021).  Therefore, the SurvCART algorithm is flexible to construct a survival tree based on heterogeneity both in time-to-event and censoring distribution. However, it is important to emphasize that the use of censoring heterogeneity in the construction of survival trees is optional.  In this blog, the construction of a survival tree is illustrated with an R dataset in step by step approach and the results are explained using the functions available in the LongCART package


Installing and Loading LongCART package
R> install.packages("LongCART")
R> library(LongCART)
The GBSG2 dataset
We will illustrate the SurvCART algorithm with GBSG2 data to analyze recurrence-free survival (RFS) originated from a prospective randomized clinical trial conducted by the German Breast Cancer Study Group. The purpose is to evaluate the effect of prognostic factors on RFS among node-positive breast cancer patients receiving chemotherapy in the adjuvant setting. RFS is time-to-event endpoint which was defined as the time from mastectomy to the first occurrence of either recurrence, contralateral or secondary tumor, or death. This GBSG2 dataset is included in the LongCART package in R. The dataset contains the information from 686 breast cancer patients with the following variables: time (RFS follow-up time in days), cens (censoring indicator: 0[censored], 1[event]), horTH (hormonal therapy: yes, no), age (age in years), menostat (menopausal status: pre, post), tsize (tumour size in mm), tgrade (tumour grade: I, II, III), pnodes (number of positive nodes), progrec (level of progesterone receptor[PR] in fmol), and estrec (level of the estrogen receptor in fmol). For details about the conduct of the study, please refer to Schumacher et al.

Let’s first explore the dataset. 
R> library(speff2trial)
R> data("ACTG175", package = "speff2trial")
R> adata<- reshape(data=ACTG175[,!(names(ACTG175) %in% c("cd80", "cd820"))],
+ varying=c("cd40", "cd420", "cd496"), v.names="cd4",
+ idvar="pidnum", direction="long", times=c(0, 20, 96))
R> adata<- adata[order(adata$pidnum, adata$time),]
The median RFS time based on the entire 686 patients was 60 months with a total of 299 reported RFS events.  Does any baseline covariate influence the median RFS time and if yes then how? A survival tree is meant to find this answer.

Impact of individual categorical baseline variables on RFS
We would like to start with understanding the impact of individual baseline covariates on the median RFS time. There could be two types of baseline covariates: categorical (e.g., horTH [hormonal therapy]) and continuous (e.g., age).  The impact of an individual categorical variable can be statistically assessed using StabCat.surv() [Note: StabCat.surv() performs a statistical test as described in Section 2.2.1 of Kundu and Ghosh 2021].

Let’s focus only on the heterogeneity in RFS (assuming exponential underlying distribution for RFS), but not on the censoring distribution. 
R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562, Adj. p-value= 0.003 
Greater evidence of heterogeneity in time-to-event distribution 
R> out1$pval
[1] 0.003431809
The p-value is 0.00343 suggesting the influence of hormonal therapy on RFS. Here we considered Exponential distribution for RFS, but the following distributions can be specified as well: "weibull""lognormal" or "normal"

Now we illustrate the use of StabCat.surv() considering heterogeneity both in RFS time and censoring time.
R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     cens.dist="exponential", event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562 0.013, Adj. p-value= 0.007 0.909 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.006863617
Note the two Test.statistic values (8.562 and 0.013): the first one corresponds to for heterogeneity in RFS time and the second one corresponds to heterogeneity in censoring distribution.  Consequently, we also see the two p-values (adjusted).  The overall p-value 0.006863617. In comparison to the earlier p-value of 0.00343, this p-value is greater due to the multiplicity adjustment.

Impact of individual continuous baseline variables on RFS
Similarly, we can assess the impact of an individual continuous variable by performing a statistical test as described in Section 2.2.2 of Kundu and Ghosh 2021 which is implemented in StabCont.surv()
R> out1<- StabCont.surv(data=GBSG2, timevar="time", censorvar="cens",
R>                      splitvar="age", time.dist="exponential",
R>                      event.ind=1)
Stability Test for Continuous grouping variable 
Test.statistic= 0.925, Adj. p-value= 0.359 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.3591482
The p-value is 0.3591482 suggesting no impact of age on RFS. Heterogeneity in censoring distribution can be considered here as well by specifying cens.dist as shown with StabCat.surv().

Construction of survival tree
A survival tree can be constructed using SurvCART(). Note that the SurvCART() function currently can handle the categorical baseline variables with numerical levels only. For nominal variables, please create the corresponding numerically coded dummy variable(s) as has been done below for horTh, trade and menostat  
R> GBSG2$horTh1<- as.numeric(GBSG2$horTh)
R> GBSG2$tgrade1<- as.numeric(GBSG2$tgrade)
R> GBSG2$menostat1<- as.numeric(GBSG2$menostat)
We also have to add the subject id if that is not already there.
R> GBSG2$subjid<- 1:nrow(GBSG2)
In the SurvCART(), we have to list out all the baseline variables of interest and declare whether it is categorical or continuous. Further, we also have to specify the RFS distribution "exponential"(default), "weibull""lognormal" or "normal". The censoring distribution could be "NA"(default: i.e., no censoring heterogeneity), "exponential" "weibull""lognormal" or "normal"

Let’s construct the survival tree with heterogeneity in RFS only (i.e., ignoring censoring heterogeneity)
R> out.tree<- SurvCART(data=GBSG2, patid="subjid", 
R>                censorvar="cens", timevar="time", 
R>                gvars=c('horTh1', 'age', 'menostat1', 'tsize',          
R>                        'tgrade1', 'pnodes', 'progrec', 'estrec'),  
R>                tgvars=c(0,1,0,1,0,1, 1,1), 
R>                time.dist="exponential", 
R>                event.ind=1, alpha=0.05, minsplit=80, minbucket=40, 
R>                print=TRUE)
All the baseline variables are listed in gvars argument. The tgvars argument is accompanied with the gvars argument which indicates type of the partitioning variables (0=categorical or 1=continuous). Further, we have considered exponential distribution for RFS time. 

Within SurvCART(), at any given tree-node, each of the baseline variables is tested for heterogeneity using eitherStabCat.surv() or StabCont.surv() and the corresponding p-value is obtained. Further, these p-values are adjusted according to the Hochberg procedure and subsequently, the baseline variable with the smallest p-value is selected. These details would be printed if print=TRUE. This procedure keeps iterating until we hit the terminal node (i.e., no more statistically significant baseline variable, node size is smaller than the minsplit or further splitting results in a node size smaller than  minbucket).

Now let’s view the tree result


In the above output, each row corresponds to a single node including the 7  terminal nodes identified by TERMINAL=TRUE.  Now let’s visualize the tree result
R> par(xpd = TRUE)
R> plot(out.tree, compress = TRUE)
R> text(out.tree, use.n = TRUE)
The resultant tree is as follows:

We can also plot KM plot for sub-groups identified by survival tree as follows:
KMPlot.SurvCART(out.tree, scale.time=365.25, type=1)


Last but not least, we can consider heterogeneity in censoring distribution in SurvCART() by specifying cens.dist .

Reference

1. Kundu, M. G., & Ghosh, S. (2021). Survival trees based on heterogeneity in time‐to‐event and censoring distributions using parameter instability test. Statistical Analysis and Data Mining: The ASA Data Science Journal14(5), 466-483. [link]
2. Y. Cui, R. Zhu, M. Zhou, and M. Kosorok, Consistency of survival tree and forest models: Splitting bias and correction. Statistica Sinica Preprint No: SS-2020-0263, 2021.
3. Schumacher, M., Bastert, G., Bojar, H., Hübner, K., Olschewski, M., Sauerbrei, W., … & Rauschecker, H. F. (1994). Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German Breast Cancer Study Group. Journal of Clinical Oncology12(10), 2086-2093. [link]
4. LongCART package v 3.1 or higher, https://CRAN.R-project.org/package=LongCART

LongCART – Regression tree for longitudinal data

Longitudinal changes in a population of interest are often heterogeneous and may be influenced by a combination of baseline factors. The longitudinal tree (that is, regression tree with longitudinal data) can be very helpful to identify and characterize the sub-groups with distinct longitudinal profile in a heterogenous population. This blog presents the capabilities of the R package LongCART for constructing longitudinal tree according to the LongCART algorithm (Kundu and Harezlak 2019). In addition, this packages can also be used to formally evaluate whether any particular baseline covariate affects the longitudinal profile via parameter instability test. In this blog, construction of longitudinal tree is illlustrated with an R dataset in step by step approach and the results are explained.

Installing and Loading LongCART package
R> install.packages("LongCART")
R> library(LongCART)
Get the example dataset The ACTG175 dataset in speff2trial package contains longitudinal observation of CD4 counts from clinical trial in HIV patients. This dataset is in “wide” format, and, we need to convert  it to first “long” format.
R> library(speff2trial)
R> data("ACTG175", package = "speff2trial")
R> adata<- reshape(data=ACTG175[,!(names(ACTG175) %in% c("cd80", "cd820"))],
+ varying=c("cd40", "cd420", "cd496"), v.names="cd4",
+ idvar="pidnum", direction="long", times=c(0, 20, 96))
R> adata<- adata[order(adata$pidnum, adata$time),]
Longtudinal model of interest Since the count data including CD4 counts are often log transformed before modeling, a simple longitudinal model for CD4 counts would be: log(CD4 countit) = beta0 + beta1*t + bi + epsilonit Does the fixed parameters of above longitdinal vary with the level of baseline covariate? Categorical baseline covariate Suppose we want to evaluate whether any of the fixed model parameters changes with the levels of any baseline categorical partitioning variable, say, gender.
R> adata$Y=ifelse(!is.na(adata$cd4),log(adata$cd4+1), NA)
R> StabCat(data=adata, patid="pidnum", fixed=Y~time, splitvar="gender")
Stability Test for Categorical grouping variable
Test.statistic=0.297, p-value=0.862
The p-value is 0.862 which indicates that we don’t have any evidence that fixed parameters vary with the levels of gender. Continuous baseline covariate Now suppose we are interested to evaluate whether any of the fixed model parameters changes with the levels of any baseline continuous partitioning variable, say, wtkg.
R> StabCont(data=adata, patid="pidnum", fixed=Y~time, splitvar="wtkg")
Stability Test for Continuous grouping variable
Test.statistic=1.004 1.945, Adj. p-value=0.265 0.002
The result returns two two p-values – the first p-value correspond to parameter instability test of beta0 and the second ones correspond to beta1 . Constructing tree for longitudinal profile The ACTG175 dataset contains several baseline variables including gender, hemo (presence of hemophilia), homo (homosexual activity), drugs (history of intravenous drug use ), oprior (prior non-zidovudine antiretroviral therapy), z30 (zidovudine use 30 days prior to treatment initiation), zprior (zidovudine use prior to treatment initiation), race, str2 (antiretroviral history), treat (treatment indicator), offtrt (indicator of off-treatment before 96 weeks), age, wtkg (weight) and karnof (Karnofsky score). We can construct longitudinal tree to identify the sub-groups defined by these baseline variables such that the individuals within the given sub-groups are homogeneous with respect to longitudinal profile of CD4 counts but the longitudinal profiles among the sub-groups are heterogenous.
R> gvars=c("age", "gender", "wtkg", "hemo", "homo", "drugs",
+ "karnof", "oprior", "z30", "zprior", "race",
+ "str2", "symptom", "treat", "offtrt", "strat")
R> tgvars=c(1, 0, 1, 0, 0, 0,
+ 1, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0)
R> out.tree<- LongCART(data=adata, patid="pidnum", fixed=Y~time,
+ gvars=gvars, tgvars=tgvars, alpha=0.05,
+ minsplit=100, minbucket=50, coef.digits=3)
All the baseline variables are listed in gvars argument. The gvars argument is accompanied with the tgvars argument which indicates type of the partitioning variables (0=categorical or 1=continuous). Note that the LongCART() function currently can handle the categorical variables with numerical levels only. For nominal variables, please create the corresponding numerically coded dummy variable(s).  Now let’s view the tree results
R> out.tree$Treeout
ID n yval var index p (Instability) loglik improve Terminal
1 1 2139 5.841-0.003time offtrt 1.00 0.000 -4208 595 FALSE
2 2 1363 5.887-0.002time treat 1.00 0.000 -2258 90 FALSE
3 4 316 5.883-0.004time str2 1.00 0.005 -577 64 FALSE
4 8 125 5.948-0.002time symptom NA 1.000 -176 NA TRUE
5 9 191 5.84-0.005time symptom NA 0.842 -378 NA TRUE
6 5 1047 5.888-0.001time wtkg 68.49 0.008 -1645 210 FALSE
7 10 319 5.846-0.002time karnof NA 0.260 -701 NA TRUE
8 11 728 5.907-0.001time age NA 0.117 -849 NA TRUE
9 3 776 5.781-0.007time karnof 100.00 0.000 -1663 33 FALSE
10 6 360 5.768-0.009time wtkg NA 0.395 -772 NA TRUE
11 7 416 5.795-0.005time z30 1.00 0.014 -883 44 FALSE
12 14 218 5.848-0.003time treat NA 0.383 -425 NA TRUE
13 15 198 5.738-0.007time age NA 0.994 -444 NA TRUE
In the above output, each row corresponds to single node including the 7  terminal nodes identified by TERMINAL=TRUE.  Now let’s visualize the tree results
R> par(xpd = TRUE)
R> plot(out.tree, compress = TRUE)
R> text(out.tree, use.n = TRUE)
The resultant tree is as follows:

ACTG175tree