**Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.**

## Another way of analysis?

When there are so many tools and techniques of prediction modelling, why do we have another field known as survival analysis? As one of the most popular branch of statistics, Survival analysis is a way of prediction at various points in time. This is to say, while other prediction models make predictions of whether an event will occur, survival analysis predicts whether the event will occur at a specified time. Thus, it requires a time component for prediction and correspondingly, predicts the time when an event will happen. This helps one in understanding the expected duration of time when events occur and provide much more useful information. One can think of natural areas of application of survival analysis which include biological sciences where one can predict the time for bacteria or other cellular organisms to multiple to a particular size or expected time of decay of atoms. Some interesting applications include prediction of the expected time when a machine will break down and maintenance will be required## How hard does it get..

It is not easy to apply the concepts of survival analysis right off the bat. One needs to understand the ways it can be used first. This includes Kaplan-Meier Curves, creating the survival function through tools such as survival trees or survival forests and log-rank test.```
```

Let’s go through each of them one by one in R. We will use the survival package in R as a starting example. The survival package has the surv() function that is the center of survival analysis. ```
```

# install.packages("survival") # Loading the package library("survival")

```
```

The package contains a sample dataset for demonstration purposes. The dataset is pbc which contains a 10 year study of 424 patients having Primary Biliary Cirrhosis (pbc) when treated in Mayo clinic. A point to note here from the dataset description is that out of 424 patients, 312 participated in the trial of drug D-penicillamine and the rest 112 consented to have their basic measurements recorded and followed for survival but did not participate in the trial. 6 of these 112 cases were lost. ```
```

We are particularly interested in ‘time’ and ‘status’ features in the dataset. Time represents the number of days after registration and final status (which can be censored, liver transplant or dead). Since it is survival, we will consider the status as dead or not-dead (transplant or censored). Further details about the dataset can be read from the command: ```
```

#Dataset description ?pbc

```
```

We start with a direct application of the Surv() function and pass it to the survfit() function. The Surv() function will take the time and status parameters and create a survival object out of it. The survfit() function takes a survival object (the one which Surv() produces) and creates the survival curves. ```
```

#Fitting the survival model survival_func=survfit(Surv(pbc$time,pbc$status == 2)~1) survival_func Call: survfit(formula = Surv(pbc$time, pbc$status == 2) ~ 1) n events median 0.95LCL 0.95UCL 418 161 3395 3090 3853

```
```

The function gives us the number of values, the number of positives in status, the median time and 95% confidence interval values. The model can also be plotted. ```
```

#Plot the survival model plot(survival_func)

As expected, the plot shows us the decreasing probabilities for survival as time passes. The dashed lines are the upper and lower confidence intervals. In the survfit() function here, we passed the formula as ~ 1 which indicates that we are asking the function to fit the model solely on the basis of survival object and thus have an intercept. The output along with the confidence intervals are actually Kaplan-Meier estimates. This estimate is prominent in medical research survival analysis. The Kaplan – Meier estimates are based on the number of patients (each patient as a row of data) from the total number who survive for a certain time after treatment. (which is the event). We can represent the Kaplan – Meier function by the formula:

```
```

Ŝ(t)=∏(1-di/ni) for all i where ti≤t Here, di the number of events and ni is the total number of people at risk at time ti

## What to make of the graph?

Unlike other machine learning techniques where one uses test samples and makes predictions over them, the survival analysis curve is a self – explanatory curve. From the curve, we see that the possibility of surviving about 1000 days after treatment is roughly 0.8 or 80%. We can similarly define probability of survival for different number of days after treatment. At the same time, we also have the confidence interval ranges which show the margin of expected error. For example, in case of surviving 1000 days example, the upper confidence interval reaches about 0.85 or 85% and goes down to about 0.75 or 75%. Post the data range, which is 10 years or about 3500 days, the probability calculations are very erratic and vague and should not be taken up. For example, if one wants to know the probability of surviving 4500 days after treatment, then though the Kaplan – Meier graph above shows a range between 0.25 to 0.55 which is itself a large value to accommodate the lack of data, the data is still not sufficient enough and a better data should be used to make such an estimate.## Alternative models: Cox Proportional Hazard model

The survival package also contains a cox proportional hazard function coxph() and use other features in the data to make a better survival model. Though the data has untreated missing values, I am skipping the data processing and fitting the model directly. In practice, however, one needs to study the data and look at ways to process the data appropriately so that the best possible models are fitted. As the intention of this article is to get the readers acquainted with the function rather than processing, applying the function is the shortcut step which I am taking.```
```

# Fit Cox Model Cox_model = coxph(Surv(pbc$time,pbc$status==2) ~.,data=pbc) summary(Cox_model) Call: coxph(formula = Surv(pbc$time, pbc$status == 2) ~ ., data = pbc) n= 276, number of events= 111 (142 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(>|z|) id -2.729e-03 9.973e-01 1.462e-03 -1.866 0.06203 . trt -1.116e-01 8.944e-01 2.156e-01 -0.518 0.60476 age 3.191e-02 1.032e+00 1.200e-02 2.659 0.00784 ** sexf -3.822e-01 6.824e-01 3.074e-01 -1.243 0.21378 ascites 6.321e-02 1.065e+00 3.874e-01 0.163 0.87038 hepato 6.257e-02 1.065e+00 2.521e-01 0.248 0.80397 spiders 7.594e-02 1.079e+00 2.448e-01 0.310 0.75635 edema 8.860e-01 2.425e+00 4.078e-01 2.173 0.02980 * bili 8.038e-02 1.084e+00 2.539e-02 3.166 0.00155 ** chol 5.151e-04 1.001e+00 4.409e-04 1.168 0.24272 albumin -8.511e-01 4.270e-01 3.114e-01 -2.733 0.00627 ** copper 2.612e-03 1.003e+00 1.148e-03 2.275 0.02290 * alk.phos -2.623e-05 1.000e+00 4.206e-05 -0.624 0.53288 ast 4.239e-03 1.004e+00 1.941e-03 2.184 0.02894 * trig -1.228e-03 9.988e-01 1.334e-03 -0.920 0.35741 platelet 7.272e-04 1.001e+00 1.177e-03 0.618 0.53660 protime 1.895e-01 1.209e+00 1.128e-01 1.680 0.09289 . stage 4.468e-01 1.563e+00 1.784e-01 2.504 0.01226 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 id 0.9973 1.0027 0.9944 1.000 trt 0.8944 1.1181 0.5862 1.365 age 1.0324 0.9686 1.0084 1.057 sexf 0.6824 1.4655 0.3736 1.246 ascites 1.0653 0.9387 0.4985 2.276 hepato 1.0646 0.9393 0.6495 1.745 spiders 1.0789 0.9269 0.6678 1.743 edema 2.4253 0.4123 1.0907 5.393 bili 1.0837 0.9228 1.0311 1.139 chol 1.0005 0.9995 0.9997 1.001 albumin 0.4270 2.3422 0.2319 0.786 copper 1.0026 0.9974 1.0004 1.005 alk.phos 1.0000 1.0000 0.9999 1.000 ast 1.0042 0.9958 1.0004 1.008 trig 0.9988 1.0012 0.9962 1.001 platelet 1.0007 0.9993 0.9984 1.003 protime 1.2086 0.8274 0.9690 1.508 stage 1.5634 0.6397 1.1020 2.218 Concordance= 0.849 (se = 0.031 ) Rsquare= 0.462 (max possible= 0.981 ) Likelihood ratio test= 171.3 on 18 df, p=0 Wald test = 172.5 on 18 df, p=0 Score (logrank) test = 286.1 on 18 df, p=0

```
```

The Cox model output is similar to how a linear regression output comes up. The R2 is only 46% which is not high and we don’t have any feature which is highly significant. The top important features appear to be age, bilirubin (bili) and albumin. Let’s see how the plot looks like. ```
```

#Create a survival curve from the cox model Cox_curve <- survfit(Cox_model) plot(Cox_curve)

With more data, we get a different plot and this one is more volatile. Compared to the Kaplan – Meier curve, the cox-plot curve is higher for the initial values and lower for the higher values. The major reason for this difference is the inclusion of variables in cox-model. The plots are made by similar functions and can be interpreted the same way as the Kaplan – Meier curve.

## Going traditional : Using survival forests

Random forests can also be used for survival analysis and the ranger package in R provides the functionality. However, the ranger function cannot handle the missing values so I will use a smaller data with all rows having NA values dropped. This will reduce my data to only 276 observations.```
```

#Using the Ranger package for survival analysis Install.packages("ranger") library(ranger) #Drop rows with NA values pbc_nadrop=pbc[complete.cases(pbc), ] #Fitting the random forest ranger_model <- ranger(Surv(pbc_nadrop$time,pbc_nadrop$status==2) ~.,data=pbc_nadrop,num.trees = 500, importance = "permutation",seed = 1) #Plot the death times plot(ranger_model$unique.death.times,ranger_model$survival[1,], type = "l", ylim = c(0,1),)

Let’s look at the variable importance plot which the random forest model calculates.

#Get the variable importance data.frame(sort(ranger_model$variable.importance,decreasing = TRUE)) sort.ranger_model.variable.importance..decreasing...TRUE. bili 0.0762338981 copper 0.0202733989 albumin 0.0165070226 age 0.0130134413 edema 0.0122113704 ascites 0.0115315711 chol 0.0092889960 protime 0.0060215073 id 0.0055867915 ast 0.0049932803 stage 0.0030225398 hepato 0.0029290675 trig 0.0028869184 platelet 0.0012958105 sex 0.0010639806 spiders 0.0005210531 alk.phos 0.0003291581 trt -0.0002020952These numbers may be different for different runs. In my example, we see that bilirubin is the most important feature.

## Lessons learned: Conclusion

Though the input data for Survival package’s Kaplan – Meier estimate, Cox Model and ranger model are all different, we will compare the methodologies by plotting them on the same graph using ggplot.#Comparing models library(ggplot2) #Kaplan-Meier curve dataframe #Add a row of model name km <- rep("Kaplan Meier", length(survival_func$time)) #Create a dataframe km_df <- data.frame(survival_func$time,survival_func$surv,km) #Rename the columns so they are same for all dataframes names(km_df) <- c("Time","Surv","Model") #Cox model curve dataframe #Add a row of model name cox <- rep("Cox",length(Cox_curve$time)) #Create a dataframe cox_df <- data.frame(Cox_curve$time,Cox_curve$surv,cox) #Rename the columns so they are same for all dataframes names(cox_df) <- c("Time","Surv","Model") #Dataframe for ranger #Add a row of model name rf <- rep("Survival Forest",length(ranger_model$unique.death.times)) #Create a dataframe rf_df <- data.frame(ranger_model$unique.death.times,sapply(data.frame(ranger_model$survival),mean),rf) #Rename the columns so they are same for all dataframes names(rf_df) <- c("Time","Surv","Model") #Combine the results plot_combo <- rbind(km_df,cox_df,rf_df) #Make a ggplot plot_gg <- ggplot(plot_combo, aes(x = Time, y = Surv, color = Model)) plot_gg + geom_line() + ggtitle("Comparison of Survival Curves")

We see here that the Cox model is the most volatile with the most data and features. It is higher for lower values and drops down sharply when the time increases. The survival forest is of the lowest range and resembles Kaplan-Meier curve. The difference might be because of Survival forest having less rows. The essence of the plots is that there can be different approaches to the same concept of survival analysis and one may choose the technique based on one’s comfort and situation. A better data with processed data points and treated missing values might fetch us a better R2 and more stable curves. At the same time, they will help better in finding time to event cases such as knowing the time when a promotion’s effect dies down, knowing when tumors will develop and become significant and lots of other applications with a significant chunk of them being from medical science. Survival, as the name suggests, relates to surviving objects and is thus related to event occurrence in a completely different way than machine learning. It is important to know this technique to know more and more ways data can help us in solving problems, with time involved in this particular case. Hope this article serves the purpose of giving a glimpse of survival analysis and the feature rich packages available in R.

## Here is the complete code for the article:

# install.packages("survival") # Loading the package library("survival") #Dataset description ?pbc #Fitting the survival model survival_func=survfit(Surv(pbc$time,pbc$status == 2)~1) survival_func #Plot the survival model plot(survival_func) # Fit Cox Model Cox_model = coxph(Surv(pbc$time,pbc$status==2) ~.,data=pbc) summary(Cox_model) #Create a survival curve from the cox model Cox_curve <- survfit(Cox_model) plot(Cox_curve) #Using the Ranger package for survival analysis #install.packages("ranger") library(ranger) #Drop rows with NA values pbc_nadrop=pbc[complete.cases(pbc), ] #Fitting the random forest ranger_model <- ranger(Surv(pbc_nadrop$time,pbc_nadrop$status==2) ~.,data=pbc_nadrop,num.trees = 500, importance = "permutation",seed = 1) #Plot the death times plot(ranger_model$unique.death.times,ranger_model$survival[1,], type = "l", ylim = c(0,1),) #Get the variable importance data.frame(sort(ranger_model$variable.importance,decreasing = TRUE)) #Comparing models library(ggplot2) #Kaplan-Meier curve dataframe #Add a row of model name km <- rep("Kaplan Meier", length(survival_func$time)) #Create a dataframe km_df <- data.frame(survival_func$time,survival_func$surv,km) #Rename the columns so they are same for all dataframes names(km_df) <- c("Time","Surv","Model") #Cox model curve dataframe #Add a row of model name cox <- rep("Cox",length(Cox_curve$time)) #Create a dataframe cox_df <- data.frame(Cox_curve$time,Cox_curve$surv,cox) #Rename the columns so they are same for all dataframes names(cox_df) <- c("Time","Surv","Model") #Dataframe for ranger #Add a row of model name rf <- rep("Survival Forest",length(ranger_model$unique.death.times)) #Create a dataframe rf_df <- data.frame(ranger_model$unique.death.times,sapply(data.frame(ranger_model$survival),mean),rf) #Rename the columns so they are same for all dataframes names(rf_df) <- c("Time","Surv","Model") #Combine the results plot_combo <- rbind(km_df,cox_df,rf_df) #Make a ggplot plot_gg <- ggplot(plot_combo, aes(x = Time, y = Surv, color = Model)) plot_gg + geom_line() + ggtitle("Comparison of Survival Curves")

## Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Vishnu Reddy and Saneesh Veetil contributed to this article.Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Hi,

Nice article.

A couple of, hopefully constructive, comments:

1. The use of install.packages(“survival”) is not needed, since the survival package is a “recommended” package, and therefore part of any standard R installation.

2. In the example using the Cox model, recognizing that you are focusing on the functions, there are reasons that the model is fairly unstable. First, given the formula specification, you included ‘id’ as a covariate, which is simply a unique, per subject, case number in the dataset. Second, the model is arguably overfit, given that there are only 111 events in the complete case dataset (n = 276, per the output) used to fit the model and you have, if my count is correct, 17 covariates. That is only about 6.5 events per covariate degree of freedom (111 / 17), which is well below the common rule of thumb guidance of 10 to 20 events per covariate degree of freedom to reduce the chances of overfitting. It might be reasonable to consider using a pre-specified, reduced model for the examples when comparing the 3 methods.

As you note, the Cox model and the forest model will both have a smaller sample size (n = 276) than the KM model (n = 418), due to missing data.

Thanks.