R code for computing variable importance for a survival model

The following R code computes the relative importance for predictor variables in a survival model. The implemented method for computing the relative importance was inspired by the Leo Breiman’s method for computing variable importance in a Random Forest.

Breiman’s method for computing variable importance can best be explained with an example.
Suppose you have 5 predictor variables, say x1 to x5. These 5 variables are used for predicting some observed response y. However, the 5 variables do not predict exactly the observed values for y. In other words, the predictions based on the 5 variables will more or less deviate from the observed values for y. The Mean Squared Error (MSE) is calculated as the mean of these deviations.
Assume now that predictor x1 has no predictive value for the response y. Hence, if we would randomly permute the observed values for x1, then our predictions for y would hardly change. As a consequence, the MSE before and after permuting the observed values for x1 would be similar.
On the other hand, assume that x3 is strongly related to our response y. If we randomly permute the observed values for x3, then our MSE before and after this permutation would deviate considerably.
Based on these random permutations and our observed change in MSE, we may conclude that predictor variable x3 is more important than x1 in predicting y.

The R code below applies Breiman’s permutation method for computing the relative importance of predictor variables to a survival model. However, instead of MSE the code employs concordance as a measure for the prediction accuracy.
Furthermore, the R code also compares in 2 simulations the performance of Breiman’s method applied to survival models with that of Breiman’s method implemented in a random survival forest.

survival model variable importance

Suggestions and/or questions? Please contact Stefan Gelissen (email: info at datall-analyse.nl).
See this page for an overview of all of Stefan’s R code blog posts.

R code

library(survival)
library(mvtnorm)


#Function for computing importance of predictor variables in a survival model
#based on Breiman's permutation method.
#Note that only right censoring is allowed.
breiman <- function (model, time, event=rep(1,length(time)), covdata, nrep,
                     plot=TRUE){
  #model is an object returned by the survreg or coxph function
  #time is the follow up time
  #event is the status indicator
  #covdata is a data frame containing the covariates
  #nrep specifies the number of Monte Carlo simulations for computing the
  # variable importance
  #plot diplays a dotchart with the computed variable importance
  
  #calculate the concordance of model (will be used as benchmark)
  if (attr(model, "class")=="survreg") {
    cmod <- 1-survConcordance(Surv(cbind(time=time, event=event))~
                                predict(model, newdata=covdata))$concordance}
  else {
    cmod <- survConcordance(Surv(cbind(time=time, event=event))~
                              predict(model, newdata=covdata))$concordance}
  
  #compute variable importance
  ncov <- ncol(covdata)
  nobs <- nrow(covdata)
  
  vi <- sapply(1:ncov, function (i){
    simMat <- covdata[rep(seq(nobs), nrep), ]
    permutedCol <- as.vector(replicate(nrep, sample(covdata[,i], nobs, replace=FALSE)))
    indicatorNrep <- rep(1:nrep, each=nobs)
    simMat[,i] <- permutedCol
    if (attr(model, "class")=="survreg") {
      permuted.pred <- predict(model, newdata=simMat, type="response")
      scmod <- tapply(permuted.pred, indicatorNrep, function (j)
        1-survConcordance(Surv(cbind(time=time, event=event))~j)$concordance)}
    else {
      permuted.pred <- predict(model, newdata=simMat, type="risk")
      scmod <- tapply(permuted.pred, indicatorNrep, function (j)
        survConcordance(Surv(cbind(time=time, event=event))~j)$concordance)}
    
    mean((cmod-scmod)/cmod)})
  
  names(vi) <- colnames(covdata)
  
  #dotchart displaying variable importance
  if (plot==TRUE) {dotchart(vi, xlab="Relative importance")}
  
  return(vi)
}





#EXAMPLE

#generate data
N <- 200 #number of observations

#variance-covariance matrix for predictor variables
varcov <- matrix(c(1, 0, 0,
                   0, 40, 0,
                   0, 0, .25), ncol=3, byrow=TRUE)
dataS <- data.frame(rmvnorm(N, mean=c(2, 10, 1), sigma=varcov, method="chol"))
colnames(dataS) <- c("x1", "x2", "x3")

#some arbitrary function
dataS$y <- exp(2 + .5*dataS$x1 + .01*(dataS$x2)^2 + .3*dataS$x3 + rnorm(N)*.5)

#apply censoring (here: Type I censoring)
#failure=1, censored=0
dataS$Cens <- ifelse(dataS$y>1500, 0, 1)
N-sum(dataS$Cens) #number of censored observations
dataS$y <- ifelse(dataS$Cens==0, 1500, dataS$y)

#split data into training and validation set
d <- sample(nrow(dataS), size=.5*(nrow(dataS)), replace=FALSE)
#training data
dataT <- dataS[d,]
#validation data
dataV <- dataS[-d,]



#fit parametric survival model to the training data
#for identifying the correct functional form for predictor variables
#in a parametric survival model, see this blog post: http://blogs2.datall-analyse.nl/2016/02/19/rcode_martingale_residuals_aft_model/
modpsm <- survreg(Surv(y, Cens)~x1 + I(x2^2) + x3, data=dataT, dist="lognormal")
modpsm

#compute variable importance (using the validation data)
breiman(model=modpsm, time=dataV$y, event=dataV$Cens, covdata=dataV[,1:3],
        nrep=100)



#fit a Cox regression model to the training data
modcox <- coxph(Surv(y, Cens)~x1 + I(x2^2) + x3, data=dataT)
modcox

#compute variable importance (using the validation data)
breiman(model=modcox, time=dataV$y, event=dataV$Cens, covdata=dataV[,1:3],
        nrep=100)





##SIMULATIONS
#note: running these simulations will be time consuming, but the results
#are worth waiting for
library(copula)
library(randomForestSRC)
library(cluster)



#SIMULATION 1

##Generate data:
#- Five predictor variables (x1 to x5) and one response variable (y).
#- Correlations between these five predictors and log(y) are set to
# .8, .6., .5, .4, and .3.
#- Correlations between predictor variables are set to .2.
myGaussian <- mvdc(copula=normalCopula(param=c(.8,.6,.5,.4,.3,
                                               .2,.2,.2,.2,
                                               .2,.2,.2,
                                               .2,.2,
                                               .2),
                                       dispstr="un", dim=6),
                   margins=c("lnorm", "norm", "norm", "norm", "norm", "norm"),
                   paramMargins=list(list(meanlog=3, sdlog=1),
                                     list(mean=10, sd=1),
                                     list(mean=100, sd=10),
                                     list(mean=1, sd=.5),
                                     list(mean=50, sd=20),
                                     list(mean=5, sd=1)) )

#statistical population
statpop <- rMvdc(10000, myGaussian)
colnames(statpop) <- c("y", "x1", "x2", "x3", "x4", "x5")

#check generated data
cor(cbind(log(statpop[,1]), statpop[,-1])) #correlations
summary(lm(log(y)~., data=data.frame(statpop))) #fit ols-model to data

#apply censoring (here: Type I censoring)
#failure=1, censored=0
Cens <- ifelse(statpop[,1]>50, 0, 1)
10000-sum(Cens) #number of censored observations
statpop[,1] <- ifelse(Cens==0, 50, statpop[,1])



##Monte-Carlo experiment

#Run 500 Monte-Carlo trials. For each trial, draw a sample of 50 from the
#statistical population. Subsequently, compute for the trial sample the
#importance of each of the predictor variables (x1 to x5) .
simMat <- matrix(NA, nrow=500, ncol=10)

for(i in 1:500){
  #sample
  randSampleObs <- sample(10000, 2*50, replace=FALSE)
  randSampleCens <- Cens[randSampleObs]
  randSample <- data.frame(statpop[randSampleObs,])
  randSample$Cens <- randSampleCens
  colnames(randSample) <- c("y", "x1", "x2", "x3", "x4", "x5", "Cens")
  
  #split random sample into training and validation set
  d <- sample(nrow(randSample), size=.5*(nrow(randSample)), replace=FALSE)
  #training data
  randSampleT <-randSample[d,]
  #validation data (needed for Breiman's method for survival models)
  randSampleV <-randSample[-d,]
  
  #fit parametric survival model to training data
  modpsm <- survreg(Surv(y, Cens)~., data=randSampleT, dist="lognormal")
  #fit random forest to training data
  modrf <- rfsrc(Surv(y, Cens)~., data=randSampleT)
  
  #Breiman's variable importance for parametric survival model
  resultBreiman <- breiman(modpsm, randSampleV[,1], randSampleV[,7],
                           randSampleV[,-c(1,7)], 50, FALSE)
  simMat[i, 1:5] <- names(resultBreiman)[order(-resultBreiman)]
  #Breiman's variable importance for random forest
  resultBreimanRF <- vimp(modrf)$importance
  simMat[i, 6:10] <- names(resultBreimanRF)[order(-resultBreimanRF)]}

#extract results
Breiman <- simMat[,1:5] #Breiman's method applied to parametric surival model
BreimanRF <- simMat[,6:10] #Breiman's method applied to random forest



##Gower similarity
#assess the degree of similarity between the estimated and the true
#ranked importance of the predictor variables
trueRanking <- c("x1", "x2", "x3", "x4", "x5")

#results Breiman's variable importance for survival model
simBreiman <- apply(Breiman, 1, function (x)
  1-daisy(data.frame(rbind(trueRanking, x)), metric="gower")[1])
mBpsm <- mean(simBreiman); sdBpsm <- sd(simBreiman)

#results Breiman's variable importance for random forest
simBreimanRF <- apply(BreimanRF, 1, function (x)
  1-daisy(data.frame(rbind(trueRanking, x)), metric="gower")[1])
mBrf <- mean(simBreimanRF); sdBrf <- sd(simBreimanRF)

#plot results
#Breiman (psm) = Breiman's variable importance method applied to survival model
#Breiman (rf) = Breiman's variable importance method applied to random forest
simGower <- cbind(c(mBrf, mBpsm), c(mBrf-sdBrf, mBpsm-sdBpsm),
                  c(mBrf+sdBrf, mBpsm+sdBpsm))
rownames(simGower) <- c("Breiman (rf)", "Breiman (psm)")
dotchart(simGower[,1], xlab="Gower Similarity",
         xlim=c(min(simGower[,2]), max(simGower[,3])))
segments(simGower[,2], 1:2, simGower[,3], 1:2, col="darkgray")



##Mean Ranked Variable Importance
#For each of the 500 Monte-Carlo trails, the predictor variables are
#ranked according to their computed variable importance value.
#The predictor variable with the highest calculated importance
#value is assigned position 1 in the ranking, the variable with the
#second highest value position 2, et cetera.
#The Mean Ranked Variable Importance is the average position calculated
#over 500 trials.

#results Breiman's variable importance for parametric survival model
rankBreiman <- apply(Breiman, 1, order)
mrBpsm <- apply(rankBreiman, 1, mean); sdrBpsm <- apply(rankBreiman, 1, sd)
lclBpsm <- mrBpsm-sdrBpsm; uclBpsm <- mrBpsm+sdrBpsm

#results Breiman's variable importance for random forest
rankBreimanRF <- apply(BreimanRF, 1, order)
mrBrf <- apply(rankBreimanRF, 1, mean); sdrBrf <- apply(rankBreimanRF, 1, sd)
lclBrf <- mrBrf-sdrBrf; uclBrf <- mrBrf+sdrBrf

#plot results
#Breiman (psm) = Breiman's variable importance method applied to survival model
#Breiman (rf) = Breiman's variable importance method applied to random forest
simRank <- c(mrBpsm, mrBrf)
names(simRank) <- rep(trueRanking, times=2)

dotchart(simRank, xlab="Mean Ranked Variable Importance",
         groups=factor(rep(c("1. Breiman (psm)", "2. Breiman (rf)"), each=5)),
         xlim=c(min(c(lclBpsm, lclBrf)), max(c(uclBpsm, uclBrf))),
         cex=.8, col="darkgreen")
segments(lclBpsm, 8:12, uclBpsm, 8:12, col="darkgray")
segments(lclBrf, 1:5, uclBrf, 1:5, col="darkgray")
abline(v=c(1,2,3,4,5), col="gray", lty=2)


#Breiman's method applied to a parametric survival model seems to
#predict more precisely the correct order of variable importances
#compared to the same method implemented in a random forest.





#SIMULATION 2

##Generate data:
#- Five predictor variables (x1 to x5) and one response variable (y).
#- Correlations between these five predictors and log(y) are set to
# .8, .6., .4, .2, and .0.
#- Correlations between predictor variables are set to .2.
#- Mean of each predictor variable is zero, and the variance is 1.
myGaussian <- mvdc(copula=normalCopula(param=c(.8,.6,.4,.2,0,
                                               .2,.2,.2,.2,
                                               .2,.2,.2,
                                               .2,.2,
                                               .2),
                                       dispstr="un", dim=6),
                   margins=c("lnorm", "norm", "norm", "norm", "norm", "norm"),
                   paramMargins=list(list(meanlog=3, sdlog=1),
                                     list(mean=0, sd=1),
                                     list(mean=0, sd=1),
                                     list(mean=0, sd=1),
                                     list(mean=0, sd=1),
                                     list(mean=0, sd=1)) )

#statistical population
statpop <- rMvdc(10000, myGaussian)
colnames(statpop) <- c("y", "x1", "x2", "x3", "x4", "x5")

#check generated data
cor(cbind(log(statpop[,1]), statpop[,-1])) #correlations
summary(lm(log(y)~., data=data.frame(statpop))) #fit ols-model to data
#Why has the coefficient for x5 a (large) negative value? Predictor variable
#x5 is a classical example of, what is called in regression modeling, a suppressor.
#That is, there is a zero correlation between x5 and the response variable in
#the statistical population. At the same time, x5 is correlated with all the
#other predictor variables (x1 to x4) in the population.
#Furthermore, all these other predictors (x1 to x4) are positively correlated with
#the response variable in the population. Under such circumstances, the coefficient
#of the suppressor in a regression model usually has a negative value.

#apply censoring (here: Type I censoring)
#failure=1, censored=0
Cens <- ifelse(statpop[,1]>50, 0, 1)
10000-sum(Cens) #number of censored observations
statpop[,1] <- ifelse(Cens==0, 50, statpop[,1])



##Monte-Carlo experiment

#Run 500 Monte-Carlo trials. For each trial, draw a sample of 50 from the
#statistical population. Subsequently, compute for the trial sample the
#importance of each of the predictor variables (x1 to x5) .
simMat <- matrix(NA, nrow=500, ncol=10)

for(i in 1:500){
  #sample
  randSampleObs <- sample(10000, 2*50, replace=FALSE)
  randSampleCens <- Cens[randSampleObs]
  randSample <- data.frame(statpop[randSampleObs,])
  randSample$Cens <- randSampleCens
  colnames(randSample) <- c("y", "x1", "x2", "x3", "x4", "x5", "Cens")
  
  #split random sample into training and validation set
  d <- sample(nrow(randSample), size=.5*(nrow(randSample)), replace=FALSE)
  #training data
  randSampleT <-randSample[d,]
  #validation data (needed for Breiman's method for surival models)
  randSampleV <-randSample[-d,]
  
  #fit parametric survival model to training data
  modpsm <- survreg(Surv(y, Cens)~., data=randSampleT, dist="lognormal")
  #fit random forest to training data
  modrf <- rfsrc(Surv(y, Cens)~., data=randSampleT)
  
  #Breiman's variable importance for parametric survival model
  resultBreiman <- breiman(modpsm, randSampleV[,1], randSampleV[,7],
                           randSampleV[,-c(1,7)], 50, FALSE)
  simMat[i, 1:5] <- names(resultBreiman)[order(-resultBreiman)]
  #Breiman's variable importance for random forest
  resultBreimanRF <- vimp(modrf)$importance
  simMat[i, 6:10] <- names(resultBreimanRF)[order(-resultBreimanRF)]}

#extract results
Breiman <- simMat[,1:5] #Breiman's method applied to parametric survival model
BreimanRF <- simMat[,6:10] #Breiman's method applied to random forest



##Gower similarity
#assess the degree of similarity between the estimated and the true
#ranked importance of the predictor variables
trueRanking <- c("x1", "x2", "x3", "x4", "x5")

#results Breiman's variable importance for survival model
simBreiman <- apply(Breiman, 1, function (x)
  1-daisy(data.frame(rbind(trueRanking, x)), metric="gower")[1])
mBpsm <- mean(simBreiman); sdBpsm <- sd(simBreiman)

#results Breiman's variable importance for random forest
simBreimanRF <- apply(BreimanRF, 1, function (x)
  1-daisy(data.frame(rbind(trueRanking, x)), metric="gower")[1])
mBrf <- mean(simBreimanRF); sdBrf <- sd(simBreimanRF)

#plot results
#Breiman (psm) = Breiman's variable importance method applied to survival model
#Breiman (rf) = Breiman's variable importance method applied to random forest
simGower <- cbind(c(mBrf, mBpsm), c(mBrf-sdBrf, mBpsm-sdBpsm),
                  c(mBrf+sdBrf, mBpsm+sdBpsm))
rownames(simGower) <- c("Breiman (rf)", "Breiman (psm)")
dotchart(simGower[,1], xlab="Gower Similarity",
         xlim=c(min(simGower[,2]), max(simGower[,3])))
segments(simGower[,2], 1:2, simGower[,3], 1:2, col="darkgray")



##Mean Ranked Variable Importance

#results Breiman's variable importance for parametric survival model
rankBreiman <- apply(Breiman, 1, order)
mrBpsm <- apply(rankBreiman, 1, mean); sdrBpsm <- apply(rankBreiman, 1, sd)
lclBpsm <- mrBpsm-sdrBpsm; uclBpsm <- mrBpsm+sdrBpsm

#results Breiman's variable importance for random forest
rankBreimanRF <- apply(BreimanRF, 1, order)
mrBrf <- apply(rankBreimanRF, 1, mean); sdrBrf <- apply(rankBreimanRF, 1, sd)
lclBrf <- mrBrf-sdrBrf; uclBrf <- mrBrf+sdrBrf

#plot results
#Breiman (psm) = Breiman's variable importance method applied to surival model
#Breiman (rf) = Breiman's variable importance method applied to random forest
simRank <- c(mrBpsm, mrBrf)
names(simRank) <- rep(trueRanking, times=2)

dotchart(simRank, xlab="Mean Ranked Variable Importance",
         groups=factor(rep(c("1. Breiman (psm)", "2. Breiman (rf)"), each=5)),
         xlim=c(min(c(lclBpsm, lclBrf)), max(c(uclBpsm, uclBrf))),
         cex=.8, col="darkgreen")
segments(lclBpsm, 8:12, uclBpsm, 8:12, col="darkgray")
segments(lclBrf, 1:5, uclBrf, 1:5, col="darkgray")
abline(v=c(1,2,3,4,5), col="gray", lty=2)


#Of both methods, Breiman's method applied to a parametric survival model
#seems to do the worst job of predicting the correct variable importance
#order for this data.

#For explaining the results of Breiman's method applied to survival models,
#it is useful to look once again at the ols-model fitted to the
#statistical population.
summary(lm(log(y)~., data=data.frame(statpop))) #fit ols-model to data
#Now, take the absolute values of the ols-coefficients for x1 to x5,
#and order them from large to small.
#The resulting order is x1, x2, x5, x3, and x4. This order resembles the
#order of the ranked variable importances we found for Breiman's method
#applied to survival model (see dotchart for Mean Ranked Variables Importance).
#Note that it only makes sense to compare the absolute values of the
#ols-coefficients when the predictor variables are measured on the same scale.
#Remember that this was the case with the predictor variables employed in
#Simulation 1 (variables x1 to x5 all had mean 0 and variance 1).





#To conclude, Simulation 1 demonstrated that Breiman's method
#applied to a parametric surival may predict more precisely the correct
#order of variable importances when compared to the same method implemented
#in a random survival forest.
#However, Simulation 2 demonstrated that under specific circumstances the results
#of Breiman's method applied to a parametric survival model should be
#interpreted with care. More specifically, when one of the predictor variables
#is expected to be a suppressor, then Breiman's method applied to a parametric
#survival model may yield invalid variable importance results for some
#of the predictor variables in the model. Under such circumstances, it
#is better not to include the supressor as a predictor in the parametric
#survival model.