R code for fitting a model to longitudinal data with a copula

The R code below demonstrates how to fit a model to longitudinal data by means of a copula. Longitudinal data is also referred to as panel, or repeated measures data.

The R code also shows how to create forecasts for longitudinal data, and how to compute prediction intervals for these forecasts.

Note that the code in this blog post focuses on the analysis of balanced longitudinal data. However, in my next blog post I will show how to handle unbalanced longitudinal data.

Two-stage parametric ML method
The copula in the R code is fitted using the two-stage parametric ML approach (also referred to as the Inference Functions for Margins [IFM] method). This method fits a copula in two steps:

  1. Estimate the parameters of the marginals.
  2. Fix the marginal parameters to the values estimated in first step, and subsequently estimate the copula parameters.

Recommended literature
For those interested in modeling longitudinal data with copulas, I recommend reading:

  • Joe, H. (2015), Dependence modeling with copulas, Boca Raton, FL: CRC Press (see pp. 322-327 for an example of modeling with longitudinal data).
  • Lambert, P, and Vandenhende, F. (2002), A copula-based model for multivariate non-normal longitudinal data, Statistics in medicine, 21, 3197–3217.
  • Frees, E., and Wang, P. (2005), Credibility using copulas, North American Actuarial Journal, 9, 31-48.

Suggestions and/or questions? Please contact Stefan Gelissen (email: info at datall-analyse.nl).
See this page for an overview of all of Stefan’s
R code blog posts.

R code

library(copula)
library(SPREDA)

#We are going to work with a Gumbel distribution, so first we will define
#functions for the cdf, pdf, quantile function, and random number generation
#of a Gumbel distribution having a location scale parametrization.
pmyGumbel <- function(x, location=0, scale=1) {
  xt <- (x-location)/scale
  psev(xt)
}

dmyGumbel <- function(x, location=0, scale=1) {
  xt <- (x-location)/scale
  dsev(xt)/scale
}

qmyGumbel <- function(p, location=0, scale=1) {
  location + qsev(p)*scale
}

rmyGumbel <- function(n, location=0, scale=1) {
  location + rsev(n)*scale
}



##Generate some artificial data using a n-dimensional Gaussian copula

#n-dimensional Gaussian copula
ndim <- 20 #number of dimensions
myCop <- normalCopula(param=.6, dim=ndim, dispstr="ex")

#Marginals:
#all n marginals follow a Gumbel distribution with location=0 and scale=.1
myMvdc <- mvdc(copula=myCop, rep("myGumbel", ndim),
               rep(list(list(location=0, scale=.1)), ndim))

#The Gumbel distribution is skewed to the left
plot(1, type="n", xlab="x", ylab="f(x)", xlim=c(-.5, .2), ylim=c(0, 4))
curve(dmyGumbel(x, location=0, scale=.1), add=TRUE)

#Draw sample (this sample will subsequently be used as the artificial data)
ns <- 120 #sample size
rs <- rMvdc(ns, myMvdc)

#Inspect the data from the first 5 cases, and observe that there is considerable
#individual heterogeneity (that is, values within the same case tend
#to be more alike than values across cases) and serial dependence (that is,
#values at different time points within the same case are correlated).
matplot(t(rs[1:5,]), type="b", pch=1, lty=1, ylab="y(t)", xlab="t-index")

#Next, transform the generated data by employing a nonlinear model.
#The employed nonlinear model is an asymptotic function through the origin.
plot(1, type="n", xlab="x", ylab="f(x)", xlim=c(0, 6000), ylim=c(-1.4, 0))
curve(-1.4*(1-exp(-exp(-7.5)*x)), add=TRUE) #nonlinear function

#Transform the data
t <- rep(seq(250, 250*ndim, 250), times=ns) #time points
phi1 <- -1.4 #asymptote
x1 <- rep(c(-2, 0, 2), each=(ns/3)*ndim) #covariate values
phi2 <- -7.5 + .7*x1 #rate of change depending on covariate x1
ys <- phi1*(1-exp(-exp(phi2)*t)) + as.vector(t(rs)) #response
dataL <- data.frame(id=rep(1:ns, each=ndim), ys , time=t, x1) #combine data

#Visualize the transformed data
matplot(t(matrix(as.vector(dataL[,2]), ncol=ndim, byrow=TRUE)), type="b", pch=1,
        lty=1, col=rep(c("red", "darkgreen", "blue"), each=ns/3),
        ylab="y(t)", xlab="t-index")
legend("topright", lty=c(1, 1, 1), pch=c(1,1,1),
       col=c("red", "darkgreen", "blue"),
       legend=c("x1 = -2", "x1 = 0", "x1 = 2"),
       bty="n", y.intersp=1.2, cex=.7)

#Inspect 5 cases with value x1=2, and notice the individual heterogeneity
#and serial dependence
dataLselect <- dataL[which(x1==2),]
matplot(t(matrix(as.vector(dataLselect[,2]), ncol=ndim, byrow=TRUE))[,1:5],
        type="b", pch=1, lty=1, ylab="y(t)", xlab="t-index")



##Fit a n-dimensional Gaussian copula to the artificial data
#by means of the two-stage parametric ML estimation method, also referred
#to as the method of Inference Functions for Margins (IFM).


#Stage 1 of the IFM: fit the marginals, and obtain the marginal parameters
#For obtaining the marginal parameters, we will fit a nonlinear model to the data

#Use the getInitial function for SSasympOrig() for obtaining start values
#for phi1 and beta0 (note: phi1=Asym and beta0=lrc)
(start <- getInitial(ys ~ SSasympOrig(time, Asym, lrc), data = dataL))

#Define the log-likelihood function for the nonlinear model
llik <- function (theta, y, t, cov) {
  #parameters
  phi1 <- theta[1]
  beta0 <- theta[2]
  beta1 <- theta[3]
  sigma <- exp(theta[4])
  
  phi2 <- beta0 + beta1*cov
  mu <- phi1*(1-exp(-exp(phi2)*t))
  
  #log-likelihood
  sum(log( dsev((y-mu) / sigma) / sigma ))
}

#For obtaining a start value for sigma (that is, the scale of the Gumbel
#distribution), set beta1=0, and inspect the resulting log-likelihood function
sigmas <- seq(.001, 5, length.out=100)
startSigma <- sapply(1:length(sigmas),
                     function (i) llik(theta=c(start[1], start[2], 0, log(sigmas[i])),
                                       y=dataL$ys, t=dataL$time, cov=dataL$x1))

#Plot the resulting log-likelihood function, and notice that the maximum of
#the log-likelihood is at about .5. Hence, a plausible start value for sigma is .5
plot(sigmas, startSigma, xlab="sigma", ylab="log-likelihood", type="l")

#Now that we have plausible start values, fit the nonlinear model by
#maximizing the log-likehood function.
#Note that the log transformation of sigma enforces the scale parameter
#of the Gumbel distribution to be positive.
nlmod <- optim(c(start[1], start[2], 0, log(.5)), llik,
               control=list(fnscale=-1),
               y=dataL$ys, t=dataL$time, cov=dataL$x1)
nlmod #results
nlmod$par #MLEs (the true values were -1.4, -7.5, .7, log(.1)=-2.3, respectively)
exp(nlmod$par[4]) #the MLE for scale parameter

#Based on the estimated parameters, subsequently compute the cumulative
#probabilities for the marginals
phi1hat <- nlmod$par[1]; beta0hat <- nlmod$par[2]
beta1hat <- nlmod$par[3]; sigmahat <- exp(nlmod$par[4])
phi2hat <- cbind(1, dataL$x1)%*%c(beta0hat, beta1hat)
muhat <- phi1hat*(1-exp(-exp(phi2hat)*dataL$time))

udat <- matrix(pmyGumbel(dataL$ys, location=muhat, scale=sigmahat),
               ncol=ndim, byrow=TRUE)


#Stage 2 of the IFM: fix the parameters of the marginals and
#estimate the copula parameter

#Starting value for the copula parameter
tauEst <- sin(cor(udat, method="kendall")*pi/2)

#Fit a Gaussian copula with an exchangeable structure. Note that is also possible
#to fit other dispersion matrices (for instance, AR(1) and Toeplitz), and
#other copulas (for instance, a Gumbel or Frank copula).
copEst <- fitCopula(myCop, udat, start=mean(tauEst[upper.tri(tauEst)]),
                    method="ml", estimate.variance=TRUE)
copEst
#MLE for the copula parameter (remember: the true value for the parameter was .6)
copEst@estimate
#Standard error for MLE copula parameter
SEparms <- sqrt(copEst@var.est)
#Construct 95% confidence interval for the copula parameter
(ci <- copEst@estimate + c(1,-1)*qnorm(.05/2)*SEparms)

#When using the two-stage parametric ML method, the standard error of the MLE
#for the copula parameter is usually underestimated (i.e., too small).
#The reason for this underestimation is that the marginal MLEs were fixed at
#the second stage of the fitting process. As a consequence, when estimating
#the copula parameter at this second stage, variation in the MLEs of the
#marginals is not taken into account.
#You may use a resampling method (such as a parametric bootstrap method)
#for estimating standard errors that include the variability of marginal MLEs.



##Forecasting
#It is possible to predict (or forecast) the values at future time points for
#individual cases. The predicted values are quantiles (such as the median).
#For generating these predictions, a subcopula of the fitted n-dimensional copula
#is recursively applied. See the predictU function below for what these subcopulas
#are and how they are constructed.

#Function for generating predictions.
#Caution: this is a tailor made function for a Gaussian copula with
#an exchangeable dispersion matrix.
predictU <- function (observedU, nAhead, p, simplify=TRUE) {
  #-observedU is a vector of initial observations for an individual case.
  # The values in this vector are estimated cumulative probabilities, and hence
  # all should be within the range [0, 1].
  #-nAhead is the number of steps ahead for which to predict the p-th quantile of
  # the individual case.
  #-p is a number specifying the quantile to be estimated, where p
  # should be within the range [0, 1].
  #-if simplify is TRUE, then only return the predicted quantile. If FALSE, then
  # also include the most likely value (=median) in the output.
  
  nobs <- length(observedU)
  maxtime <- nobs+nAhead
  
  udpred <- rep(NA, maxtime)
  ui <- rep(NA, maxtime)
  
  udpred[1:nobs] <- observedU
  ui[1:nobs] <- observedU
  
  #predict quantiles at nAhead time points
  for (i in (nobs+1):maxtime) {
    uisub <- ui[1:(i-1)]
    
    #construct the subcopula:
    #-this subcopula is, similar to its parent n-dimensional copula,
    # a Gaussian copula with an exchangeable dispersion matrix
    #-in contrast to its n-dimensional parent, the subcopula 
    # has i dimensions
    #-finally, the dispersion matrix of the subcopula is a submatrix
    # of the dispersion matrix of the n-dimensional parent
    subcopula <- normalCopula(param=copEst@estimate, dim=i, dispstr="ex")
    
    #using the subcopula, compute the p-th quantile at time point i,
    #given the values at all previous time points
    ud <- NA
    udfun <- function (ud, uisub, p) {p - cCopula(t(as.matrix(c(uisub, ud))),
                                                  copula=subcopula,
                                                  indices=i)}
    try(ud <- uniroot(udfun, interval=c(1-.99999999999999, .99999999999999),
                      uisub=uisub, p=p, tol=1e-5)$root, silent=FALSE)
    udpred[i] <- ud
    
    #using the subcopula, compute the most likely value at time point i,
    #given the values at all previous time points
    #note that the median (p=.5) is taken to be the most likely value at
    #time point i
    uilik <- NA
    try(uilik <- uniroot(udfun, interval=c(1-.99999999999999, .99999999999999),
                         ui=uisub, p=.5, tol=1e-5)$root, silent=FALSE)
    ui[i] <- uilik
  }
  if (simplify) udpred
  else list(quantile=udpred, ui=ui)
}


#1) Generate a holdout sample for cases with x1=-2
ns2 <- 400 #sample size
rs2 <- rMvdc(ns2, myMvdc) 

x1 <- rep(-2, each=ns2*ndim) #value on covariate x1
t <- rep(seq(250, 250*ndim, 250), times=ns2) #time points
phi1 <- -1.4 #true value for phi1 (=asymptote)
phi2 <- -7.5 + .7*x1 #true value for phi2 (=rate of change)
ys <- phi1*(1-exp(-exp(phi2)*t)) + as.vector(t(rs2)) #response
dataL2 <- data.frame(id=rep(1:ns2, each=ndim), ys , time=t, x1) #combine data


#2) Based on the parameters we estimated above, compute the cumulative
#probabilities for the marginals
phi2hat2 <- cbind(1, dataL2$x1)%*%c(beta0hat, beta1hat)
muhat <- phi1hat*(1-exp(-exp(phi2hat2)*dataL2$time))
udat2 <- matrix(pmyGumbel(dataL2$ys, location=muhat, scale=sigmahat),
                ncol=ndim, byrow=TRUE)


#3) Forecasts for 4 individual cases of the holdout sample with x1=-2
obs <- 10 #number of initial observations
nAheads <- 10 #number of steps ahead for the predictions
casenr <- c(1:4) #select the first 4 cases of the sample

#Predictions without taking the individual heterogeneity into account
phi2s <- cbind(1, x1[casenr[1]])%*%c(beta0hat, beta1hat)
mus <- phi1hat*(1-exp(-exp(phi2s)*seq(250, 250*ndim, 250))) 

#Generate predictions for these 4 individual cases, but this time taking the
#individual heterogeneity into account. Predicted are the medians 10 steps ahead,
#given 10 initial observations.
pred <- sapply(1:4, function (i) {
  case <- udat2[casenr[i], 1:ndim]
  upred50 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.5)
  qmyGumbel(upred50, location=mus, scale=sigmahat)})

#Plot the results:
#-black lines are the observations
#-red lines are the predicted medians taking individual heterogeneity into account
#-blue lines are predictions not taking the individual heterogeneity into account
#-green line is the time point at which the forecasting starts (that is, after
# the first 10 initial observations)
par(mfrow=c(2, 2))
matplot(cbind(dataL2[which(dataL2$id==1), 2], pred[,1]), type="b",
        pch=1, lty=1, ylim=c(-1.2,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[1], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==2), 2], pred[,2]), type="b",
        pch=1, lty=1, ylim=c(-1.2,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[2], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==3), 2], pred[,3]), type="b",
        pch=1, lty=1, ylim=c(-1.2,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[3], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==4), 2], pred[,4]), type="b",
        pch=1, lty=1, ylim=c(-1.2,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[4], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")
par(mfrow=c(1, 1))


#4) Have a closer look at single individual case
casenr <- 3
case <- udat2[casenr, 1:ndim]
caseData <- dataL2[which(dataL2$id==casenr),]

#Predict the median, 5th, and 95th quantiles for this individual case.
#The 5th and 95th quantiles together comprise a 90% prediction interval.
upred50 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.5)
upred95 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.95)
upred05 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.05)

y50 <- qmyGumbel(upred50, location=mus, scale=sigmahat)
y95 <- qmyGumbel(upred95, location=mus, scale=sigmahat)
y05 <- qmyGumbel(upred05, location=mus, scale=sigmahat)

#Plot the results, and note that approximately 1 (out of 10) predictions
#should lie outside the boundaries of the 90% prediction interval.
matplot(cbind(caseData$ys, y50, y95, y05), type="b", pch=1, lty=1,
        col=c("black", "red", "blue", "blue"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr, "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
legend("bottomleft", lty=c(1, 1, 1), pch=c(1,1,1),
       col=c("black", "red", "blue"),
       legend=c("observed", "median", "90% prediction interval"),
       bty="n", y.intersp=1.2, cex=.7)


#5) Similar, but with now start with only 2 initial observations.
upred50 <- predictU(observedU=case[1:2], nAhead=18, p=.5)
upred95 <- predictU(observedU=case[1:2], nAhead=18, p=.95)
upred05 <- predictU(observedU=case[1:2], nAhead=18, p=.05)

y50 <- qmyGumbel(upred50, location=mus, scale=sigmahat)
y95 <- qmyGumbel(upred95, location=mus, scale=sigmahat)
y05 <- qmyGumbel(upred05, location=mus, scale=sigmahat)

#Plot the results
matplot(cbind(caseData$ys, y50, y95, y05), type="b", pch=1, lty=1,
        col=c("black", "red", "blue", "blue"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr, "; x1 = -2"))
abline(v=2, lty=2, col="darkgreen")
legend("bottomleft", lty=c(1, 1, 1), pch=c(1,1,1),
       col=c("black", "red", "blue"),
       legend=c("observed", "median", "90% prediction interval"),
       bty="n", y.intersp=1.2, cex=.7)


#6) Make forecasts for time points in the range from 5250 to 7500.
#Remember that the model was fitted on observations in the range [250, 5000],
#thus time points in the range [5250, 7500] is new territory for the model.

#Select data from a single case in the holdout sample
casenr <- 3
case <- udat2[casenr, 1:ndim]
caseData <- dataL2[which(dataL2$id==casenr),]

obs <- length(udat2[casenr,]) #number of initial observations in the range [250, 5000]
nAheads <- 10 #number of steps ahead for the forecasts
timeAtForecasts <- seq(5250, 7500, 250) #time points at the steps ahead

phi2hat2 <- cbind(1, rep(x1[casenr], obs+nAheads))%*%c(beta0hat, beta1hat)
muhat <- phi1hat*(1-exp(-exp(phi2hat2)*c(caseData$time, timeAtForecasts)))

#Generate forecasts
upred50 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.5)
upred95 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.95)
upred05 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.05)

y50 <- qmyGumbel(upred50, location=muhat, scale=sigmahat)
y95 <- qmyGumbel(upred95, location=muhat, scale=sigmahat)
y05 <- qmyGumbel(upred05, location=muhat, scale=sigmahat)

#Plot the results
plot(1:obs, caseData$ys, type="b", xlim=c(1,30), ylim=c(-1.5, 0),
     ylab="y(t)", xlab="t-index")
lines(21:30, y50[21:30], type="l", col="red")
lines(21:30, y95[21:30], type="l", col="blue")
lines(21:30, y05[21:30], type="l", col="blue")
abline(v=obs, lty=2, col="darkgreen")
legend("bottomleft", lty=c(1, 1, 1), pch=c(1,NA,NA),
       col=c("black", "red", "blue"),
       legend=c("observed", "median", "90% prediction interval"),
       bty="n", y.intersp=1.2, cex=.7)


#7) We now have a look at the coverage probabilty of the predicted quantiles.
#That is,
#-for the 5th quantile, 5% (=nominal coverage probability) of the observed values
# are expected to be lower than the predicted 5th quantile
#-for the median, 50% of the observations are expected to be lower than the
# predicted 50th quantile
#-for the 95th quantile, 95% are expected to be lower than the predicted
# 95th quantile.

#Coverage probabilities of 1 step ahead predictions
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.5)[11]
  upred95 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.95)[11]
  upred05 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.05)[11]
  c(sum(uobs<=upred50), sum(uobs<=upred95), sum(uobs<=upred05))})

sum(cp[1,])/(ns2) #nominal coverage probability is .50
sum(cp[2,])/(ns2) #nominal coverage probability is .95
sum(cp[3,])/(ns2) #nominal coverage probability is .05

#Coverage probabilities of 5 steps ahead predictions
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11:15]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.5)[11:15]
  upred95 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.95)[11:15]
  upred05 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.05)[11:15]
  c(sum(uobs<=upred50), sum(uobs<=upred95), sum(uobs<=upred05))})

sum(cp[1,])/(ns2*5) #nominal coverage probability is .50
sum(cp[2,])/(ns2*5) #nominal coverage probability is .95
sum(cp[3,])/(ns2*5) #nominal coverage probability is .05

#Coverage probability of 10 steps ahead predictions for the median
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11:20]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=10, p=.5)[11:20]
  c(sum(uobs<=upred50))})

sum(cp)/(ns2*10) #nominal coverage probability is .50



##Repeat the same procedure, but this time set x1=0

#1) Generate a holdout sample for cases with x1=0
ns2 <- 400 #sample size
rs2 <- rMvdc(ns2, myMvdc)

x1 <- rep(0, each=ns2*ndim) #value on covariate x1
t <- rep(seq(250, 250*ndim, 250), times=ns2) #time points
phi1 <- -1.4 #true value for phi1 (=asymptote)
phi2 <- -7.5 + .7*x1 #true value for phi2 (=rate of change)
ys <- phi1*(1-exp(-exp(phi2)*t)) + as.vector(t(rs2)) #response
dataL2 <- data.frame(id=rep(1:ns2, each=ndim), ys , time=t, x1) #combine data


#2) Based on the parameters we estimated above, compute the cumulative
#probabilities for the marginals
phi2hat2 <- cbind(1, dataL2$x1)%*%c(beta0hat, beta1hat)
muhat <- phi1hat*(1-exp(-exp(phi2hat2)*dataL2$time))
udat2 <- matrix(pmyGumbel(dataL2$ys, location=muhat, scale=sigmahat),
                ncol=ndim, byrow=TRUE)


#3) Forecasts for 4 individual cases of the holdout sample with x1=0
obs <- 10 #number of initial observations
nAheads <- 10 #number of steps ahead for the predictions
casenr <- c(1:4) #select the first 4 cases of the sample

#Predictions without taking the individual heterogeneity into account
phi2s <- cbind(1, x1[casenr[1]])%*%c(beta0hat, beta1hat)
mus <- phi1hat*(1-exp(-exp(phi2s)*seq(250, 250*ndim, 250))) 

#Generate predictions for these 4 individual cases, but this time taking the
#individual heterogeneity into account. Predicted are the medians 10 steps ahead,
#given 10 initial observations.
pred <- sapply(1:4, function (i) {
  case <- udat2[casenr[i], 1:ndim]
  upred50 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.5)
  qmyGumbel(upred50, location=mus, scale=sigmahat)})

#Plot the results:
#-black lines are the observations
#-red lines are the predicted medians taking individual heterogeneity into account
#-blue lines are predictions not taking the individual heterogeneity into account
#-green line is the time point at which the forecasting starts (that is, after
# the first 10 initial observations)
par(mfrow=c(2, 2))
matplot(cbind(dataL2[which(dataL2$id==1), 2], pred[,1]), type="b",
        pch=1, lty=1, ylim=c(-1.9,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[1], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==2), 2], pred[,2]), type="b",
        pch=1, lty=1, ylim=c(-1.9,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[2], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==3), 2], pred[,3]), type="b",
        pch=1, lty=1, ylim=c(-1.9,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[3], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==4), 2], pred[,4]), type="b",
        pch=1, lty=1, ylim=c(-1.9,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[4], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")
par(mfrow=c(1, 1))


#4) Have a closer look at single individual case
casenr <- 3
case <- udat2[casenr, 1:ndim]
caseData <- dataL2[which(dataL2$id==casenr),]

#Predict the median, 5th, and 95th quantiles for this individual case.
#The 5th and 95th quantiles together comprise a 90% prediction interval.
upred50 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.5)
upred95 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.95)
upred05 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.05)

y50 <- qmyGumbel(upred50, location=mus, scale=sigmahat)
y95 <- qmyGumbel(upred95, location=mus, scale=sigmahat)
y05 <- qmyGumbel(upred05, location=mus, scale=sigmahat)

#Plot the results, and note that approximately 1 (out of 10) predictions
#should lie outside the boundaries of the 90% prediction interval.
matplot(cbind(caseData$ys, y50, y95, y05), type="b", pch=1, lty=1,
        col=c("black", "red", "blue", "blue"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr, "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
legend("bottomleft", lty=c(1, 1, 1), pch=c(1,1,1),
       col=c("black", "red", "blue"),
       legend=c("observed", "median", "90% prediction interval"),
       bty="n", y.intersp=1.2, cex=.7)


#5) Similar, but with now start with only 2 initial observations.
upred50 <- predictU(observedU=case[1:2], nAhead=18, p=.5)
upred95 <- predictU(observedU=case[1:2], nAhead=18, p=.95)
upred05 <- predictU(observedU=case[1:2], nAhead=18, p=.05)

y50 <- qmyGumbel(upred50, location=mus, scale=sigmahat)
y95 <- qmyGumbel(upred95, location=mus, scale=sigmahat)
y05 <- qmyGumbel(upred05, location=mus, scale=sigmahat)

#Plot the results
matplot(cbind(caseData$ys, y50, y95, y05), type="b", pch=1, lty=1,
        col=c("black", "red", "blue", "blue"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr, "; x1 = -2"))
abline(v=2, lty=2, col="darkgreen")
legend("bottomleft", lty=c(1, 1, 1), pch=c(1,1,1),
       col=c("black", "red", "blue"),
       legend=c("observed", "median", "90% prediction interval"),
       bty="n", y.intersp=1.2, cex=.7)


#6) We now have a look at the coverage probabilty of the predicted quantiles.
#That is,
#-for the 5th quantile, 5% (=nominal coverage probability) of the observed values
# are expected to be lower than the predicted 5th quantile
#-for the median, 50% of the observations are expected to be lower than the
# predicted 50th quantile
#-for the 95th quantile, 95% are expected to be lower than the predicted
# 95th quantile.

#Coverage probabilities of 1 step ahead predictions
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.5)[11]
  upred95 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.95)[11]
  upred05 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.05)[11]
  c(sum(uobs<=upred50), sum(uobs<=upred95), sum(uobs<=upred05))})

sum(cp[1,])/(ns2) #nominal coverage probability is .50
sum(cp[2,])/(ns2) #nominal coverage probability is .95
sum(cp[3,])/(ns2) #nominal coverage probability is .05

#Coverage probabilities of 5 steps ahead predictions
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11:15]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.5)[11:15]
  upred95 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.95)[11:15]
  upred05 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.05)[11:15]
  c(sum(uobs<=upred50), sum(uobs<=upred95), sum(uobs<=upred05))})

sum(cp[1,])/(ns2*5) #nominal coverage probability is .50
sum(cp[2,])/(ns2*5) #nominal coverage probability is .95
sum(cp[3,])/(ns2*5) #nominal coverage probability is .50

#10 steps ahead for median
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11:20]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=10, p=.5)[11:20]
  c(sum(uobs<=upred50))})

#Coverage probability of 10 steps ahead predictions for the median
#(note: this is computationally intensive)
sum(cp)/(ns2*10) #nominal coverage probability is .50



##Repeat the same procedure, but this time set x1=2

#1) Generate a holdout sample for cases with x1=-2
ns2 <- 400 #sample size
rs2 <- rMvdc(ns2, myMvdc)

x1 <- rep(2, each=ns2*ndim) #value on covariate x1
t <- rep(seq(250, 250*ndim, 250), times=ns2) #time points
phi1 <- -1.4 #true value for phi1 (=asymptote)
phi2 <- -7.5 + .7*x1 #true value for phi2 (=rate of change)
ys <- phi1*(1-exp(-exp(phi2)*t)) + as.vector(t(rs2)) #response
dataL2 <- data.frame(id=rep(1:ns2, each=ndim), ys , time=t, x1) #combine data

#2) Based on the parameters we estimated above, compute the cumulative
#probabilities for the marginals
phi2hat2 <- cbind(1, dataL2$x1)%*%c(beta0hat, beta1hat)
muhat <- phi1hat*(1-exp(-exp(phi2hat2)*dataL2$time))
udat2 <- matrix(pmyGumbel(dataL2$ys, location=muhat, scale=sigmahat),
                ncol=ndim, byrow=TRUE)


#3) Forecasts for 4 individual cases of the holdout sample with x1=-2
obs <- 10 #number of initial observations
nAheads <- 10 #number of steps ahead for the predictions
casenr <- c(1:4) #select the first 4 cases of the sample

#Predictions without taking the individual heterogeneity into account
phi2s <- cbind(1, x1[casenr[1]])%*%c(beta0hat, beta1hat)
mus <- phi1hat*(1-exp(-exp(phi2s)*seq(250, 250*ndim, 250))) 

#Generate predictions for these 4 individual cases, but this time taking the
#individual heterogeneity into account. Predicted are the medians 10 steps ahead,
#given 10 initial observations.
pred <- sapply(1:4, function (i) {
  case <- udat2[casenr[i], 1:ndim]
  upred50 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.5)
  qmyGumbel(upred50, location=mus, scale=sigmahat)})

#Plot the results:
#-black lines are the observations
#-red lines are the predicted medians taking individual heterogeneity into account
#-blue lines are predictions not taking the individual heterogeneity into account
#-green line is the time point at which the forecasting starts (that is, after
# the first 10 initial observations)
par(mfrow=c(2, 2))
matplot(cbind(dataL2[which(dataL2$id==1), 2], pred[,1]), type="b",
        pch=1, lty=1, ylim=c(-2.1,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[1], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==2), 2], pred[,2]), type="b",
        pch=1, lty=1, ylim=c(-2.1,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[2], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==3), 2], pred[,3]), type="b",
        pch=1, lty=1, ylim=c(-2.1,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[3], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")

matplot(cbind(dataL2[which(dataL2$id==4), 2], pred[,4]), type="b",
        pch=1, lty=1, ylim=c(-2.1,0),
        col=c("black", "red"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr[4], "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
lines(1:20, mus, lty=2, col="blue")
par(mfrow=c(1, 1))


#4) Have a closer look at single individual case
casenr <- 3
case <- udat2[casenr, 1:ndim]
caseData <- dataL2[which(dataL2$id==casenr),]

#Predict the median, 5th, and 95th quantiles for this individual case.
#The 5th and 95th quantiles together comprise a 90% prediction interval.
upred50 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.5)
upred95 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.95)
upred05 <- predictU(observedU=case[1:obs], nAhead=nAheads, p=.05)

y50 <- qmyGumbel(upred50, location=mus, scale=sigmahat)
y95 <- qmyGumbel(upred95, location=mus, scale=sigmahat)
y05 <- qmyGumbel(upred05, location=mus, scale=sigmahat)

#Plot the results, and note that approximately 1 (out of 10) predictions
#should lie outside the boundaries of the 90% prediction interval.
matplot(cbind(caseData$ys, y50, y95, y05), type="b", pch=1, lty=1,
        col=c("black", "red", "blue", "blue"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr, "; x1 = -2"))
abline(v=obs, lty=2, col="darkgreen")
legend("bottomleft", lty=c(1, 1, 1), pch=c(1,1,1),
       col=c("black", "red", "blue"),
       legend=c("observed", "median", "90% prediction interval"),
       bty="n", y.intersp=1.2, cex=.7)


#5) Similar, but with now start with only 2 initial observations.
upred50 <- predictU(observedU=case[1:2], nAhead=18, p=.5)
upred95 <- predictU(observedU=case[1:2], nAhead=18, p=.95)
upred05 <- predictU(observedU=case[1:2], nAhead=18, p=.05)

y50 <- qmyGumbel(upred50, location=mus, scale=sigmahat)
y95 <- qmyGumbel(upred95, location=mus, scale=sigmahat)
y05 <- qmyGumbel(upred05, location=mus, scale=sigmahat)

#Plot the results
matplot(cbind(caseData$ys, y50, y95, y05), type="b", pch=1, lty=1,
        col=c("black", "red", "blue", "blue"), ylab="y(t)", xlab="t-index",
        main=paste("Case", casenr, "; x1 = -2"))
abline(v=2, lty=2, col="darkgreen")
legend("bottomleft", lty=c(1, 1, 1), pch=c(1,1,1),
       col=c("black", "red", "blue"),
       legend=c("observed", "median", "90% prediction interval"),
       bty="n", y.intersp=1.2, cex=.7)


#6) We now have a look at the coverage probabilty of the predicted quantiles.
#That is,
#-for the 5th quantile, 5% (=nominal coverage probability) of the observed values
# are expected to be lower than the predicted 5th quantile
#-for the median, 50% of the observations are expected to be lower than the
# predicted 50th quantile
#-for the 95th quantile, 95% are expected to be lower than the predicted
# 95th quantile.

#Coverage probabilities of 1 step ahead predictions
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.5)[11]
  upred95 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.95)[11]
  upred05 <- predictU(observedU=udat2[i, 1:10], nAhead=1, p=.05)[11]
  c(sum(uobs<=upred50), sum(uobs<=upred95), sum(uobs<=upred05))})

sum(cp[1,])/(ns2) #nominal coverage probability is .50
sum(cp[2,])/(ns2) #nominal coverage probability is .95
sum(cp[3,])/(ns2) #nominal coverage probability is .05

#Coverage probabilities of 5 steps ahead predictions
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11:15]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.5)[11:15]
  upred95 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.95)[11:15]
  upred05 <- predictU(observedU=udat2[i, 1:10], nAhead=5, p=.05)[11:15]
  c(sum(uobs<=upred50), sum(uobs<=upred95), sum(uobs<=upred05))})

sum(cp[1,])/(ns2*5) #nominal coverage probability is .50
sum(cp[2,])/(ns2*5) #nominal coverage probability is .95
sum(cp[3,])/(ns2*5) #nominal coverage probability is .05

#Coverage probability of 10 steps ahead predictions for the median
#(note: this is computationally intensive)
cp <- sapply(1:ns2, function(i) {
  uobs <- udat2[i, 11:20]
  upred50 <- predictU(observedU=udat2[i, 1:10], nAhead=10, p=.5)[11:20]
  c(sum(uobs<=upred50))})

sum(cp)/(ns2*10) #nominal coverage probability is .50

#Finally, I have done no extensive coverage probability simulation studies yet.
#Some remaining questions regarding the coverage probabilities are:
#-how are the coverage probabilities for other types of models?
# In the simulations above I employed a specific nonlinear
# model (an asymptotic function through the origin), but how are the results for
# other linear and nonlinear models?
#-What are the results if only small number of intitial observations are used?
# For the simulations above I used 10 initial observations, but what are the
# the coverage probabilities with only a small number of initial observations
# (for instance, 2 or 3)? 
#-Should, in case of a small number of observations, the coverage probabilities
# deviate substantially from the nominal coverage probabilities, then what 
# mimimal number of initial observations are needed such that the coverage
# probabilities come close to the nominal coverage probabilities?