In digital soil mapping (DSM), we make predictions of the spatial distribution of a soil property, which comes with uncertainties/errors. To quantify the accuracy we split the data into a training and test set, where we train a machine learning model (e.g., random forest, additive models, splines, etc.) on the training set and evaluate the model on the test set. Then we predict over covariates to make a continous map and also we predict the uncertainties (e.g., standard deviation, standard error, prediction limit range, etc.). But how do we evaluate the estimates of uncertainty we just predicted? One option is to calculate the prediction interval coverage probability (PICP), which measures how well the predictions fall within the formulated prediction interval (PI). PIs are similar to confidence intervals but use predicted data to determine the intervals.
This function will be in the rafikisol R package, however, I have not loaded it yet as there are some more functions I want to add. Nevertheless, we will make it into a function called calcPICP(), which was basically taken from the book “Using R for Digital Soil Mapping” by Malone et al., (2017).
The function takes 3 parameters, x = a data frame of the data, response = the vector of the measured data (e.g., data$response), and pred = the predicted value (e.g., data$predicted).
calcPICP = function(data, response, pred){
#We first get the residuals of the model
res = response - pred
#Then we get the standard deviation of the residuals and combine with the data.
data$stdev = sd(res)
#We than make a series of quantiles from a normal cumulative distribution.
qp <- qnorm(c(0.995, 0.9875, 0.975, 0.95, 0.9, 0.8, 0.7, 0.6, 0.55, 0.525))
#Then make a matrix the with the row length of the data and columns of qp
vMat <- matrix(NA, nrow = nrow(data), ncol = length(qp))
#Now we must loop around the quantiles and multiply it by the standard deviation to get a series of standard errors with different prediction intervals.
for(i in 1:length(qp)){
vMat[, i] <- data$stdev * qp[i]
}
#Make another matrix same as before for the upper limits
uMat <- matrix(NA, nrow = nrow(data), ncol = length(qp))
#We calculate the upper limits by adding the series of standard errors to the predictions of the model.
for(i in 1:length(qp)) {
uMat[, i] <- pred + vMat[, i]
}
#We make another matrix for the lower limits
lMat <- matrix(NA, nrow = nrow(data), ncol = length(qp))
#We calculate the lower limits by subtracting the series from the predicted values.
for(i in 1:length(qp)) {
lMat[, i] <- pred - vMat[, i]
}
#Now we want to see which prediction intervals cover the measured data creating a matrix of 1s and 0s.
bMat <- matrix(NA, nrow = nrow(data), ncol = length(qp))
for(i in 1:ncol(bMat)){
bMat[, i]<-as.numeric(response <= uMat[, i] &
response >= lMat[, i])
}
#To calculate the PICP we take the colsums/nrow*100 for the matrix of 0s and 1s
picp <- colSums(bMat)/nrow(bMat)*100
#Make a vector of confidence levels
cl <- c(99, 97.5, 95, 90, 80, 60, 40, 20, 10, 5)
#We put into a data frame for plotting
results <- data.frame(picp = picp, cl = cl)
#Since we want PICP to CI to be a 1:1 line we also calculate Lin’s concordance correlation coefficient (CCC) with the yardstick R package.
ccc <- as.data.frame(yardstick::ccc_vec(results$picp, results$cl))
#Make name correct
names(ccc) = "CCC" #name
#must add axis values for plotting
ccc$x = 10 #x axis
ccc$y = 90 #y axis
#Now we can plot the PICP to CI, add the 1:1 line and the CCC
p = ggplot(data = results, aes(x= cl, y = picp))+ #add data
geom_point()+ #add points
geom_text(data = ccc,aes(x= x, y =y, label = paste("CCC = ",round(CCC, 2))))+ #add CCC value
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = 'red')+ #add 1:1 line
labs(x = 'Confidence level', y = "PICP", title = "PICP to confidence level")+ #labels
theme_bw() #make it look good
#Now we want to return a list of the plot as well as a data frame of the total results.
return(setNames(list(p, results), c("Plot", "Results")))
}
Now we have the function giving us a plot of the PICP to CI and results. This is useful when running many models and now we can just plug in the data.
picp = picpCalc(dat, dat$clay, dat$pred)
now, plot the data.
picp[[1]]