AFEchidna: a new package solving mixed linear model for plant and animal datasets

Mixed linear models(MLM) are linear models with a combination of fixed and random effects to explain the degree of variation in interest traits, such as milk yield in cows or volume growth in forest trees.  MLM are widely used in the analysis in the progeny test data of plants and animals. . Nowadays, most software uses the Restricted Maximum Likelihood (REML) method to estimate the variance components of random effects, and then estimate the fixed effects and predict the random effects. Such genetic analysis software includes ASReml, Echidna, SAS, BLUPF90, and R packages sommer, breedR, etc. Echidna is a free software developed in 2018 by Professor Gilmour, the main developer of ASReml. It also uses REML method to estimate parameter values, and its syntax and function is very close to that of ASReml. It is the most powerful free software for animal and plant genetic assessment, but its usage is a little complicated, which may be difficult for ordinary users. 

Here, I released  AFEchidna, an R package based on Echidna software, and demonstrated how to use a mixed linear model to generate solutions for variance components, genetic parameters, and random effects BLUPs.

The main functions in AFEchidna:
  • get.es0.file: generate es0 file
  • echidna(): specify mixed linear model
  • Var(): output variance components
  • plot(): model diagnose plots
  • pin(): calculate genetic parameters
  • predict(): model predictions
  • coef(): model equation solutions
  • model.comp():  compare different models
  • update(): run new mode
library(AFEchidna) 
setwd("D:\\Rdata") 
get.es0.file(dat.file=" Provenance.csv")  # generate .es file
get.es0.file(es.file=" Provenance.es")    # generate .es0 file

Specified a mixed model:

m1.esr <- echidna( fixed=height~1+Prov,
                   random=~Block*Female,
                   residual=~units,
                   es0.file='Provenance.es0')

Output related results:
> Var(m1.esr)
          Term   Sigma       SE   Z.ratio
1     Residual 2.52700 0.131470 19.221115
2        Block 0.10749 0.089924  1.195343
3       Female 0.18950 0.083980  2.256490
4 Block:Female 0.19762 0.086236  2.291618
> pin(m1.esr, mulp=c(Va~4*V3,
+ Vp~V1+V3+V4,
+ h2~4*V3/(V1+V3+V4)), digit=5)
 Term Estimate      SE
1  Va  0.75801 0.33592
11 Vp  2.91415 0.14973
12 h2  0.26011 0.10990
> coef(m1.esr)$fixed
  Term Level    Effect         SE
1 Prov    11 0.0000000  0.0000000
2 Prov    12 -1.6656325 0.3741344
3 Prov    13 -0.6237406 0.3701346
4 Prov     0 -1.2201920 0.3769892
5   mu     1 11.5120637 0.3619919
> coef(m1.esr)$random %>% head
    Term Level    Effect        SE
1  Block     2 0.3110440 0.1854718
2  Block     3 0.1268193 0.1858290
3  Block     4 0.2055624 0.1858158
4  Block     5 -0.2918516 0.1866871
5  Block     0 -0.3515741 0.1878287
6 Female   191 -0.1633745 0.3433451

A easy way to run batch analysis:
mt.esr <- update(m1.esr,trait=~height+diameter+volume,batch=TRUE)
> Var(mt.esr)

V1-Residual; V2-Block; V3-Female; V4-Block.Female
Converge: 1 means True; 0 means FALSE.

              V1       V2      V3      V4    V1.se V2.se V3.se V4.se Converge maxit
height    2.5270 0.107490 0.18950 0.19762 0.13147 0.089924 0.08398 0.08624 1 4
diameter 16.8810 0.167130 0.80698 0.69100 0.87659 0.198100 0.41470 0.50067 1 5
volume    0.0037 0.000084 0.00022 0.00016 0.00019 0.000077 0.00010 0.00011 1 6

55,000 in Awards for Energy & Buildings Hackathon, Sponsored by NYSERDA

The New York State Energy Research & Development Agency (NYSERDA) is partnering with Onboard Data to host a $55,000 Global Energy & Buildings Hackathon. We’re inviting all engineers, data scientists and software developers whether they are professionals, professors, researchers or students to participate. More below…


Challenge participants will propose exciting, new ideas that can improve our world’s buildings. The hackathon will share data from 200+ buildings to participants. This data set is rich and one of a kind. The data set is normalized from equipment, systems and IoT devices found within buildings.
We seek submissions that positively impact or accelerate the decarbonization of New York State buildings. 

Total awards are $55,000. Sign-ups stay open until April 15th and the competition is open from April 22nd to May 30th. More can be found here: www.rtemhackathon.com.

Advance the next generation of building technology!

Batched Imputation for High Dimensional Missing Data Problems

High dimensional data spaces are notoriously associated with slower computation time, whether imputation or some other operation. As such, in high dimensional contexts many imputation methods run out of gas, and fail to converge (e.g., MICE, PCA, etc., depending on the size of the data). Further, though some approaches to high dimensional imputation exist, most are limited by being unable to simultaneously and natively handle mixed-type data.

To address these problems of either inefficient or slow computation time, as well as the complexities associated with mixed-type data, I recently released the first version of a new algorithm, hdImpute,  for fast, accurate imputation for high dimensional missing data problems. The algorithm is built on top of missForest and missRanger, with chained random forests as the imputation engine.

The benefit of the hdImpute is in simplifying the computational burden via a batch process comprised of a few stages. First, the algorithm divides the data space into smaller subsets of data based on cross-feature correlations (i.e., column-wise instead of row-wise subsets). The algorithm then imputes each batch, the size of which is controlled by a hyperparameter, batch, set by the user. Then, the algorithm continues to subsequent batches until all batches are individually imputed. Then, the final step joins the completed, imputed subsets and returns a single, completed data frame. 

Let’s walk through a brief demo with some simulated data. First, create the data.

# load a couple libraries
library(tidyverse)
library(hdImpute) # using v0.1.0 here

# create the data
{
set.seed(1234)

data <- data.frame(X1 = c(1:6),
X2 = c(rep("A", 3), 
rep("B", 3)), 
X3 = c(3:8),
X4 = c(5:10),
X5 = c(rep("A", 3), 
rep("B", 3)), 
X6 = c(6,3,9,4,4,6))

data <- data[rep(1:nrow(data), 500), ] # expand/duplicate rows

data <- data[sample(1:nrow(data)), ] # shuffle rows
}
Next, take a look to make sure we are indeed working with mixed-type data. 

# quick check to make sure we have mixed-type data
data %>% 
map(., class) %>% 
unique()

> [[1]] [1] "integer" [[2]] [1] "character" [[3]] [1] "numeric"
Good to go: three data classes represented in our data object. Practically, the value of this feature is that there is no requirement for lengthy preprocessing of the data, unless desired by the user of course.

Finally, introduce some NA’s to our data object, and store in d.

# produce NAs (30%)
d <- missForest::prodNA(data, noNA = 0.30) %>% 
as_tibble()
Importantly, the package allows for two approaches to use hdImpute: in stages (to allow for more flexibility, or, e.g., tying different stages into different points of a modeling pipeline); or as a single call to hdImpute(). Let’s take a look at the first approach via stages. 

Approach 1: Stages

To use hdImpute in stages, three functions are used, which comprise each of the stages of the algorithm: 

  1. feature_cor(): creates the correlation matrix. Note: Dependent on the size and dimensionality of the data as well as the speed of the machine, this preprocessing step could take some time. 

  2. flatten_mat(): flattens the correlation matrix from the previous stage, and ranks the features based on absolute correlations. The input for flatten_mat() should be the stored output from feature_cor().

  3. impute_batches(): creates batches based on the feature rankings from flatten_mat(), and then imputes missing values for each batch, until all batches are completed. Then, joins the batches to give a completed, imputed data set.

    Here’s the full workflow.

# stage 1: calculate correlation matrix and store as matrix/array
all_cor <- feature_cor(d)

# stage 2: flatten and rank the features by absolute correlation and store as df/tbl
flat_mat <- flatten_mat(all_cor) # can set return_mat = TRUE to print the flattened and ranked feature list

# stage 3: impute, join, and return
imputed1 <- impute_batches(data = d,
features = flat_mat, 
batch = 2)

# take a look at the completed data
imputed1

d # compare to the original if desired
Of note, setting return_mat = TRUE returns all cross-feature correlations as previously mentioned. But calling the stored object (regardless of the value passed to return_mat) will return the vector of ranked features based on absolute correlation from calling flatten_mat(). Thus, the default for return_mat is set to FALSE, as it isn’t necessary to inspect the cross-feature correlations, though users certainly can. But all that is required for imputation in stage 3 is the vector of ranked features (passed to the argument features), which will be split into batches based on their position in the ranked vector.

That’s it! We have a completed data frame returned via hdImpute. The speed and accuracy of the algorithm are better understood in larger scale benchmarking experiments. But I hope the logic is clear enough from this simple demonstration. 

Approach 2: Single Call

Finally, consider the simpler, yet less flexible approach of making a single call to hdImpute(). There are only two required arguments: data (original data with missing values) and batch (number of batches to create from the data object). Here’s what this approach might look like using our simulated data from before:

# fit and store in imputed2
imputed2 <- hdImpute(data = d, batch = 2)

# take a look
imputed2

# make sure you get the same results compared to the stages approach above
imputed1 == imputed2

Visually Comparing

Though several methods and packages exist to explore imputation error, a quick approach to comparing error between the imputed values and the original (complete) data is to visualize the contour of the relationship between a handful of features. For example, take a look at the contour plot of the original data.

data %>%
select(X1:X3) %>% 
ggplot(aes(x = X1, y = X3, fill = X2)) + 
geom_density_2d() + 
theme_minimal() + 
labs(title = "Original (Complete) Data")


Next, here’s the same plot but for the imputed set. 

imputed2 %>% 
select(X1:X3) %>% 
ggplot(aes(x = X1, y = X3, fill = X2)) + 
geom_density_2d() + 
theme_minimal(). + 
labs(title = "Imputed (Complete) Data")



As expected, the imputed version has a bit of error (looser distributions) relative to the original version. But the overall pattern is quite similar, suggesting hdImpute did a fair job of imputing plausible values.

Contribute

To dig more into the package, check out the repo, which includes source code, a getting started vignette, tests, and all other documentation.

As the software is in its infancy, contribution at any level is welcomed, from bug reports to feature suggestions. Feel free to open an issue or PR to request the change or contribute. 

Thanks and I hope this helps ease high dimensional imputation!

Gaussian Simulation of Spatial Data Using R

Gaussian Simulation of Spatial Data Using R

Richard E. Plant

This post is a condensed version of an Additional Topic to accompany Spatial Data Analysis in Ecology and Agriculture using R, Second Edition. The full version together with the data and R code can be found in the Additional Topics section of the book’s website, http://psfaculty.plantsciences.ucdavis.edu/plant/sda2.htm Chapter and section references are contained in that text, which is referred to as SDA2.

1.     Introduction

Probably the most common task in spatial data analysis is to estimate the values of some quantity at locations other than those at which it was measured. Beginning GIS classes often emphasize the use of spatial interpolation methods such as inverse distance weighting and kriging to carry out this task, and these are indeed important tools. They are not the whole story, however, and for some applications they may actually give results that lead to incorrect conclusions. A common alternative to interpolation is a method called conditional simulation. The purpose of this post is to describe conditional simulation, in particular conditional Gaussian simulation, to compare it with interpolation, and to show how to program it in R. We will also touch on the related topic of unconditional Gaussian simulation and see how the two differ. To demonstrate conditional simulation we need a data set that is densely defined in space so that we may sample from it, use the sample data to predict population values at unsampled points, and compare these predicted values with the true data values. The yield monitor data associated with Data Set 4 of SDA2 are very suitable for this purpose. In particular, we will use the wheat yield data of Field 2. A cleaned version of this data set was developed in Section 6.3.2 of SDA2, and this was converted to a regular grid. Fig. 1 shows the full gridded data set. The grid has 144 cells in the east-west direction and 72 cells in the north-south direction. Also shown in the figure are the 78 points at which field sample data were collected. The locations of these points in the grid were set to the closest grid cell to each sample point. Here is a part of the code showing the names of the R data objects and the assignment of sample point locations. > data.Set4.2 <- read.csv(“set4\\set4.296sample.csv”, header = TRUE)
> Y.full.data.sf <- st_read(“Created\\Set42pop.shp”)
> Y.full.data.sp <- as(Y.full.data.sf, “Spatial”)
> gridded(Y.full.data.sp) <- TRUE
> proj4string(Y.full.data.sp) <- CRS(“+proj=utm +zone=10
    +ellps=WGS84″)

> sample.coords <- cbind(data.Set4.2$Easting, data.Set4.2$Northing)                      Figure 1. Yield population data. The sample points are shown as bright yellow pixels.

As was discussed in SDA2, the areas in the western half of the field with very low yield were the result of a failed herbicide application due to equipment malfunction. We will use the sample points to extract from the full grid the data values that will be used to develop our kriging interpolations and conditional simulations. We will also need the record numbers of the sample points.

> closest.point <- function(sample.pt, grid.data){
+    dist.sq <- (coordinates(grid.data)[,1]-sample.pt[1])^2 +
+       (coordinates(grid.data)[,2]-sample.pt[2])^2
+    return(which.min(dist.sq))
+ }
> samp.pts <- apply(sample.coords, 1, closest.point,
+   grid.data = Y.full.data.sp)
> data.Set4.2$ptno <- samp.pts
> data.Set4.2$Yield <- Y.full.data.sp$Yield[samp.pts]
 The kriging analysis is similar, but not identical, to the analysis carried out in Sec. 6.3.2 of SDA2. > # Convert data.Set4.2 to a Spatial Points Data Frame
> # This is needed for the function krige()
> data.Set4.2.sp <- data.Set4.2
> coordinates(data.Set4.2.sp) <- c(“Easting”, “Northing”)
> proj4string(data.Set4.2.sp) <- CRS(“+proj=utm +zone=10
+   +ellps=WGS84″)
> # Kriging
> data.var <- variogram(object = Yield ~ 1, locations =
+    data.Set4.2.sp, cutoff = 600)
> data.fit <- fit.variogram(object = data.var, model =
+    vgm(150000, “Sph”, 250, 1))
> Yield.krige <- krige(formula = Yield ~ 1,
+    locations = data.Set4.2.sp, newdata = Y.full.data.sp,
+    model = data.fit)

Figure 2a. Kriged interpolation of the sample point data in Figure 1.     

 
Figure 2b. Conditional Gaussian simulation of the sample point data in Figure 1. Fig. 2a shows the result of an ordinary kriging interpolation of the data. Fig. 2b shows a conditional Gaussian simulation of the same data. The code to develop this simulation will be given in Section 4. Comparison of the three figures shows that the kriging interpolation well approximates the pattern of yield but is much smoother than the original data. The simulation is considerably less smooth than the kriging interpolation. The smoothness of the kriged interpolation relative to those of the real data and the conditional Gaussian simulation is easy to see. Based on Fig. 2 one can say that although the kriged interpolation may be “optimal,” the pattern of variation of the conditional Gaussian simulation in some sense more closely resembles that of the true data. In what sense is the kriged interpolation optimal? The answer to this question is given in SDA2 Sec. 6.3.2, and in much greater detail by Isaacs and Srivastava (1989, Chapter 12). Here it is in brief: if the quantity to be estimated is denoted Y(x) and the estimates are denoted , where x represents the position vector, and if Y is measured at locations xi, then the kriged estimate   minimizes the error variance

.

(1)
This obviously does not mean that the kriged interpolation is the “best” estimate in every sense. The weaknesses of the kriged interpolation are described so well by Goovaerts (1997, p. 370) that I can do no better than to quote him. Interpolated estimates like that produced by kriging “…tend to smooth out local details of the spatial variation of the attribute. Typically, small values are overestimated, whereas large values are underestimated. Such conditional bias is a serious shortcoming when trying to detect patterns of extreme attribute values, such as zones of high permeability values or zones rich in a metal. Another drawback of estimation is that the smoothing is not uniform. Rather, it depends on the local data configuration: smoothing is minimal close to the data locations and increases as the location being estimated gets farther away. A map of kriging estimates appears more variable in densely sampled areas that in sparsely sampled areas.” An example from my own experience may make the first point clearer. I was involved for many years in experiments at the UC Davis Long Term Research in Agricultural Systems (LTRAS) site, a large agricultural field experiment. At one point, we were interested in conducting experiments on soil drainage at the site. This was made very difficult (actually, almost impossible) by the fact that the site lies directly next to, and at a higher elevation than, a creek. The site has some local areas of very high drainage, and water applied to the soil runs through these local areas and into the creek in the way that water drains from a bathtub. This means that taking soil samples and interpolating them would provide an unacceptably false estimate of the drainage of the site, because the interpolation would likely undervalue the very high drainage areas, where the water more quickly drains to the creek. Conditional simulation is a Monte Carlo process (see SDA2 Sec. 3.3). For convenience we will repeat a bit of the discussion in that section here. The basic idea of “traditional” Monte Carlo simulation is to formulate a probabilistic model of a process, run this model many times, and collect statistics on the outcomes. SDA2 Sec. 3.3 contains a simple example of a Monte Carlo simulation of the tossing of a fair coin 20 times. The simulation is carried out using a function called coin.toss() that generates the number of heads as a pseudorandom number from a binomial distribution. > coin.toss <- function (n.tosses, p){
+   n.heads <- rbinom(1,n.tosses,p)
+ }
The function takes two arguments, the number of tosses n.tosses and the probability of heads p. The simulation is then carried out as follows.

> set.seed(123)
> n.tosses <- 20
> p <- 0.5
> n.reps <- 10000
> U <- replicate(n.reps, coin.toss(n.tosses, p))

The function replicate() generates a vector U whose elements are the number of heads in each of the 10,000 replications of the experiment.

> mean(U)
[1] 9.9814
> var(U)
[1] 4.905345
The mean and variance of U are close to their theoretically predicted values of 10 and 5, respectively.
In this post we will only discuss a specific kind of conditional simulation, called conditional Gaussian simulation. As the name implies, the value of the quantity being simulated is assumed to follow a normal (i.e., Gaussian) distribution. We will see that this greatly improves the analysis. Zanon and Leuangthong (2004) give a nice outline of the steps involved in conditional Gaussian simulation. They are as follows.

1. If the data are not normally distributed, transform them into a distribution that is to the greatest extent possible normal.
2. Generate a random sequence of steps that encounters each data value once.
3. At each step sequentially,
    a. search to find nearby sample data and previously simulated values,
    b.  calculate the conditional distribution based on these data, and
    c. perform Monte Carlo simulation to obtain a single value from the
        distribution.
4. Repeat Step 3 until every data location has been visited.
5. If necessary, back transform the simulated data to the original distribution. As with neural networks in the Additional Topic on that technology, probably the best way to gain understanding of conditional Gaussian simulation is to construct a conditional Gaussian simulation algorithm that carries out the process in a similar way to the R function used to construct the simulation shown in Fig. 2b. To do this we must first review some important aspects of geostatistics (Section 2). In Section 3 we will construct the simple conditional Gaussian simulation algorithm, and in Section 4 we will describe the use of R gstat library (Pebesma, 2003, 2004) functions to carry out conditional and unconditional Gaussian simulation.

2.     Geostatistics Review

a. Variance and covariance
Let Y(x)  represent the quantity of interest measured at n locations xi, i =1,…,n, where xi is a two dimensional position vector. It is traditional in statistics to use this subscript notation, whereas in geostatistics it is traditional to use a notation involving the spatial lag, and we will switch to that in a moment. The key step in the conditional simulation algorithm described in the previous section is Step 3b, computing the probability distribution of the process Y(x) at the current location x conditional on the values of Y at nearby already sampled or simulated locations. To do this we must be able to compute the covariance between Y values at these locations, and this is done using the variogram. Much of the material that we will be going over in Subsection a is contained in SDA2 Sections 4.6.1 and 6.3.2, but it is convenient to repeat it in greater detail. Let the lag distance h be the distance between two points in the region. Assume that the region is stationary and isotropic (see SDA2, Sec. 3.2.2), so that h only depends on the magnitude of the distance and not its direction or the location of the points in the region. The derivation of the relationship between the covariance and the variogram follows an argument similar to that used in the Additional Topic on Spatial Sampling. The covariance between Y(x) and Y(x+h) is defined as

C(h) = Cov{Y(x),Y(x+h)}.

(2)
To carry out a conditional simulation we must relate this covariance to the variogram. The variogram γ(h) is defined as (SDA2, Sec. 4.6.1)
(3)
If  Var{Y} = σ 2 then it turns out (see the Additional Topic for a derivation) that
(4)
As discussed in SDA2, Sec. 4.6.1, the variogram is estimated by the experimental variogram . It is worthwhile noting that an experimental variogram may have a nugget, i.e. that if   is the experimental variogram then it may be the case that > 0, but nevertheless the variogram γ(h) is always 0 at h = 0, so that the variance is given by C(0) = σ 2.

b. Simple kriging
The interpolation exercise carried out in Section 1 involved ordinary kriging. The difference between simple kriging and ordinary kriging is that in simple kriging the value of the mean μ of the distribution  is assumed to be known, whereas in ordinary kriging it is not. In the most common interpolation applications, the mean is obviously not known, and so ordinary kriging is the more widely used of the two. We will see, however, that simple kriging can be employed in conditional simulation, and so we will develop it here. Since the mean μ is known, we can define a new variable  

Z(x) = Y(x) – μ.

(5)
and derive the equations in terms of Z, which has mean zero. The kriging interpolator   is a linear interpolation, which means that it is defined by the equation

,

(6)
where the xi are the locations at which Y, and therefore Z, is measured, and x is the location at which it is to be interpolated. Kriging interpolation is based on the assumption of a probability model for , and is derived within that model with the objective of minimizing the variance of the error associated with the interpolator.

It turns out (again, see the Additional Topic for a derivation) that the minimum of the expected variance is achieved under the condition

                                                ΣφiC(xi,xi + hij) – C(xi,x) = 0, i = 1,…,n.                           (7)
The equations (7) are called the simple kriging equations.

Equations (7) can be written in matrix form. If we let the n by n matrix C have elements Cij = C(xi,xi +hij) (hopefully this abuse of notation is not confusing: C is a matrix and C(xi,xi +hij)  is the covariance between locations xi and xi + hij, and let the vector c have components C(xi,x) then a system of n equations of the form of Equation (7) for the n coefficients φi can be written as the matrix equation
                                                                     Cφ = c.                                                   (8)
The values of φ the can be determined by solving Equation (8). This is not, however, the most computationally efficient way to compute these quantities, as will be discussed below. We will follow in the footsteps of Leuangthong et al. (2008) and create a simple four-point example in which we can directly compute the appropriate simple kriging quantities. In our case we will use the four points in the northwest corner of the sample set. We will krige on the quantity Z, the value of Yield minus its mean over the full 78 point data set. > Y.mean <- mean(data.Set4.2$Yield)
> data.Set4.2$Z <- data.Set4.2$Yield – Y.mean
> data.4pts <- data.Set4.2[which(data.Set4.2$Easting < 592200 &
+    data.Set4.2$Northing > 4267750),]

Figure 4. Simple four-point example, showing the Z values. The value at the blue point is to be estimated based on the values at the red points. Fig. 4 shows the four points. We will use the values of the three points shown in red to estimate the value of the point shown in blue. The values of Z are shown at each point. We can anticipate that the value of the blue point will be seriously overestimated, which illustrates the smoothing nature of kriging described above. First, we compute the distance matrix of these points and then we apply the variogram formula computed in Section 1. The way this is coded we need to be careful about the order of the data and make sure that the point to be estimated is the last in that order. > print(h4.mat <-as.matrix(with(data.4pts,
+   dist(cbind(Easting, Northing))))
       1       2        3        4
1  0.00000 61.0000 61.03278 86.26703
2 61.00000  0.0000 84.86460 61.00000
3 61.03278 84.8646  0.00000 59.00000
4 86.26703 61.0000 59.00000  0.00000

> print(nugget <- data.fit$psill[1])
[1] 0
> print(psill <- data.fit$psill[2])
[1] 1533706
> print(rnge <- data.fit$range[2])

[1] 248.2806

Using the formula for the spherical variogram given by Issacs and Srivastava (1997), we can construct a function sph.vgm() in terms of the nugget, the partial sill, and the range (the sill is the sum of the nugget and the partial sill). Here is the computation of the variogram matrix. > sph.vgm <- function(nug, psill, rnge, hh){
+    if (hh == 0) return(0)
+    if (hh > rnge) return(nug + psill)
+    return(nug + psill*((3*hh)/(2*rnge) – 0.5 * (hh/rnge)^3))
+    }
> gam.mat <- matrix(0, 4, 4)
> for (i in 1:4)
+    for (j in 1:4)
+       gam.mat[i,j] <- sph.vgm(0, psill, rnge, h4.mat[i,j])
> gam.mat
        [,1]     [,2]     [,3]     [,4]
[1,]      0.0 553850.7 554136.1 767179.4
[2,] 553850.7      0.0 755728.0 553850.7
[3,] 554136.1 755728.0      0.0 536401.1
[4,] 767179.4 553850.7 536401.1      0.0
We can then use Equation (5) to compute the variance-covariance matrix > # The fourth point is the point to be estimated
> print(C.full <- nugget + psill – gam.mat)

         [,1]      [,2]      [,3]      [,4]
[1,] 1533705.6  979855.0  979569.6  766526.3
[2,]  979855.0 1533705.6  777977.7  979855.0
[3,]  979569.6  777977.7 1533705.6  997304.5
[4,]  766526.3  979855.0  997304.5 1533705.6
The matrix C in Equation (8) is the 3 x 3 upper left corner of C.full and the vector c is the first three elements of the fourth column. We can solve Equation (8) for φ and then compute the simple kriging estimate of Z.
> C.mat <- Cfull[1:3, 1:3]
> c.vec <- Cfull[1:3,4]
> print(phi <- solve(C.mat, c.vec))
[1] -0.1015358  0.4591511  0.4822024
> print(Z.est <- sum(phi * data.4pts$Z[1:3]))
[1] 979.1371
As predicted, the estimate is rather poor, since the true value of Z is anomalously low. In the next subsection we will see that the simple kriging Equation (8) can be expressed as a linear regression model.

c. Regression digression
The derivation in Subsection b showed that the simple kriging estimator is the best linear unbiased estimator in the sense that among unbiased estimators it minimizes the variance of the error. The Gauss-Markov theorem (Kutner et al., 2005, p. 18) states that the estimator of the linear regression model is the best linear unbiased estimator. It must therefore be the case that the solution of the simple kriging equations is the same as that of corresponding linear regression equation. The purpose of this brief subsection is to give an informal argument to show this. SDA2 Appendix A.2 contains a discussion of the equations of multiple linear regression. We will reproduce some of this discussion here. Multiple linear regression provides the least squares estimate for the parameters  of the model

                       .   (9)

Here I have changed to a nonstandard notation different from that of SDA2 but matching that of this Additional Topic. To observe the analogy between simple kriging and linear regression we will be concerned with the special case in which β = 0. (regression through the mean). The reason β = 0 is that simple kriging is concerned with estimating the value of a quantity Z having mean 0, and this corresponds to regression through the mean. The equations for the regression model turn out to be (again, see the Additional Topic for a derivation)                              ΣjβjXiXjXiY = 0, i = 1, … n .                               (10)

If we compare this equation with Equation (7), the simple kriging equations, which is repeated here,

                             ΣφiC(xi,xi + hij) – C(xi,x) = 0, i = 1,…,n.                           (11)

we can see that

                             XiXj = C(xi,xj), φi = βiXiY = C(xi,x) .                       (12)

Thus, we have shown (albeit with a bit of hand waving) that simple kriging does indeed produce the same result as linear regression. In the next subsection we will see that they are also the same as those generated by Bayes’ rule.
 
d. Bayesian statistics
Recall that Bayes’ rule is written as follows. Let A and B be two events. The joint probability, denoted P{A,B}, is the probability that both events A and B occur. The conditional probability that A occurs given that B occurs is denoted P{A|B}, and one can similarly define the conditional probability P{B|A} that B occurs given that A occurs. The laws of probability state that (Larsen and Marx, 86, p. 42)
                          P{A,B} = P{A|B}P{B} = P{B|A} P{A}, (13)
and these can be used to formally define the conditional probability. Bayes’ rule is obtained by dividing the second equation in (13) by P{A},                                                   .                              (14)

If we have observed the event A then Bayes’ rule gives us a means to compute the probability that B occurs given this observation together with any prior information that we may already have about event B. In order to describe how Bayes’ rule is applied to conditional simulation we will shift from a model expressed in terms of events A and B to one in which the probabilities involve two scalar quantities y1 and y2. The following discussion is taken primarily from Leuangthong et al. (2008, p. 107), although I have changed the notation to better match mine. Rather than considering the probability of events, we consider probability densities, which we denote with a lower case p. We will assume that our objective is to compute the value of the quantity Y2 based on having computed the value of Y1, and that p(y1) and p(y2)  represent probability densities of the values of Y1 andY2. Bayes’ rule becomes                                         .             (15)
The density p(y2|y1) is in geostatistics generally called the conditional density, p(y1,y2) is called the joint density, and p(y1) is called the marginal density. I should note that if you are familiar with Bayesian statistics in the context of subjective probability theory then both the terminology and the concepts as applied here are slightly different and should not be confused. The formula for the probability density of a univariate distribution of a Gaussian random variable with mean m and variance  is (Larsen and Marx, 1986, p. 210)

                                       .                      (16)

I will not repeat the mathematical derivations here (again, see the Additional Topic) but note that they are based on Leuangthong et al. (2008, p. 108). They show that if the distribution p(y1) of Y1 is Gaussian (i.e., normal) then the conditional density p(y2|y1) of Y2 is also Gaussian.  This property, that the densities all have the same Gaussian form, is a rare one not shared by other probability densities. It is a primary reason why we do conditional Gaussian simulation. The great advantage of taking a Bayesian approach to conditional Gaussian simulation is that it facilitates the successive incorporation of the new data at each step of the simulation process (i.e. at each new visit to a point to be simulated). This incorporation is described, for example, by Leuangthong et al. (2008, p. 109) and by Goovaerts (1997, p. 376). Basically, once we know Y1 = y1,  we can compute Y2 = y2 based on the formulas given above. Once we know Y2 = y2, we can compute Y3 = y3, and so forth. In other words, the calculation at each new visit point is a computationally simple matter of Bayesian updating. As described by Goovaerts, this provides tremendous efficiency in the simulation. This is the way that the computations are carried out in software such as the gstat package. The presentation of the algorithm is simplified, however, if one adopts a simple matrix inversion approach like that used in the four-point example above, and this is what we shall do in Section 3. Because they use Bayesian updating of a Gaussian distribution, however, R library algorithms such as those in gstat require that the data be normally distributed, and for this reason if they are not then they must be transformed. That is the subject of the next subsection.

e. The normal scores transformation
If you have had a course in linear regression, one of the topics you probably covered was transforming the data to a form that more closely fits the assumptions made to derive the method. The situation is similar with conditional simulation, but the objective is different. Linear regression is fairly robust to deviations from normality of the distribution of errors but is highly sensitive to deviations in error magnitude, and for this reason transformations may be carried out to “stabilize the variance,” that is, to make the error distribution more homogeneous. Conditional Gaussian simulation, on the other hand, assumes a normal distribution of the data, and thus data whose distribution is not Gaussian must be transformed to come closer to matching this assumption. The standard transformation used in geostatistics is called the normal scores transformation and is quite simple to describe and implement. It is described by Pyrcz and Deutsch (2018) and by Goovaerts (1997); we will follow the nice description by Goovaerts (1997, p. 268). Step 1. Arrange the sample data in ascending order, breaking ties if necessary. > # First compute normal scores of Yield at sample points:
> # Arrange Yield values in ascending order
> # data.Set4.2 is the original sample data, still a data frame

> data.Yield.ord <- data.Set4.2$Yield[order(data.Set4.2$Yield)]
> # Check for ties
> min(diff(data.Yield.ord))
[1] 0.5649183
Step 2. Compute the cumulative frequency of the ordered data values (i.e., the quantiles). In the next section we will also need their coordinates as well as their record numbers so we will include these as well. > data.Set4.2.ord <- data.frame(Yield = data.Yield.ord,
+    Yquant = seq(1:length(data.Yield.ord)) /
+       length(data.Yield.ord),
+    Easting = data.Set4.2$Easting[order(data.Set4.2$Yield)],
+    Northing = data.Set4.2$Northing[order(data.Set4.2$Yield)],
+    ptno = data.Set4.2$ptno[order(data.Set4.2$Yield)])
Step 3. Match the quantile of each of the ordered values with the corresponding quantile of the normal distribution. > data.ordered$Z <- qnorm(data.ordered$Yquant)
> data.ordered$Z[length(data.ordered$Z)]
[1] Inf
> data.Set4.2.ord$Z[length(data.Set4.2.ord$Z) – 1]
[1] 2.231606
> data.ordered$Z[length(data.ordered$Z)] <- 3.0
The function qnorm() assigns the value Inf to the largest quantile, so this replaced with a suitably large finite value. In theory the normal score transformation produces a quantity Z(x) with mean exactly zero, but in practice numerical issues plus our adjustment of the largest quantile may upset this a bit, so we will adjust the values to return their mean to zero, > # Give the z values a mean of zero
> mean(data.Set4.2.ord$Z)
[1] 0.03846154
> data.Set4.2.ord$Z <- data.Set4.2.ord$Z – mean(data.Set4.2.ord$Z)
Fig. 5 shows the normal score transformation schematically. The fiftieth quantile Yield out of 78 has the value 5167 kg, and corresponds to the quantile value 50/78 = 0.64. The quantile value 0.64 of the normal distribution on the right corresponds to the value 0.32, which is the normal score.

                                     Figure 5. Diagrammatic representation of normal score transformation. Based on similar figures by Goovaerts (1997, p. 268) and Pyrcz and Deutsch (2018).

After the Gaussian simulation is completed the results must be transformed back to the original data units. Because the data are so densely arranged, this can be accomplished by simple linear interpolation.

3. The Conditional Gaussian Simulation Algorithm

In Section 2 we created a normal score transformation Z of the yields of the sample points. We will use this transformed quantity in the simulation. Here are the steps. Step 1, the visit order. Since we start with a set of 78 locations whose Z value is known, we will randomly pick one of these as are starting point. Then we randomly arrange all the locations, sampled and unsampled. > set.seed(123)
> print(pt1 <- sample(samp.pts, 1))
[1] 5763
> visit.order <- sample(1:length(data.Full.ordered$Z),
+   replace = FALSE)
> visit.order <- c(pt1, visit.order[-which(visit.order == pt1)])
> visit.order[1:10]
[1] 5763 2463 2511 8718 2986 1842 9334 3371 4761 6746
Step 2. At the current visit point, if it is a sample point, insert its known value. If not, determine the parameters (mean and variance) of the Gaussian cumulative distribution function using simple kriging of nearby known values. This where software such as gstat uses Bayesian updating and we will use simple matrix inversion. We will base our estimate on the ten closest points whose value has already been determined. We create a data field Zsim in the original sf object Y.full.data.sf to hold the simulated values. We also determine the quantities to be used in the simple kriging. > Y.full.data.sf$Zsim <- 0
> data.varn <- variogram(object = Z ~ 1, locations =
+    data.Set4.2.ord.sp, cutoff = 600)
> data.fitn <- fit.variogram(object = data.varn, model =
+    vgm(1, “Sph”, 250, 1))
> print(Z.nug <- data.fitn$psill[1])
[1] 0
> print(Z.sill <- Z.nug + data.fitn$psill[2])
[1] 1.157344
> print(Z.range <- data.fitn$range[2])
[1] 239.7456
> n.kr.pts <- 10
The variogram() argument cutoff is the spatial separation distance up to which point pairs are included in the computation. The arguments of the function vgm() are respectively the initial guess of the partial sill, the model, and the initial guesses of the range and the nugget. Again, the partial sill is the sill minus the nugget. Fig. 6 shows the setup for the 100th visit point. The point to be simulated is shown in red, the sample points are shown in green, and the points whose Z values have already been simulated are shown in blue. It is evident that the simulation is based on a combination of sample values and previously simulated values.
                                         Figure 6. Diagram of the simulation data for the one hundredth visit point. Next we create a function closest.points() to find the closest points to the current visit point. The argument coord.data is used to pass the data frame Y.full.data.sf. The argument n.pts has the value 10. > closest.points <- function(visit.pt, coord.data,
+    known.pts, n.pts){
+    c.pts <- numeric(n.pts)
+ # Determine the coordinates of the current visit point
+    visit.coords <- with(coord.data, c(Easting[visit.pt],
+          Northing[visit.pt]))
+ # Determine the coordinates of all the points whose value is known
+    known.coords <- with(coord.data,
+        cbind(Easting[known.pts], Northing[known.pts]))
+ # Respectively go through the known points and find the
+ # n.pts closest points

+    test.coords <- known.coords
+    for (n in 1:(n.pts + 1)){
+       dist.sq <- (test.coords[,1] – visit.coords[1])^2 +
+          (test.coords[,2] – visit.coords[2])^2
+       min.pt <- which.min(dist.sq)
+       c.pts[n] <- known.pts[min.pt]
# Once a point is selected, change its coordinates to an
# impossible value to keep it from being selected again

+       test.coords[min.pt,] <- 0
+       }
+    return(c.pts)
+    }
We use the array known.pts to hold the indices of the locations whose Z value is known. Initially these are the sample points. > known.pts <- samp.pts At each point in the visit order, if it is a sample then that value is used (Step 2a). These values are assigned at the start > for (i in 1:length(samp.pts))
+    Y.full.data.sf$Zsim[samp.pts[i]] <- with(data.Set4.2.ord,
+       Z[which(ptno == samp.pts[i])])
 If the visit point is not a sample point then the Monte Carlo computation is implemented. Rather than using the R function dist() to compute the distance matrix, as was done for the simple four point example, it is more convenient to use our own function to compute the distances. > pt.dist <- function(x1, x2) sqrt(sum((x1 – x2)^2))  There is a lot of code, so I have tried to liberally include comments to explain what is going on. The drawing of the value from the normal distribution takes place at the end of the loop. The first line of code sets up the cycling through the visit points. > for (visit.pt in visit.order){
+ # If visit point not a sample point use the kriging variance to
+ # construct the distribution and draw a value from this
+       distribution

+    if (!(visit.pt %in% samp.pts)){
+ # Find the closest points to the visit point
+       cl.pts <- closest.points(visit.pt, Y.full.data.sf,
+          known.pts, n.kr.pts)
+ # Add the visit point to end of the set of closest points
+       krige.pts <- c(cl.pts, visit.pt)
+ # Determine the distance matrix
+       h.mat <- matrix(0, length(krige.pts), length(krige.pts))
+       for (i in 1:length(krige.pts))
+          for (j in 1:length(krige.pts)){
+             x1 <- with(Y.full.data.sf,
+               c(Easting[krige.pts[i]],Northing[krige.pts[i]]))
+             x2 <- with(Y.full.data.sf,
+               c(Easting[krige.pts[j]],Northing[krige.pts[j]]))
+             h.mat[i,j] <- pt.dist(x1, x2)
+             }
+ # Determine the gamma matrix
+       gam.mat <- matrix(0, nrow(h.mat), nrow(h.mat))
+       for (i in 1:(n.kr.pts + 1))
+          for (j in 1:(n.kr.pts + 1))
+             gam.mat[i,j] <-
+                sph.vgm(Z.nug, Z.sill, Z.range, h.mat[i,j])
+       C.full <- Z.sill – gam.mat
+       C.mat <- C.full[1:(n.kr.pts), 1:(n.kr.pts)]
+       c.vec <- C.full[1:(n.kr.pts),(n.kr.pts + 1)]
+ # Solve for the phi values
+       phi <- solve(C.mat, c.vec)
+ #Estimate the mean via simple kriging
+       mu.est <- sum(phi * Y.full.data.sf$Zsim[cl.pts])
+ # Variance equals the sill
+       sd.est <- sqrt(Z.sill)
+ # Draw a value from a normal distribution with these parameters
+       Y.full.data.sf$Zsim[visit.pt] <- rnorm(1, mu.est, sd.est)
+ # Add the visit point to the list of known points
+       known.pts <- c(known.pts, visit.pt)
+       }
+    }
If you use the code accompanying this post to run this example, then as you watch the little blue wheel go around, think about the fact that when we repeat this example in the next section using the gstat function krige(), the computation will be effectively instantaneous. R library functions are generally better than anything we could make by hand. Finally, we use interpolation to transform the simulation back to the original scale (the function to do this was created in Section 1 but will be shown in Section 4). Fig. 6 shows the result. It is a considerably cruder simulation of the original data, although it does preserve some of the reduction in yield on the west side of the field.

Figure 6. Results of the “homemade” conditional simulation.   Although the homemade simulation is clearly much cruder, the general yield pattern is still discernable. In the next section we will go through the same code as was used in Section 1 to illustrate the use of the gstat package to carry out conditional and unconditional Gaussian simulation.

4. Conditional and Unconditional Gaussian Simulation using gstat

Now that we have an idea of how conditional Gaussian simulation works, we will go over the code used to generate Figs. 2b and 3 in Section 1. After that we will create code to generate an unconditional Gaussian simulation and compare the two. The R code in the code file accompanying this Additional topic is repeated to match the presentation order here. The first step is the normal score transformation of the 78 sample points. This has already been described in Subsection 2e. The next step is to create a SpatialPointsDataFrame for the gstat function variogram() to compute and fit the variogram. We have already seen the last four lines of code in Section 3.

> coordinates(data.ordered) <- c(“Easting”, “Northing”)
> proj4string(data.ordered) <- CRS(“+proj=utm +zone=10 +ellps=WGS84”)
> data.varn <- variogram(object = Z ~ 1,
+    locations = data.ordered, cutoff = 600)
> data.fitn <- fit.variogram(object = data.varn,
+    model = vgm(1, “Sph”, 250, 1))


The last step is the actual computation of the conditional Gaussian simulation. > Z.sim <- krige(formula = Z ~ 1, locations = data.ordered,+  newdata = Y.full.data.sp, model = data.fitn, nmax = 10, nsim = 6)
drawing 6 GLS realisations of beta…
[using conditional Gaussian simulation]

> class(Z.sim)
[1] “SpatialPixelsDataFrame”
attr(,”package”)
[1] “sp”

> spplot(Z.sim) # Check
Note the I am using krige() to carry out the conditional simulation. There are other ways to do it but I find this the most straightforward. The function krige() is a “wrapper” for the function gstat() A wrapper function provides a more convenient interface to another function, often for certain specialized applications. The argument locations contains the known data values, in the form of a Spatial object such as a SpatialPointsDataFrame, and the argument newdata contains the locations at which the simulation values are to be computed, again in the form of a Spatial object. In this case I use a SpatialPixelsDataFrame. The argument nsim is the number of simulations to be computed. The function gstat() is so efficient that it does not take much more time to compute 100, or ever 1000, simulations than to compute one (You will use this feature in Exercise 3). The last step in the code above is to do a quick and dirty plot to check that the result is reasonable.

The code for the function krige() when used for ordinary kriging is very similar to that used for conditional Gaussian simulation, so let’s take a look at the former in order to compare the two. Here is the call to the function krige() used to produce Fig. 2a. > Yield.krige <- krige(formula = Yield ~ 1,
+    locations = data.Set4.2.sp, newdata = Y.full.data.sp,
+    model = data.fit)
The difference in the two function calls is the assignment of a positive value to the argument nsim in the conditional Gaussian simulation (the default value of nsim is zero). This positive value signals that conditional Gaussian simulation rather than kriging is to be used. The value of the argument nsim if it is not zero indicates the number of simulations to be computed. The argument nmax provides an indication of the number of known values to be used, but the algorithm is much more sophisticated than the simple one used in Section 3. The last step in the conditional Gaussian simulation process is to transform the data back to the initial units. As indicated, this is done using simple linear interpolation. We first create an interpolation function xform() and then do the transformation. > xform <- function(in.vec, out.vec, val){
+    i <- 1
+    while (i < length(in.vec) & val >= in.vec[i+1]) i <- i + 1
+    if (i == length(in.vec)) return(out.vec[i])
+    return(out.vec[i]  + ((val – in.vec[i]) /
+      (in.vec[i+1] – in.vec[i])) * (out.vec[i+1] – out.vec[i]))
+    }
> # Back transform on a point by point basis
> Yield.sim <- Z.sim
> for(i in 1:nrow(Yield.sim@data))
+    for(j in 1:ncol(Yield.sim@data))
+       Yield.sim@data[i,j] <- xform(data.ordered@data$Z,
+          data.ordered@data$Yield, Z.sim@data[i,j])
Now we move on to unconditional Gaussian simulation. The difference between conditional and unconditional simulation is that in unconditional simulation only previously simulated values are used in Step 3a, the establishment of “known” values. The simulation begins with no “known” sample values. Unconditional Gaussian simulation cannot be done using krige() but rather requires the direct use of the function gstat(). The argument dummy = T signals that this is an unconditional simulation. The argument nmax is the number of nearest observations used for a simulation. The object returned by the function gstat() is then used as an argument in the function predict() to generate the actual simulation. > Z.pred <- gstat(formula = Z ~ 1, locations = ~Easting + Northing,
+   dummy = T, beta = 0, model = data.fitn, nmax = 20)
> set.seed(123)
> Z.usim <- predict(Z.pred, newdata = Y.full.data.sp, nsim = 6)
[using unconditional Gaussian simulation]
In a simple kriging application beta is the value of the mean, which in our Gaussian simulation is zero. Fig. 8 shows a plot of the result. The simulation has the same general local spatial structure as the original data but does not preserve the global spatial pattern.

Figure 8. Unconditional simulation using yield data

5. Further Reading

Much of the discussion in this Additional Topic is taken from Leuangthong et al. (2008). Goovaerts (1997) contains the same material, but the presentation is at a higher mathematical level. Wackernagel (2003) and Journel (2009) contain more mathematically sophisticated discussions of the relationship between kriging and linear regression.


6. Exercises

  1. Using the four-point data set of Fig. 6, compute a conditional Gaussian simulation of the fourth point based on the other three.
 
  1. Try carrying out conditional simulation on the original Yield data without using the normal score transformation. How does the result compare with that computed here?
 
  1. In principle, since a conditional Gaussian simulation is a set of pointwise Monte Carlo simulations, the mean of a large number of simulations should converge to the kriged interpolation. Compute 10, 100, 1000 simulations of the normalized data and see if they do converge.
 
  1. A naïve way to address the problems associated with kriging would be to simply krige the data and then add a normally distributed random variable to each kriged value using the variance as computed from the variogram. Try this out and compare the results with those obtained using conditional Gaussian simulation. How do you explain the difference?
 
  1. One of Goovaerts’ assertion about kriging is that smoothing is least close to the data locations and increases as the location being estimated gets farther away. Using the variogram information from the full 78 sample points (computing a variogram with only 39 data points is not wise), carry out a kriging interpolation and a conditional Gaussian simulation based on only the western half of the sample data and see whether kriging in the “unsampled” eastern half is indeed smoother than the simulation.
 
  1. Using multiple simulations, construct several histogram and variogram comparisons similar to those of Fig. 4b and Fig. 5b and see how they compare with each other.

7. References

Box, G. E. P., and G. C. Tiao (1973). Bayesian Inference in Statistical Analysis. Addison-Wesley, Menlo Park, CA. Leuangthong, O, K.D. Khan, and C.V. Deutsch (2008). Solved Problems in Geostatistics. Wiley, New York. Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford, UK, Oxford University Press. Isaaks, E. H., and R. M. Srivastava (1989). An Introduction to Applied Geostatistics. Oxford University Press, Oxford, U.K. Journel, A.G. (1989) Fundamentals of Geostatistics in Five Lessons. American Geophysical Union, Washington, D.C. Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li (2005). Applied Linear Statistical Models. Mc-GrawHill Irwin, Boston, MA. Larsen, R. J., and M. L. Marx (1986). An Introduction to Mathematical Statistics and its Applications. Prentice-Hall, Englewood Cliffs, NJ. Leuangthong, O, K.D. Kahn, and C.V. Deutsch (2008). Solved Problems in Geostatistics. Wiley, Hoboken, NJ Pebesma, E. (2003), gstat User’s Manual. Utrecht University, Utrecht, The Netherlands Pebesma, E. J. (2004). Multivariable geostatistics in S: the gstat package. Computers and Geosciences 30: 683-691. Pyrcz, N.J. and C.V. Deutsch (2018). Transforming Data to a Gaussian Distribution. https://geostatisticslessons.com/lessons/normalscore Ripley, B. D. (1981). Spatial Statistics. Wiley, New York. Wackernagel, H. (2003) Multivariate Geostatistics. Springer, Berlin Zanon, S. and O. Leuangthong (2004) Implementation Aspects of Sequential Simulation. Geostatistics Banff 2004  DOI: 10.1007/978-1-4020-3610-1_55 Copyright © 2022 Richard E. Plant  

RbaseX, a BaseX-client written in R

The primary programming language used in the Data Science course at Erasmus University Rotterdam was ‘R’. So when the topic ‘XML, XQuery and XSL’ was tackled, there was a need for a tool that could work with R. BaseX, an Open Source XML database, offered the functionalities that were needed for the excercises but there was no R support. However, a server protocol was available in which all communication was fully described. This protocol has now been used for the development of RBaseX. Version 1.1.1 of this client is available at (https://cran.r-project.org/web/packages/RBaseX/index.html). All following examples are based on the minutes of the Dutch Parliament which are freely available in XML-format. BaseX is written in Java. It is shipped as .jar, .zip, .exe or .war. On Windows and Linux, it works out of the box. All clients are based on the client/server architecture. Hence, a BaseX database server must be started first. RbaseX can be used in two modes:
  • Standard mode is used for connecting to a server and sending commands
  • Query Mode is used for defining queries, binding variables, iterative evaluation

Standard mode

RBaseX is developed using the R6 package for object-oriented programming. All interactions between the client and BaseX begin with the creation of a new instance of BaseXClient.
The following command creates a new BasexClient instance and stores it in the variable Session. This Session is used in all the examples.
library(RBaseX)
Session <- BasexClient$new("localhost", 1984L, username = "admin", password = "admin")
In the Command-command, the Create statement can be used to create new databases. If a database with that name already exists, it is overwritten.
Existing databases can be opened with the statement Open. BaseX has a convenience check statement which combines open and create. If a database with that name already exists, it is opened. Otherwise it is created (and opened).
Session$Command("Check Parliament")
After the client has send a command to the server, the server responds by sending the outcome of the command as a raw vector or byte-array. This outcome is converted to a list. In most cases this list has attributes $result and if appropriate $info. To indicate if the process was successful a single byte is added to the raw vetor. A \0x00 byte means success, \0x01 indicates an error. The value of this status-byte is stored as a private variable in the Session instance-variable and can be accessed as the $success attribute.
Public methods get_success() and set_success() are methods which are not described in the server protocol. Neither are the methods set_intercept, get_intercept and restore_intercept. I added them because I needed to catch errors. When the Intercept variable has the value FALSE and an error in your code is detected, execution stops. When the Intercept variable has the value TRUE, you can catch the error. Scraping the site of the Parliament is surprisingly easy. A search for all the minutes of a given date, returns a page that contains an ID and the number of documents. The two can be combined to construct a link to all the minutes in XML-format. The files can be stored directly into the database. The following store-function uses the BaseX Add-command.
Session$set_intercept(TRUE)
store <- function(Period) {
  for (i in 1:length(Period)) {
    Minutes <- minutes(Period[i])
    if (!identical(Minutes, list())) {
      Source <- paste("https://zoek.officielebekendmakingen.nl/", Minutes, sep="")
      mapply(function(source) Session$Add("Debates", source), Source)
    }
  }
}
Thanks to the set_intercept, errors, due to missing files, are handled neatly. Downloading and storing the 740 files in the database took 3.5 minutes.

Querying

‘Simple’ XQuery-statements can be handled as normal commands. So you can use the following statement to return the number of records that were inserted:
(Count <- Session$Command("xquery count(collection('Parliament'))")$result)
## [[1]]
## [1] "740"
I use the BaseX-GUI to develop queries. Queries that have been tested in the GUI are copied into the R-environment.
The following XQuery code returns for every meeting the Debate-ID, the names of the speakers, the order in which they spoke and the party they represent. result2frame Is a utility function which converts the results to a dataframe.
library(glue)
library(knitr)
  
Query_Expr <-paste(
'import module namespace  functx = "http://www.functx.com";', 
'for $Debat in collection("Parliament")',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',
'  for $Speaker at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speaker/spreker/naam/achternaam/text()',
'    let $Pol := $Speaker/spreker/politiek/text()',
'    order by $debate-id, $CountInner',
'return($debate-id, $CountInner, $Spreker, ($Pol, "n.v.t")[1])')

result <- Session$Command(as.character(glue("xquery {Query_Expr}")))
result_frame <- result2frame(result$result, 4)
names(result_frame) <- c('Debate ID','Speach', 'Speaker', 'Party')

kable(head(result_frame,3))
Debate ID Speach Speaker Party
20202021-102-10 1 voorzitter n.v.t
20202021-102-10 2 voorzitter n.v.t
20202021-102-2 1 voorzitter n.v.t
As you can see the ordering by debate-id does not give a correct order. This would be corrected if the digit(s) following the second hyphen were padded with 0’s. Since some meetings have more than 100 items, I pad up to length 3. It is easy to write a function for this transformation and incorporate it in the query expression.
import module namespace functx = “http://www.functx.com”;
declare function local:order_id
( $Meeting as xs:string)
{ let $debate-id := fn:analyze-string( $Meeting, “(\d{8}-\d)-(\d)”)
let $date := $debate-id//fn:match/:group[@nr=“1”]/text()
let $item-nr := functx:pad-integer-to-length($debate-id//fn:match/:group[@nr=“2”]/number(),3)
return fn:string-join(($date, $item-nr), “-”)
};
After incorporating this function in the query, ordering is correct.
Debate ID Speach Speaker Party
20202021-102-2 1 voorzitter n.v.t
20202021-102-2 2 Bromet GroenLinks
20202021-102-2 3 voorzitter n.v.t

Query mode

ExecuteQuery

The Session$Query()-command does not execute a query expression. It creates a new instance of the QueryClass and adds this instance to the Session variable. It is this QueryClass-instance that has an ExecuteQuery() method. This is demonstrated in the following example to extract the text from the minutes.
Query_Expr <-paste(
'import module namespace  functx = "http://www.functx.com";', 
'for $Debat at $CountOuter in collection("Parliament")',
'    where $CountOuter <=2',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',  
'  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speach/spreker/naam/achternaam/text()',
'    let $Pol := $Speach/spreker/politiek/text()',
'    order by $debate-id, $CountInner',
'    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
'       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
'    return($debate-id, $Spreker, ($Pol, "n.v.t")[1], $CountPar, $tekst)'
)

QueryExe <- Session$Query(Query_Expr)
result <- QueryExe$queryObject$ExecuteQuery()
result_frame <- result2frame(result, 5)
names(result_frame) <- c('Debate ID','Speaker', 'Party', 'Order', 'Text')

kable(head(result_frame,1))
Debate ID Speaker Party Order Text
20202021-102-2 voorzitter n.v.t 1 Allereerst hebben we het traditionele mondelinge vragenuur. …

Binding

An XQuery expression may use external variables. Before using them they have to be bound to the queryObject(). In the following example, bind() is used to bind $Regex to a regular expression.
library(dplyr)
library(knitr)
Query_Bind <- Session$Query(paste(
'declare variable $Regex external;',
'for $Debat at $CountOuter in collection("Parliament")',
'    where $CountOuter <= 2',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',  
'  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speach/spreker/naam/achternaam/text()',
'    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
'       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
'    return($debate-id, $Spreker, $tekst)'))

Query_Bind$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")
## $Binding
## character(0)
## 
## $success
## [1] TRUE
BindResult <- Query_Bind$queryObject$ExecuteQuery()
ResultFrame <- result2frame(BindResult, 3)
names(ResultFrame) <- c('Debate ID','Speaker', 'Text')

kable(head(ResultFrame,1))
Debate ID Speaker Text
20202021-102 voorzitter Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven weer geschiedenis. …


Iteration

Sometimes there is a need to iterate over the results of a query. When using the more()-next() construct, BaseX prefixes every item with a byte that represents the Type ID or XDM meta data. For every item, RBaseX creates a list with as elements the byte and the value of the item.
In the following example I’m only interested in every third item. This example shows how to iterate over all the items and also shows how to use the wrapper functions which are available for all the commands.
Query_Iter <- Session$Query(paste(
  'declare variable $Regex external;',
  'for $Debat at $CountOuter in collection("Parliament")',
  '    where $CountOuter <= 2',
  '  let $debate-id := fn:analyze-string(',
  '    $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',  
  '  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
  '    let $Spreker := $Speach/spreker/naam/achternaam/text()',
  '    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
  '       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
  '    return($debate-id, $Spreker, $tekst)'))

Query_Iter$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")
## $Binding
## character(0)
## 
## $success
## [1] TRUE
IterCnt <- 0
IterResult <- c()
while (IterCnt <= 10 && More(Query_Iter)) {
  NextItem <- Next(Query_Iter)
  IterCnt <- IterCnt + 1
  if (IterCnt %% 3 == 0) {
    IterResult <- c(IterResult, NextItem)
    (str(NextItem[[1]][[2]]))
  }
}
##  chr "Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven we"| __truncated__
##  chr "Voorzitter. Het was altijd al een eer om hier te staan. Vandaag is dat het in het bijzonder. We voelen ons alle"| __truncated__
##  chr "Dank u wel. Het woord is nu aan de minister. Gaat uw gang."

BaseX modules

BaseX provides many more functionalities. A complete overview can be found at https://docs.basex.org/wiki/Main_Page.

Create a hyper-marketing model using Naïve Bayes

By Huey Fern Tay with Greg Page

Everyone loves an extra income stream – even the super-wealthy owners of luxurious properties that they only inhabit for just a few weeks each year.  Offering a property as a short-term rental through a platform like Airbnb can provide a wonderful side hustle.  For some owners, however, the associated hassles could be a powerful deterrent to using the service.  Text messages at 3 a.m. about Wi-Fi passwords, stopped-up toilets, and the lack of water pressure in the shower might be just enough to tip the scales against such an undertaking…especially when such messages are followed up by angry “Why isn’t this fixed yet?” queries just 30 minutes later.

So what if an all-in-one concierge service could take away ALL of those hassles?  If an intermediary service could handle all of the tenant interactions, the marketing, the logistics of the key hand-offs, etc. then suddenly the idle rich jetsetters might be a bit more willing to open up their pied-a-terres to the unbathed masses.  Such a service would benefit all stakeholders – travelers would have more options, the property owners would earn more income, and the platform would receive more commission fees from the extra transactions. In exchange for a fee paid to the service, willing property owners could have a side hustle that was “all side, no hustle.”  

Let’s imagine that such a service is looking to establish itself, with an initial marketing outreach effort to high-end property owners not already using Airbnb.  Let’s also imagine that this new service is operating on a shoestring budget, and therefore needs to. How can it identify the properties within a city that are most likely to command high values in the short-term rental market? 

The Naïve Bayes classifier is a good candidate for the task at hand because of its simplicity, computational efficiency, and ability to handle categorical variables.  Furthermore, its classification outcomes come with associated probability values – we can use those to identify records that are most likely to belong to some particular group. 

To illustrate how this method could be used to solve the business problem outlined above, I will utilize Airbnb data of San Francisco listings.

One of the first decisions the modeler must make is deciding how to bin the data. In this case, the question is which numerical variable would you use to separate the properties? Would you use price, review ratings, or number of reviews? Each of these variables has a different impact on the outcome and may not be equally effective at separating classes. If the classes are not well separated, then even a large dataset will not be helpful.

The next important decision is to determine the number of classification categories you would like to create. Also, what would the cut-off be for each group? In other words, should you create four groups and bin them equally? Or should you create three groups by dividing the data according to a 15-70-15 proportion, 20-40-20, or 30-40-30…etc? The decision made at this step has a big impact on the model.

Consider both models below which were each created with 3811 rows of data representing 60% of the total dataset. Both models were created with the same predictor variables, such as the number of bedrooms, bathrooms, property type, location, etc. But in Model 1, the data was binned into 4 equal groups while in Model 2, the data was binned into 3 equal groups. The model summary for Model 1 showed the model has an accuracy of 54.19%, which is good considering that this performance is slightly more than double the No Information Rate (Naïve Rate). Model 2’s accuracy’s level is at 65.36%, a level which is nearly double its No Information Rate.

These are encouraging results but in our Airbnb example, we are more interested in knowing how well our model performs when it is asked to classify properties into any of the classes used in the model. For this reason, it is worth considering the true positive rate i.e. ‘sensitivity’. Model 1 is better at predicting the true positives (‘sensitivity’) for classes at opposite ends of the spectrum. This suggests Model 1 has difficulty reading nuances. On the other hand, Model 2’s performance in this regard is comparatively more balanced.


Naive Bayes model with four classification outcome



Naive Bayes output with three classification outcome


But wait – let’s get back to our original goal.  While overall accuracy is good to see, what we are most interested in here is identifying that high-end price group.  Owners of such units will be the best targets for our all-in-one concierge service.  Therefore, let’s dive a bit deeper to examine this model’s suitability for identifying such properties. 

By running the predict() function with the type=’raw’ parameter included, we can view the associated probabilities for each outcome class, and then rank records by probability of belonging to some particular outcome group. 

Taking this approach with the validation set, we find that among the 100 records identified by the model as most likely to land in the top tier group, 96 truly belonged to “Above Average and Pricey Digs.”   Among the 150 likeliest, 140 units, or 93.33%, actually belonged to that group, and among the 200 likeliest, 185 units, or 92.5%, were truly in that top tier. 

But that’s not all. It is worth going one step further by evaluating the model with lift charts or decile-wise lift charts because these charts determine how effectively our model ‘skims the cream’.

The decile-wise lift charts below illustrate how effectively the model can predict membership in the ‘above average and pricey digs’ group. When the model is used to classify the top 27% properties in this category, its performance is more than 3.5x better than a random guess.

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Another way to assess the model’s ability to identify top-tier rentals is with a two-dimensional lift chart.  Such a chart only works with two-outcome class scenarios, so we start here by collapsing the first and second tiers together, and then labelling that group as “other.”  

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

In the entire validation set, there are 780 units that land in the highest price tier.  The values along the x-axis represent all the validation set records, ranked in order of their probability of belonging to the highest-tier class.  The y-axis shows the number of correct predictions.  The solid line represents the model’s performance – it shows us, for instance, that among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group.  The line flattens out at around x=1500, because by that point, the model has already identified nearly all of the records that truly belonged to this outcome class. 

The dotted line, by contrast, shows how effective a model would be if it simply labelled all the records as belonging to the top tier.  Since 33.6%, of the validation records belong to this group, each x-axis value here corresponds to a y-axis value that is exactly 33.6% as large.  The difference between the solid line and the dotted line represents the model’s improvement as the number of cases increases.

What do these results mean for the concierge service?  Let’s revisit our original assumptions: 


  • such a service would most likely appeal to the owners of properties in the highest pricing tier;
  • the initial outreach efforts should be made to owners of properties not already registered with Airbnb; and that
  • the new service has a limited budget, and therefore needs to carefully focus its outreach efforts only on that tier of properties whose owners would be likeliest to use it
Given these assumptions, our primary interest does not lie with overall model accuracy.  The model’s ability to distinguish between the bottom two pricing tiers is almost immaterial to us; however, we are keenly interested in the answer to this question:  When the model predicts that a property will belong to the highest pricing tier, how often is it correct? 

As demonstrated here, the model delivers quite effectively in this regard, especially when we maintain a relatively narrow focus on the properties that are most likely to be top tier.  Splashy magazine inserts, Super Bowl advertisements, and big-ticket endorsements from celebrities might be in the cards for this service down the road, after it spreads across the globe and prepares for its IPO roadshow.  For now, though, the hyper-specific focus that can come from “skimming the cream” off the top of those Naïve Bayes model probability predictions may be the surest next step for this service’s success. 

Data source: Inside Airbnb
 

Walmart’s (WMT) 7-Year Nonlinear Market Trend at ‘Critical Juncture’

This is a brief market update from my earlier post on October 28, 2021, Walmart’s 7-Year Nonlinear Market Trend using ‘Stealth Curves.’

The prior post indicated the then current nonlinear market structure for Walmart (WMT) stock.

October 28, 2021 Market View



Fast forward to today …

This market is now resting on its multi-year lower Stealth Support Curve. As the below updated graphic indicates, this represents a critical juncture for this market.

Current Market View (1-27-2022)



This current chart can be produced using the code in my October 2021 post along with updated market data.  The data is freely available from Nasdaq.com here. Just be certain the updated data match the column headings in my ‘reformatted’ GitHub data file.  
 
For those interested in numerous Stealth Curve examples applied to various other markets, simply view this LinkedIn post.  Full details of Stealth Curve equation parameterization are described in my latest Amazon publication, Stealth Curves: The Elegance of ‘Random’ Markets.

Feel free to contact me directly on my LinkedIn site.
 

Brian K. Lee, MBA, PRM, CMA, CFA


Bifurcation diagram for nonautonomous SIR model

I am trying to do the bifurcation diagram for a nonautonomous SIR two-patch model. But It’s not giving the proper result. The model and parameter values are in the code. I solved the model by ODE solution for a set of bifurcation parameter values and then took the maximum and minimum for the plot. How should I draw a phase plane analysis for my model? Any suggestions? 

Bifurcation Analysis #########
library(deSolve)
library(tidyverse)
vals_betaAbar <- seq(1,70,by=1)  #bifurcation parameter

outdf = NULL

for (i in 1:length(vals_betaAbar)) {

betaAbar = vals_betaAbar[i]


sir.two.patch.model.closed <- function (t, x, params){
#create local variable IA, IB, RA, RB, SA, SB
IA <- x[1]
IB <- x[2]
RA <- x[3]
RB <- x[4]
SA <- x[5]
SB <- x[6]
#we can simplify code using “with”
#this argument to “with” lets us use the variable Names
with(
as.list(params),
{ #the system of rate equations
dIA <- betaAbar(1+sigAcos(2pit/Ti))IASA/Na-(muA+gamaA+phiIAB(1+sigAcos(2pit/Ti)))IA+phiIBA(1+sigAcos(2pit/Ti))IB
dIB <- betaBbar(1+sigAcos(2pit/Ti))IBSB/NB-(muB+gamaB+phiIBA(1+sigAcos(2pit/Ti)))IB +phiIAB(1+sigAcos(2pit/Ti))IA
dRA <- gamaAIA-(muA+phiRAB(1+sigAcos(2pit/Ti)))RA+phiRBA(1+sigAcos(2pit/Ti))RB
dRB <- gamaB
IB-(muB+phiRBA(1+sigAcos(2pit/Ti)))RB+phiRAB(1+sigAcos(2pit/Ti))RA
dSA <- muANa-betaAbar(1+sigAcos(2pit/Ti))IASA/Na-(muA+phiSAB(1+sigAcos(2pit/Ti)))SA+phiSBA(1+sigAcos(2pit/Ti))SB
dSB <- muB
NB-betaBbar(1+sigAcos(2pit/Ti))IBSB/NB-(muB+phiSBA(1+sigAcos(2pit/Ti)))SB+phiSAB(1+sigAcos(2pit/Ti))SA
#combine results into a single vector dx
dx <- c(dIA,dIB,dRA,dRB,dSA,dSB)
#return result as a list
list(dx)
}
)
}
#function seq returns a sequence
times <- seq(0,100,length.out=3000)
#function c “c”combines values into a vector
params <- c(Na=100, NB=1000, betaAbar=betaAbar, betaBbar=5,muA=0.06, muB=0.02, gamaA=12, gamaB=12,phiIAB=0.4, phiIBA=0.04, phiSAB=0.4, phiSBA=0.04, phiRAB=0.4, phiRBA=0.04, sigA=0., Ti=1)

#initial conditions
xstart <- c(IA=5,IB=5,RA=0,RB=0,SA=100-5,SB=1000-5)
#result stored in data frame
out <- as.data.frame(ode(xstart,times,sir.two.patch.model.closed,params))

out$betaAbar= vals_betaAbar[i]

out=out[251:300,]

outdf = rbind.data.frame(outdf, out)

}


bifdata=NULL
for (i in 1:length(vals_betaAbar)) {

betaAbar = vals_betaAbar


bifdata_A = NULL
bifdata_A$IA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$IA)
bifdata_A$IA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$IA)
bifdata_A$IB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$IB)
bifdata_A$IB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$IB)

bifdata_A$SA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$SA)
bifdata_A$SA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$SA)
bifdata_A$SB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$SB)
bifdata_A$SB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$SB)

bifdata_A$RA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$RA)
bifdata_A$RA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$RA)
bifdata_A$RB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$RB)
bifdata_A$RB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$RB)

bifdata_A$betaAbar= vals_betaAbar[i]

bifdata = rbind.data.frame(bifdata, bifdata_A)
}

dev.new()
ggplot(data=bifdata)+
geom_line(aes(x=betaAbar,y=IA_max))+
geom_point(aes(x=betaAbar,y=IA_min))

#############################

Predicting future recessions

Even if this sounds incredible, yes, we can predict future recessions using a couple of time series, some simple econometric models, and … R

The basic idea is that the slope of the yield curve is somewhat linked to the probability of future recessions. In other words, the difference between the short and the long term rate can be used as a tool for monitoring business cycles. Nothing new about that: Kessel (1965) documented the cyclical behavior of the yield spread, and he showed that the yield spread tended to decline immediately before a recession. This relationship is one of the most famous stylized facts among economists (see Figure 1).

Figure_1_FRED
Figure 1: The yield spread and recessions

– So, why people don’t use this model to predict recessions ?
– Well, it seems to be related to the fact that (i) they think it only used to work in the US (ii) they don’t feel to be qualified to run a sophisticated R code to estimate this relationship.

This post is about answering these two questions: (i) yes, the yield curve does signal recessions (ii) yes, it is easy to monitor economic cycles with R using the EWS package !

First, if you have some doubts about the predictive power of the yield spread, please have a look on Hasse and Lajaunie (2022)’s recent paper, published in the Quarterly Review of Economics and Finance. The authors – Quentin and I –  reexamine the predictive power of the yield spread across countries and over time. Using a dynamic panel/dichotomous model framework and a unique dataset covering 13 OECD countries over the period 1975–2019, we empirically show that the yield spread signals recessions. This result is robust to different econometric specifications, controlling for recession risk factors and time sampling. Main results are reported in Table 1.

Table_1
Table 1: Estimation of the predictive power of the yield spread (1975–2019)

– Wait, what does mean “dichotomous model” ?
– Don’t be afraid: the academic literature provides a specific econometric framework to predict future recessions.

Estrella and Hardouvelis (1991) and Kauppi and Saikkonen (2008) have enhanced the use of binary regression models (probit and logit models) to model the relationship between recession dummies (i.e., binary variables) and the yield spread (i.e., continuous variable). Indeed, classic linear regressions cannot do the job here. If you have specific issues about probit/logit models, you should have a look on Quentin’s PhD dissertation. He is a specialist in nonlinear econometrics.

Now, let’s talk about the EWS R package. In a few words, the package is available on the CRAN package repository and it includes data and code you need to replicate our empirical findings. So you only have to run a few lines of code to estimate the predictive power of the yield spread. Not so bad eh ?

Here is an example focusing on the US: first install and load the package, then we extract the data we need.
# Load the package
library(EWS)

# Load the dataset included in the package
data("data_USA") # Print the first six rows of the dataframe head(data_USA)
Well, now we just have to run these few lines:
# Data process
Var_Y <- as.vector(data_USA$NBER)
Var_X <- as.vector(data_USA$Spread)
# Estimate the logit regression
results <- Logistic_Estimation(Dicho_Y = Var_Y, Exp_X = Var_X, Intercept = TRUE, Nb_Id = 1, Lag = 1, type_model = 1)
# print results
print(results)
The results can be printed… and plotted ! Here is an illustration what you should have:

$Estimation
name Estimate Std.Error zvalue Pr
1 Intercept -1.2194364 0.3215586 -3.792268 0.0001492776
2 1 -0.5243175 0.2062655 -2.541955 0.0110234400
and then what you could plot:

Figure_2
Figure 2: The predictive power of the yield spread in the US (1999-2020)

Nice output, let’s interpret what we have. First the estimation results: the intercept is equal to -1.21 and high significant, and the lagged yield spread is equal to -0.52 and is also highly significant. This basic result illustrates the predictive power of the yield spread.

– But what does mean “1” instead of the name of the lagged variable ? And what if we choose to have another lag  ? And if we choose model 2 instead of model 1 ?
– “1” refers to the number associated to the lagged variable, and you can change the model or the number of lags via the function arguments:
$Estimation
name Estimate Std.Error zvalue Pr
1 Intercept 0.08342331 0.101228668 0.8241075 4.098785e-01
2 1 -0.32340655 0.046847136 -6.9034433 5.075718e-12
3 Index_Lag 0.85134073 0.003882198 219.2934980 0.000000e+00

Last but not least, you can choose the best model according to AIC, BIC or R2 criteria:

$AIC
[1] 164.0884
$BIC
[1] 177.8501
$R2
[1] 0.2182592

Everything you need to know about the predictive power of the yield spread is here. These in-sample estimations confirm empirical evidences from the literature for the US. And for those who are interested in out-of-sample forecasting… the EWS package provides what you need. I’ll write an another post soon !

References

Estrella, A., & Hardouvelis, G. A. (1991). The term structure as a predictor of real economic activity. The Journal of Finance46(2), 555-576.

Hasse, J. B., & Lajaunie, Q. (2022). Does the yield curve signal recessions? new evidence from an international panel data analysis. The Quarterly Review of Economics and Finance, 84, 9-22.

Hasse, J. B., & Lajaunie, Q. (2020). EWS: Early Warning System. 
R package version 0.1. 0.

Kauppi, H., & Saikkonen, P. (2008). Predicting US recessions with dynamic binary response models. The Review of Economics and Statistics90(4), 777-791.

Kessel, Reuben, A. “The Cyclical Behavior of the Term Structure of Interest Rates.” NBER Occasional Paper 91, National Bureau of Economic Research, 1965.

Detecting multicollinearity — it’s not that easy sometimes

By Huey Fern Tay with Greg Page

When are two variables too related to one another to be used together in a linear regression model? Should the maximum acceptable correlation be 0.7? Or should the rule of thumb be 0.8? There is actually no single, ‘one-size-fits-all’ answer to this question.

As an alternative to using pairwise correlations, an analyst can examine the variance inflation factor, or VIF, associated with each of the numeric input variables. Sometimes, however, pairwise correlations, and even VIF scores, do not tell the entire picture.

Consider this correlation matrix created from a Los Angeles Airbnb dataset.



Two item pairs identified in the correlation matrix above have a strong correlation value:

· Beds + log_accommodates (r= 0.701)

· Beds + bedrooms (r= 0.706)

Based on one school of thought, these correlation values are cause for concern; meanwhile, other sources suggest that such values are nothing to be worried about.

The variance inflation factor, which is used to detect the severity of multicollinearity, does not suggest anything unusual either.

library(car)
vif(model_test) 

The VIF for each potential input variable is found by making a separate linear regression model, in which the variable being scored serves as the outcome, and the other numeric variables are the predictors. The VIF score for each variable is found by applying this formula:


When the other numeric inputs explain a large percentage of the variability in some other variable, then that variable will have a high VIF. Some sources will tell you that any VIF above 5 means that a variable should not be used in a model, whereas other sources will say that VIF values below 10 are acceptable. None of the vif() results here appear to be problematic, based on standard cutoff thresholds.

Based on the vif() results shown above, plus some algebraic manipulation of the VIF formula, we can know that a model that predicts beds as the outcome variable, using log_accommodates, bedrooms, and bathrooms as the inputs, has an r-squared of just a little higher than 0.61. That is verified with the model shown below:



But look at what happens when we build a multi-linear regression model predicting the price of an Airbnb property listing.



The model summary hints at a problem because the coefficient for beds is negative. The proper interpretation for each coefficient in this linear model is the way that log_price will be impacted by a one-unit increase in that input, with all other inputs held constant.

Literally, then, this output indicates that having more beds within a house or apartment will make its rental value go down, when all else is held constant. That not only defies our common sense, but it also contradicts something that we already know to be the case — that bed number and log_price are positively associated. Indeed, the correlation matrix shown above indicates a moderately-strong linear relationship between these values, of 0.4816.

After dropping ‘beds’ from the original model, the adjusted R-squared declines only marginally, from 0.4878 to 0.4782.



This tiny decline in adjusted r-squared is not worrisome at all. The very low p-value associated with this model’s F-statistic indicates a highly significant overall model. Moreover, the signs of the coefficients for each of these inputs are consistent with the directionality that we see in the original correlation matrix.

Moreover, we still need to include other important variables that determine real estate pricing e.g. location and property type. After factoring in these categories along with other considerations such as pool availability, cleaning fee, and pet-friendly options, the model’s adjusted R-squared value is pushed to 0.6694. In addition, the residual standard error declines from 0.5276 in the original model to 0.4239.

Long story short: we cannot be completely reliant on rules of thumb, or even cutoff thresholds from textbooks, when evaluating the multicollinearity risk associated with specific numeric inputs. We must also examine model coefficients’ signs. When a coefficient’s sign “flips” from the direction that we should expect, based on that variable’s correlation with the response variable, that can also indicate that our model coefficients are impacted by multicollinearity.

Data source: Inside Airbnb