Natural Gas Prices Are Again on an Unsustainable Upward Trajectory …

I identified a real-time market condition in the natural gas market (symbol = UNG) on LinkedIn and followed up on a prior r-bloggers post here. The market ultimately declined 42% following the earlier breach of its identified unsustainable Stealth Support Curve.

Chart from earlier r-blogger post

In short order, this same market is once again on another such path. Once this market meaningfully breaks its current Stealth Support Curve, the magnitude of its ultimate decline is naturally indeterminate and is in no way guaranteed to again be ‘large.’ Nonetheless, it is highly improbable for market price evolution to continue along its current trajectory. If market low prices continue to adhere to the current Stealth Support Curve (below), prices will increase at least +1,800% within a month. Thus, the forecast of impending Stealth Support Curve breach is not a difficult one. The relevant questions are 1) when will it occur and 2 ) what is the magnitude of decline following the breach?

Past and Current Stealth Support Curves

Stealth Support Curve Formulas

t1 = 9/28/2011

R Code

library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/ung/historical

# Download reformatted data (columns/headings) from my github site and save to a local drive

# https://github.com/123blee/Stealth_Curves.io/blob/main/UNG_prices_4_11_2022.xlsx

# Load your local data file
ung <- read_excel("... Insert your local file path here .../UNG_prices_4_11_2022.xlsx")
ung


# Convert 'Date and Time' to 'Date' column
ung[["Date"]] <- as.Date(ung[["Date"]])
ung

bars <- nrow(ung)

# Add bar indicator as first tibble column
ung %
  add_column(t = 1:nrow(ung), .before = "Date")
ung

# Add 40 future days to the tibble for projection of the Stealth Curve once added
future <- 40
ung %
  add_row(t = (bars+1):(bars+future))

# Market Pivot Lows using 'Low' Prices
# Chart 'Low' UNG prices
xmin <- 2250
xmax <- bars + future
ymin <- 0
ymax <- 25
plot.new()
background <- c("azure1")
chart_title_low <- c("Natural Gas (UNG) \nDaily Low Prices ($)")
u <- par("usr") 
rect(u[1], u[3], u[2], u[4], col = background) 
par(ann=TRUE)
par(new=TRUE)
t <- ung[["t"]]
Price <- ung[["Low"]]
plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
     
     ylim = c(ymin, ymax) ,
     
     xlim = c(xmin, xmax ) ) 

# Add 1st Stealth Support Curve to tibble
# Stealth Support Curve parameters
a <-   -444.56  
b <-      6.26  
c <-  -2555.01  

ung %
  mutate(Stealth_Curve_Low_1 = a/(t + c) + b)
ung

# Omit certain Stealth Support Curve values from charting
ung[["Stealth_Curve_Low_1"]][1:2275] <- NA
ung[["Stealth_Curve_Low_1"]][2550:xmax] <- NA
ung

# Add 1st Stealth Curve to chart
lines(t, ung[["Stealth_Curve_Low_1"]])

# Add 2nd Stealth Support Curve to tibble
# Stealth Support Curve parameters
a <-   -277.30  
b <-      8.70  
c <-  -2672.65 

ung %
  mutate(Stealth_Curve_Low_2 = a/(t + c) + b)
ung

# Omit certain Stealth Support Curve values from charting
ung[["Stealth_Curve_Low_2"]][1:2550] <- NA
ung[["Stealth_Curve_Low_2"]][2660:xmax] <- NA
ung

# Add 2nd Stealth Curve to chart
lines(t, ung[["Stealth_Curve_Low_2"]])

The current low price is $22.68 on 4-11-2022 (Market close = $23.30).

The contents of this article are in no way meant to imply trading, investing, and/or hedging advice. Consult your personal financial expert(s) for all such matters. Details of Stealth Curve parameterization are found in my Amazon text, ‘Stealth Curves: The Elegance of Random Markets’ .

Brian K. Lee, MBA, PRM, CMA, CFA

Connect with me on LinkedIn.

Generate an MS Excel Workbook from inside RMarkdown

I make a presentation every week or so with {RMarkdown}. Invariably, one or more associates, those not fluent in R, will ask, “Can I get a copy of your ‘Excel’ ?” I don’t do much with Microsoft Excel directly, however, I’ve made creating a workbook part of my workflow using {openxlsx}. Now, I can immediately fire off a matching MS Excel Workbook after a discussion and look responsive. Here is an example:

```{r setup}
  library(tidyverse)
  library(openxlsx)

  make_one_sheet <- function(){ 
    sheet_name <- knitr::opts_current$get()$label
    addWorksheet(wb, sheet = sheet_name)
    writeData(wb, sheet = sheet_name, x = df_to_excel) 
    insertPlot(wb, sheet = sheet_name, 
      startCol = ncol(df_to_excel) + 2 ) 
    saveWorkbook(wb, file ="my_excel.xlsx", 
      overwrite = TRUE) 
  }

  wb <- createWorkbook()
```

I define a function in the setup chunk, to add to each chunk, which writes out the data and an image of the ggplot to tie everything together. The chunk label becomes the worksheet name. Also the workbook is set up in the setup chunk. Saving the workbook with every chunk simplifies adding new chunks as the deck is developed.

```{r dot_plot}
  plot_data <- iris %>% filter(Sepal.Length > 5) 

  plot_data %>% ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +  
    geom_point() 

  df_to_excel <- plot_data %>% 
    select(Species, Sepal.Length, Sepal.Width)  

  make_one_sheet()
```

Each chunk has the same four steps: do all of the processing of data to get ready for the specific ggplot, create the ggplot like one normally would, reorganize the data frame for the worksheet, and finally make the worksheet/workbook. Then you can do another ggplot:

```{r histogram}
  plot_data <- iris %>%  group_by(Species) %>%
    mutate(Group = cut(Petal.Length, breaks = 0:7)) %>% 
    group_by(Group, Species) %>% tally()

  plot_data %>% ggplot(aes(x = Group, y = n, fill = Species)) +
    geom_bar(stat = "identity", position = "stack")

  df_to_excel <- plot_data %>%
    pivot_wider(id_cols = "Group", names_from = "Species", values_from = "n")
  
  make_one_sheet()
```

Sometimes you want to pivot data for the worksheet, change some of the columns names, etc., for the workbook. Now you have your slides and a companion Excel Workbook!

AFEchidna: a new package solving mixed linear model for plant and animal datasets

Mixed linear models(MLM) are linear models with a combination of fixed and random effects to explain the degree of variation in interest traits, such as milk yield in cows or volume growth in forest trees. MLM are widely used in the analysis in the progeny test data of plants and animals. . Nowadays, most software uses the Restricted Maximum Likelihood (REML) method to estimate the variance components of random effects, and then estimate the fixed effects and predict the random effects. Such genetic analysis software includes ASReml, Echidna, SAS, BLUPF90, and R packages sommer, breedR, etc. Echidna is a free software developed in 2018 by Professor Gilmour, the main developer of ASReml. It also uses REML method to estimate parameter values, and its syntax and function is very close to that of ASReml. It is the most powerful free software for animal and plant genetic assessment, but its usage is a little complicated, which may be difficult for ordinary users.

Here, I released AFEchidna, an R package based on Echidna software, and demonstrated how to use a mixed linear model to generate solutions for variance components, genetic parameters, and random effects BLUPs.

The main functions in AFEchidna:

get.es0.file: generate es0 file
echidna(): specify mixed linear model
Var(): output variance components
plot(): model diagnose plots
pin(): calculate genetic parameters
predict(): model predictions
coef(): model equation solutions
model.comp(): compare different models
update(): run new mode

library(AFEchidna)
setwd("D:\\Rdata")
get.es0.file(dat.file=" Provenance.csv") # generate .es file
get.es0.file(es.file=" Provenance.es") # generate .es0 file

Specified a mixed model:

m1.esr <- echidna( fixed=height~1+Prov,
random=~Block*Female,
residual=~units,
es0.file='Provenance.es0')

Output related results:
> Var(m1.esr)
Term Sigma SE Z.ratio
1 Residual 2.52700 0.131470 19.221115
2 Block 0.10749 0.089924 1.195343
3 Female 0.18950 0.083980 2.256490
4 Block:Female 0.19762 0.086236 2.291618
> pin(m1.esr, mulp=c(Va~4*V3,
+ Vp~V1+V3+V4,
+ h2~4*V3/(V1+V3+V4)), digit=5)
Term Estimate SE
1 Va 0.75801 0.33592
11 Vp 2.91415 0.14973
12 h2 0.26011 0.10990
> coef(m1.esr)$fixed
Term Level Effect SE
1 Prov 11 0.0000000 0.0000000
2 Prov 12 -1.6656325 0.3741344
3 Prov 13 -0.6237406 0.3701346
4 Prov 0 -1.2201920 0.3769892
5 mu 1 11.5120637 0.3619919
> coef(m1.esr)$random %>% head
Term Level Effect SE
1 Block 2 0.3110440 0.1854718
2 Block 3 0.1268193 0.1858290
3 Block 4 0.2055624 0.1858158
4 Block 5 -0.2918516 0.1866871
5 Block 0 -0.3515741 0.1878287
6 Female 191 -0.1633745 0.3433451

A easy way to run batch analysis:
mt.esr <- update(m1.esr,trait=~height+diameter+volume,batch=TRUE)
> Var(mt.esr)

V1-Residual; V2-Block; V3-Female; V4-Block.Female
Converge: 1 means True; 0 means FALSE.

V1 V2 V3 V4 V1.se V2.se V3.se V4.se Converge maxit
height 2.5270 0.107490 0.18950 0.19762 0.13147 0.089924 0.08398 0.08624 1 4
diameter 16.8810 0.167130 0.80698 0.69100 0.87659 0.198100 0.41470 0.50067 1 5
volume 0.0037 0.000084 0.00022 0.00016 0.00019 0.000077 0.00010 0.00011 1 6

55,000 in Awards for Energy & Buildings Hackathon, Sponsored by NYSERDA

The New York State Energy Research & Development Agency (NYSERDA) is partnering with Onboard Data to host a $55,000 Global Energy & Buildings Hackathon. We’re inviting all engineers, data scientists and software developers whether they are professionals, professors, researchers or students to participate. More below…

Challenge participants will propose exciting, new ideas that can improve our world’s buildings. The hackathon will share data from 200+ buildings to participants. This data set is rich and one of a kind. The data set is normalized from equipment, systems and IoT devices found within buildings. We seek submissions that positively impact or accelerate the decarbonization of New York State buildings.

Total awards are $55,000. Sign-ups stay open until April 15th and the competition is open from April 22nd to May 30th. More can be found here: www.rtemhackathon.com.

Advance the next generation of building technology!

Batched Imputation for High Dimensional Missing Data Problems

High dimensional data spaces are notoriously associated with slower computation time, whether imputation or some other operation. As such, in high dimensional contexts many imputation methods run out of gas, and fail to converge (e.g., MICE, PCA, etc., depending on the size of the data). Further, though some approaches to high dimensional imputation exist, most are limited by being unable to simultaneously and natively handle mixed-type data.

To address these problems of either inefficient or slow computation time, as well as the complexities associated with mixed-type data, I recently released the first version of a new algorithm, hdImpute, for fast, accurate imputation for high dimensional missing data problems. The algorithm is built on top of missForest and missRanger, with chained random forests as the imputation engine.

The benefit of the hdImpute is in simplifying the computational burden via a batch process comprised of a few stages. First, the algorithm divides the data space into smaller subsets of data based on cross-feature correlations (i.e., column-wise instead of row-wise subsets). The algorithm then imputes each batch, the size of which is controlled by a hyperparameter, batch, set by the user. Then, the algorithm continues to subsequent batches until all batches are individually imputed. Then, the final step joins the completed, imputed subsets and returns a single, completed data frame.

Let’s walk through a brief demo with some simulated data. First, create the data.

# load a couple libraries
library(tidyverse)
library(hdImpute) # using v0.1.0 here

# create the data
{
set.seed(1234)

data <- data.frame(X1 = c(1:6),
X2 = c(rep("A", 3), 
rep("B", 3)), 
X3 = c(3:8),
X4 = c(5:10),
X5 = c(rep("A", 3), 
rep("B", 3)), 
X6 = c(6,3,9,4,4,6))

data <- data[rep(1:nrow(data), 500), ] # expand/duplicate rows

data <- data[sample(1:nrow(data)), ] # shuffle rows
}

Next, take a look to make sure we are indeed working with mixed-type data.

# quick check to make sure we have mixed-type data
data %>% 
map(., class) %>% 
unique()

> [[1]] [1] "integer" [[2]] [1] "character" [[3]] [1] "numeric"

Good to go: three data classes represented in our data object. Practically, the value of this feature is that there is no requirement for lengthy preprocessing of the data, unless desired by the user of course.

Finally, introduce some NA’s to our data object, and store in d.

# produce NAs (30%)
d <- missForest::prodNA(data, noNA = 0.30) %>% 
as_tibble()

Importantly, the package allows for two approaches to use hdImpute: in stages (to allow for more flexibility, or, e.g., tying different stages into different points of a modeling pipeline); or as a single call to hdImpute(). Let’s take a look at the first approach via stages.

Approach 1: Stages

To use hdImpute in stages, three functions are used, which comprise each of the stages of the algorithm:

feature_cor(): creates the correlation matrix. Note: Dependent on the size and dimensionality of the data as well as the speed of the machine, this preprocessing step could take some time.
flatten_mat(): flattens the correlation matrix from the previous stage, and ranks the features based on absolute correlations. The input for flatten_mat() should be the stored output from feature_cor().
impute_batches(): creates batches based on the feature rankings from flatten_mat(), and then imputes missing values for each batch, until all batches are completed. Then, joins the batches to give a completed, imputed data set.

Here’s the full workflow.

# stage 1: calculate correlation matrix and store as matrix/array
all_cor <- feature_cor(d)

# stage 2: flatten and rank the features by absolute correlation and store as df/tbl
flat_mat <- flatten_mat(all_cor) # can set return_mat = TRUE to print the flattened and ranked feature list

# stage 3: impute, join, and return
imputed1 <- impute_batches(data = d,
features = flat_mat, 
batch = 2)

# take a look at the completed data
imputed1

d # compare to the original if desired

Of note, setting return_mat = TRUE returns all cross-feature correlations as previously mentioned. But calling the stored object (regardless of the value passed to return_mat) will return the vector of ranked features based on absolute correlation from calling flatten_mat(). Thus, the default for return_mat is set to FALSE, as it isn’t necessary to inspect the cross-feature correlations, though users certainly can. But all that is required for imputation in stage 3 is the vector of ranked features (passed to the argument features), which will be split into batches based on their position in the ranked vector.

That’s it! We have a completed data frame returned via hdImpute. The speed and accuracy of the algorithm are better understood in larger scale benchmarking experiments. But I hope the logic is clear enough from this simple demonstration.

Approach 2: Single Call

Finally, consider the simpler, yet less flexible approach of making a single call to hdImpute(). There are only two required arguments: data (original data with missing values) and batch (number of batches to create from the data object). Here’s what this approach might look like using our simulated data from before:

# fit and store in imputed2
imputed2 <- hdImpute(data = d, batch = 2)

# take a look
imputed2

# make sure you get the same results compared to the stages approach above
imputed1 == imputed2

Visually Comparing

Though several methods and packages exist to explore imputation error, a quick approach to comparing error between the imputed values and the original (complete) data is to visualize the contour of the relationship between a handful of features. For example, take a look at the contour plot of the original data.

data %>%
select(X1:X3) %>% 
ggplot(aes(x = X1, y = X3, fill = X2)) + 
geom_density_2d() + 
theme_minimal() + 
labs(title = "Original (Complete) Data")

Next, here’s the same plot but for the imputed set.

imputed2 %>% 
select(X1:X3) %>% 
ggplot(aes(x = X1, y = X3, fill = X2)) + 
geom_density_2d() + 
theme_minimal(). + 
labs(title = "Imputed (Complete) Data")

As expected, the imputed version has a bit of error (looser distributions) relative to the original version. But the overall pattern is quite similar, suggesting hdImpute did a fair job of imputing plausible values.

Contribute

To dig more into the package, check out the repo, which includes source code, a getting started vignette, tests, and all other documentation.

As the software is in its infancy, contribution at any level is welcomed, from bug reports to feature suggestions. Feel free to open an issue or PR to request the change or contribute.

Thanks and I hope this helps ease high dimensional imputation!

Gaussian Simulation of Spatial Data Using R

Richard E. Plant

This post is a condensed version of an Additional Topic to accompany Spatial Data Analysis in Ecology and Agriculture using R, Second Edition. The full version together with the data and R code can be found in the Additional Topics section of the book’s website, http://psfaculty.plantsciences.ucdavis.edu/plant/sda2.htm Chapter and section references are contained in that text, which is referred to as SDA2.

1. Introduction

Probably the most common task in spatial data analysis is to estimate the values of some quantity at locations other than those at which it was measured. Beginning GIS classes often emphasize the use of spatial interpolation methods such as inverse distance weighting and kriging to carry out this task, and these are indeed important tools. They are not the whole story, however, and for some applications they may actually give results that lead to incorrect conclusions. A common alternative to interpolation is a method called conditional simulation. The purpose of this post is to describe conditional simulation, in particular conditional Gaussian simulation, to compare it with interpolation, and to show how to program it in R. We will also touch on the related topic of unconditional Gaussian simulation and see how the two differ. To demonstrate conditional simulation we need a data set that is densely defined in space so that we may sample from it, use the sample data to predict population values at unsampled points, and compare these predicted values with the true data values. The yield monitor data associated with Data Set 4 of SDA2 are very suitable for this purpose. In particular, we will use the wheat yield data of Field 2. A cleaned version of this data set was developed in Section 6.3.2 of SDA2, and this was converted to a regular grid. Fig. 1 shows the full gridded data set. The grid has 144 cells in the east-west direction and 72 cells in the north-south direction. Also shown in the figure are the 78 points at which field sample data were collected. The locations of these points in the grid were set to the closest grid cell to each sample point. Here is a part of the code showing the names of the R data objects and the assignment of sample point locations. > data.Set4.2 <- read.csv(“set4\\set4.296sample.csv”, header = TRUE)
> Y.full.data.sf <- st_read(“Created\\Set42pop.shp”)
> Y.full.data.sp <- as(Y.full.data.sf, “Spatial”)
> gridded(Y.full.data.sp) <- TRUE
> proj4string(Y.full.data.sp) <- CRS(“+proj=utm +zone=10
+ellps=WGS84″)
> sample.coords <- cbind(data.Set4.2$Easting, data.Set4.2$Northing)

Figure 1. Yield population data. The sample points are shown as bright yellow pixels.

As was discussed in SDA2, the areas in the western half of the field with very low yield were the result of a failed herbicide application due to equipment malfunction. We will use the sample points to extract from the full grid the data values that will be used to develop our kriging interpolations and conditional simulations. We will also need the record numbers of the sample points.

> closest.point <- function(sample.pt, grid.data){
+    dist.sq <- (coordinates(grid.data)[,1]-sample.pt[1])^2 +
+       (coordinates(grid.data)[,2]-sample.pt[2])^2
+    return(which.min(dist.sq))
+ }
> samp.pts <- apply(sample.coords, 1, closest.point,
+   grid.data = Y.full.data.sp)
> data.Set4.2$ptno <- samp.pts
> data.Set4.2$Yield <- Y.full.data.sp$Yield[samp.pts] The kriging analysis is similar, but not identical, to the analysis carried out in Sec. 6.3.2 of SDA2. > # Convert data.Set4.2 to a Spatial Points Data Frame
> # This is needed for the function krige()
> data.Set4.2.sp <- data.Set4.2
> coordinates(data.Set4.2.sp) <- c(“Easting”, “Northing”)
> proj4string(data.Set4.2.sp) <- CRS(“+proj=utm +zone=10
+   +ellps=WGS84″)
> # Kriging
> data.var <- variogram(object = Yield ~ 1, locations =
+    data.Set4.2.sp, cutoff = 600)
> data.fit <- fit.variogram(object = data.var, model =
+    vgm(150000, “Sph”, 250, 1))
> Yield.krige <- krige(formula = Yield ~ 1,
+    locations = data.Set4.2.sp, newdata = Y.full.data.sp,
+    model = data.fit)

Figure 2a. Kriged interpolation of the sample point data in Figure 1.

Figure 2b. Conditional Gaussian simulation of the sample point data in Figure 1. Fig. 2a shows the result of an ordinary kriging interpolation of the data. Fig. 2b shows a conditional Gaussian simulation of the same data. The code to develop this simulation will be given in Section 4. Comparison of the three figures shows that the kriging interpolation well approximates the pattern of yield but is much smoother than the original data. The simulation is considerably less smooth than the kriging interpolation. The smoothness of the kriged interpolation relative to those of the real data and the conditional Gaussian simulation is easy to see. Based on Fig. 2 one can say that although the kriged interpolation may be “optimal,” the pattern of variation of the conditional Gaussian simulation in some sense more closely resembles that of the true data. In what sense is the kriged interpolation optimal? The answer to this question is given in SDA2 Sec. 6.3.2, and in much greater detail by Isaacs and Srivastava (1989, Chapter 12). Here it is in brief: if the quantity to be estimated is denoted Y(x) and the estimates are denoted

, where x represents the position vector, and if Y is measured at locations x_i, then the kriged estimate

minimizes the error variance

(1)

This obviously does not mean that the kriged interpolation is the “best” estimate in every sense. The weaknesses of the kriged interpolation are described so well by Goovaerts (1997, p. 370) that I can do no better than to quote him. Interpolated estimates like that produced by kriging “…tend to smooth out local details of the spatial variation of the attribute. Typically, small values are overestimated, whereas large values are underestimated. Such conditional bias is a serious shortcoming when trying to detect patterns of extreme attribute values, such as zones of high permeability values or zones rich in a metal. Another drawback of estimation is that the smoothing is not uniform. Rather, it depends on the local data configuration: smoothing is minimal close to the data locations and increases as the location being estimated gets farther away. A map of kriging estimates appears more variable in densely sampled areas that in sparsely sampled areas.” An example from my own experience may make the first point clearer. I was involved for many years in experiments at the UC Davis Long Term Research in Agricultural Systems (LTRAS) site, a large agricultural field experiment. At one point, we were interested in conducting experiments on soil drainage at the site. This was made very difficult (actually, almost impossible) by the fact that the site lies directly next to, and at a higher elevation than, a creek. The site has some local areas of very high drainage, and water applied to the soil runs through these local areas and into the creek in the way that water drains from a bathtub. This means that taking soil samples and interpolating them would provide an unacceptably false estimate of the drainage of the site, because the interpolation would likely undervalue the very high drainage areas, where the water more quickly drains to the creek. Conditional simulation is a Monte Carlo process (see SDA2 Sec. 3.3). For convenience we will repeat a bit of the discussion in that section here. The basic idea of “traditional” Monte Carlo simulation is to formulate a probabilistic model of a process, run this model many times, and collect statistics on the outcomes. SDA2 Sec. 3.3 contains a simple example of a Monte Carlo simulation of the tossing of a fair coin 20 times. The simulation is carried out using a function called coin.toss() that generates the number of heads as a pseudorandom number from a binomial distribution. > coin.toss <- function (n.tosses, p){
+ n.heads <- rbinom(1,n.tosses,p)
+ } The function takes two arguments, the number of tosses n.tosses and the probability of heads p. The simulation is then carried out as follows.

> set.seed(123)
> n.tosses <- 20
> p <- 0.5
> n.reps <- 10000
> U <- replicate(n.reps, coin.toss(n.tosses, p))

The function replicate() generates a vector U whose elements are the number of heads in each of the 10,000 replications of the experiment.

> mean(U)
[1] 9.9814
> var(U)
[1] 4.905345 The mean and variance of U are close to their theoretically predicted values of 10 and 5, respectively.
In this post we will only discuss a specific kind of conditional simulation, called conditional Gaussian simulation. As the name implies, the value of the quantity being simulated is assumed to follow a normal (i.e., Gaussian) distribution. We will see that this greatly improves the analysis. Zanon and Leuangthong (2004) give a nice outline of the steps involved in conditional Gaussian simulation. They are as follows.

1. If the data are not normally distributed, transform them into a distribution that is to the greatest extent possible normal.
2. Generate a random sequence of steps that encounters each data value once.
3. At each step sequentially,
    a. search to find nearby sample data and previously simulated values,
    b. calculate the conditional distribution based on these data, and
    c. perform Monte Carlo simulation to obtain a single value from the
    distribution.
4. Repeat Step 3 until every data location has been visited.
5. If necessary, back transform the simulated data to the original distribution. As with neural networks in the Additional Topic on that technology, probably the best way to gain understanding of conditional Gaussian simulation is to construct a conditional Gaussian simulation algorithm that carries out the process in a similar way to the R function used to construct the simulation shown in Fig. 2b. To do this we must first review some important aspects of geostatistics (Section 2). In Section 3 we will construct the simple conditional Gaussian simulation algorithm, and in Section 4 we will describe the use of R gstat library (Pebesma, 2003, 2004) functions to carry out conditional and unconditional Gaussian simulation.

2. Geostatistics Review

a. Variance and covariance
Let Y(x) represent the quantity of interest measured at n locations x_i, i =1,…,n, where x_i is a two dimensional position vector. It is traditional in statistics to use this subscript notation, whereas in geostatistics it is traditional to use a notation involving the spatial lag, and we will switch to that in a moment. The key step in the conditional simulation algorithm described in the previous section is Step 3b, computing the probability distribution of the process Y(x) at the current location x conditional on the values of Y at nearby already sampled or simulated locations. To do this we must be able to compute the covariance between Y values at these locations, and this is done using the variogram. Much of the material that we will be going over in Subsection a is contained in SDA2 Sections 4.6.1 and 6.3.2, but it is convenient to repeat it in greater detail. Let the lag distance h be the distance between two points in the region. Assume that the region is stationary and isotropic (see SDA2, Sec. 3.2.2), so that h only depends on the magnitude of the distance and not its direction or the location of the points in the region. The derivation of the relationship between the covariance and the variogram follows an argument similar to that used in the Additional Topic on Spatial Sampling. The covariance between Y(x) and Y(x+h) is defined as

C(h) = Cov{Y(x),Y(x+h)}.

(2)

To carry out a conditional simulation we must relate this covariance to the variogram. The variogram γ(h) is defined as (SDA2, Sec. 4.6.1)

(3)

If Var{Y} = σ² then it turns out (see the Additional Topic for a derivation) that

(4)

As discussed in SDA2, Sec. 4.6.1, the variogram is estimated by the experimental variogram . It is worthwhile noting that an experimental variogram may have a nugget, i.e. that if

is the experimental variogram then it may be the case that

> 0, but nevertheless the variogram γ(h) is always 0 at h = 0, so that the variance is given by C(0) = σ².

b. Simple kriging
The interpolation exercise carried out in Section 1 involved ordinary kriging. The difference between simple kriging and ordinary kriging is that in simple kriging the value of the mean μ of the distribution is assumed to be known, whereas in ordinary kriging it is not. In the most common interpolation applications, the mean is obviously not known, and so ordinary kriging is the more widely used of the two. We will see, however, that simple kriging can be employed in conditional simulation, and so we will develop it here. Since the mean μ is known, we can define a new variable

Z(x) = Y(x) – μ.

(5)

and derive the equations in terms of Z, which has mean zero. The kriging interpolator

is a linear interpolation, which means that it is defined by the equation

(6)

where the x_i are the locations at which Y, and therefore Z, is measured, and x is the location at which it is to be interpolated. Kriging interpolation is based on the assumption of a probability model for

, and is derived within that model with the objective of minimizing the variance of the error associated with the interpolator.

It turns out (again, see the Additional Topic for a derivation) that the minimum of the expected variance is achieved under the condition

                                                Σφ_iC(x_i,x_{i +}h_ij) – C(x_i,x) = 0, i = 1,…,n.                           (7)
The equations (7) are called the simple kriging equations.

Equations (7) can be written in matrix form. If we let the n by n matrix C have elements C_ij = C(x_i,x_i +h_ij) (hopefully this abuse of notation is not confusing: C is a matrix and C(x_i,x_i +h_ij) is the covariance between locations x_i and x_i+ h_ij, and let the vector c have components C(x_i,x) then a system of n equations of the form of Equation (7) for the n coefficients φ_i can be written as the matrix equation
                                                                     Cφ = c.                                                   (8)
The values of φ the can be determined by solving Equation (8). This is not, however, the most computationally efficient way to compute these quantities, as will be discussed below. We will follow in the footsteps of Leuangthong et al. (2008) and create a simple four-point example in which we can directly compute the appropriate simple kriging quantities. In our case we will use the four points in the northwest corner of the sample set. We will krige on the quantity Z, the value of Yield minus its mean over the full 78 point data set. > Y.mean <- mean(data.Set4.2$Yield)
> data.Set4.2$Z <- data.Set4.2$Yield – Y.mean
> data.4pts <- data.Set4.2[which(data.Set4.2$Easting < 592200 &
+    data.Set4.2$Northing > 4267750),]

Figure 4. Simple four-point example, showing the Z values. The value at the blue point is to be estimated based on the values at the red points. Fig. 4 shows the four points. We will use the values of the three points shown in red to estimate the value of the point shown in blue. The values of Z are shown at each point. We can anticipate that the value of the blue point will be seriously overestimated, which illustrates the smoothing nature of kriging described above. First, we compute the distance matrix of these points and then we apply the variogram formula computed in Section 1. The way this is coded we need to be careful about the order of the data and make sure that the point to be estimated is the last in that order. > print(h4.mat <-as.matrix(with(data.4pts,
+   dist(cbind(Easting, Northing))))
       1       2        3        4
1 0.00000 61.0000 61.03278 86.26703
2 61.00000 0.0000 84.86460 61.00000
3 61.03278 84.8646 0.00000 59.00000
4 86.26703 61.0000 59.00000 0.00000
> print(nugget <- data.fit$psill[1])
[1] 0
> print(psill <- data.fit$psill[2])
[1] 1533706
> print(rnge <- data.fit$range[2])
[1] 248.2806

Using the formula for the spherical variogram given by Issacs and Srivastava (1997), we can construct a function sph.vgm() in terms of the nugget, the partial sill, and the range (the sill is the sum of the nugget and the partial sill). Here is the computation of the variogram matrix. > sph.vgm <- function(nug, psill, rnge, hh){
+    if (hh == 0) return(0)
+    if (hh > rnge) return(nug + psill)
+    return(nug + psill*((3*hh)/(2*rnge) – 0.5 * (hh/rnge)^3))
+    }
> gam.mat <- matrix(0, 4, 4)
> for (i in 1:4)
+    for (j in 1:4)
+       gam.mat[i,j] <- sph.vgm(0, psill, rnge, h4.mat[i,j])
> gam.mat
        [,1]     [,2]     [,3]     [,4]
[1,]      0.0 553850.7 554136.1 767179.4
[2,] 553850.7      0.0 755728.0 553850.7
[3,] 554136.1 755728.0      0.0 536401.1
[4,] 767179.4 553850.7 536401.1      0.0 We can then use Equation (5) to compute the variance-covariance matrix > # The fourth point is the point to be estimated
> print(C.full <- nugget + psill – gam.mat)
         [,1]      [,2]      [,3]      [,4]
[1,] 1533705.6 979855.0 979569.6 766526.3
[2,] 979855.0 1533705.6 777977.7 979855.0
[3,] 979569.6 777977.7 1533705.6 997304.5
[4,] 766526.3 979855.0 997304.5 1533705.6 The matrix C in Equation (8) is the 3 x 3 upper left corner of C.full and the vector c is the first three elements of the fourth column. We can solve Equation (8) for φ and then compute the simple kriging estimate of Z.
> C.mat <- Cfull[1:3, 1:3]
> c.vec <- Cfull[1:3,4]
> print(phi <- solve(C.mat, c.vec))
[1] -0.1015358 0.4591511 0.4822024
> print(Z.est <- sum(phi * data.4pts$Z[1:3]))
[1] 979.1371 As predicted, the estimate is rather poor, since the true value of Z is anomalously low. In the next subsection we will see that the simple kriging Equation (8) can be expressed as a linear regression model.

c. Regression digression
The derivation in Subsection b showed that the simple kriging estimator is the best linear unbiased estimator in the sense that among unbiased estimators it minimizes the variance of the error. The Gauss-Markov theorem (Kutner et al., 2005, p. 18) states that the estimator of the linear regression model is the best linear unbiased estimator. It must therefore be the case that the solution of the simple kriging equations is the same as that of corresponding linear regression equation. The purpose of this brief subsection is to give an informal argument to show this. SDA2 Appendix A.2 contains a discussion of the equations of multiple linear regression. We will reproduce some of this discussion here. Multiple linear regression provides the least squares estimate for the parameters of the model

. (9)

Here I have changed to a nonstandard notation different from that of SDA2 but matching that of this Additional Topic. To observe the analogy between simple kriging and linear regression we will be concerned with the special case in which β₀ = 0. (regression through the mean). The reason β₀ = 0 is that simple kriging is concerned with estimating the value of a quantity Z having mean 0, and this corresponds to regression through the mean. The equations for the regression model turn out to be (again, see the Additional Topic for a derivation)                          Σ_jβ_jX_iX_j – X_iY = 0, i = 1, … n .                               (10)

If we compare this equation with Equation (7), the simple kriging equations, which is repeated here,

                             Σφ_iC(x_i,x_{i +}h_ij) – C(x_i,x) = 0, i = 1,…,n.                           (11)

we can see that

                             X_iX_j = C(x_i,x_j), φ_i = β_i, X_iY = C(x_i,x) .                       (12)

Thus, we have shown (albeit with a bit of hand waving) that simple kriging does indeed produce the same result as linear regression. In the next subsection we will see that they are also the same as those generated by Bayes’ rule.

d. Bayesian statistics
Recall that Bayes’ rule is written as follows. Let A and B be two events. The joint probability, denoted P{A,B}, is the probability that both events A and B occur. The conditional probability that A occurs given that B occurs is denoted P{A|B}, and one can similarly define the conditional probability P{B|A} that B occurs given that A occurs. The laws of probability state that (Larsen and Marx, 86, p. 42)

P{A,B} = P{A|B}P{B} = P{B|A} P{A},

(13)

and these can be used to formally define the conditional probability. Bayes’ rule is obtained by dividing the second equation in (13) by P{A},

.                              (14)

If we have observed the event A then Bayes’ rule gives us a means to compute the probability that B occurs given this observation together with any prior information that we may already have about event B. In order to describe how Bayes’ rule is applied to conditional simulation we will shift from a model expressed in terms of events A and B to one in which the probabilities involve two scalar quantities y₁ and y₂. The following discussion is taken primarily from Leuangthong et al. (2008, p. 107), although I have changed the notation to better match mine. Rather than considering the probability of events, we consider probability densities, which we denote with a lower case p. We will assume that our objective is to compute the value of the quantity Y₂ based on having computed the value of Y₁, and that p(y₁) and p(y₂) represent probability densities of the values of Y₁ andY₂. Bayes’ rule becomes                                         .           (15)
The density p(y₂|y₁) is in geostatistics generally called the conditional density, p(y₁,y₂) is called the joint density, and p(y₁) is called the marginal density. I should note that if you are familiar with Bayesian statistics in the context of subjective probability theory then both the terminology and the concepts as applied here are slightly different and should not be confused. The formula for the probability density of a univariate distribution of a Gaussian random variable with mean m and variance is (Larsen and Marx, 1986, p. 210)

.                      (16)

I will not repeat the mathematical derivations here (again, see the Additional Topic) but note that they are based on Leuangthong et al. (2008, p. 108). They show that if the distribution p(y₁) of Y₁ is Gaussian (i.e., normal) then the conditional density p(y₂|y₁) of Y₂ is also Gaussian. This property, that the densities all have the same Gaussian form, is a rare one not shared by other probability densities. It is a primary reason why we do conditional Gaussian simulation. The great advantage of taking a Bayesian approach to conditional Gaussian simulation is that it facilitates the successive incorporation of the new data at each step of the simulation process (i.e. at each new visit to a point to be simulated). This incorporation is described, for example, by Leuangthong et al. (2008, p. 109) and by Goovaerts (1997, p. 376). Basically, once we know Y₁ = y₁, we can compute Y₂ = y₂ based on the formulas given above. Once we know Y₂ = y₂, we can compute Y₃ = y₃, and so forth. In other words, the calculation at each new visit point is a computationally simple matter of Bayesian updating. As described by Goovaerts, this provides tremendous efficiency in the simulation. This is the way that the computations are carried out in software such as the gstat package. The presentation of the algorithm is simplified, however, if one adopts a simple matrix inversion approach like that used in the four-point example above, and this is what we shall do in Section 3. Because they use Bayesian updating of a Gaussian distribution, however, R library algorithms such as those in gstat require that the data be normally distributed, and for this reason if they are not then they must be transformed. That is the subject of the next subsection.

e. The normal scores transformation
If you have had a course in linear regression, one of the topics you probably covered was transforming the data to a form that more closely fits the assumptions made to derive the method. The situation is similar with conditional simulation, but the objective is different. Linear regression is fairly robust to deviations from normality of the distribution of errors but is highly sensitive to deviations in error magnitude, and for this reason transformations may be carried out to “stabilize the variance,” that is, to make the error distribution more homogeneous. Conditional Gaussian simulation, on the other hand, assumes a normal distribution of the data, and thus data whose distribution is not Gaussian must be transformed to come closer to matching this assumption. The standard transformation used in geostatistics is called the normal scores transformation and is quite simple to describe and implement. It is described by Pyrcz and Deutsch (2018) and by Goovaerts (1997); we will follow the nice description by Goovaerts (1997, p. 268). Step 1. Arrange the sample data in ascending order, breaking ties if necessary. > # First compute normal scores of Yield at sample points:
> # Arrange Yield values in ascending order
> # data.Set4.2 is the original sample data, still a data frame
> data.Yield.ord <- data.Set4.2$Yield[order(data.Set4.2$Yield)]
> # Check for ties
> min(diff(data.Yield.ord))
[1] 0.5649183 Step 2. Compute the cumulative frequency of the ordered data values (i.e., the quantiles). In the next section we will also need their coordinates as well as their record numbers so we will include these as well. > data.Set4.2.ord <- data.frame(Yield = data.Yield.ord,
+    Yquant = seq(1:length(data.Yield.ord)) /
+       length(data.Yield.ord),
+    Easting = data.Set4.2$Easting[order(data.Set4.2$Yield)],
+    Northing = data.Set4.2$Northing[order(data.Set4.2$Yield)],
+    ptno = data.Set4.2$ptno[order(data.Set4.2$Yield)]) Step 3. Match the quantile of each of the ordered values with the corresponding quantile of the normal distribution. > data.ordered$Z <- qnorm(data.ordered$Yquant)
> data.ordered$Z[length(data.ordered$Z)]
[1] Inf
> data.Set4.2.ord$Z[length(data.Set4.2.ord$Z) – 1]
[1] 2.231606
> data.ordered$Z[length(data.ordered$Z)] <- 3.0 The function qnorm() assigns the value Inf to the largest quantile, so this replaced with a suitably large finite value. In theory the normal score transformation produces a quantity Z(x) with mean exactly zero, but in practice numerical issues plus our adjustment of the largest quantile may upset this a bit, so we will adjust the values to return their mean to zero, > # Give the z values a mean of zero
> mean(data.Set4.2.ord$Z)
[1] 0.03846154
> data.Set4.2.ord$Z <- data.Set4.2.ord$Z – mean(data.Set4.2.ord$Z) Fig. 5 shows the normal score transformation schematically. The fiftieth quantile Yield out of 78 has the value 5167 kg, and corresponds to the quantile value 50/78 = 0.64. The quantile value 0.64 of the normal distribution on the right corresponds to the value 0.32, which is the normal score.

Figure 5. Diagrammatic representation of normal score transformation. Based on similar figures by Goovaerts (1997, p. 268) and Pyrcz and Deutsch (2018).

After the Gaussian simulation is completed the results must be transformed back to the original data units. Because the data are so densely arranged, this can be accomplished by simple linear interpolation.

3. The Conditional Gaussian Simulation Algorithm

In Section 2 we created a normal score transformation Z of the yields of the sample points. We will use this transformed quantity in the simulation. Here are the steps. Step 1, the visit order. Since we start with a set of 78 locations whose Z value is known, we will randomly pick one of these as are starting point. Then we randomly arrange all the locations, sampled and unsampled. > set.seed(123)
> print(pt1 <- sample(samp.pts, 1))
[1] 5763
> visit.order <- sample(1:length(data.Full.ordered$Z),
+   replace = FALSE)
> visit.order <- c(pt1, visit.order[-which(visit.order == pt1)])
> visit.order[1:10]
[1] 5763 2463 2511 8718 2986 1842 9334 3371 4761 6746 Step 2. At the current visit point, if it is a sample point, insert its known value. If not, determine the parameters (mean and variance) of the Gaussian cumulative distribution function using simple kriging of nearby known values. This where software such as gstat uses Bayesian updating and we will use simple matrix inversion. We will base our estimate on the ten closest points whose value has already been determined. We create a data field Zsim in the original sf object Y.full.data.sf to hold the simulated values. We also determine the quantities to be used in the simple kriging. > Y.full.data.sf$Zsim <- 0
> data.varn <- variogram(object = Z ~ 1, locations =
+    data.Set4.2.ord.sp, cutoff = 600)
> data.fitn <- fit.variogram(object = data.varn, model =
+    vgm(1, “Sph”, 250, 1))
> print(Z.nug <- data.fitn$psill[1])
[1] 0
> print(Z.sill <- Z.nug + data.fitn$psill[2])
[1] 1.157344
> print(Z.range <- data.fitn$range[2])
[1] 239.7456
> n.kr.pts <- 10 The variogram() argument cutoff is the spatial separation distance up to which point pairs are included in the computation. The arguments of the function vgm() are respectively the initial guess of the partial sill, the model, and the initial guesses of the range and the nugget. Again, the partial sill is the sill minus the nugget. Fig. 6 shows the setup for the 100^th visit point. The point to be simulated is shown in red, the sample points are shown in green, and the points whose Z values have already been simulated are shown in blue. It is evident that the simulation is based on a combination of sample values and previously simulated values.

Figure 6. Diagram of the simulation data for the one hundredth visit point. Next we create a function closest.points() to find the closest points to the current visit point. The argument coord.data is used to pass the data frame Y.full.data.sf. The argument n.pts has the value 10. > closest.points <- function(visit.pt, coord.data,
+    known.pts, n.pts){
+    c.pts <- numeric(n.pts)
+ # Determine the coordinates of the current visit point
+    visit.coords <- with(coord.data, c(Easting[visit.pt],
+          Northing[visit.pt]))
+ # Determine the coordinates of all the points whose value is known
+    known.coords <- with(coord.data,
+        cbind(Easting[known.pts], Northing[known.pts]))
+ # Respectively go through the known points and find the
+ # n.pts closest points
+    test.coords <- known.coords
+    for (n in 1:(n.pts + 1)){
+       dist.sq <- (test.coords[,1] – visit.coords[1])^2 +
+          (test.coords[,2] – visit.coords[2])^2
+       min.pt <- which.min(dist.sq)
+       c.pts[n] <- known.pts[min.pt]
# Once a point is selected, change its coordinates to an
# impossible value to keep it from being selected again
+       test.coords[min.pt,] <- 0
+       }
+    return(c.pts)
+    } We use the array known.pts to hold the indices of the locations whose Z value is known. Initially these are the sample points. > known.pts <- samp.pts At each point in the visit order, if it is a sample then that value is used (Step 2a). These values are assigned at the start > for (i in 1:length(samp.pts))
+    Y.full.data.sf$Zsim[samp.pts[i]] <- with(data.Set4.2.ord,
+       Z[which(ptno == samp.pts[i])]) If the visit point is not a sample point then the Monte Carlo computation is implemented. Rather than using the R function dist() to compute the distance matrix, as was done for the simple four point example, it is more convenient to use our own function to compute the distances. > pt.dist <- function(x1, x2) sqrt(sum((x1 – x2)^2)) There is a lot of code, so I have tried to liberally include comments to explain what is going on. The drawing of the value from the normal distribution takes place at the end of the loop. The first line of code sets up the cycling through the visit points. > for (visit.pt in visit.order){
+ # If visit point not a sample point use the kriging variance to
+ # construct the distribution and draw a value from this
+       distribution
+    if (!(visit.pt %in% samp.pts)){
+ # Find the closest points to the visit point
+       cl.pts <- closest.points(visit.pt, Y.full.data.sf,
+          known.pts, n.kr.pts)
+ # Add the visit point to end of the set of closest points
+       krige.pts <- c(cl.pts, visit.pt)
+ # Determine the distance matrix
+       h.mat <- matrix(0, length(krige.pts), length(krige.pts))
+       for (i in 1:length(krige.pts))
+          for (j in 1:length(krige.pts)){
+             x1 <- with(Y.full.data.sf,
+               c(Easting[krige.pts[i]],Northing[krige.pts[i]]))
+             x2 <- with(Y.full.data.sf,
+               c(Easting[krige.pts[j]],Northing[krige.pts[j]]))
+             h.mat[i,j] <- pt.dist(x1, x2)
+             }
+ # Determine the gamma matrix
+       gam.mat <- matrix(0, nrow(h.mat), nrow(h.mat))
+       for (i in 1:(n.kr.pts + 1))
+          for (j in 1:(n.kr.pts + 1))
+             gam.mat[i,j] <-
+                sph.vgm(Z.nug, Z.sill, Z.range, h.mat[i,j])
+       C.full <- Z.sill – gam.mat
+       C.mat <- C.full[1:(n.kr.pts), 1:(n.kr.pts)]
+       c.vec <- C.full[1:(n.kr.pts),(n.kr.pts + 1)]
+ # Solve for the phi values
+       phi <- solve(C.mat, c.vec)
+ #Estimate the mean via simple kriging
+       mu.est <- sum(phi * Y.full.data.sf$Zsim[cl.pts])
+ # Variance equals the sill
+       sd.est <- sqrt(Z.sill)
+ # Draw a value from a normal distribution with these parameters
+       Y.full.data.sf$Zsim[visit.pt] <- rnorm(1, mu.est, sd.est)
+ # Add the visit point to the list of known points
+       known.pts <- c(known.pts, visit.pt)
+       }
+    } If you use the code accompanying this post to run this example, then as you watch the little blue wheel go around, think about the fact that when we repeat this example in the next section using the gstat function krige(), the computation will be effectively instantaneous. R library functions are generally better than anything we could make by hand. Finally, we use interpolation to transform the simulation back to the original scale (the function to do this was created in Section 1 but will be shown in Section 4). Fig. 6 shows the result. It is a considerably cruder simulation of the original data, although it does preserve some of the reduction in yield on the west side of the field.

Figure 6. Results of the “homemade” conditional simulation. Although the homemade simulation is clearly much cruder, the general yield pattern is still discernable. In the next section we will go through the same code as was used in Section 1 to illustrate the use of the gstat package to carry out conditional and unconditional Gaussian simulation.

4. Conditional and Unconditional Gaussian Simulation using gstat

Now that we have an idea of how conditional Gaussian simulation works, we will go over the code used to generate Figs. 2b and 3 in Section 1. After that we will create code to generate an unconditional Gaussian simulation and compare the two. The R code in the code file accompanying this Additional topic is repeated to match the presentation order here. The first step is the normal score transformation of the 78 sample points. This has already been described in Subsection 2e. The next step is to create a SpatialPointsDataFrame for the gstat function variogram() to compute and fit the variogram. We have already seen the last four lines of code in Section 3.

> coordinates(data.ordered) <- c(“Easting”, “Northing”)
> proj4string(data.ordered) <- CRS(“+proj=utm +zone=10 +ellps=WGS84”)
> data.varn <- variogram(object = Z ~ 1,
+    locations = data.ordered, cutoff = 600)
> data.fitn <- fit.variogram(object = data.varn,
+    model = vgm(1, “Sph”, 250, 1))

The last step is the actual computation of the conditional Gaussian simulation. > Z.sim <- krige(formula = Z ~ 1, locations = data.ordered,+ newdata = Y.full.data.sp, model = data.fitn, nmax = 10, nsim = 6)
drawing 6 GLS realisations of beta…
[using conditional Gaussian simulation]
> class(Z.sim)
[1] “SpatialPixelsDataFrame”
attr(,”package”)
[1] “sp”
> spplot(Z.sim) # Check Note the I am using krige() to carry out the conditional simulation. There are other ways to do it but I find this the most straightforward. The function krige() is a “wrapper” for the function gstat() A wrapper function provides a more convenient interface to another function, often for certain specialized applications. The argument locations contains the known data values, in the form of a Spatial object such as a SpatialPointsDataFrame, and the argument newdata contains the locations at which the simulation values are to be computed, again in the form of a Spatial object. In this case I use a SpatialPixelsDataFrame. The argument nsim is the number of simulations to be computed. The function gstat() is so efficient that it does not take much more time to compute 100, or ever 1000, simulations than to compute one (You will use this feature in Exercise 3). The last step in the code above is to do a quick and dirty plot to check that the result is reasonable.

The code for the function krige() when used for ordinary kriging is very similar to that used for conditional Gaussian simulation, so let’s take a look at the former in order to compare the two. Here is the call to the function krige() used to produce Fig. 2a. > Yield.krige <- krige(formula = Yield ~ 1,
+    locations = data.Set4.2.sp, newdata = Y.full.data.sp,
+    model = data.fit) The difference in the two function calls is the assignment of a positive value to the argument nsim in the conditional Gaussian simulation (the default value of nsim is zero). This positive value signals that conditional Gaussian simulation rather than kriging is to be used. The value of the argument nsim if it is not zero indicates the number of simulations to be computed. The argument nmax provides an indication of the number of known values to be used, but the algorithm is much more sophisticated than the simple one used in Section 3. The last step in the conditional Gaussian simulation process is to transform the data back to the initial units. As indicated, this is done using simple linear interpolation. We first create an interpolation function xform() and then do the transformation. > xform <- function(in.vec, out.vec, val){
+    i <- 1
+    while (i < length(in.vec) & val >= in.vec[i+1]) i <- i + 1
+    if (i == length(in.vec)) return(out.vec[i])
+    return(out.vec[i] + ((val – in.vec[i]) /
+      (in.vec[i+1] – in.vec[i])) * (out.vec[i+1] – out.vec[i]))
+    }
> # Back transform on a point by point basis
> Yield.sim <- Z.sim
> for(i in 1:nrow(Yield.sim@data)) +    for(j in 1:ncol(Yield.sim@data))
+       Yield.sim@data[i,j] <- xform(data.ordered@data$Z,
+          data.ordered@data$Yield, Z.sim@data[i,j]) Now we move on to unconditional Gaussian simulation. The difference between conditional and unconditional simulation is that in unconditional simulation only previously simulated values are used in Step 3a, the establishment of “known” values. The simulation begins with no “known” sample values. Unconditional Gaussian simulation cannot be done using krige() but rather requires the direct use of the function gstat(). The argument dummy = T signals that this is an unconditional simulation. The argument nmax is the number of nearest observations used for a simulation. The object returned by the function gstat() is then used as an argument in the function predict() to generate the actual simulation. > Z.pred <- gstat(formula = Z ~ 1, locations = ~Easting + Northing,
+   dummy = T, beta = 0, model = data.fitn, nmax = 20)
> set.seed(123)
> Z.usim <- predict(Z.pred, newdata = Y.full.data.sp, nsim = 6)
[using unconditional Gaussian simulation] In a simple kriging application beta is the value of the mean, which in our Gaussian simulation is zero. Fig. 8 shows a plot of the result. The simulation has the same general local spatial structure as the original data but does not preserve the global spatial pattern.

Figure 8. Unconditional simulation using yield data

5. Further Reading

Much of the discussion in this Additional Topic is taken from Leuangthong et al. (2008). Goovaerts (1997) contains the same material, but the presentation is at a higher mathematical level. Wackernagel (2003) and Journel (2009) contain more mathematically sophisticated discussions of the relationship between kriging and linear regression.

6. Exercises

Using the four-point data set of Fig. 6, compute a conditional Gaussian simulation of the fourth point based on the other three.

Try carrying out conditional simulation on the original Yield data without using the normal score transformation. How does the result compare with that computed here?

In principle, since a conditional Gaussian simulation is a set of pointwise Monte Carlo simulations, the mean of a large number of simulations should converge to the kriged interpolation. Compute 10, 100, 1000 simulations of the normalized data and see if they do converge.

A naïve way to address the problems associated with kriging would be to simply krige the data and then add a normally distributed random variable to each kriged value using the variance as computed from the variogram. Try this out and compare the results with those obtained using conditional Gaussian simulation. How do you explain the difference?

One of Goovaerts’ assertion about kriging is that smoothing is least close to the data locations and increases as the location being estimated gets farther away. Using the variogram information from the full 78 sample points (computing a variogram with only 39 data points is not wise), carry out a kriging interpolation and a conditional Gaussian simulation based on only the western half of the sample data and see whether kriging in the “unsampled” eastern half is indeed smoother than the simulation.

Using multiple simulations, construct several histogram and variogram comparisons similar to those of Fig. 4b and Fig. 5b and see how they compare with each other.

7. References

Box, G. E. P., and G. C. Tiao (1973). Bayesian Inference in Statistical Analysis. Addison-Wesley, Menlo Park, CA. Leuangthong, O, K.D. Khan, and C.V. Deutsch (2008). Solved Problems in Geostatistics. Wiley, New York. Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford, UK, Oxford University Press. Isaaks, E. H., and R. M. Srivastava (1989). An Introduction to Applied Geostatistics. Oxford University Press, Oxford, U.K. Journel, A.G. (1989) Fundamentals of Geostatistics in Five Lessons. American Geophysical Union, Washington, D.C. Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li (2005). Applied Linear Statistical Models. Mc-GrawHill Irwin, Boston, MA. Larsen, R. J., and M. L. Marx (1986). An Introduction to Mathematical Statistics and its Applications. Prentice-Hall, Englewood Cliffs, NJ. Leuangthong, O, K.D. Kahn, and C.V. Deutsch (2008). Solved Problems in Geostatistics. Wiley, Hoboken, NJ Pebesma, E. (2003), gstat User’s Manual. Utrecht University, Utrecht, The Netherlands Pebesma, E. J. (2004). Multivariable geostatistics in S: the gstat package. Computers and Geosciences 30: 683-691. Pyrcz, N.J. and C.V. Deutsch (2018). Transforming Data to a Gaussian Distribution. https://geostatisticslessons.com/lessons/normalscore Ripley, B. D. (1981). Spatial Statistics. Wiley, New York. Wackernagel, H. (2003) Multivariate Geostatistics. Springer, Berlin Zanon, S. and O. Leuangthong (2004) Implementation Aspects of Sequential Simulation. Geostatistics Banff 2004 DOI: 10.1007/978-1-4020-3610-1_55 Copyright © 2022 Richard E. Plant

RbaseX, a BaseX-client written in R

The primary programming language used in the Data Science course at Erasmus University Rotterdam was ‘R’. So when the topic ‘XML, XQuery and XSL’ was tackled, there was a need for a tool that could work with R. BaseX, an Open Source XML database, offered the functionalities that were needed for the excercises but there was no R support. However, a server protocol was available in which all communication was fully described. This protocol has now been used for the development of RBaseX. Version 1.1.1 of this client is available at (https://cran.r-project.org/web/packages/RBaseX/index.html). All following examples are based on the minutes of the Dutch Parliament which are freely available in XML-format. BaseX is written in Java. It is shipped as .jar, .zip, .exe or .war. On Windows and Linux, it works out of the box. All clients are based on the client/server architecture. Hence, a BaseX database server must be started first. RbaseX can be used in two modes:

Standard mode is used for connecting to a server and sending commands
Query Mode is used for defining queries, binding variables, iterative evaluation

Standard mode

RBaseX is developed using the R6 package for object-oriented programming. All interactions between the client and BaseX begin with the creation of a new instance of BaseXClient.
The following command creates a new BasexClient instance and stores it in the variable Session. This Session is used in all the examples.

library(RBaseX)
Session <- BasexClient$new("localhost", 1984L, username = "admin", password = "admin")

In the Command-command, the Create statement can be used to create new databases. If a database with that name already exists, it is overwritten.
Existing databases can be opened with the statement Open. BaseX has a convenience check statement which combines open and create. If a database with that name already exists, it is opened. Otherwise it is created (and opened).

Session$Command("Check Parliament")

After the client has send a command to the server, the server responds by sending the outcome of the command as a raw vector or byte-array. This outcome is converted to a list. In most cases this list has attributes $result and if appropriate $info. To indicate if the process was successful a single byte is added to the raw vetor. A \0x00 byte means success, \0x01 indicates an error. The value of this status-byte is stored as a private variable in the Session instance-variable and can be accessed as the $success attribute.
Public methods get_success() and set_success() are methods which are not described in the server protocol. Neither are the methods set_intercept, get_intercept and restore_intercept. I added them because I needed to catch errors. When the Intercept variable has the value FALSE and an error in your code is detected, execution stops. When the Intercept variable has the value TRUE, you can catch the error. Scraping the site of the Parliament is surprisingly easy. A search for all the minutes of a given date, returns a page that contains an ID and the number of documents. The two can be combined to construct a link to all the minutes in XML-format. The files can be stored directly into the database. The following store-function uses the BaseX Add-command.

Session$set_intercept(TRUE)
store <- function(Period) {
  for (i in 1:length(Period)) {
    Minutes <- minutes(Period[i])
    if (!identical(Minutes, list())) {
      Source <- paste("https://zoek.officielebekendmakingen.nl/", Minutes, sep="")
      mapply(function(source) Session$Add("Debates", source), Source)
    }
  }
}

Thanks to the set_intercept, errors, due to missing files, are handled neatly. Downloading and storing the 740 files in the database took 3.5 minutes.

Querying

‘Simple’ XQuery-statements can be handled as normal commands. So you can use the following statement to return the number of records that were inserted:

(Count <- Session$Command("xquery count(collection('Parliament'))")$result)

## [[1]]
## [1] "740"

I use the BaseX-GUI to develop queries. Queries that have been tested in the GUI are copied into the R-environment.
The following XQuery code returns for every meeting the Debate-ID, the names of the speakers, the order in which they spoke and the party they represent. result2frame Is a utility function which converts the results to a dataframe.

library(glue)
library(knitr)
  
Query_Expr <-paste(
'import module namespace  functx = "http://www.functx.com";', 
'for $Debat in collection("Parliament")',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',
'  for $Speaker at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speaker/spreker/naam/achternaam/text()',
'    let $Pol := $Speaker/spreker/politiek/text()',
'    order by $debate-id, $CountInner',
'return($debate-id, $CountInner, $Spreker, ($Pol, "n.v.t")[1])')

result <- Session$Command(as.character(glue("xquery {Query_Expr}")))
result_frame <- result2frame(result$result, 4)
names(result_frame) <- c('Debate ID','Speach', 'Speaker', 'Party')

kable(head(result_frame,3))

Debate ID	Speach	Speaker	Party
20202021-102-10	1	voorzitter	n.v.t
20202021-102-10	2	voorzitter	n.v.t
20202021-102-2	1	voorzitter	n.v.t

As you can see the ordering by debate-id does not give a correct order. This would be corrected if the digit(s) following the second hyphen were padded with 0’s. Since some meetings have more than 100 items, I pad up to length 3. It is easy to write a function for this transformation and incorporate it in the query expression.

import module namespace functx = “http://www.functx.com”;
declare function local:order_id
( $Meeting as xs:string)
{ let $debate-id := fn:analyze-string( $Meeting, “(\d{8}-\d)-(\d)”)
let $date := $debate-id//fn:match/:group[@nr=“1”]/text()
let $item-nr := functx:pad-integer-to-length($debate-id//fn:match/:group[@nr=“2”]/number(),3)
return fn:string-join(($date, $item-nr), “-”)
};

After incorporating this function in the query, ordering is correct.

Debate ID	Speach	Speaker	Party
20202021-102-2	1	voorzitter	n.v.t
20202021-102-2	2	Bromet	GroenLinks
20202021-102-2	3	voorzitter	n.v.t

Query mode

ExecuteQuery

The Session$Query()-command does not execute a query expression. It creates a new instance of the QueryClass and adds this instance to the Session variable. It is this QueryClass-instance that has an ExecuteQuery() method. This is demonstrated in the following example to extract the text from the minutes.

Query_Expr <-paste(
'import module namespace  functx = "http://www.functx.com";', 
'for $Debat at $CountOuter in collection("Parliament")',
'    where $CountOuter <=2',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',  
'  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speach/spreker/naam/achternaam/text()',
'    let $Pol := $Speach/spreker/politiek/text()',
'    order by $debate-id, $CountInner',
'    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
'       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
'    return($debate-id, $Spreker, ($Pol, "n.v.t")[1], $CountPar, $tekst)'
)

QueryExe <- Session$Query(Query_Expr)
result <- QueryExe$queryObject$ExecuteQuery()
result_frame <- result2frame(result, 5)
names(result_frame) <- c('Debate ID','Speaker', 'Party', 'Order', 'Text')

kable(head(result_frame,1))

Debate ID	Speaker	Party	Order	Text
20202021-102-2	voorzitter	n.v.t	1	Allereerst hebben we het traditionele mondelinge vragenuur. …

Binding

An XQuery expression may use external variables. Before using them they have to be bound to the queryObject(). In the following example, bind() is used to bind $Regex to a regular expression.

library(dplyr)
library(knitr)
Query_Bind <- Session$Query(paste(
'declare variable $Regex external;',
'for $Debat at $CountOuter in collection("Parliament")',
'    where $CountOuter <= 2',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',  
'  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speach/spreker/naam/achternaam/text()',
'    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
'       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
'    return($debate-id, $Spreker, $tekst)'))

Query_Bind$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")

## $Binding
## character(0)
## 
## $success
## [1] TRUE

BindResult <- Query_Bind$queryObject$ExecuteQuery()
ResultFrame <- result2frame(BindResult, 3)
names(ResultFrame) <- c('Debate ID','Speaker', 'Text')

kable(head(ResultFrame,1))

Debate ID	Speaker	Text
20202021-102	voorzitter	Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven weer geschiedenis. …

Iteration

Sometimes there is a need to iterate over the results of a query. When using the more()-next() construct, BaseX prefixes every item with a byte that represents the Type ID or XDM meta data. For every item, RBaseX creates a list with as elements the byte and the value of the item.
In the following example I’m only interested in every third item. This example shows how to iterate over all the items and also shows how to use the wrapper functions which are available for all the commands.

Query_Iter <- Session$Query(paste(
  'declare variable $Regex external;',
  'for $Debat at $CountOuter in collection("Parliament")',
  '    where $CountOuter <= 2',
  '  let $debate-id := fn:analyze-string(',
  '    $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',  
  '  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
  '    let $Spreker := $Speach/spreker/naam/achternaam/text()',
  '    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
  '       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
  '    return($debate-id, $Spreker, $tekst)'))

Query_Iter$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")

## $Binding
## character(0)
## 
## $success
## [1] TRUE

IterCnt <- 0
IterResult <- c()
while (IterCnt <= 10 && More(Query_Iter)) {
  NextItem <- Next(Query_Iter)
  IterCnt <- IterCnt + 1
  if (IterCnt %% 3 == 0) {
    IterResult <- c(IterResult, NextItem)
    (str(NextItem[[1]][[2]]))
  }
}

##  chr "Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven we"| __truncated__
##  chr "Voorzitter. Het was altijd al een eer om hier te staan. Vandaag is dat het in het bijzonder. We voelen ons alle"| __truncated__
##  chr "Dank u wel. Het woord is nu aan de minister. Gaat uw gang."

BaseX modules

BaseX provides many more functionalities. A complete overview can be found at https://docs.basex.org/wiki/Main_Page.

Create a hyper-marketing model using Naïve Bayes

By Huey Fern Tay with Greg Page

Everyone loves an extra income stream – even the super-wealthy owners of luxurious properties that they only inhabit for just a few weeks each year. Offering a property as a short-term rental through a platform like Airbnb can provide a wonderful side hustle. For some owners, however, the associated hassles could be a powerful deterrent to using the service. Text messages at 3 a.m. about Wi-Fi passwords, stopped-up toilets, and the lack of water pressure in the shower might be just enough to tip the scales against such an undertaking…especially when such messages are followed up by angry “Why isn’t this fixed yet?” queries just 30 minutes later.

So what if an all-in-one concierge service could take away ALL of those hassles? If an intermediary service could handle all of the tenant interactions, the marketing, the logistics of the key hand-offs, etc. then suddenly the idle rich jetsetters might be a bit more willing to open up their pied-a-terres to the unbathed masses. Such a service would benefit all stakeholders – travelers would have more options, the property owners would earn more income, and the platform would receive more commission fees from the extra transactions. In exchange for a fee paid to the service, willing property owners could have a side hustle that was “all side, no hustle.”

Let’s imagine that such a service is looking to establish itself, with an initial marketing outreach effort to high-end property owners not already using Airbnb. Let’s also imagine that this new service is operating on a shoestring budget, and therefore needs to. How can it identify the properties within a city that are most likely to command high values in the short-term rental market?

The Naïve Bayes classifier is a good candidate for the task at hand because of its simplicity, computational efficiency, and ability to handle categorical variables. Furthermore, its classification outcomes come with associated probability values – we can use those to identify records that are most likely to belong to some particular group.

To illustrate how this method could be used to solve the business problem outlined above, I will utilize Airbnb data of San Francisco listings.

One of the first decisions the modeler must make is deciding how to bin the data. In this case, the question is which numerical variable would you use to separate the properties? Would you use price, review ratings, or number of reviews? Each of these variables has a different impact on the outcome and may not be equally effective at separating classes. If the classes are not well separated, then even a large dataset will not be helpful.

The next important decision is to determine the number of classification categories you would like to create. Also, what would the cut-off be for each group? In other words, should you create four groups and bin them equally? Or should you create three groups by dividing the data according to a 15-70-15 proportion, 20-40-20, or 30-40-30…etc? The decision made at this step has a big impact on the model.

Consider both models below which were each created with 3811 rows of data representing 60% of the total dataset. Both models were created with the same predictor variables, such as the number of bedrooms, bathrooms, property type, location, etc. But in Model 1, the data was binned into 4 equal groups while in Model 2, the data was binned into 3 equal groups. The model summary for Model 1 showed the model has an accuracy of 54.19%, which is good considering that this performance is slightly more than double the No Information Rate (Naïve Rate). Model 2’s accuracy’s level is at 65.36%, a level which is nearly double its No Information Rate.

These are encouraging results but in our Airbnb example, we are more interested in knowing how well our model performs when it is asked to classify properties into any of the classes used in the model. For this reason, it is worth considering the true positive rate i.e. ‘sensitivity’. Model 1 is better at predicting the true positives (‘sensitivity’) for classes at opposite ends of the spectrum. This suggests Model 1 has difficulty reading nuances. On the other hand, Model 2’s performance in this regard is comparatively more balanced.

Naive Bayes model with four classification outcome

Naive Bayes model with four classification outcome

Naive Bayes output with three classification outcome

But wait – let’s get back to our original goal. While overall accuracy is good to see, what we are most interested in here is identifying that high-end price group. Owners of such units will be the best targets for our all-in-one concierge service. Therefore, let’s dive a bit deeper to examine this model’s suitability for identifying such properties.

By running the predict() function with the type=’raw’ parameter included, we can view the associated probabilities for each outcome class, and then rank records by probability of belonging to some particular outcome group.

Taking this approach with the validation set, we find that among the 100 records identified by the model as most likely to land in the top tier group, 96 truly belonged to “Above Average and Pricey Digs.” Among the 150 likeliest, 140 units, or 93.33%, actually belonged to that group, and among the 200 likeliest, 185 units, or 92.5%, were truly in that top tier.

But that’s not all. It is worth going one step further by evaluating the model with lift charts or decile-wise lift charts because these charts determine how effectively our model ‘skims the cream’.

The decile-wise lift charts below illustrate how effectively the model can predict membership in the ‘above average and pricey digs’ group. When the model is used to classify the top 27% properties in this category, its performance is more than 3.5x better than a random guess.

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Another way to assess the model’s ability to identify top-tier rentals is with a two-dimensional lift chart. Such a chart only works with two-outcome class scenarios, so we start here by collapsing the first and second tiers together, and then labelling that group as “other.”

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

In the entire validation set, there are 780 units that land in the highest price tier. The values along the x-axis represent all the validation set records, ranked in order of their probability of belonging to the highest-tier class. The y-axis shows the number of correct predictions. The solid line represents the model’s performance – it shows us, for instance, that among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group. The line flattens out at around x=1500, because by that point, the model has already identified nearly all of the records that truly belonged to this outcome class.

The dotted line, by contrast, shows how effective a model would be if it simply labelled all the records as belonging to the top tier. Since 33.6%, of the validation records belong to this group, each x-axis value here corresponds to a y-axis value that is exactly 33.6% as large. The difference between the solid line and the dotted line represents the model’s improvement as the number of cases increases.

What do these results mean for the concierge service? Let’s revisit our original assumptions:

such a service would most likely appeal to the owners of properties in the highest pricing tier;
the initial outreach efforts should be made to owners of properties not already registered with Airbnb; and that
the new service has a limited budget, and therefore needs to carefully focus its outreach efforts only on that tier of properties whose owners would be likeliest to use it

Given these assumptions, our primary interest does not lie with overall model accuracy. The model’s ability to distinguish between the bottom two pricing tiers is almost immaterial to us; however, we are keenly interested in the answer to this question: When the model predicts that a property will belong to the highest pricing tier, how often is it correct?

As demonstrated here, the model delivers quite effectively in this regard, especially when we maintain a relatively narrow focus on the properties that are most likely to be top tier. Splashy magazine inserts, Super Bowl advertisements, and big-ticket endorsements from celebrities might be in the cards for this service down the road, after it spreads across the globe and prepares for its IPO roadshow. For now, though, the hyper-specific focus that can come from “skimming the cream” off the top of those Naïve Bayes model probability predictions may be the surest next step for this service’s success.

Data source: Inside Airbnb

Walmart’s (WMT) 7-Year Nonlinear Market Trend at ‘Critical Juncture’

This is a brief market update from my earlier post on October 28, 2021, Walmart’s 7-Year Nonlinear Market Trend using ‘Stealth Curves.’

The prior post indicated the then current nonlinear market structure for Walmart (WMT) stock.

October 28, 2021 Market View

Fast forward to today …

This market is now resting on its multi-year lower Stealth Support Curve. As the below updated graphic indicates, this represents a critical juncture for this market.

Current Market View (1-27-2022)

This current chart can be produced using the code in my October 2021 post along with updated market data. The data is freely available from Nasdaq.com here. Just be certain the updated data match the column headings in my ‘reformatted’ GitHub data file.

For those interested in numerous Stealth Curve examples applied to various other markets, simply view this LinkedIn post. Full details of Stealth Curve equation parameterization are described in my latest Amazon publication, Stealth Curves: The Elegance of ‘Random’ Markets.

Feel free to contact me directly on my LinkedIn site.

Brian K. Lee, MBA, PRM, CMA, CFA

Bifurcation diagram for nonautonomous SIR model

I am trying to do the bifurcation diagram for a nonautonomous SIR two-patch model. But It’s not giving the proper result. The model and parameter values are in the code. I solved the model by ODE solution for a set of bifurcation parameter values and then took the maximum and minimum for the plot. How should I draw a phase plane analysis for my model? Any suggestions?

Bifurcation Analysis #########

library(deSolve)
library(tidyverse)
vals_betaAbar <- seq(1,70,by=1) #bifurcation parameter

outdf = NULL

for (i in 1:length(vals_betaAbar)) {

betaAbar = vals_betaAbar[i]

sir.two.patch.model.closed <- function (t, x, params){
#create local variable IA, IB, RA, RB, SA, SB
IA <- x[1]
IB <- x[2]
RA <- x[3]
RB <- x[4]
SA <- x[5]
SB <- x[6]
#we can simplify code using “with”
#this argument to “with” lets us use the variable Names
with(
as.list(params),
{ #the system of rate equations
dIA <- betaAbar(1+sigAcos(2pit/Ti))IASA/Na-(muA+gamaA+phiIAB(1+sigAcos(2pit/Ti)))IA+phiIBA(1+sigAcos(2pit/Ti))IB
dIB <- betaBbar(1+sigAcos(2pit/Ti))IBSB/NB-(muB+gamaB+phiIBA(1+sigAcos(2pit/Ti)))IB +phiIAB(1+sigAcos(2pit/Ti))IA
dRA <- gamaAIA-(muA+phiRAB(1+sigAcos(2pit/Ti)))RA+phiRBA(1+sigAcos(2pit/Ti))RB
dRB <- gamaBIB-(muB+phiRBA(1+sigAcos(2pit/Ti)))RB+phiRAB(1+sigAcos(2pit/Ti))RA
dSA <- muANa-betaAbar(1+sigAcos(2pit/Ti))IASA/Na-(muA+phiSAB(1+sigAcos(2pit/Ti)))SA+phiSBA(1+sigAcos(2pit/Ti))SB
dSB <- muBNB-betaBbar(1+sigAcos(2pit/Ti))IBSB/NB-(muB+phiSBA(1+sigAcos(2pit/Ti)))SB+phiSAB(1+sigAcos(2pit/Ti))SA
#combine results into a single vector dx
dx <- c(dIA,dIB,dRA,dRB,dSA,dSB)
#return result as a list
list(dx)
}
)
}
#function seq returns a sequence
times <- seq(0,100,length.out=3000)
#function c “c”combines values into a vector
params <- c(Na=100, NB=1000, betaAbar=betaAbar, betaBbar=5,muA=0.06, muB=0.02, gamaA=12, gamaB=12,phiIAB=0.4, phiIBA=0.04, phiSAB=0.4, phiSBA=0.04, phiRAB=0.4, phiRBA=0.04, sigA=0., Ti=1)

#initial conditions
xstart <- c(IA=5,IB=5,RA=0,RB=0,SA=100-5,SB=1000-5)
#result stored in data frame
out <- as.data.frame(ode(xstart,times,sir.two.patch.model.closed,params))

out$betaAbar= vals_betaAbar[i]

out=out[251:300,]

outdf = rbind.data.frame(outdf, out)

}

bifdata=NULL
for (i in 1:length(vals_betaAbar)) {

betaAbar = vals_betaAbar

bifdata_A = NULL
bifdata_A$IA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$IA)
bifdata_A$IA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$IA)
bifdata_A$IB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$IB)
bifdata_A$IB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$IB)

bifdata_A$SA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$SA)
bifdata_A$SA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$SA)
bifdata_A$SB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$SB)
bifdata_A$SB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$SB)

bifdata_A$RA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$RA)
bifdata_A$RA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$RA)
bifdata_A$RB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$RB)
bifdata_A$RB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$RB)

bifdata_A$betaAbar= vals_betaAbar[i]

bifdata = rbind.data.frame(bifdata, bifdata_A)
}

dev.new()
ggplot(data=bifdata)+
geom_line(aes(x=betaAbar,y=IA_max))+
geom_point(aes(x=betaAbar,y=IA_min))

#############################