R – Page 33 – R-posts.com

Surface Renewal Analysis for Energy Flux Exchanges in an Ecosystem: 1: Calculating Ramp Characteristics using R.

Summary: The application of Surface Renewal (SR) analysis as an alternative to energy and gaseous flux measurements using eddy covariance has gained prominence as a less costly and reliable method. The R code provided in this post is the first of two posts. In this first post R programming language is applied to calculate ramp characteristics; namely: ramp amplitude (A) and ramp duration (τ). A sample dataset is provided to demonstrate the functionality of the code.

INTRODUCTION

The surface energy balance equation explains the net solar radiation from the sun to and from the earth’s surface and is comprised of three important elements namely: latent heat (LE), sensible heat (H) and ground heat flux (G). LE is the latent heat flux that results from both evapotranspiration from the soil and vegetation. Sensible heat warms the layer above the surface while ground flux is associated with heat stored in the soil or pedozone. The sum of these 3 components totals the radiative flux which is a combination of both longwave and shortwave radiation energy. Quantification of these components within a watershed or field, assists scientific understanding of plant development, precipitation and water demands by competing users. There are several micrometeorological methods available to measure these components both directly and indirectly. The eddy covariance (EC) method utilizes an omnidirectional 3D sonic anemometer as well as either an open or closed-path infrared H₂O gas analyzing system to measure LE. Solar radiation can be measured using radiometers. Soil heat plates placed at various depths, measure ground heat flux (G) while sensible heat is measured indirectly as the remainder after subtracting G and H from net solar radiation. The EC methodology is rather expensive and has several limitations which are discussed by several micrometeorologists (e.g. Castellvi et al., (2006); Castellvi et al, (2009a, 2009b); and Suvočarev et al., 2014)). The surface renewal method (SR) however provides an efficient and cheap option for estimating H, LE energy and methane (CH₄) and carbon dioxide (CO₂) gas exchange fluxes to and from varied ecosystems such as uniform pine forest (Katul et al., 1996); grapevine canopies (Spano et al., 2000); rice fields (Castellvi et al., 2006); and rangelands (Castellvi et al., 2008). The method is both cheaper and simple to implement on medium spatial scales such as livestock paddocks and feedlots. SR method does not require instrument levelling and has no limitations in orientation (Suvočarev et al., 2014)). Additionally, the SR method has demonstrated better energy balance closure (Castellvi, et al., 2008).

To estimate H flux, measurements are taken at one height together with horizontal wind speeds and temperature. To estimate LE fluxes, measurements may be taken at one height together with horizontal wind speeds and moisture concentrations. The Surface Renewal (SR) process is derived from understandings of gaseous flux exchanges to or from an air parcel traveling at a known or specified height over a canopy. In the SR process an air parcel moves into the lower reaches of the canopy and remains for a duration τ (tau); before another parcel replaces it ejecting it upwards. Paw et al, (1995) created a diagram of this process and likened it to a ramp-like event.

Figure 1: Air parcel diagram of surface renewal process.

Figure 2: An ideal depiction of a surface renewal ramp where atmospheric conditions are unstable.

The ramp is characterized by both an amplitude (A) and period (τ – Greek letter :tau). The purpose of this R code is to calculate these two characteristics for measuring both gaseous and energy flux exchanges occurring during the SR process. These values are then applied in estimating H, LE, CO₂ and CH₄. Ramp Characteristics Amplitude Van Atta (1977) and Antonia and Van Atta (1978) (both cited in Synder et al., 2007) pioneered structure function analysis for ramp patterns in temperature. However, they did not connect ramps and structure functions to the concept of surface renewal but rather relegated the structure function analysis to other features of turbulent flow. The Biomicrometeorological Research Team at the University of California, Davis were the first to connect to the concepts of surface renewal to ramp patterns analyzed by structure functions. The surface renewal method was thus developed and refined by scientific contributions of several researchers led by Kyaw Tha Paw U, (Professor of Atmospheric Sciences and Biometeorologist)** at the University of California, Davis; researchers at INRA, France and the University of Sassari, Italy. These later researchers not only developed surface renewal, but were the first to show usage of structure functions in surface renewal for micrometeorological purposes (A list of their publications and presentations highlighting these early innovations are provided below). These researchers studied structure functions, ramp patterns and turbulence and their relationships. They utilized the structure function approach to estimate both amplitude and duration from moments recorded in the field (Equation 1).

(Equation 1)
Where h(V_i-V_i-j) is the difference between two sequential high-frequency measurements (time lag interval, j) of water vapor density or air temperature. M represents the total number of data points in the 0.5 hour time period. i represents the summation index while n is the structure function order. The R code builds of these procedures and calculates the 2^nd, 3^rd and 5^th moments of a structure function following Van Atta (1977). Applying the three moments to calculate p and q; as:

(Equation 2)
And;

(Equation 3)
A cubic equation encompassing ramp amplitude is provided as:

(Equation 4)
Equating y to 0 and solving for the real roots of the equation yields the ramp amplitude.
Ramp duration
The “inverse ramp frequency” also known as the mean ramp event, τ , is calculated as:

(Equation 5)

j represents the instance when the ratio between the third moment and observation time (seconds) is a maximum.
Demonstration The sample dataset (test.csv) utilized in this package consists of water vapor measurements (gm-3) recorded over 30 minutes (half an hour) at 20 Hz per second. Therefore a total of 36000 measurements are provided. The sample dataset can be downloaded here.
R code
The code provided below converts the H₂O concentrations from g per m³ to mmol per m³. Thereafter, 2nd, 3rd and 5th moments are calculated for each 1/20th of a second. These values are then applied to calculating two values; p and q. A (amplitude) is derived by solving for the true roots of a cubic polynomial equation. The R code generates a numeric vector that consists of three values: 1) r value – this is the time (seconds) when the ratio between the third moment and observation time (seconds) is a maximum. 2) τ (Greek letter – tau) is the total ramp duration 3) A is the amplitude of the ramp that models the SR process. In order for the package to function, two supporting packages are necessary. These are the NISTunits and polynom packages.

ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, “Package”])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

### usage of packages

packages <- c(“NISTunits”,“polynom”,“bigmemory”)
ipak(packages) ## Loading required package: NISTunits ## Loading required package: polynom ## Loading required package: bigmemory ## Loading required package: bigmemory.sri ## NISTunits   polynom bigmemory 
##      TRUE      TRUE      TRUE require(NISTunits)


 ### cube root function
Math.cbrt <- function(x) 
  {
  sign(x) * abs(x)^(1/3)
  }

####
#### set the directory and read file. In my case my dataset is stored in
#### D:SurfaceRENEWAL folder

setwd(“D:/SurfaceRENEWAL/”)
df<-read.csv(“test.csv”, header=TRUE)

## creating empty data frame
### based on the value of hertz. In this case it is 20 Hz per second..

hertz<-20


dat <- data.frame(col1=numeric(hertz), col2=numeric(hertz), col3=numeric(hertz),stringsAsFactors = FALSE)

####convert to mmol m-3

df$H2O<-1000*df$H2O/18.01528
 
### Creating and Initializing the table with three columns 
dat <- data.frame(col1=numeric(hertz), col2=numeric(hertz), col3=numeric(hertz),stringsAsFactors = FALSE)
dat ##    col1 col2 col3
## 1     0    0    0
## 2     0    0    0
## 3     0    0    0
## 4     0    0    0
## 5     0    0    0
## 6     0    0    0
## 7     0    0    0
## 8     0    0    0
## 9     0    0    0
## 10    0    0    0
## 11    0    0    0
## 12    0    0    0
## 13    0    0    0
## 14    0    0    0
## 15    0    0    0
## 16    0    0    0
## 17    0    0    0
## 18    0    0    0
## 19    0    0    0
## 20    0    0    0 for (p in 1:hertz)
{
  #### initializing cumulative totals
  tempTot2<-0
  tempTot3<-0
  tempTot5<-0
  #### the first column is the frequency
  #### the second column is the duration (in seconds)
  dat$col1[p]<-p
  dat$col2[p]<-p/hertz

  #### initializing temporary variables for calculating 
  #### 2nd, 3rd and 5th moments
  
  tem2<-0
  tem3<-0
  tem5<-0
    for (i in (p+1):nrow(df))
      {
      tem2<-(df$H2O[i]-df$H2O[i-p])^2
      tempTot2<-tem2+tempTot2  
      tem3<-(df$H2O[i]-df$H2O[i-p])^3
      tempTot3<-tem3+tempTot3
      tem5<-(df$H2O[i]-df$H2O[i-p])^5
      tempTot5<-tem5+tempTot5
      }

  dat$col3[p]<-(tempTot2/(nrow(df)-dat$col1[p]))
  dat$col4[p]<-(tempTot3/(nrow(df)-dat$col1[p]))
  dat$col5[p]<-(tempTot5/(nrow(df)-dat$col1[p]))
  
}


dat$moment2<-abs(dat$col3)
dat$moment3<-abs(dat$col4)
dat$moment5<-abs(dat$col5)

#### Solution following Synder et al.(2007)
#### values for p and q
dat$p<-10*dat$col3 – (dat$col5/dat$col4)
dat$q <- 10*dat$col4
#### solutions to the cubic equation after Synder et al., (2007)
dat$D<-(((dat$q)^2)/4)+(((dat$p)^3)/27)

for (i in 1:hertz) {
  if (dat$D[i]>0)
    {
        one1<-(dat$D[i])^0.5
        one2<–0.5*dat$q[i]
        dat$x1[i]<-Math.cbrt(one2+one1)
        dat$x2[i]<-Math.cbrt(one2-one1)
        dat$a[i]<-dat$x1[i]+dat$x2[i]
        rm(one1)
        rm(one2)
    }
 else {
   if (dat$D[i]<0)
   {
     one1<-(dat$D[i])^0.5
     one2<–0.5*dat$q[i]
     dat$x1[i]<-Math.cbrt(one1+one2)
     dat$x2[i]<-Math.cbrt(one1-one2)
     rm(one1)
     rm(one2)
     
     part1<-((-1)*dat$p[i]/3)^3 
    b<-NISTradianTOdeg(acos((-0.5*dat$q[i])/((part1)^0.5)))
    
    a1=2*((-dat$p[i]/3)^0.5)*(cos(NISTdegTOradian(b/3)))
    a2=2*((-dat$p[i]/3)^0.5)*(cos(NISTdegTOradian(120+b/3))) 
    a3=2*((-dat$p[i]/3)^0.5)*(cos(NISTdegTOradian(240+b/3)))
    c<-max(a1, a2, a3)
    dat$a[i]<-c
    rm(c, a1, a2, a3, part1)
    
   }
 } 
}

myvars <- names(dat) %in% c(“x1”, “x2”) 
dat <- dat[!myvars]

##### using polynomial solver for a ####
require(polynom)

for (i in 1:hertz){
  
  p<-polynomial(coef = c(dat$q[i],dat$p[i], 0, 1))
  pz <- solve(p)
  length(pz)
  for (j in 1:length(pz))
       {
    assign(paste(“sol”, j, sep = “”), pz[j]) }
      dat$sol1[i]<-sol1
  dat$sol2[i]<-sol2
  dat$sol3[i]<-sol3
    rm(pz)
    rm(sol1, sol2, sol3)
}


for(i in 1:hertz)
{
z1<-dat$sol1[i]
z2<-dat$sol2[i]
z3<-dat$sol3[i]

dat$realsol1[i]<-Re(z1)
dat$Imsol1[i]<-Im(z1)
dat$realsol1[i]
dat$Imsol1[i]


dat$realsol2[i]<-Re(z2)
dat$Imsol2[i]<-Im(z2)
dat$realsol2[i]
dat$Imsol2[i]

dat$realsol3[i]<-Re(z3)
dat$Imsol3[i]<-Im(z3)
dat$realsol3[i]
dat$Imsol3[i]

max(dat$realsol1[i],dat$realsol2[i],dat$realsol3[i])
dat$valA[i]<-max(dat$realsol1[i],dat$realsol2[i],dat$realsol3[i])
dat$max3lagtime[i]<-dat$moment3[i]/dat$col2[i]
}

myvars <- names(dat) %in% c(“col1”, “col2”,
                            “moment2”, “moment3”,
                            “moment5”,“p”, “q”, “a”,“valA”,  “max3lagtime” ) 
summaryTable <- dat[myvars]

summaryTable ##    col1 col2   moment2    moment3    moment5          p           q
## 1     1 0.05  39.97585   227.4659   262452.8  -754.0533   -2274.659
## 2     2 0.10 115.63242  1134.9434  3160154.1 -1628.0914  -11349.434
## 3     3 0.15 189.03933  2265.2436  8969219.4 -2069.1011  -22652.436
## 4     4 0.20 254.38826  3320.6327 15645921.7 -2167.8465  -33206.327
## 5     5 0.25 312.56747  4242.6024 22015534.6 -2063.4833  -42426.024
## 6     6 0.30 364.54446  5068.9708 27896667.6 -1857.9740  -50689.708
## 7     7 0.35 410.76632  5884.4135 34475044.8 -1751.0421  -58844.135
## 8     8 0.40 452.93659  6660.0190 41266600.0 -1666.8026  -66600.190
## 9     9 0.45 492.31085  7343.4792 47495349.9 -1544.5819  -73434.792
## 10   10 0.50 529.03029  7939.6602 53081193.5 -1395.2721  -79396.602
## 11   11 0.55 562.93084  8509.7643 59150608.6 -1321.6019  -85097.643
## 12   12 0.60 593.42319  9046.5529 66051182.5 -1367.0223  -90465.529
## 13   13 0.65 620.52924  9426.8395 71306547.4 -1358.9127  -94268.395
## 14   14 0.70 644.90293  9647.2450 73292986.2 -1148.2677  -96472.450
## 15   15 0.75 667.44272  9846.4826 73991999.9  -840.1344  -98464.826
## 16   16 0.80 689.19924 10113.5842 75160476.5  -539.6436 -101135.842
## 17   17 0.85 710.79955 10484.7886 78308090.9  -360.7380 -104847.886
## 18   18 0.90 732.54936 10959.1331 83838648.3  -324.6232 -109591.331
## 19   19 0.95 754.19575 11389.8848 89552992.0  -320.5445 -113898.848
## 20   20 1.00 775.68790 11665.0130 94068144.4  -307.2478 -116650.130
##           a     valA max3lagtime
## 1  28.85951 28.85951    4549.317
## 2  43.46502 43.46502   11349.434
## 3  50.20279 50.20279   15101.624
## 4  52.87582 52.87582   16603.163
## 5  53.45273 53.45273   16970.410
## 6  53.04339 53.04339   16896.569
## 7  53.41124 53.41124   16812.610
## 8  53.87871 53.87871   16650.048
## 9  53.91351 53.91351   16318.843
## 10 53.62664 53.62664   15879.320
## 11 53.86497 53.86497   15472.299
## 12 54.90598 54.90598   15077.588
## 13 55.33885 55.33885   14502.830
## 14 54.13317 54.13317   13781.779
## 15 52.21134 52.21134   13128.643
## 16 50.44371 50.44371   12641.980
## 17 49.70186 49.70186   12335.045
## 18 50.11435 50.11435   12176.815
## 19 50.67653 50.67653   11989.352
## 20 50.95577 50.95577   11665.013 ### picking the ramp characteristics based on the rato of third moment and lag time.

shortened_sol<-dat[which(dat$max3lagtime==max(dat$max3lagtime)),]
shortened_sol ##   col1 col2     col3      col4      col5  moment2  moment3  moment5
## 5    5 0.25 312.5675 -4242.602 -22015535 312.5675 4242.602 22015535
##           p         q         D        a               sol1
## 5 -2063.483 -42426.02 124575728 53.45273 -26.72637-8.91137i
##                 sol2        sol3  realsol1    Imsol1  realsol2   Imsol2
## 5 -26.72637+8.91137i 53.45273+0i -26.72637 -8.911368 -26.72637 8.911368
##   realsol3 Imsol3     valA max3lagtime
## 5 53.45273      0 53.45273    16970.41 amplitude<-shortened_sol$a
tau<-((-1*shortened_sol$col2)*(shortened_sol$valA)^3)/shortened_sol$col4
r<-shortened_sol$col2

solution<-data.frame(“Parameters” = c(“r”, “amplitude”, “tau”), 
                     “Values”= c(r, amplitude, tau), 
                     “Units”= c(“seconds”,“mmol m-3”,“seconds” ))
solution ##   Parameters    Values    Units
## 1          r  0.250000  seconds
## 2  amplitude 53.452730 mmol m-3
## 3        tau  8.999479  seconds

Acknowledgements

The author would like to acknowledge Andy Suyker for providing the test dataset. The authors also acknowledges the contributions of Kosana Suvočarev and Tala Awada. References on Surface Renewal Starting with first publications on surface renewal (by Paw U et al.***) Conference Papers & Abstracts Paw U, K.T. and Y. Brunet, 1991. A surface renewal measure of sensible heat flux density. pp. 52-53. In preprints, 20th Conference on Agricultural and Forest Meteorology, September 10-13, 1991, Salt Lake City, Utah. American Meteorological Society, Boston, MA. Paw U, K.T., 1993. Using surface renewal/turbulent coherent structure concepts to estimate and analyze scalar fluxes from plant canopies. Annales Geophysicae 11(supplement II):C284. Paw U, K.T. and Su, H-B., 1994. The usage of structure functions in studying turbulent coherent structures and estimating sensible heat flux. pp. 98-99. In preprints, 21st Conference on Agricultural and Forest Meteorology, March 7-11, 1994, San Diego, California. American Meteorological Society, Boston, MA. Paw U, K.T., Qiu, J. , Su, H.B., Watanabe, T., and Brunet, Y., 1995. Surface renewal analysis: a new method to obtain scalar fluxes without velocity data. Agric. For. Meteorol. 74:119-137. Spano D., Snyder R.L., Paw U K.T., and DeFonso E. 1994. Verification of the surface renewal method for estimating evapotranspiration. 297-298. In preprints, 21st Conference on Agricultural and Forest Meteorology, March 7-11, 1994, San Diego, California. American Meteorological Society, Boston, MA. Paw U, K.T., Su, H.-B., and Braaten, D.A., 1996. The usage of structure functions in estimating water vapor and carbon dioxide exchange between plant canopies and the atmosphere. pp. J14-J15. In preprints, 22nd Conference on Agricultural and Forest Meteorology, and 12th Conference on Biometeorology and Aerobiology, January 28-February 2, 1996, Atlanta, Georgia. American Meteorological Society, Boston, MA. Spano, D., Duce, P., Snyder, R.L., and Paw U, K.T., 1996. Verification of the structure function approach to determine sensible heat flux density using surface renewal analysis. pp. 163-164. In preprints, 22nd Conference on Agricultural and Forest Meteorology, January 28-February 2, 1996, Atlanta, Georgia. American Meteorological Society, Boston, MA. Spano, D., Duce, P., Snyder, R.L. and Paw U., K.T., 1998. Surface renewal analysis for sensible and latent heat flux density over sparse canopy. pp. 129-130. In preprints, 23rd Conference on Agricultural and Forest Meteorology, November 2-6, 1998, Albuquerque, New Mexico. American Meteorological Society, Boston, MA. Spano, D., Snyder, R.L., Duce, P., Paw U, K.T., Falk, M.B. 2000. Determining scalar fluxes over an old-growth forest using surface renewal. In, Preprints of the 24th Conference on Agricultural and Forest Meteorology, Davis, California, American Meteorological Society, Boston, MA, pp.80-81. Spano, D., Duce, P., Snyder, R.L., Paw U, K.T. and Falk, M. 2002. Surface renewal determination of scalar fluxes over an old-growth forest. In preprints, 25th Conference on Agricultural and Forest Meteorology, Norfolk, Virgina, American Meteorological Society, Boston, MA, Pp. 61-62. Snyder,R.L., Spano, D., Duce, P., Paw U, K.T., 2002. Using surface renewal analysis to determine crop water use coefficients. In preprints, 25th Conference on Agricultural and Forest Meteorology, Norfolk, Virgina, American Meteorological Society, Boston, MA, Pp. 9-10. Spano, D., Duce, P., Snyder, R.L., Paw U, K.T., Baldocchi, D., Xu, L., 2004. Micrometeorological measurements to assess fire fuel dryness. In CD preprints, 26th Conference on Agricultural and Forest Meteorology, Vancover, British Columbia, American Meteorological Society, Boston, MA, 11.7. Snyder, R.L., Spano, D., and Paw U, K.T., 1996. Surface renewal analysis for sensible and latent heat flux density. Boundary Layer Meteorol. 77:249-266. Spano, D., Snyder, R.L., Duce, P., and Paw U, K.T., 1997. Surface renewal analysis for sensible heat flux density using structure functions. Agric. Forest Meteorol., 86:259-271. Spano, D., Duce, P., Snyder, R.L., and Paw U, K.T., 1997. Surface renewal estimates of evapotranspiration. Tall Canopies. Acta Horticult., 449:63-68. Snyder, R.L., Paw U, K.T., Spano, D., and Duce, P., 1997. Surface renewal estimates of evapotranspiration. Theory. Acta Horticult., 449:49-55. Spano, D., Snyder, R.L., Duce, P., and Paw U, K.T., 2000. Estimating sensible and latent heat flux densities from grapevine canopies using surface renewal. Agric. Forest. Meteorol. 104:171-183. Paw U, K.T. 2001. Coherent structures and surface renewal. Book Chapter in Advanced Short Course on Agricultural, Forest and Micro Meteorology. Sponsored by the Consiglio Nazionale delle Ricerche (CNR, Italy) and the University of Sassari, Italy. Bologna, Italy, 2001, pp. 63-76. Paw U, K.T., Snyder, R.L., Spano, D., and Su, H.-B. 2005. Surface Renewal Estimates of Scalar Exchange. Refereed chapter in Micrometeorology of Agricultural Systems, ed. J.L. Hatfield, Agronomy Society of America. pp. 455-483. Snyder, R.L., Spano, D., Duce, P., Paw U, K.T., and Rivera, M., 2008. Surface Renewal estimation of pasture evapotranspiration. J. Irrig. Drainage Eng. ASCE 134:716-721. Shapland, T.M., McElrone, A.J., Snyder, R.L., and Paw U, K.T., 2012. Structure function analysis of two-scale scalar ramps. Part I: Theory and Modelling. Boundary-Layer Meteorol. 145:5-25. Shapland, T.M., McElrone, A.J., Snyder, R.L., and Paw U, K.T., 2012. Structure function analysis of two-scale scalar ramps. Part II: Ramp Characteristics and surface renewal flux estimation. Boundary-Layer Meteorol. 145:27-44. Shapland, T.M., McElrone, A.J., Snyder, R.L., and Paw U, K.T.,2013. A turnkey program for field-scale energy flux density measurements using eddy covariance and surface renewal. Italian J Agrometeorology, 18(1): 5-16. McElrone, AJ., Shapland, T.M., Calderon, A., Fitzmaurice, L., Paw U, K.T, and Snyder, R.L. 2013. J Visualized Exp. 82::e50666 Snyder, R. L., Spano D., Duce K.T., Paw U, Anderson F. E. and Falk M. (2007). Surface Renewal Manual. University of California, Davis, California. Spano, D., Snyder, R. L., & Duce, P. (2000). Estimating sensible and latent heat flux densities from grapevine canopies using surface renewal. Agricultural and Forest Meteorology, 104(3), 171-183. Suvočarev, K., Shapland, T. M., Snyder, R. L., & Martínez-Cob, A. (2014). Surface renewal performance to independently estimate sensible and latent heat fluxes in heterogeneous crop surfaces. Journal of hydrology, 509, 83-93. Paw U, K. T., Snyder, R. L., Spano, D., & Su, H. B. (2005). Surface renewal estimates of scalar exchange. Agronomy, 47, 455. Other references cited in the paper Castellvi, F., Martinez-Cob, A., & Perez-Coveta, O. (2006). Estimating sensible and latent heat fluxes over rice using surface renewal. Agricultural and forest meteorology, 139(1), 164-169. Castellvi, F., & Snyder, R. L. (2009). Combining the dissipation method and surface renewal analysis to estimate scalar fluxes from the time traces over rangeland grass near Ione (California). Hydrological processes, 23(6), 842-857. Castellví, F., & Snyder, R. L. (2009). On the performance of surface renewal analysis to estimate sensible heat flux over two growing rice fields under the influence of regional advection. Journal of hydrology, 375(3), 546-553. Castellvi, F., Snyder, R. L., & Baldocchi, D. D. (2008). Surface energy-balance closure over rangeland grass using the eddy covariance method and surface renewal analysis. Agricultural and forest meteorology, 148(6), 1147-1160. Katul, G., Hsieh, C. I., Oren, R., Ellsworth, D., & Phillips, N. (1996). Latent and sensible heat flux predictions from a uniform pine forest using surface renewal and flux variance methods. Boundary-Layer Meteorology, 80(3), 249-282.

Mengistu, M. G., & Savage, M. J. (2010). Surface renewal method for estimating sensible heat flux. Water SA, 36(1), 9-18. ***Kyaw Tha Paw U, Professor of Atmospheric Sciences and Biometeorologist; BS MIT, PhD, Yale University. [email protected]. Tel. (530) 752-1510

Web data acquisition: from database to dataframe for data analysis and visualization (Part 4)

The previous post described how the deeply nested JSON data on fligths were parsed and stored in an R-friendly database structure. However, looking into the data, the information is not yet ready for statistical analysis and visualization and some further processing is necessary before extracting insights and producing nice plots. In the parsed batch, it is clearly visible the redundant structure of the data with the flight id repeted for each segment of each flight. This is also confirmed with the following simple check as the rows of the dataframe are more than the unique counts of the elements in the id column.

dim(data_items)
[1] 397  15

length(unique(data_items$id))
201

# real time changes of data could produce different results

This implies that the information of each segment of each flight has to be aggregated and merged in a dataset as single observations of a statistical analysis between, for example, price and distance. First, a unique primary key for each observation has to be used as reference variable to uniquely identify each element of the dataset.

library(plyr) # sql like functions
library(readr) # parse numbers from strings
 
data_items <- data.frame(data_items)
 
# id (primary key)
data <- data.frame(unique(data_items$id))
colnames(data) <- c('id')
 
# n° of segment
n_segment <- aggregate(data_items['segment.id'], by=data_items['id'], length)
data <- join(data, n_segment, by='id', type='left', match='first') # sql left join
# mileage
mileage <- aggregate(data_items['segment.leg.mileage'], by=data_items['id'], sum)
data <- join(data, mileage, by='id', type='left', match='first') # sql left join
# price
price <- data.frame('id'=data_items$id, 'price'=parse_number(data_items$saleTotal))
data <- join(data, price, by='id', type='left', match='first') # sql left join
# dataframe
colnames(data) <- c('id','segment', 'mileage', 'price')
head(data)

The aggregation of mileage and price using the unique primary key allows to set up a dataframe ready for statistical analysis and data visualization. Current data tells us that there is a maximum of three segments in the connection between FCO and LHR with a minimum price of around EUR 122 and a median around EUR 600.


# descriptive statistics
summary(data)
 
 
# histogram price & distance
g1 <- ggplot(data, aes(x=price)) + 
  geom_histogram(bins = 50) +  
  ylab("Distribution of the Price (EUR)") +
  xlab("Price (EUR)") 
 
g2 <- ggplot(data, aes(x=mileage)) + 
  geom_histogram(bins = 50) +  
  ylab("Distribution of the Distance") +
  xlab("Distance (miles)")
 
grid.arrange(g1, g2)
# price - distance relationship
s0 <- ggplot(data = data, aes(x = mileage, y = price)) +
    geom_smooth(method = "lm", se=FALSE, color="black") +
    geom_point() + labs(x = "Distance in miles", y = "Price (EUR)")
s0 <- ggMarginal(s0, type = "histogram", binwidth = 30)
s0

Of course, plenty of other analysis and graphical representations using flights features are possible given the large set of variables available in QPX Express API and the availability of data in real time.

To conclude the 4-step (flight) trip from data acquisition to data analysis, let's recap the most important concepts described in each of the post: 1) Client-Server connection 2) POST request in R 3) Data parsing and structuring 4) Data analysis and visualization

That's all folks! #R #rstats #maRche #json #curl #qpxexpress #Rbloggers This post is also shared in www.r-bloggers.com

Shiny app to explore ggplot2

Do you struggle with learning ggplot2? Do you have problems with understanding what aesthetics actually do and how manipulating them change your plots?

Here is the solution! Explore 33 ggplot2 geoms on one website! I created this ggplot2 explorer to help all R learners to understand how to plot beautiful/useful charts using the most popular vizualization package ggplot2. It won’t teach you how to write a code, but definitely will show you how ggplot2 geoms look like, and how manipulating their arguments changes visualization. You can find my app here

And you can find code on my github

Shiny: data presentation with an extra

A Shiny app with three tabs presenting different sections of the same data.
Shiny is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with drop-down lists and checkboxes—programming-free. The apps can be useful for both the data analyst and the public.

Shiny apps are based on the Internet: This allows for private consultation of the data on one’s own browser as well as for online publication. Free apps can handle quite a lot of data, which can be increased with a subscription.

The target user of Shiny is extremely broad. Let’s take science—open science. At a time when openly archiving all data is becoming standard practice (e.g., OSF.io, Figshare.com, Github.com), Shiny can be used to walk the extra mile by letting people tour the data at once without programming. It’s the right tool for acknowledging all aspects of the data. Needless to say, these apps do not replace raw data archiving.

The apps simply add. For instance, the data in the lower app is a little noisy, right? Indeed it shows none of the succession of waves that characterizes word reading. The app helped me in identifying this issue. Instead of running along a host of participants and electrodes through a heavy code score, I found that the drop-down lists of the app let me seamlessly surf the data. By Participant 7, though, my wave dropped me…

Those data were very poor—systematically poorer than those of any other participants. I then checked the EEG preprocessing logs, and confirmed that those data had to be discarded. So much for the analysis utility of such an app. On the open science utility, what I did on discovering the fault was maintain the discarded data in the app, with a note, so that any colleagues and reviewers could consider it too. Now, although this example of use concerns a rather salient trait in the data, some other Shiny app might help to spot patterns such as individual differences, third variables.

Building a Shiny app is not difficult. Apps basically draw on some data presentation code (tables, plots) that you already have. Then just add a couple of scripts into the folder: one for the user interface (named iu.R), one for the process (named server.R), and perhaps another one compiling the commands for deploying the app and checking any errors.

The steps to start a Shiny app from scratch are:

1: Tutorials. Being open-source software, the best manuals are available through a Google search.

2: User token. Signing up and reading in your private key—just once.

3: GO. Plunge into the ui and server scripts, and deployApp().

4: Bugs and logs. They are not bugs in fact—rather fancies. For instance, some special characters have to get even more special (technically, UTF-8 encoding). For a character such as μ, Shiny prefers Âμ. Just cling to error logs by calling:

showLogs(appPath = getwd(), appFile = NULL, appName = NULL, account = NULL, entries = 50, streaming = FALSE)

The log output will mention any typos and unaccepted characters, pointing to specific lines in your code.

It may take a couple of intense days to get a first app running. As usual with programming, it’s easy to run into the traps which are there to spice up the way. The app’s been around for years, so tips and tricks abound on the Internet. For greater companionship, there are dedicated Google groups, and then there’s StackExchange etc., where you can post any needs/despair. Post your code, log, and explanation, and you’ll be rescued out in a couple of days. Long live those contributors.

It will often be enough to upload a bare app, but you might then think it can look better.

5 (optional): Pro up. Use tabs to combine multiple apps in one, use different widgets, etc. Tutorials like this one on Youtube can take you there, especially those that provide the code, as in the description of that video. Use those scripts as templates. For example, see this script in which the function conditionalPanel() is used to modify the app’s sidebar based on which tab is selected. The utility of tabs is illustrated in the upper cover of this article and in the app shown in the text: When having multiple data sections, the tabs allow you to have all in one (cover screenshot), instead of linking to other apps in each (screenshot in text).

Time for logistics. You can include any text in your app’s webpage, such as explanations of any length, web links, and author information. Oh, also importantly: the Shiny application allows for the presentation of data in any of the forms available in R—notably, plots and tables. Costs: Shiny is free to use, in principle. You may only have to pay if you use your app(s) a lot—first month, commonly—, in which case you may pay 9 euros a month. There are different subscription plans. The free plan allows 25 hours of use per month, distributed among up to five apps.

There do exist alternatives to Shiny. One is fairly similar: It’s called Tableau. A nice comparison of the two apps is available here. Then, there’s a more advanced application called D3.js, which allows for lush graphics but proves more restrictive to newbies.

In sum, if you already program in R or even in Python, but have not tried online data presentation yet, consider it.

Feel free to share your ideas, experiences, or doubts in a comment on the original post.

New online datacamp course: Forecasting in R

Forecasting in R is taught by Rob J. Hyndman, author of the forecast package

Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning. This course provides an introduction to time series forecasting using R.

What You’ll Learn
Chapter 1: Exploring and Visualizing Time Series in R
The first thing to do in any data analysis task is to plot the data.
Chapter 2: Benchmark Methods and Forecast Accuracy
In this chapter, you will learn general tools that are useful for many different forecasting situations.
Chapter 3: Exponential Smoothing
is framework generates reliable forecasts quickly and for a wide range of time series.
Chapter 4: Forecasting with ARIMA Models
ARIMA models provide another approach to time series forecasting.
Chapter 5: Advanced Methods
In this chapter, you will look at some methods that handle more complicated seasonality.

You can start the free chapter for free of Forecasting in R.

It can be easy to explore data generating mechanisms with the simstudy package

I learned statistics and probability by simulating data. Sure, I battled my way through proofs, but I never believed the results until I saw it in a simulation. I guess I have it backwards, it worked for me. And now that I do this for a living, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure – I’ll use the occasional formula, but I always feel the need to check it with simulation. It’s just the way I am.

Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called simstudy, or simulating study data. My goal here is to introduce the basic idea behind simstudy, and provide a relatively simple example that comes from a question posed by a user who was trying to generate correlated longitudinal data.

The basic idea

Simulation using simstudy has two essential steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved. Simulation can include observed or randomized treatment assignment/exposures, survival data, longitudinal/panel data, multi-level/hierarchical data, data sets with correlated variables based on a specified covariance structure, and data sets with missing data resulting from any sort of missingness pattern.

The key to simulating data in simstudy is the creation of series of data definition tables that look like this:

Here’s the code that is used to generate this definition, which is stored as a data.table :

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")

To create a simple data set based on these definitions, all one needs to do is execute a single genData command. In this example, we generate 500 records that are based on the definition in the def table:

dt <- genData(500, def)

dt

[code language="lang="r"]
##      idnum nr       x1       y1  y2 xCat          g1 a1
##   1:     1  7 10.74929 30.01273 123    3 13310.84731  0
##   2:     2  7 18.56196 44.77329  17    1   395.41606  0
##   3:     3  7 16.96044 43.76427  42    3   522.45258  0
##   4:     4  7 19.51511 45.14214  22    3  3045.06826  0
##   5:     5  7 10.79791 27.25012 128    1   406.88647  0
##  ---                                                   
## 496:   496  7 19.96636 52.05377  21    3   264.85557  1
## 497:   497  7 15.97957 39.62428  44    2    40.59823  0
## 498:   498  7 19.74036 47.32292  21    3  3410.54787  0
## 499:   499  7 19.71597 48.26259  25    3   206.90961  1
## 500:   500  7 14.60405 28.94185  53    1    97.43068  0

There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from across the globe a couple of days ago. (I’d say the best thing about building an R package is hearing from folks literally all over the world and getting to talk to them about statistics and R.)

Going a bit further: simulating a prospective cohort study with repeated measures

The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? The answer is, of course.

Drawing on her original code, we wanted a scenario the included two treatment arms or exposures and three measurements per individual. The change over time was supposed to be greater for one of the groups. This was what I sent back to my correspondent:

# Define the outcome

ydef <- defDataAdd(varname = "Y", dist = "normal", 
                   formula = "5 + 2.5*period + 1.5*T + 3.5*period*T", 
                   variance = 3)

# Generate a 'blank' data.table with 24 observations and assign them to groups

set.seed(1234)

indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

# Create a longitudinal data set of 3 records for each id

longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# Let's look at the data

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")

My correspondent quickly pointed out that I hadn’t really provided her with the full solution – as the measurements for each individual in the example above are not correlated across the time periods. If we generate a data set based on 1,000 individuals and estimate a linear regression model we see that the parameter estimates are quite good, though we can see from the estimate of alpha (at the bottom of the output), which was approximately 0.02, that there is little within-individual correlation:

# Fit a GEE model to the data

fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)

## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)          4.98268 0.07227 4753.4   <2e-16 ***
## factor(T)Yes         1.48555 0.10059  218.1   <2e-16 ***
## period               2.53946 0.05257 2333.7   <2e-16 ***
## factor(T)Yes:period  3.51294 0.07673 2096.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)    2.952 0.07325
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha  0.01737 0.01862
## Number of clusters:   1000   Maximum cluster size: 3

One way to generate correlated data

To address this shortcoming, there are at least two ways we can go about it. The first is to use the simstudy function genCorData. In this example, we generate correlated errors that are normally distributed with mean 0, variance 3, and common correlation coefficient of 0.7. Using this approach, the underlying data generation process is a bit cryptic, because we are acknowledging that we don’t know what is driving the relationship between the three outcomes, just that they have some common cause or other relationship that results in a strong relationship. However, it does the trick:

# define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + e")

# define the correlated errors

mu <- c(0, 0, 0)
sigma <- rep(sqrt(3), 3)

# generate correlated data for each id and assign treatment

dtCor <- genCorData(24, mu = mu, sigma = sigma, rho = 0.7, corstr = "cs")
dtCor <- trtAssign(dtCor, nTrt = 2, balanced = TRUE, grpName = "T")

# create longitudinal data set and generate outcome based on definition

longData <- addPeriods(dtCor, nPeriods = 3, idvars = "id", timevars = c("V1", 
    "V2", "V3"), timevarName = "e")
longData <- addColumns(ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# look at the data, outcomes should appear more correlated, lines a bit straighter

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")

Again, we recover the true parameters. And this time, if we look at the estimated correlation, we see that indeed the outcomes are correlated within each individual. The estimate is 0.71, very close to the the “true” value 0.7.

fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")

summary(fit)

## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           5.0636  0.0762 4411   <2e-16 ***
## factor(T)Yes          1.4945  0.1077  192   <2e-16 ***
## period                2.4972  0.0303 6798   <2e-16 ***
## factor(T)Yes:period   3.5204  0.0426 6831   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     3.07   0.117
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.711  0.0134
## Number of clusters:   1000   Maximum cluster size: 3

Another way to generate correlated data

A second way to generate correlated data is through an individual level random-effect or random intercept. This could be considered some unmeasured characteristic of the individuals (which happens to have a convenient normal distribution with mean zero). This random effect contributes equally to all instances of an individuals outcomes, but the outcomes for a particular individual deviate slightly from the hypothetical straight line as a result of unmeasured noise.

ydef1 <- defData(varname = "randomEffect", dist = "normal", formula = 0, variance = sqrt(3))
ydef2 <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + randomEffect", 
    variance = 1)

indData <- genData(24, ydef1)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

indData[1:6]

##    id T randomEffect
## 1:  1 0      -1.3101
## 2:  2 1       0.3423
## 3:  3 0       0.5716
## 4:  4 1       2.6723
## 5:  5 0      -0.9996
## 6:  6 1      -0.0722

longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef2, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")

fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)

## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           4.9230  0.0694 5028   <2e-16 ***
## factor(T)Yes          1.4848  0.1003  219   <2e-16 ***
## period                2.5310  0.0307 6793   <2e-16 ***
## factor(T)Yes:period   3.5076  0.0449 6104   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     2.63  0.0848
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.619  0.0146
## Number of clusters:   1000   Maximum cluster size: 3

I sent all this back to my correspondent, but I haven’t heard yet if either of these solutions meets her needs. I certainly hope so. Speaking of which, if there are specific topics you’d like me to discuss related to simstudy, definitely get in touch, and I will try to write something up.

Three-Way Analysis of Variance: Simple Second-Order Interaction Effects and Simple Main Effects

In this article we will show how to run a three-way analysis of variance when both the third-order interaction effect and the second-order interaction effects are statistically significant. This type of analysis can become pretty tedious, especially when our factors have many levels, so we will try to explain it here as clearly as possible. (If you want to watch me doing these analyses live, get my free course on statistical analysis with R here.)

First of all, let’s present the fictitious data we are going to work with. Let’s suppose that a pharmaceutical company is planning to launch a new vitamin that allegedly improves the employees’ resistance to effort. The vitamin is tested on a sample of 720 employees, divided into three groups: employees who take a placebo (the control group), employees who take the vitamin in low dose and employees who take the vitamin in high dose. Half of the employees are male, and half are female. Moreover, we have both blue collar employees and white collar employees in our sample. The resistance to effort is measured on a scale whatsoever, from 1 to 30 (30 being the highest resistance). Our goal is to determine whether the effort resistance is influenced by three factors: dose of vitamin (placebo, low dose, and high dose), gender (male, female) and type of employee (blue collar, white collar). You can find the experiment data in CSV format here. Third-order interaction effect First of all, let’s check whether the third-order interaction effect is significant. We are going to run the analysis using the aov function in the stats package (our data frame is called vitamin). aov1 <- aov(effort~dose*gender*type, data=vitamin) summary(aov1) In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).

                                  Df Sum Sq Mean Sq F value   Pr(>F)


dose:gender:type   2    187    93.4  22.367 3.81e-10

The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects. Simple second-order interaction effects The simple second-order interaction effects are the effects of each pair of factors at each level of the third factor. Specifically, we have to compute the following effects:

the interaction effect of dose and type of employee, for each gender category (male and female)
the interaction effect of gender and type of employee, at each dose level (placebo, low and high)
the interaction effect of dose and gender, for each type of employee (blue collar and white collar).

The total number of second-order interaction effect is given by the sum of the factor levels. In our case we have 7 effects (3+2+2). We will not analyze of all them, because this article would become too long. We will only focus on the first set of effects, leaving the others for you as an exercise. So let’s investigate the interaction effect of dose and type of employee for each gender group. First we have to create two separate data frames, for male and female employees. We do that with the filter command in the dplyr package (though you can also use brackets or subsets). vitamin_male <- filter(vitamin, gender=="male") vitamin_female <- filter(vitamin, gender=="female") Now we perform a two-way analysis of variance on each data frame (the factors being dose and type, of course). [code lang=””r””] aov1 <- aov(effort~dose*type, data=vitamin_male) summary(aov1) aov2 <- aov(effort~dose*type, data=vitamin_female) summary(aov2) The results of the analyses are shown below (we have only retained the lines with the interaction effects).

                Df Sum Sq Mean Sq F value   Pr(>F)    

 dose:type     2    249   124.7   28.42 3.57e-12            

                    Df Sum Sq Mean Sq F value   Pr(>F)    

 dose:type     2  137.2    68.6   17.31 6.74e-08

We can notice that both simple second-order interaction effects are significant (p<0.01). So we are dealing with a combined influence of the factors dose and type of employee in both male and female groups. In this situation, we have to examine the simple main effects for each factor. This is what we are going to do in the next section. Simple main effects Let’s compute the main effect for the factor dose of vitamin, which is the most important (after all, the company wants to demonstrate that the vitamin does affect the resistance to effort). You will be able to compute the other simple main effects yourself, using this as an example. Now we must create four separate data frames, for each combination of the factors gender and type of employee: male – blue collar, male – white collar, female – blue collar, female – white collar. [code lang=””r””] vitamin_male_blue <- filter(vitamin, gender=="male", type=="blue collar") vitamin_male_white <- filter(vitamin, gender=="male", type=="white collar") vitamin_female_blue <- filter(vitamin, gender=="female", type=="blue collar") vitamin_female_white <- filter(vitamin, gender=="female", type=="white collar") Next we perform a one-way ANOVA for each data frame. Let’s do it for the first group, male – blue collar. [code lang=””r””] aov1 <- aov(effort~dose, data=vitamin_male_blue) summary(aov1)


                 Df Sum Sq Mean Sq F value Pr(>F)    

 dose          2 2943.5  1471.8   349.9 <2e-16

The simple main effect for the factor dose on this group is statistically significant (p<0.01). In other words, there is a significant difference between placebo, low dose and high dose levels within the male – blue collar employees category, regarding the resistance to effort. To find out how big the differences are, we use the TuckeyHSD function to compute the test with the same name.

TukeyHSD(aov1)


                                              diff        lwr       upr   p adj

 low dose-high dose -2.528333  -3.413363 -1.643303     0

 placebo-high dose  -9.558333 -10.443363 -8.673303     0

 placebo-low dose   -7.030000  -7.915030 -6.144970     0

By inspection of the table we conclude that the differences in effort resistance between the dose groups are significant (p<0.01). The highest difference, in absolute values, is that between low dose and placebo levels: 9.5 points. So the employees who took a high dose present a higher resistance to effort than those who just took a placebo. One more example: the simple main effects of the variable dose of vitamin on the female – blue collar group. [code lang=””r””] aov1 <- aov(effort~dose, data=vitamin_female_blue) summary(aov1) Df Sum Sq Mean Sq F value Pr(>F)
dose 2 399.6 199.81 45.57 <2e-16

TukeyHSD(aov1)


                                            diff        lwr       upr     p adj

 low dose-high dose  1.083333  0.1797508  1.986916 0.0141485

 placebo-high dose  -2.476667 -3.3802492 -1.573084 0.0000000

 placebo-low dose   -3.560000 -4.4635826 -2.656417 0.0000000

The simple main effect is statistically significant, as it results from the first table. Furthermore, all the differences between dose levels are significant. The highest difference is the difference between low dose and placebo (3.5 points). To learn more on data analysis in R, check the free “Statistics with R” video course here.

Everyone knows that loops in R are to be avoided but vectorization is not always possible

It goes without saying that there are always many ways to solve a problem in R, but clearly some ways are better (for example, faster) than others. Recently, I found myself in a situation where I could not find a way to avoid using a loop, and I was immediately concerned, knowing that I would want this code to be flexible enough to run with a very large number of observations, possibly over many observations. Two tools immediately came to mind: data.table and Rcpp . This brief description explains the background of the simulation problem I was working on and walks through the evolution of ideas to address the problems I ran into when I tried to simulate a large number of individuals. In particular, when I tried to simulate a very large number of individuals, say over 1 million, running the simulation over night wasn’t enough.

Setting up the problem

The task in question here is not the focus, but needs a little explanation to understand what is motivating the programming issue. I am conducting a series of simulations that involve generating an individual-level stochastic (Markov) process for any number of individuals. For the data generation, I am using the simstudy package developed to help facilitate simulated data. The functions defDataAdd and genData are both from simstudy. The first part of the simulation involves specifying the transition matrix P that determine a state I am calling status, and then defining the probability of an event that are based on a particular status level at a particular time point. For each individual, I generate 36 months of data and a status and event for each month.

library(data.table)
library(simstudy)

set.seed(123)

P <- matrix(c(0.985, 0.015, 0.000, 0.000, 
              0.000, 0.950, 0.050, 0.000,
              0.000, 0.000, 0.850, 0.150,
              0.000, 0.000, 0.000, 1.000), 
            nrow = 4, byrow = TRUE)

form <- "(status == 1) * 0.02 + (status == 2) * 0.10 + (status == 3) * 0.20"

dtDef <- defDataAdd(varname = "event", 
            formula = form, 
            dist = "binary",
            link = "identity")

N = 5000
did <- genData(N)

In order to simulate the Markov process, I decided immediately that Rcpp would be most appropriate because I knew I could not avoid looping. Since each state of a Markov process depends on the state immediately preceding, states need to be generated sequentially, which means no obvious way to vectorize (if someone has figured that out, let me know.)

#include  < RcppArmadilloExtensions/sample.h >; 

// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;

// [[Rcpp::export]]

IntegerVector MCsim( unsigned int nMonths, NumericMatrix P, 
                     int startStatus, unsigned int startMonth ) {
  
  IntegerVector sim( nMonths );
  IntegerVector m( P.ncol());
  NumericVector currentP;
  IntegerVector newstate;
  
  unsigned int q = P.ncol();
  
  m = Rcpp::seq(1, q);
  
  sim[startMonth - 1] = startStatus;
  
  for (unsigned int i = startMonth; i < nMonths; i++) {
    
    newstate = RcppArmadillo::sample(m, 1, TRUE, P.row(sim(i-1) - 1));
    sim(i) = newstate(0);
    
  }
  
  return sim;
}

The process is simulated for each individual using the Rcpp function MCsim, but is done in the context of a data.table statement. The key here is that each individual is processed separately through the keyby = id statement. This obviates the requirement to loop through individuals even though I still need to loop within individuals for the stochastic process. This algorithm is quite fast, even with very large numbers of individuals and large numbers of observations (in this case months) per individual.

dt <- did[, .(status = MCsim(36, P, 1, 1)), 
          keyby = id]
dt[, month := 1 : .N, keyby = id]
dt <- addColumns(dtDefs = dtDef, dtOld = dt)

dt

##           id status month event
##      1:    1      1     1     0
##      2:    1      1     2     0
##      3:    1      1     3     0
##      4:    1      1     4     0
##      5:    1      1     5     0
##     ---                        
## 179996: 5000      4    32     0
## 179997: 5000      4    33     0
## 179998: 5000      4    34     0
## 179999: 5000      4    35     0
## 180000: 5000      4    36     0

This is where things begin to slow down

It is the next phase of the simulation that started to cause me problems. For the simulation, I need to assign individuals to a group or cohort which is defined by a month and is based on several factors: (1) whether an event occurred in that month, (2) whether the status of that individual in that month exceeded a value of 1, and (3) whether or not the individual experienced 2 or more events in the prior 12 months. An indivdual might be eligible for more than one cohort, but will be assigned to the first possible cohort (i.e. the earliest month where all three criteria are met.) Again, the specifics of the simulation are not important here. What is important, is the notion that the problem requires looking through individual data sequentially, something R is generally not so good at when the sequences get particularly long, and they must be repeated a large number of times. My first, naïve, approach was to create an R function that loops through all the individuals and loops within each individual until a cohort is found:

rAssignCohortID <- function(id, month, status, 
                            event, nInds, 
                            startMonth, thresholdNum) {
  
  cohort <- rep(0, length(id));

  for (j in (1 : nInds))  {
    
    idMonth <- month[id == j];
    idEvent <- event[id == j];
    idStatus <- status[id == j];
    
    endMonth <- length(idMonth);
    
    done <- FALSE;
    i <- max(startMonth - idMonth[1], 13);
    
    while (i <= endMonth & !done) {
      
      if (idEvent[i] == 1 & idStatus[i] > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[begin:end]);
        
        if (sumED <= thresholdNum) {
          
          cohort[id == j] <- i - 1 + month[1];
          done = TRUE;
        }
      }    
      i = i + 1;
    }    
  }
  
  return(cohort);
}

system.time(dt[, cohort1 := rAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])

##    user  system elapsed 
##   10.92    1.81   12.89

Working through possible solutions

The naïve approach works, but can we do better? I thought Rcpp might be a solution, because we know that loops in C++ are much more efficient. However, things did not turn out so well after I translated the function into C++; in fact, they got a little worse.


#include < Rcpp.h >;

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector cAssignCohortID( IntegerVector id, 
                    IntegerVector month, 
                    IntegerVector status, 
                    IntegerVector event,
                    int nInds, 
                    int startMonth, 
                    int thresholdNum) {
    
  IntegerVector cohort(id.length(), 0);
  
  IntegerVector idMonth;
  IntegerVector idEvent;
  IntegerVector idStatus;
  
  for (int j = 0; j < nInds; j++) {
    
    idMonth = month[id == j+1];
    idEvent = event[id == j+1];
    idStatus = status[id == j+1];

    int endMonth = idMonth.length();
    int sumED;
    bool done = FALSE;
    int i = std::max(startMonth - idMonth(0), 12);
    int begin;
    int end;

    while (i < endMonth & !done) {
      
      if (idEvent(i) == 1 & idStatus(i) > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[Rcpp::seq(begin, end)]);
        
        if (sumED >= thresholdNum) {
          cohort[id == j + 1] = i + month(0);
          done = TRUE;
        }
      }    
      i += 1;
    }    
  }
   
  return(cohort);
}

system.time(dt[, cohort2 := cAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])

## user  system elapsed 
## 13.88   2.03   16.05

I know that the function cAssignCohortID bogs down not in the loop, but in each phase where I need to subset the data set to work on a single id. For example, I need to execute the statement idMonth = month[id == j+1] for each id, and this apparently uses a lot of resources. I tried variations on this theme, alternatives to subset the data set within the Rcpp function, but could get no improvements. But a light bulb went off in my head (dim as it might be), telling me that this is one of the many things data.table is particularly good at. In fact, I used this trick earlier in generating the stochastic process data. So, rather than subsetting the data within the function, I created a regular R function that handles only a single individual id at a time, and let data.table do the hard work of splitting up the data set to process by individual. As you can see, things got markedly faster.

rAssignCohort <- function(id, month, status, event, 
                   nInds, startMonth, thresholdNum) {
  
  cohort <- 0
  
  endMonth = length(month);
    
  done = FALSE;
  i = max(startMonth - month[1], 13);
  
  while (i <= endMonth & !done) {
    
    if (event[i] == 1 & status[i] > 1) {
        
      begin = i-12;
      end = i-1;
        
      sumED = sum(event[begin:end]);
        
      if (sumED >= thresholdNum) {
          
        cohort <- i - 1 + month[1];
        done = TRUE;
      }
    }    
    i = i + 1;
  }
  
  return(cohort)
}

system.time(dt[, cohort3 := rAssignCohort(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2), 
               keyby=id])

##    user  system elapsed 
##     0.2     0.0     0.2

Finally, it occurred to me that an Rcpp function that is not required to subset the data might offer more yet improvements in speed. So, for the last iteration, I combined the strengths of looping in Rcpp with the strengths of subsetting in data.table to create a formidable combination. (Even when sample sizes exceed 1 million, the data are generated in a flash.)

#include < Rcpp.h >; 

using namespace Rcpp;

// [[Rcpp::export]]
int cAssignCohort( IntegerVector month, 
                  IntegerVector status,
                  IntegerVector event,
                  int startMonth, int thresholdNum) {
  
  int endMonth = month.length();
  int sumED;
  int cohort = 0;
  bool done = FALSE;
  int i = std::max(startMonth - month(0), 12);
  int begin;
  int end;
  
  while (i < endMonth & !done) {
    
    if (event(i) == 1 & status(i) > 1) {
      
      begin = i-12;
      end = i-1;
      
      sumED = sum(event[Rcpp::seq(begin, end)]);
      
      if (sumED >= thresholdNum) {
        cohort = i + month(0);
        done = TRUE;
      }
    }    
    i += 1;
  }
  return(cohort);
}

system.time(dt[, cohort4 := cAssignCohort(month, status, event, 
                               startMonth=13,  thresholdNum = 2), keyby=id])

##    user  system elapsed 
##    0.01    0.00    0.01

For a more robust comparison, let’s use the benchmark function in package rbenchmark, and you can see how well data.table performs and how much Rcpp can add when used efficiently.

library(rbenchmark)

benchmark(
  dt[, cohort1 := rAssignCohortID(id, month, status, event,       # Naïve approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort2 := cAssignCohortID(id, month, status, event,        # Rcpp approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort3 := rAssignCohort(id, month, status, event,    # data.table approach
                      nInds = N, startMonth = 13, thresholdNum = 2), keyby=id],
  dt[, cohort4 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
    replications = 5,
    columns = c("replications", "elapsed", "relative"))

##   replications elapsed relative
## 1            5   46.18    461.8
## 2            5   68.01    680.1
## 3            5    0.96      9.6
## 4            5    0.10      1.0

Postscript

I shared all of this with the incredibly helpful folks who have created data.table, and they offered a data.table only solution that avoids all looping, which I will share here for completeness. While it is an improvement over the third approach presented above (R function with data.table statment keyby), it is still no match for the fastest solution. (But, this all just goes to show you there will always be new approaches to consider, and I don’t claim to have come any where near to trying them all out.)

dtfunc <- function(dx) {
  
  dx[, prev12 := Reduce(`+`, shift(event, 1:12)), by=id]
  map <- CJ(id=1:N, start=13L, end=36L, event=1L, statusx=1L, prev12x=1L)
  ans <- dx[map, on=.(id, event, status > statusx, prev12 > prev12x, month >= start, month <= end), 
            .I, allow=TRUE, by=.EACHI, nomatch=0L][, .(id, I)]
  minans <- ans[, .(I=min(I)), by=id]
  
  dx <- dx[, cohort5 := 0L][minans, cohort5 := min(month) - 1L + dx$month[I], on="id", by=.EACHI]

  return(dx)
}

system.time(dtfunc(dt))

##    user  system elapsed 
##    0.18    0.00    0.17

And here is a more complete comparison of the fastest version with this additional approach:

benchmark(
  dt[, cohort6 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
  dt2 <- dtfunc(dt),
  replications = 5,
  columns = c("replications", "elapsed", "relative"))

##   replications elapsed relative
## 1            5    0.10      1.0
## 2            5    0.85      8.5

Sinew: a R Package to create self-populating roxygen2 skeletons

Sinew is a R package that generates a roxygen2 skeletons populated with information scraped from within the function script. The goal of the package is to automate nearly all of the mundane tasks needed to document functions, properly set up the import fields for oxygenation, and make it easier to attain documentation consistency across functions.

Functionality

makeOxygen: Create a skeleton for roxygen2 documentation populated with information scraped from within the package function scripts.
makeImport: Create import(s) calls for DESCRIPTION, NAMESPACE, and roxygen2.
makeDictionary: Create a R file of all the unique roxygen2 parameter fields in a package R subdirectory.

Installation

#CRAN
install.packages('sinew')

#DEV
devtools::install_github('metrumresearchgroup/sinew')

Example Output

makeOxygen(lm)
#' @title FUNCTION_TITLE
#' @description FUNCTION_DESCRIPTION
#' @param formula PARAM_DESCRIPTION
#' @param data PARAM_DESCRIPTION
#' @param subset PARAM_DESCRIPTION
#' @param weights PARAM_DESCRIPTION
#' @param na.action PARAM_DESCRIPTION
#' @param method PARAM_DESCRIPTION, Default: 'qr'
#' @param model PARAM_DESCRIPTION, Default: TRUE
#' @param x PARAM_DESCRIPTION, Default: FALSE
#' @param y PARAM_DESCRIPTION, Default: FALSE
#' @param qr PARAM_DESCRIPTION, Default: TRUE
#' @param singular.ok PARAM_DESCRIPTION, Default: TRUE
#' @param contrasts PARAM_DESCRIPTION, Default: NULL
#' @param offset PARAM_DESCRIPTION
#' @param ... PARAM_DESCRIPTION
#' @return OUTPUT_DESCRIPTION
#' @importFrom stats model.frame

In more detail…

For new developers, getting a package ready for building and submitting to CRAN is an expletive-filled, head-scratching experience to say the least. Trying to figure out the basics of what goes in depends and what goes in imports is a lost afternoon most of us would like back. Once that is understood, filling in relevant information to each field is a mundane task even for a well polished package developer. The out-of-the-box roxygen2 skeleton supplied by RStudio gives the bare bones road map of what should be part of function documentation:

Machine Learning Using Support Vector Machines

Support Vector Machines (SVM) is a data classification method that separates data using hyperplanes. The concept of SVM is very intuitive and easily understandable. If we have labeled data, SVM can be used to generate multiple separating hyperplanes such that the data space is divided into segments and each segment contains only one kind of data. SVM technique is generally useful for data which has non-regularity which means, data whose distribution is unknown.

Let’s take a simple example to understand how SVM works. Say you have only two kinds of values and we can represent them as in the figure:

This data is simple to classify and one can see that the data is clearly separated into two segments. Any line that separates the red and blue items can be used to classify the above data. Had this data been multi-dimensional data, any plane can separate and successfully classify the data. However, we want to find the “most optimal” solution. What will then be the characteristic of this most optimal line? We have to remember that this is just the training data and we can have more data points which can lie anywhere in the subspace. If our line is too close to any of the datapoints, noisy test data is more likely to get classified in a wrong segment. We have to choose the line which lies between these groups and is at the farthest distance from each of the segments. To solve this classifier line, we draw the line as y=ax+b and make it equidistant from the respective data points closest to the line. So we want to maximize the margin (m).

We know that all points on the line ax+b=0 will satisfy the equation. If we draw two parallel lines – ax+b=-1 for one segment and ax+b=1 for the other segment such that these lines pass through a datapoint in the segment which is nearest to our line then the distance between these two lines will be our margin. Hence, our margin will be m=2/||a||. Looked at another way, if we have a training dataset (x1,x2,x3…xn) and want to produce and outcome y such that y is either -1 or 1 (depending on which segment the datapoint belongs to), then our classifier should correctly classify data points in the form of y=ax+b. This also means that y(ax+b)>1 for all data points. Since we have to maximize the margin, we have the solve this problem with the constraint of maximizing 2/||a|| or, minimizing ||a||^2/2 which is basically the dual form of the constraint. Solving this can be easy or complex depending upon the dimensionality of data. However, we can do this quickly with R and try out with some sample dataset. To use SVM in R, I just created a random data with two features x and y in excel. I took all the values of x as just a sequence from 1 to 20 and the corresponding values of y as derived using the formula y(t)=y(t-1) + r(-1:9) where r(a,b) generates a random integer between a and b. I took y(1) as 3. The following code in R illustrates a set of sample generated values: x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) Let’s see how our data looks like. For this we use the plot function #Plot the dataset plot(train,pch=16)

Seems to be a fairly good data. Looking at the plot, it also seems like a linear regression should also be a good fit to the data. We’ll use both the models and compare them. First, the code for linear regression: #Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model)

SVM To use SVM in R, we have a package e1071. The package is not preinstalled, hence one needs to run the line “install.packages(“e1071”) to install the package and then import the package contents using the library command. The syntax of svm package is quite similar to linear regression. We use svm function here. #SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4)

The points follow the actual values much more closely than the abline. Can we verify that? Let’s calculate the rmse errors for both the models: #Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y – pred svm_error <- sqrt(mean(error_2^2)) &amp;amp;amp;amp;nbsp;# 2.696281 In this case, the rmse for linear model is ~3.83 whereas our svm model has a lower error of ~2.7. A straightforward implementation of SVM has an accuracy higher than the linear regression model. However, the SVM model goes far beyond that. We can further improve our SVM model and tune it so that the error is even lower. We will now go deeper into the SVM function and the tune function. We can specify the values for the cost parameter and epsilon which is 0.1 by default. A simple way is to try for each value of epsilon between 0 and 1 (I will take steps of 0.01) and similarly try for cost function from 4 to 2^9 (I will take exponential steps of 2 here). I am taking 101 values of epsilon and 8 values of cost function. I will thus be testing 808 models and see which ones performs best. The code may take a short while to run all the models and find the best version. The corresponding code will be svm_tune <- tune(svm, y ~ x, &amp;amp;amp;amp;nbsp;data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Printing gives the output: #Parameter tuning of ‘svm’: # – sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0.09 256 #- best performance: 2.569451 #This best performance denotes the MSE. The corresponding RMSE is 1.602951 which is square root of MSE An advantage of tuning in R is that it lets us extract the best function directly. We don’t have to do anything and just extract the best function from the svm_tune list. We can now see the improvement in our model by calculating its RMSE error using the following code #The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y – best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738 This tuning method is known as grid search. R runs all various models with all the possible values of epsilon and cost function in the specified range and gives us the model which has the lowest error. We can also plot our tuning model to see the performance of all the models together plot(svm_tune)

This plot shows the performance of various models using color coding. Darker regions imply better accuracy. The use of this plot is to determine the possible range where we can narrow down our search to and try further tuning if required. For instance, this plot shows that I can run tuning for epsilon in the new range of 0 to 0.2 and while I’m at it, I can move in even lower steps (say 0.002) but going further may lead to overfitting so I can stop at this point. From this model, we have improved on our initial error of 2.69 and come as close as 1.29 which is about half of our original error in SVM. We have come very far in our model accuracy. Let’s see how the best model looks like when plotted. plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4)

Visually, the points predicted by our tuned model almost follow the data. This is the power of SVM and we are just seeing this for data with two features. Imagine the abilities of the model with more number of complex features! Summary SVM is a powerful technique and especially useful for data whose distribution is unknown (also known as non-regularity in data). Because the example considered here consisted of only two features, the SVM fitted by R here is also known as linear SVM. SVM is powered by a kernel for dealing with various kinds of data and its kernel can also be set during model tuning. Some such examples include gaussian and radial. Hence, SVM can also be used for non-linear data and does not require any assumptions about its functional form. Because we separate data with the maximum possible margin, the model becomes very robust and can deal with incongruencies such as noisy test data or biased train data. We can also interpret the results produced by SVM through visualization. A common disadvantage with SVM is associated with its tuning. The level of accuracy in predicting over the training data has to be defined in our data. Because our example was custom generated data, we went ahead and tried to get our model accuracy as high as possible by reducing the error. However, in business situations when one needs to train the model and continually predict over test data, SVM may fall into the trap of overfitting. This is the reason SVM needs to be carefully modeled – otherwise the model accuracy may not be satisfactory. As I did in the example, SVM technique is closely related to regression technique. For linear data, we can compare SVM with linear regression while non-linear SVM is comparable to logistic regression. As the data becomes more and more linear in nature, linear regression becomes more and more accurate. At the same time, tuned SVM can also fit the data. However, noisiness and bias can severely impact the ability of regression. In such cases, SVM comes in really handy! With this article, I hope one may understand the basics of SVM and its implementation in R. One can also deep dive into the SVM technique by using model tuning. The full R code used in the article is laid out as under: x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) #Plot the dataset plot(train,pch=16) #Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model) #SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4) #Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y – pred svm_error <- sqrt(mean(error_2^2)) # 2.696281 # perform a grid search svm_tune <- tune(svm, y ~ x, data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Parameter tuning of ‘svm’: # – sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0 8 #- best performance: 2.872047 #The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y – best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738 plot(svm_tune) plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4) Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.