Granger-causality without assuming linear regression, enhancements to generalCorr package

Consider the regression Y(t) =a0+a1 Y(t-1)+ .. +ap Y(t-p) +b1 X(t-1)+.. bp X(t-p) +e(t)

Let (X — g — > Y) denote that the time series X(t) Granger-causes the Y(t) series. The R package `lmtest’ has a function grangertest() for testing (X — g — > Y).  It tests the Granger non-causality Null Hypothesis H0: b1=b2=  …bp=0, that certain regression coefficients are all zero. This is a standard  procedure in econometrics textbooks and assumes linear regression and the F-test.  Now the F-test is correct only if the underlying distribution of regression errors e(t) is Normal.  Normality a strong assumption and easily relaxed by using the bootstrap.  generalCorr::bootGcRsq relaxes the Normality assumption and considers kernel regressions which provide far better fits (higher R-squares) generalCorr::causeSummary(mtx) is a powerful tool for assessing concurrent causality not covered by Granger causality Measures of dependence in statistics are symmetric.  Why? Dependence relations in nature or data are almost never symmetric. (a) An infant depends on mother for survival, but mother’s survival does not equally depend on the infant. (b) New York’s rainfall depends on the latitude, but latitude does not equally depend on the New York’s rainfall at all. As a measure of dependence the 100+ year old Pearson correlation coefficient miserably underestimates dependence. For example if x=1:10 and y=sin(x) perfectly depends on x, a good measure of dependence should be 1. Instead, the Pearson correlation coefficient -0.17 under-estimates it by 83%.  The gmcmtx0(mtx) function in `generalCorr’ package provides a non-symmetric matrix of generalized correlation coefficients with the correct measure of dependence. depMeas(x,y) gives a correct measure of dependence.

R can help decide the opening of pandemic-ravaged economies

Epidemiological models do not provide death forecasts.  Statistical models using inverse Mills ratio, generalized linear models (GLM) with Poisson link for death count data plus autoregressive distributed lag (ARDL) models can provide improved death forecasts for individual states.

See details at http://ssrn.com/abstract=3637682  and http://ssrn.com/abstract=3649680 Our 95% confidence interval is [optimistic 4719 to pessimistic 8026] deaths, with a point-estimate forecast for US deaths at 6529 for the week ending July 27, 2020.
Following 2 functions may be useful.  We are finding 95%
confidence intervals for the individual state forecast of the last count of Covid-19 deaths using my `meboot’ package.   I recommend using reps=1000 for more reliable confidence intervals.

fittedLast=function(yt,zt){
#ARDL(1,1) says yt=a+b1 zt =b2 yLag +b3 zLag + err

yt= actual NEW deaths on week t or ND

zt= NEW deaths forecast by US wide model using IMR or PredNewDeaths

#we expect one less weeks of data for predicted new deaths or zt
yt=as.numeric(yt)
zt=as.numeric(zt)
nweeks=length(yt)
if (length(zt)!=nweeks) stop(“unmatched yt zt in fittedLast”)

yt is nweeks X 1 vector# zt is for (nweek-1) X 1 vector

yLag=as.numeric(yt[1:(nweeks-1)])
yyt=as.numeric(yt[2:nweeks])
zzt=as.numeric(zt[2:nweeks])
zLag=as.numeric(zt[1:(nweeks-1)])

not sure if glm will work

latest.fitted=NA
if(min(yt)>0 & min(zt)>0){
reg=glm(yyt~yLag+zzt+zLag,family=poisson(link = “log”))
#reg=lm(yyt~yLag+zzt)
f1=fitted.values(reg)
#f1=fitted(reg)
latest.fitted=f1[length(f1)]
}
return(latest.fitted)
}#end function fittedLast

#example
#set.seed(99)
#z=sample(1:10);y=sample(1:11);
#fittedLast(y,z)
#reg=lm(y[2:10]~y[1:9]+z[2:10])
#print(cbind(y[2:10],fitted(reg)))
#coef(reg)

ARDL95=function(yt,zt,reps=100){
require(meboot)

yt= actual NEW deaths on week t or ND

zt= NEW deaths forecast by US wide model using IMR or PredNewDeaths

#we expect equal no of weeks of data for predicted new deaths or zt
#calls fittedLast which fits
#ARDL(1,1) says yt=a+b1 zt =b2 yLag +b3 zLag + err
yt=as.numeric(yt)
zt=round(as.numeric(zt),0)
nweeks=length(yt)
if (length(zt)!=(nweeks)) stop(“unmatched yt zt in fittedLast”)
mid=fittedLast(yt=yt, zt=zt) #middle of confInterval
#create data
y.ens=meboot(x=as.numeric(yt), reps=reps)$ens
z.ens=meboot(x=as.numeric(zt), reps=reps)$ens
outf=rep(NA,nweeks)#place to store
for (j in 1:reps){
outf[j]=fittedLast(yt=y.ens[,j],z.ens[,j])
}# end j loop over reps
#use ensembles above to get 95percent conf interval
#of the latest.fitted
qu1=quantile(outf,prob=c(0.025,0.975),na.rm=TRUE)
out3=c(qu1[1],mid,qu1[2])
return(out3)
}#end function ARDL95


R code for removing Bias in Covid-19 data Using Econometric Tools

A version of our detailed paper with all the theory is submitted to a journal called Econometrica. It is entitled “A Novel Solution to Biased Data in Covid-19 Incidence Studies” and is available at vinod/virus1.pdf and also at ssrn.com/abstract=3637682 The paper solves the bias in data arising from non-random testing of entire population for Covid-19. A two-equation model overcomes the bias by using the inverse Mills ratio and improves forecasts of new deaths in all nine out of nine weekly out-of-sample comparisons. We identify the political party of the governors of states with desirable negative and undesirable positive trends.
library(sampleSelection)
library(margins)
library(dplyr)
##Pull Data
Cumulative_Deaths=read.csv("Cumulative Deaths.csv")
Cumulative_Infections=read.csv("Cumulative Infections.csv")
Covid=read.csv("Covid_0615.csv")
##clean data
Covid$State.Abbreviation<-Covid$STA
Final_Dataset<-merge(Covid,Cumulative_Deaths,by="State.Abbreviation")
Data<-merge(Final_Dataset,Cumulative_Infections,by="State.Abbreviation")
2020/06/15: First Step Probit Model

myprobit<-glm(TI~HES+CPT+UI+HR+HI, family=binomial(link="probit"),data=Data)
##TI is Testing Indicator (i.e. 1 for tested and 0 for not tested)
##HES is Hospital Employee Share
##CPT is portion of population that Commutes on Public Transit
##UI in Uninsured Population
##HR is Hypertension Rate (one of the leading comorbidities)
#HI is Household Income

summary(myprobit)
## 
## Call:
## glm(formula = TI ~ HES + CPT + UI + HR + HI, family = binomial(link = "probit"), 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7296  -0.3849  -0.3553  -0.3376   2.5180  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.4319390  0.0636110 -38.231  < 2e-16 ***
## HES          0.0663127  0.0081659   8.121 4.64e-16 ***
## CPT          0.0193450  0.0004345  44.523  < 2e-16 ***
## UI          -0.0070335  0.0009966  -7.058 1.69e-12 ***
## HR           0.0198518  0.0011021  18.012  < 2e-16 ***
## HI           0.0021614  0.0004608   4.690 2.73e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 254187  on 499999  degrees of freedom
## Residual deviance: 250590  on 499994  degrees of freedom
## AIC: 250602
## 
## Number of Fisher Scoring iterations: 5
2020/06/15: Probit Model Marginal Effects

#Marginal Effects of probit model
effects_pro=margins(myprobit) #print(effects_pro)
summary(effects_pro)
 

1   CPT 0.0025685341    5.801388e-05    44.274475   0.000000e+00    0.0024548290    0.0026822392
2   HES 0.0088046672    1.084429e-03    8.119170    4.693813e-16    0.0066792246    0.0109301098
3   HI  0.0002869787    6.119577e-05    4.689519    2.738481e-06    0.0001670372    0.0004069203
4   HR  0.0026358250    1.465264e-04    17.988742   2.387156e-72    0.0023486386    0.0029230114
5   UI  -0.0009338730   1.323531e-04    -7.055923   1.714593e-12    -0.0011932802   -0.0006744657
5 rows
2020/06/15: Extract inverse Mills ratio from probit model

##Compute inverse Mills ratio from probit model usine 'invMillsRatio' command from the 'sampleSelection' package.
IMR=invMillsRatio(myprobit,all=FALSE)
Data$IMR<-invMillsRatio(myprobit)$IMR1
##create table of IMR values by state
IMR.Index=cbind(as.character(Data$State.Abbreviation),Data$IMR)
IMR_table=distinct(data.frame(IMR.Index))
2020/06/15: Second Step Heckman Equation (i.e. with IMR)

##GLM with Poisson distribution to estimate cumulative deaths based on the log of cumulative infections from the previous week, as well as the IMR. 
##Use 'glm' command with 'family=poisson(link="log")'
##Only ran the model on individuals who were tested (i.e. TI = 1).
##CD is "Cumulative Deaths"
##CI is "Cumulative Infections"

glm_IMR_0615_CD<-glm(X20200615_CD~log(X20200608_CI)+IMR,family=poisson(link="log"),
                     data=subset(Data,TI==1))
fig-out-of-sample-all

#Compute fitted values for each state
Tested_Individuals<-subset(Data,TI==1)
Fitted_CD_IMR=fitted(glm_IMR_0615_CD)
Fitted_CD=fitted(glm_0615_CD)

Results_Table=cbind(as.character(Tested_Individuals$State.Abbreviation),Tested_Individuals$X20200622_CI,
               Tested_Individuals$X20200622_CD,Tested_Individuals$X20200615_CD,Fitted_CD_IMR,Fitted_CD,
               Tested_Individuals$IMR)

#Pull fitted value for each state
Final_Results=distinct(data.frame(Results_Table))

View(Final_Results)
Assessing Nation-wide Forecast

#Compare fitted values with in-sample (2020/06/15) and out-of-sample (2020/06/22) actual deaths

Actual_Infections_Out_Sample=sum(as.numeric(Final_Results[,2]))

Actual_Deaths_Out_Sample=sum(as.numeric(Final_Results[,3]))
Predict_Deaths_IMR=sum(as.numeric(Final_Results$Fitted_CD_IMR))
Predict_Deaths_no_IMR=sum(as.numeric(Final_Results$Fitted_CD))
Out_of_sample_per_diff_CD_IMR=(Predict_Deaths_IMR-Actual_Deaths_Out_Sample)/Actual_Deaths_Out_Sample
Out_of_sample_per_diff__CD_no_IMR=(Predict_Deaths_no_IMR-Actual_Deaths_Out_Sample)/Actual_Deaths_Out_Sample

Actual_Deaths_In_Sample=sum(as.numeric(Final_Results[,4]))
In_sample_per_diff_CD_IMR=(Predict_Deaths_IMR-Actual_Deaths_In_Sample)/Actual_Deaths_In_Sample
In_sample_per_diff_CD_no_IMR=(Predict_Deaths_no_IMR-Actual_Deaths_In_Sample)/Actual_Deaths_In_Sample

print(rbind(Actual_Deaths_In_Sample,In_sample_per_diff_CD_IMR,In_sample_per_diff_CD_no_IMR))
##                                       [,1]
## Actual_Deaths_In_Sample       1.098220e+05
## In_sample_per_diff_CD_IMR     4.064085e-03
## In_sample_per_diff_CD_no_IMR -4.705036e-02
print(rbind(Actual_Infections_Out_Sample,Actual_Deaths_Out_Sample,Predict_Deaths_no_IMR, Out_of_sample_per_diff_CD_IMR,Out_of_sample_per_diff__CD_no_IMR))
##                                            [,1]
## Actual_Infections_Out_Sample       2.290489e+06
## Actual_Deaths_Out_Sample           1.139440e+05
## Predict_Deaths_no_IMR              1.046548e+05
## Out_of_sample_per_diff_CD_IMR     -3.225860e-02
## Out_of_sample_per_diff__CD_no_IMR -8.152394e-02

In all nine cases using IMR bias correction improves out-of-sample forecasts of Covid-19 deaths in US as a whole

cbind(Week,IMR_Perc_Diff,Without_IMR_Perc_Diff)
##       Week IMR_Perc_Diff Without_IMR_Perc_Diff
##  [1,]    1        -25.27                -26.96
##  [2,]    2        -21.61                -22.50
##  [3,]    3        -17.66                -18.73
##  [4,]    4        -12.14                -13.78
##  [5,]    5         -9.38                -11.28
##  [6,]    6         -7.25                 -9.70
##  [7,]    7         -5.53                 -9.01
##  [8,]    8         -3.94                 -8.03

Epidemiologists should consider GLM & IMR for better Covid death predictions and overcome testing bias

Social science research network (SSRN) reviewer has posted our “A Novel Solution to Biased Data in Covid-19 Incidence Studies” written by myself (H D Vinod) and K Theiss.  It uses a new two-equation generalized linear model (glm) with Poisson link for forecasting Covid-19 deaths in the US and individual states. The first equation explicitly models the probability of being tested. It helps build an estimate of the bias using inverse mills ratio (IMR) Detailed paper is available for free download at:

ssrn.com/abstract=3637682 It uses open-source R software.  A supplement to the paper entitled “State-by-state Forecasts of Covid-19 Deaths” is available for free download at http://www.fordham.edu/economics/vinod/WeeklyCovid.pdf Our bias-corrected out-of-sample death forecasts for future weeks should help state governors and mayors decide on limits to opening local economies.  One can compare the performance of democratic versus republican governors using the information tabulated.