Numerical Partial Derivative Estimation – the {NNS} package

NNS (v0.5.5) now on CRAN has an updated partial derivative routine dy.d_() . This function estimates true average partial derivatives, as well as ceteris paribus conditions for points of interest.

Example below on the syntax for estimating first derivatives of the function y = x_1^2 * x_2^2 , for the points x_1 = 0.5 and x_2 = 0.5, and for both regressors x_1 and x_2.

set.seed(123)
x_1 = runif(1000)
x_2 = runif(1000)
y = x_1 ^ 2 * x_2 ^ 2
dy.d_(cbind(x_1, x_2), y, wrt = 1:2, eval.points = t(c(.5,.5)))["First",]
[[1]]
[1] 0.2454744

[[2]]
[1] 0.2439307


The analytical solution for both regressors at x_1 = x_2 = 0.5 is 0.25.

The referenced paper gives many more examples, comparing dy.d_() to kernel regression gradients and OLS coefficients.

For even more NNS capabilities, check out the examples at GitHub:
https://github.com/OVVO-Financial/NNS/blob/NNS-Beta-Version/examples/index.md


Reference Paper:
Vinod, Hrishikesh D. and Viole, Fred, Comparing Old and New Partial Derivative Estimates from Nonlinear Nonparametric Regressions
https://ssrn.com/abstract=3681104 

Supplemental Materials:
https://ssrn.com/abstract=3681436


Most popular on Netflix. Daily Tops for last 60 days

Everyday, around 9 pm, I get fresh portion of the Netflix Top movies / TV shows. I’ve been doing this for more than two months and decided to show the first results answering following questions:

How many movies / TV shows make the Top?
What movie was the longest #1 on Netflix?
What movie was the longest in Tops?
For how many days movies / TV shows stay in Tops and as #1?
To have a try to plot all this up and down zigzags…

I took 60 days span (almost all harvest so far) and Top5 overall, not Top10, in each category to talk about really the most popular and trendy.

So, let`s count how many movies made the Top5, I mean it is definitely less than 5 *60…
library(tidyverse)
library (stats)
library (gt)
fjune <- read_csv("R/movies/fjune.csv")
#wrangle raw data - reverse (fresh date first), take top 5, take last 60 days
fjune_dt %>% rev () %>% slice (1:5) %>% select (1:60)

#gather it together and count the number of unique titles 
fjune_dt_gathered <- gather (fjune_dt)
colnames (fjune_dt_gathered) <- c("date", "title")
unique_fjune_gathered <- unique (fjune_dt_gathered$title)

str (unique_fjune_gathered)
#chr [1:123] "Unknown Origins" "The Debt Collector 2" "Lucifer" "Casino Royale"
OK, it is 123 out of 300.
Now, it is good to have distribution in each #1 #2 #3 etc.
t_fjune_dt <- as.data.frame (t(fjune_dt),stringsAsFactors = FALSE)
# list of unique titles in each #1 #2 #3etc. for each day
unique_fjune <- sapply (t_fjune_dt, unique)

# number of unique titles in each #1 #2 #3 etc.
n_unique_fjune <- sapply (unique_fjune, length)
n_unique_fjune <- setNames (n_unique_fjune, c("Top1", "Top2", "Top3", "Top4", "Top5"))

n_unique_fjune
Top1 Top2 Top3 Top4 Top5
32   45   45   52   49
What movie was the longest in Tops / #1?
# Top5
table_fjune_dt5 <- sort (table (fjune_dt_gathered$title), decreasing = T)
# Top1
table_fjune_dt1 <- sort (table (t_fjune_dt$V1), decreasing = T)

# plotting the results
bb5 <- barplot (table_fjune_dt5 [1:5], ylim=c(0,22), main = "Days in Top5", las = 1, col = 'light blue')
text(bb5,table_fjune_dt5 [1:5] +1.2,labels=as.character(table_fjune_dt5 [1:5]))
bb1 <- barplot (table_fjune_dt1 [1:5], main = "Days as Top1", las = 1, ylim=c(0,6), col = 'light green')
text(bb1,table_fjune_dt1 [1:5] +0.5, labels=as.character(table_fjune_dt1 [1:5]))


Let`s weight and rank (1st = 5, 2nd = 4, 3rd = 3 etc.) Top 5 movies / TV shows during last 60 days.

i<- (5:1)
fjune_dt_gathered5 <- cbind (fjune_dt_gathered, i) 
fjune_dt_weighted % group_by(title) %>% summarise(sum = sum (i)) %>% arrange (desc(sum))
top_n (fjune_dt_weighted, 10) %>% gt () 


As we see the longer movies stays in Top the better rank they have with simple weighting meaning “lonely stars”, which got the most #1, draw less top attention through the time span.

Average days in top
av_fjune_dt5 <- round (nrow (fjune_dt_gathered) / length (unique_fjune_gathered),1) # in Top5
av_fjune_dt1 <- round (nrow (t_fjune_dt) / length (unique_fjune$V1),1) #as Top1

cat("Average days in top5: ", av_fjune_dt5, "\n") 
cat("Average days as top1: ", av_fjune_dt1)
#Average days in top5: 2.4
#Average days as top1: 1.9 
Average days in top distribution
as5 <- as.data.frame (table_fjune_dt5, stringsAsFActrors=FALSE) 

as1 <- as.data.frame (table_fjune_dt1, stringsAsFActrors=FALSE) 
par (mfcol = c(1,2)) 
boxplot (as5$Freq, ylim=c(0,7), main = "Days in Top5") 
boxplot (as1$Freq, ylim=c(0,7), main = "Days as Top1")

And now, let`s try to plot Top5 movies / TV shows daily changes.
I played with different number of days and picked 4 – with longer span I had less readable plots since every new day brings new comers into the Top with its own line, position in the legend, and, ideally its own color.

fjune_dt_gathered55 <- fjune_dt_gathered5 %>% slice (1:20) %>% rev ()
ggplot(fjune_dt_gathered55)+ aes (x=date, y=i, group = title) + geom_line (aes (color = title), size = 2) +
geom_point (aes (color = title), size = 5)+ theme_classic()+
theme(legend.position="right")+ylab("top")+ ylim ("5", "4", "3", "2", "1")+
geom_line(aes(colour = title), arrow = arrow(angle = 15, ends = "last", type = "closed"))+scale_color_brewer(palette="Paired") 

Simulations Comparing Interaction for Adjusted Risk Ratios versus Adjusted Odds Ratios


Introduction

Risk ratios (RR) and odds ratios (OR) are both used to analyze factors associated with increased risk of disease. They are different in that a risk ratio is the ratio of two risks and an odds ratio is the ratio of two odds. Risks and odds are calculated differently. For example, with one toss of a fair die, the risk of rolling a 1 is 1/6 = 0.167. The odds of rolling a 1 are 1/5 = 0.200. In textbook terms, risk is successes divided by total trials and odds are successes divided by failures. In this example rolling a 1 is a success and rolling 2 through 5 is a failure. In R, glm (generalized linear model) is used to compute adjusted risk ratios (ARR’s) and adjusted odds ratios (AOR’s). Adjusting eliminates the influence of correlations between factors on the ratios. In statistical terms, adjusting controls for confounding. In addition to confounding, the RR’s and OR’s can be influenced by interaction which is also called effect modification. This occurs when the ratio for one group is different from the ratio for another group. For example, non-smokers who work with asbestos might have 4.0 times the risk of getting lung cancer compared to non-smokers who do not work with asbestos. This is for non-smokers. For smokers, this risk ratio might be much larger at 8.0, meaning smokers who work with asbestos have 8.0 times the risk of getting lung cancer compared to smokers who do not work with asbestos. In this case there is interaction between smoking and asbestos exposure because the RR for non-smokers is 4.0 and the RR for smokers is 8.0. Alternatively, it could be said that smoking modifies the effect of the asbestos exposure on lung cancer. For a given data set, the AOR’s and ARR’s will almost always agree in terms of statistical significance. The AOR’s and ARR’s will be different but the p values will be close. This is not true for interaction terms. Often, using the same data, an analysis of interaction using OR’s will indicate significant interaction while the RR’s will indicate no interaction. This phenomenon is explained very well in the article “Distributional interaction: Interpretational problems when using incidence odds ratios to assess interaction” by Ulka B Campbell, Nicolle M Gatto and Sharon Schwartz. (Epidemiologic Perspectives & Innovations 2005, 2:1 doi:10.1186/1742-5573-2-1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1079911/.

The authors described the problem for RR’s and OR’s and included mathematical explanations. The purpose of this article is to show how this problem occurs when using adjusted RR’s and adjusted OR’s with the R glm function to assess interaction.

Creating a test data set

The first step is to create a test data set. The R code below produces a data set of 1,000,000 lines with simulated variables for two risk factors, a grouping variable and an outcome variable. This is designed to simulate a situation where there are two groups of people that have different risks of disease. There are also two risk factors and these are designed to have the same risk ratios for the disease for simulated people in both groups.

>
> rm(list=ls(all=TRUE))
>
> nx=1000000
>
> set.seed(523)
>
> x1=runif(nx)
> x2=runif(nx)
> grpx=runif(nx)
> dep1=runif(nx)
>
> rf1=ifelse(x1 < 0.20,1,0)
> rf2=ifelse(x2 < 0.30,1,0)
> grp=ifelse(grpx < 0.5,1,0)
>
> d1=as.data.frame(cbind(rf1,rf2,grp,dep1))

The R output above creates the two risk factor variables (rf1 and rf2) and the group variable (grp). The proportions are 0.20, for rf1, 0.30 for rf2 and 0.50 for grp 0 and grp 1. This simulates a scenario where the risk factors are not too rare and the two groups are similar to a male/female grouping. The proportions will vary slightly due to randomizing with the runif R function.


First Simulation

> #Sim 1
>
> attach(d1,warn.conflict=F)
>
> xg12f=(grp*100)+(rf1*10)+rf2
>
> disprop=0.01
>
> disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
> disease=ifelse(xg12f == 1 & dep1 < (1.5*disprop),1,disease)
> disease=ifelse(xg12f == 10 & dep1 < (1.7*disprop),1,disease)
> disease=ifelse(xg12f == 11 & dep1 < (1.5*1.7*disprop),1,disease)
>
> disease=ifelse(xg12f == 100 & dep1 < (2*disprop),1,disease)
> disease=ifelse(xg12f == 101 & dep1 < (3*disprop),1,disease)
> disease=ifelse(xg12f == 110 & dep1 < (3.4*disprop),1,disease)
> disease=ifelse(xg12f == 111 & dep1 < (5.1*disprop),1,disease)
>
> sum(disease)/nrow(d1)
[1] 0.019679
>

The block of output above creates a new variable, xg12f, that combines the three variables, grp, rf1 and rf2. This makes it easier to create the outcome variable named disease. Each value of xg12f defines a unique combination of the variables grp, rf1 and rf2. The disease proportion (disprop) is initially set to 0.01 and then the series of ifelse statements increases the disease proportion by multiples of disprop based on the risk factors and group. This simulates a situation where the disease is rare for people with no risk factors (xg12f = 0) and then increases with various combinations of risk factors and group. For example, a simulated person in group 0 with risk factor 1 but not risk factor 2 would have a value for xg12f of 10. The third ifelse statement sets the disease proportion for simulated persons in this category to 1.7 times the disprop value. In general, simulated persons with risk factors have higher disease proportions and simulated persons in group 1 have higher disease proportions. Simulated persons in group 1 with both risk factors (xg12f = 111) have the highest multiple at 5.1 times the proportion of disease for simulated persons in group 0 with no risk factors (xg12f = 0). As shown above, the overall proportion of simulated persons with the disease is 0.019679.

  >
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=log))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = log))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.3265  -0.2001  -0.2001  -0.1422   3.0328 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -4.598839   0.018000 -255.492   <2e-16 ***
rf1          0.500622   0.029614   16.905   <2e-16 ***
rf2          0.401599   0.026798   14.986   <2e-16 ***
grp          0.677464   0.021601   31.362   <2e-16 ***
rf1:rf2     -0.004331   0.031359   -0.138    0.890   
rf1:grp      0.032458   0.032800    0.990    0.322   
rf2:grp      0.032862   0.030670    1.071    0.284   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 193574  on 999999  degrees of freedom
Residual deviance: 189361  on 999993  degrees of freedom
AIC: 189375Number of Fisher Scoring iterations: 7> x1=exp(confint.default(r1))
> rr=exp(coef(r1))
> round(cbind(rr,x1),digits=2)
              rr 2.5 % 97.5 %
(Intercept) 0.01  0.01   0.01
rf1         1.65  1.56   1.75
rf2         1.49  1.42   1.57
grp         1.97  1.89   2.05
rf1:rf2     1.00  0.94   1.06
rf1:grp     1.03  0.97   1.10
rf2:grp     1.03  0.97   1.10
>


The R output above uses the R function glm to produce ARR’s with 95% confidence intervals and p values. In this glm the family is binomial with link = log. This produces coefficients that are the logs of the ARR’s and these are used to produce the table of ARR’s and 95% CI that appears below the glm output. It should be noted that the intercept is the log of the disease proportion for simulated persons that have no risk factors. In the table of ARR’s the data for the intercept is not the ARR but rather the proportion of disease for simulated persons with no risk factors, 0.01. As expected, the ARR’s are statistically significant for rf1, rf2 and grp. Also as expected, the interaction terms rf1:rf2, rf1:grp and rf2:grp, are not statistically significant.


Next are the AOR’s.


>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=logit))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = logit))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.3265  -0.2000  -0.2000  -0.1423   3.0326 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -4.588289   0.018205 -252.037   <2e-16 ***
rf1          0.505839   0.030181   16.760   <2e-16 ***
rf2          0.405558   0.027242   14.887   <2e-16 ***
grp          0.686728   0.021929   31.316   <2e-16 ***
rf1:rf2      0.002334   0.032418    0.072    0.943   
rf1:grp      0.042153   0.033593    1.255    0.210   
rf2:grp      0.040418   0.031327    1.290    0.197   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 193574  on 999999  degrees of freedom
Residual deviance: 189361  on 999993  degrees of freedom
AIC: 189375Number of Fisher Scoring iterations: 7> x1=exp(confint.default(r1))
> or=exp(coef(r1))
> round(cbind(or,x1),digits=2)
              or 2.5 % 97.5 %
(Intercept) 0.01  0.01   0.01
rf1         1.66  1.56   1.76
rf2         1.50  1.42   1.58
grp         1.99  1.90   2.07
rf1:rf2     1.00  0.94   1.07
rf1:grp     1.04  0.98   1.11
rf2:grp     1.04  0.98   1.11
>
The R output above is similar to the previous glm output but in this glm the family is binomial with link = logit instead of log. This produces coefficients that are the logs of the AOR’s and these are used to produce the table of AOR’s and 95% CI. The line for the intercept in the table is not AOR’s but is the odds of disease for simulated persons with no risk factors. As with ARR’s, the AOR’s are statistically significant for rf1, rf2 and grp. The interaction terms rf1:rf2, rf1:grp and rf2:grp, are not statistically significant which also agrees with the glm output for the ARR’s.


Second Simulation

> #sim 2
>
> disprop=0.05
>
> disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
> disease=ifelse(xg12f == 1 & dep1 < (1.5*disprop),1,disease)
> disease=ifelse(xg12f == 10 & dep1 < (1.7*disprop),1,disease)
> disease=ifelse(xg12f == 11 & dep1 < (1.5*1.7*disprop),1,disease)
>
> disease=ifelse(xg12f == 100 & dep1 < (2*disprop),1,disease)
> disease=ifelse(xg12f == 101 & dep1 < (3*disprop),1,disease)
> disease=ifelse(xg12f == 110 & dep1 < (3.4*disprop),1,disease)
> disease=ifelse(xg12f == 111 & dep1 < (5.1*disprop),1,disease)
>
> sum(disease)/nrow(d1)
[1] 0.09847
>
The R output above uses the same data file but the disprop variable is increased from 0.01 to 0.05. This reflects a situation where the disease proportion is higher. As shown above, the overall proportion of simulated persons with the disease is 0.09847.
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=log))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = log))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.7718  -0.4588  -0.4168  -0.3207   2.4468 Coefficients:
              Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -2.9935084  0.0078756 -380.099   <2e-16 ***
rf1          0.5071062  0.0127189   39.870   <2e-16 ***
rf2          0.4029193  0.0115901   34.764   <2e-16 ***
grp          0.6901061  0.0093582   73.743   <2e-16 ***
rf1:rf2      0.0007967  0.0130359    0.061    0.951   
rf1:grp      0.0155518  0.0139306    1.116    0.264   
rf2:grp      0.0206406  0.0131138    1.574    0.115   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 643416  on 999999  degrees of freedom
Residual deviance: 620356  on 999993  degrees of freedom
AIC: 620370Number of Fisher Scoring iterations: 6> x1=exp(confint.default(r1))
> rr=exp(coef(r1))
> round(cbind(rr,x1),digits=2)
              rr 2.5 % 97.5 %
(Intercept) 0.05  0.05   0.05
rf1         1.66  1.62   1.70
rf2         1.50  1.46   1.53
grp         1.99  1.96   2.03
rf1:rf2     1.00  0.98   1.03
rf1:grp     1.02  0.99   1.04
rf2:grp     1.02  0.99   1.05
>
The R output above is glm output with family = binomial and link = log. As before, the coefficients are the logs of ARR’s and are used to produce the table of ARR’s. Also as before the line for the intercept in the table is not ARR’s but is the prevalence of disease for simulated persons with no risk factors. In this case it is 0.05 in contrast to the earlier glm result with the previous data file which was 0.01. As expected, this output shows statistically significant ARR’s for the risk factors and no statistically significant results for the interaction terms.
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=logit))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = logit))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.7701  -0.4586  -0.4157  -0.3210   2.4459 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -2.939696   0.008344 -352.305  < 2e-16 ***
rf1          0.534243   0.014054   38.012  < 2e-16 ***
rf2          0.423218   0.012628   33.515  < 2e-16 ***
grp          0.740304   0.010119   73.158  < 2e-16 ***
rf1:rf2      0.042163   0.015615    2.700  0.00693 **
rf1:grp      0.072082   0.015824    4.555 5.23e-06 ***
rf2:grp      0.063938   0.014672    4.358 1.31e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 643416  on 999999  degrees of freedom
Residual deviance: 620356  on 999993  degrees of freedom
AIC: 620370Number of Fisher Scoring iterations: 5> x1=exp(confint.default(r1))
> or=exp(coef(r1))
> round(cbind(or,x1),digits=2)
              or 2.5 % 97.5 %
(Intercept) 0.05  0.05   0.05
rf1         1.71  1.66   1.75
rf2         1.53  1.49   1.57
grp         2.10  2.06   2.14
rf1:rf2     1.04  1.01   1.08
rf1:grp     1.07  1.04   1.11
rf2:grp     1.07  1.04   1.10
>
The R output above is glm output with family = binomial and link = logit. The coefficients are the logs of AOR’s and are used to produce the table of AOR’s. The line for the intercept in the table is not AOR’s but is the odds of disease for simulated persons with no risk factors. As expected, this output shows statistically significant AOR’s for the risk factors. However, in contrast to the results with the ARR’s, this glm output shows statistically significant results for the interaction terms. In short, with this data set, the interaction terms are not statistically significant when using ARR’s but they are statistically significant when using AOR’s. Lastly, the value of the disprop variable is increased again, this time to 0.10. The output below shows the creation of the new disease variable and shows the overall proportion of simulated persons with the disease is increased to 0.196368.

The output below is the, by now familiar, glm output for ARR’s and AOR’ s. With the increased proportion of disease, the ARR’s still show the interaction terms as not statistically significant. But the AOR’s show the interaction terms as having p values less than 2 X 10-16 which is pretty low. Once again, the ARR’s show no interaction while the AOR’s indicate strong interaction.

 Third Simulation

>
> #sim 3
>
> disprop=0.10
>
> disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
> disease=ifelse(xg12f == 1 & dep1 < (1.5*disprop),1,disease)
> disease=ifelse(xg12f == 10 & dep1 < (1.7*disprop),1,disease)
> disease=ifelse(xg12f == 11 & dep1 < (1.5*1.7*disprop),1,disease)
>
> disease=ifelse(xg12f == 100 & dep1 < (2*disprop),1,disease)
> disease=ifelse(xg12f == 101 & dep1 < (3*disprop),1,disease)
> disease=ifelse(xg12f == 110 & dep1 < (3.4*disprop),1,disease)
> disease=ifelse(xg12f == 111 & dep1 < (5.1*disprop),1,disease)
>
> sum(disease)/nrow(d1)
[1] 0.196368
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=log))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = log))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-1.2029  -0.6670  -0.5688  -0.4587   2.1465 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -2.303778   0.005401 -426.540   <2e-16 ***
rf1          0.512466   0.008491   60.356   <2e-16 ***
rf2          0.402541   0.007816   51.502   <2e-16 ***
grp          0.691635   0.006344  109.025   <2e-16 ***
rf1:rf2      0.008824   0.008221    1.073    0.283   
rf1:grp      0.013005   0.009147    1.422    0.155   
rf2:grp      0.011577   0.008706    1.330    0.184   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 990652  on 999999  degrees of freedom
Residual deviance: 938046  on 999993  degrees of freedom
AIC: 938060Number of Fisher Scoring iterations: 6> x1=exp(confint.default(r1))
> rr=exp(coef(r1))
> round(cbind(rr,x1),digits=2)
              rr 2.5 % 97.5 %
(Intercept) 0.10  0.10   0.10
rf1         1.67  1.64   1.70
rf2         1.50  1.47   1.52
grp         2.00  1.97   2.02
rf1:rf2     1.01  0.99   1.03
rf1:grp     1.01  1.00   1.03
rf2:grp     1.01  0.99   1.03
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=logit))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = logit))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-1.1920  -0.6657  -0.5655  -0.4604   2.1433 Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.190813   0.006088 -359.89   <2e-16 ***
rf1          0.561856   0.010543   53.29   <2e-16 ***
rf2          0.438542   0.009391   46.70   <2e-16 ***
grp          0.796766   0.007483  106.48   <2e-16 ***
rf1:rf2      0.134330   0.012407   10.83   <2e-16 ***
rf1:grp      0.169283   0.012129   13.96   <2e-16 ***
rf2:grp      0.124378   0.011124   11.18   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 990652  on 999999  degrees of freedom
Residual deviance: 938069  on 999993  degrees of freedom
AIC: 938083Number of Fisher Scoring iterations: 4> x1=exp(confint.default(r1))
> or=exp(coef(r1))
> round(cbind(or,x1),digits=2)
              or 2.5 % 97.5 %
(Intercept) 0.11  0.11   0.11
rf1         1.75  1.72   1.79
rf2         1.55  1.52   1.58
grp         2.22  2.19   2.25
rf1:rf2     1.14  1.12   1.17
rf1:grp     1.18  1.16   1.21
rf2:grp     1.13  1.11   1.16
>


Discussion


These simulations show that given the same data, different generalized linear models (glm) will sometimes lead to different conclusions about interaction between the dependent and independent variables. When the disease is rare, adjusted odds ratios and adjusted risk ratios tend to agree regarding the statistical significance of the interaction terms. However, as the prevalence of the disease increases, the adjusted odds ratios and adjusted risk ratios tend to produce conflicting results for the interaction terms.  This is discussed and explained mathematically in the article referred to above: “Distributional interaction: Interpretational problems when using incidence odds ratios to assess interaction” by Ulka B Campbell, Nicolle M Gatto and Sharon Schwartz. (Epidemiologic Perspectives & Innovations 2005, 2:1 doi:10.1186/1742-5573-2-1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1079911/. The purpose of this paper was to illustrate how this works with adjusted risk ratios and adjusted odds ratios using the R glm function. For practicing statisticians and epidemiologists, it might be good advice to use risk ratios whenever denominators are available; and use odds ratios only when denominators are not available, such as case-control studies. For example, with case-control studies, data is collected for selected persons with a disease (cases) and also for a specific number of persons without the disease (controls). With this data there are no denominators for persons at risk of getting the disease and disease rates cannot be calculated. Without disease rates, rate ratios cannot be calculated, but case control data can be used to calculate odds ratios even though the rates are unavailable. For more on this see “Statistical Methods in Cancer Research Volume I: The Analysis of Case-Control Studies”, IARC Scientific Publication No. 32, Authors: Breslow NE, Day NE at: https://publications.iarc.fr/Book-And-Report-Series/Iarc-Scientific-Publications/Statistical-Methods-In-Cancer-Research-Volume-I-The-Analysis-Of-Case-Control-Studies-1980

Here is the R program used in this analysis. It should run when copied and pasted into the R window.


rm(list=ls(all=TRUE))

nx=1000000

set.seed(523)

x1=runif(nx)
x2=runif(nx)
grpx=runif(nx)
dep1=runif(nx)

rf1=ifelse(x1 < 0.20,1,0)
rf2=ifelse(x2 < 0.30,1,0)
grp=ifelse(grpx < 0.5,1,0)

d1=as.data.frame(cbind(rf1,rf2,grp,dep1))

#Sim 1

attach(d1,warn.conflict=F)

xg12f=(grp100)+(rf110)+rf2

disprop=0.01

disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
disease=ifelse(xg12f == 1 & dep1 < (1.5disprop),1,disease)
disease=ifelse(xg12f == 10 & dep1 < (1.7
disprop),1,disease)
disease=ifelse(xg12f == 11 & dep1 < (1.51.7disprop),1,disease)

disease=ifelse(xg12f == 100 & dep1 < (2disprop),1,disease)
disease=ifelse(xg12f == 101 & dep1 < (3
disprop),1,disease)
disease=ifelse(xg12f == 110 & dep1 < (3.4disprop),1,disease)
disease=ifelse(xg12f == 111 & dep1 < (5.1
disprop),1,disease)

sum(disease)/nrow(d1)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=log))
summary(r1)
x1=exp(confint.default(r1))
rr=exp(coef(r1))
round(cbind(rr,x1),digits=2)

r1=glm(disease~rf1+rf2+grp+rf1
rf2+rf1grp+rf2grp,
family=binomial(link=logit))
summary(r1)
x1=exp(confint.default(r1))
or=exp(coef(r1))
round(cbind(or,x1),digits=2)

#sim 2

disprop=0.05

disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
disease=ifelse(xg12f == 1 & dep1 < (1.5disprop),1,disease)
disease=ifelse(xg12f == 10 & dep1 < (1.7
disprop),1,disease)
disease=ifelse(xg12f == 11 & dep1 < (1.51.7disprop),1,disease)

disease=ifelse(xg12f == 100 & dep1 < (2disprop),1,disease)
disease=ifelse(xg12f == 101 & dep1 < (3
disprop),1,disease)
disease=ifelse(xg12f == 110 & dep1 < (3.4disprop),1,disease)
disease=ifelse(xg12f == 111 & dep1 < (5.1
disprop),1,disease)

sum(disease)/nrow(d1)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=log))
summary(r1)
x1=exp(confint.default(r1))
rr=exp(coef(r1))
round(cbind(rr,x1),digits=2)

r1=glm(disease~rf1+rf2+grp+rf1
rf2+rf1grp+rf2grp,
family=binomial(link=logit))
summary(r1)
x1=exp(confint.default(r1))
or=exp(coef(r1))
round(cbind(or,x1),digits=2)

#sim 3

disprop=0.10

disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
disease=ifelse(xg12f == 1 & dep1 < (1.5disprop),1,disease)
disease=ifelse(xg12f == 10 & dep1 < (1.7
disprop),1,disease)
disease=ifelse(xg12f == 11 & dep1 < (1.51.7disprop),1,disease)

disease=ifelse(xg12f == 100 & dep1 < (2disprop),1,disease)
disease=ifelse(xg12f == 101 & dep1 < (3
disprop),1,disease)
disease=ifelse(xg12f == 110 & dep1 < (3.4disprop),1,disease)
disease=ifelse(xg12f == 111 & dep1 < (5.1
disprop),1,disease)

sum(disease)/nrow(d1)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=log))
summary(r1)
x1=exp(confint.default(r1))
rr=exp(coef(r1))
round(cbind(rr,x1),digits=2)

r1=glm(disease~rf1+rf2+grp+rf1
rf2+rf1grp+rf2grp,
family=binomial(link=logit))
summary(r1)
x1=exp(confint.default(r1))
or=exp(coef(r1))
round(cbind(or,x1),digits=2)

ROC for Decision Trees – where did the data come from?

ROC for Decision Trees – where did the data come from?
By Jerry Tuttle
      In doing decision tree classification problems, I have often graphed the ROC (Receiver Operating Characteristic) curve. The True Positive Rate (TPR) is on the y-axis, and the False Positive Rate (FPR) is on the x-axis. True Positive is when the lab test predicts you have the disease and you actually do have it. False Positive is when the lab test predicts you have the disease but you actually do not have it.
      The following code uses the sample dataset kyphosis from the rpart package, creates a default decision tree, prints the confusion matrix, and plots the ROC curve. (Kyphosis is a type of spinal deformity.)

library(rpart)
df <- kyphosis
set.seed(1)
mytree <- rpart(Kyphosis ~ Age + Number + Start, data = df, method="class")
library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(mytree, uniform=TRUE, main="Kyphosis Tree")
predicted <- predict(mytree, type="class")
table(df$Kyphosis,predicted)
library(ROCR)
pred <- prediction(predict(mytree, type="prob")[, 2], df$Kyphosis)
plot(performance(pred, "tpr", "fpr"), col="blue", main="ROC Kyphosis, using library ROCR")
abline(0, 1, lty=2)
auc <- performance(pred, "auc")
[email protected]


ROC from package

      However, for a long time there has been a disconnect in my mind between the confusion matrix and the ROC curve. The confusion matrix provides a single value of the ordered pair (x=FPR, y=TPR) on the ROC curve, but the ROC curve has a range of values. Where are the other values coming from?
      The answer is that a default decision tree confusion matrix uses a single probability threshold value of 50%. A decision tree ends in a set of terminal nodes. Every observation falls in exactly one of those terminal nodes. The predicted classification for the entire node is based on whichever classification has the greater percentage of observations, which for binary classifications requires a probability greater than 50%. So for example if a single observation has predicted probability based on its terminal node of 58% that the disease is present, then a 50% threshold would classify this observation as disease being present. But if the threshold were changed to 60%, then the observation would be classified as disease not being present.
     The ROC curve uses a variety of probability thresholds, reclassifies each observation, recalculates the confusion matrix, and recalculates a new value of the ordered pair (x=FPR, y=TPR) for the ROC curve. The resulting curve shows the spread of these ordered pairs over all (including interpolated, and possibly extrapolated) probability thresholds, but the threshold values are not commonly displayed on the curve. Plotting performance in the ROCR package does this behind the scenes, but I wanted to verify this myself. This dataset has a small number of predictions of disease present, and at large threshold values the prediction is zero, resulting in a one column confusion matrix and zero FPR and TPR. The following code individually applies different probability thresholds to build the ROC curve, although it does not extrapolate for large values of FPR and TPR.

dat <- data.frame()
s <- predict(mytree, type="prob")[, 2]
for (i in 1:21){
p <- .05*(i-1)
thresh p, “present”, “absent”)
t <- table(df$Kyphosis,thresh)
fpr <- ifelse(ncol(t)==1, 0, t[1,2] / (t[1,2] + t[1,1]))
tpr <- ifelse(ncol(t)==1, 0, t[2,2] / (t[2,2] + t[2,1]))
dat[i,1] <- fpr
dat[i,2] <- tpr
}
colnames(dat) <- c("fpr", "tpr")
plot(x=dat$fpr, y=dat$tpr, xlab="FPR", ylab="TPR", xlim=c(0,1), ylim=c(0,1),
main="ROC Kyphosis, using indiv threshold calcs", type="b", col="blue")
abline(0, 1, lty=2)


Roc from indiv calcs

R can help decide the opening of pandemic-ravaged economies

Epidemiological models do not provide death forecasts.  Statistical models using inverse Mills ratio, generalized linear models (GLM) with Poisson link for death count data plus autoregressive distributed lag (ARDL) models can provide improved death forecasts for individual states.

See details at http://ssrn.com/abstract=3637682  and http://ssrn.com/abstract=3649680 Our 95% confidence interval is [optimistic 4719 to pessimistic 8026] deaths, with a point-estimate forecast for US deaths at 6529 for the week ending July 27, 2020.
Following 2 functions may be useful.  We are finding 95%
confidence intervals for the individual state forecast of the last count of Covid-19 deaths using my `meboot’ package.   I recommend using reps=1000 for more reliable confidence intervals.

fittedLast=function(yt,zt){
#ARDL(1,1) says yt=a+b1 zt =b2 yLag +b3 zLag + err

yt= actual NEW deaths on week t or ND

zt= NEW deaths forecast by US wide model using IMR or PredNewDeaths

#we expect one less weeks of data for predicted new deaths or zt
yt=as.numeric(yt)
zt=as.numeric(zt)
nweeks=length(yt)
if (length(zt)!=nweeks) stop(“unmatched yt zt in fittedLast”)

yt is nweeks X 1 vector# zt is for (nweek-1) X 1 vector

yLag=as.numeric(yt[1:(nweeks-1)])
yyt=as.numeric(yt[2:nweeks])
zzt=as.numeric(zt[2:nweeks])
zLag=as.numeric(zt[1:(nweeks-1)])

not sure if glm will work

latest.fitted=NA
if(min(yt)>0 & min(zt)>0){
reg=glm(yyt~yLag+zzt+zLag,family=poisson(link = “log”))
#reg=lm(yyt~yLag+zzt)
f1=fitted.values(reg)
#f1=fitted(reg)
latest.fitted=f1[length(f1)]
}
return(latest.fitted)
}#end function fittedLast

#example
#set.seed(99)
#z=sample(1:10);y=sample(1:11);
#fittedLast(y,z)
#reg=lm(y[2:10]~y[1:9]+z[2:10])
#print(cbind(y[2:10],fitted(reg)))
#coef(reg)

ARDL95=function(yt,zt,reps=100){
require(meboot)

yt= actual NEW deaths on week t or ND

zt= NEW deaths forecast by US wide model using IMR or PredNewDeaths

#we expect equal no of weeks of data for predicted new deaths or zt
#calls fittedLast which fits
#ARDL(1,1) says yt=a+b1 zt =b2 yLag +b3 zLag + err
yt=as.numeric(yt)
zt=round(as.numeric(zt),0)
nweeks=length(yt)
if (length(zt)!=(nweeks)) stop(“unmatched yt zt in fittedLast”)
mid=fittedLast(yt=yt, zt=zt) #middle of confInterval
#create data
y.ens=meboot(x=as.numeric(yt), reps=reps)$ens
z.ens=meboot(x=as.numeric(zt), reps=reps)$ens
outf=rep(NA,nweeks)#place to store
for (j in 1:reps){
outf[j]=fittedLast(yt=y.ens[,j],z.ens[,j])
}# end j loop over reps
#use ensembles above to get 95percent conf interval
#of the latest.fitted
qu1=quantile(outf,prob=c(0.025,0.975),na.rm=TRUE)
out3=c(qu1[1],mid,qu1[2])
return(out3)
}#end function ARDL95


labourR 1.0.0: Automatic Coding of Occupation Titles

Occupations classification is an important step in tasks such as labour market analysis, epidemiological studies and official statistics. To assist research on the labour market, ESCO has defined a taxonomy for occupations. Occupations are specified and organized in a hierarchical structure based on the International Standard Classification of Occupations (ISCO). labourR is a new package that performs occupations coding for multilingual free-form text (e.g. a job title) using the ESCO hierarchical classification model.



The initial motivation was to retrieve the work experience history from a Curriculum Vitae generated from the Europass online CV editor. Document vectorization is performed using the ESCO model and a fuzzy match is allowed with various string distance metrics.

The labourR package:

  • Allows classifying multilingual free-text using the ESCO-ISCO hierarchy of occupations.
  • Computations are fully vectorized and memory efficient.
  • Includes facilities to assist research in information mining of labour market data.

Installation

You can install the released version of labourR from CRAN with,

install.packages("labourR")
Example
    library(labourR)
    corpus <- data.frame(
      id = 1:3,
      text = c("Data Scientist", "Junior Architect Engineer", "Cashier at McDonald's")
    )
    Given an ISCO level, the top suggested ISCO group is returned. num_leaves specifies the number of ESCO occupations used to perform a plurality vote,

    classify_occupation(corpus = corpus, isco_level = 3, lang = "en", num_leaves = 5)
    #>    id iscoGroup                                          preferredLabel
    #> 1:  1       251       Software and applications developers and analysts
    #> 2:  2       214 Engineering professionals (excluding electrotechnology)
    #> 3:  3       523                              Cashiers and ticket clerks
    
    For further information browse the vignettes.

    Preview data with either head or heaD

    R is a case sensitive programming language, which sometimes creates unusual and yet annoying problems for users. A common mistake that I often make when using R is to press the shift button too early when I use the head() function – which results in heaD(). However, this returns an error message (not surprisingly) rather than printing the first six rows of the data. 
    > head(mtcars)
                       mpg cyl disp  hp drat    wt  qsec vs am gear carb
    Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    
    > heaD(mtcars)
    Error in heaD(mtcars) : could not find function "heaD"
    
    At the beginning, this didn’t bother me since I could easily fix this typo but over time it has become a bit annoying. The faster I typed in R, the more I repeated this mistake. Then I thought creating a heaD() function that does the same job as head() would be the ultimate solution to my problem. Because I would need this “new” function every time I open R, I decided to add it into my .Rprofile file. The .Rprofile file contains R code to be run when R starts up. Typically .Rprofile is located in the user’s home directory. In my computer, I place it under: C:\Users\okanb\OneDrive\Documents

    To create a new .Rprofile file, you can simply create a text file using a text editor, add the content to be used when R starts up, and save it as .Rprofile (i.e., no file name and the file extension is .Rprofile).  If you already have this file in your computer, you can simply run usethis::edit_r_profile() to open and edit the file in R. In the file, I added the following line: 
    # Create a heaD function
    heaD <- function(x) head(x) # or just: heaD <- head
    
    After editing and saving .Rprofile, you need to restart R. From this point on, R will recognize both head() and heaD().

    > heaD(mtcars)
                       mpg cyl disp  hp drat    wt  qsec vs am gear carb
    Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    

    Creating custom neural networks with nnlib2Rcpp

    For anyone interested, this is a post about creating arbitrary, new, or custom Neural Networks (NN) using the nnlib2Rcpp R package. The package provides some useful NN models that are ready to use. Furthermore, it can be a versatile basis for experimentation with new or custom NN models, which is what this brief tutorial is about. A warning is necessary at this point:

    Warning: the following text contains traces of C++ code. If you are in any way sensitive to consuming C++, please abandon reading immediately.

    The NN models in nnlib2Rcpp are created using a collection of C++ classes written for creating NN models called nnlib2. A cut-down class-diagram of the classes (and class-templates) in this collection can be found here . The most important class in the collection is “component” (for all components that constitute a NN). Objects of class “component” can be added to a NN “topology” (hosted in objects of class “nn”) and interact with each other. Layers of processing nodes (class “layer”), groups of connections (class “connection_set”), and even entire neural nets (class “nn”) are based on class “component”. When implementing new components, it is also good to remember that:

    – Objects of class “layer” contain objects of class “pe” (processing elements [or nodes]).

    – Template “Layer” simplifies creation of homogeneous “layer” sub-classes containing a particular “pe” subclass (i.e. type of nodes).

    – Objects of class “connection_set” contain objects of class “connection”.

    – Template “Connection_Set” simplifies creation of homogeneous “connection_set” sub-classes containing a particular “connection” subclass (i.e. type of connections).

    – Customized and modified NN components and sub-components are to be defined based on these classes and templates.

    – All aforementioned classes have an “encode” (training) and a “recall” (mapping) method; both are virtual and can be overridden with custom behavior. Calling “nn” “encode” triggers the “encode” function of all the components in its topology which, in turn, triggers “encode” for “pe” objects (processing nodes) in a “layer” or “connection” objects in a “connection_set”. Similarly for “recall”.

    The NN to create in this example will be based on Perceptron, the most classic of them all. It is not yet implemented in nnlib2Rcpp, so in this example we will play the role of Prof. Rosenblatt and his team [6] and implement a simple multi-class Perceptron ourselves. Unlike, Prof Rosenblatt, you -instead of inventing it- can find information about it in a wikipedia page for it [1]. We will implement a simplified (not 100% scientifically sound) variation, with no bias, fixed learning rate (at 0.3) and connection weights initialized to 0.

    Let’s add it to nnlib2Rcpp.

    Step 1: setup the tools needed.

    To follow this example, you will need to have Rtools [2] and the Rcpp R package [3] installed, and the nnlib2Rcpp package source (version 0.1.4 or above). This can be downloaded from CRAN [4] or the latest version can be found on github [5]. If fetched or downloaded from github, the nnlib2Rcpp.Rproj is a project file for building the package in RStudio. I recommend getting the source from github, unpacking it (if needed) in a directory and then opening the aforementioned nnlib2Rcpp.Rproj in Rstudio. You can then test-build the unmodified package; if it succeeds you can proceed to the next step, adding your own NN components.

    Step 2: define the model-specific components and sub-components.

    Open the “additional_parts.h” C++ header file found in sub-directory “src” and create the needed classes. Much of the default class behavior is similar to what is required for a Perceptron, so we will focus on what is different and specific to the model. We will need to define (in the “additional_parts.h” file) the following:

    (a) a “pe” subclass for the Perceptron processing nodes. All “pe” objects provide three methods, namely “input_function”, “activation_function”, and “threshold_function”; by default, each is applied to the result of the previous one, except for the “input_function” which gathers (by default sums) all incoming values and places result on the internal register variable “input”. The sequence of executing these methods is expected to place the final result in “pe” variable “output”. You may choose (or choose not) to modify these methods if this fits your model and implementation approach. You may also choose to modify “pe” behavior in its “encode” and/or “recall” functions, possibly bypassing the aforementioned methods completely. It may help to see the “pe.h” header file (also in directory “src”) for more insight on the base class. In any case, a C++ implementation for Perceptron processing nodes could be:
    class perceptron_pe : public pe
    {
    public:
    
    DATA threshold_function(DATA value)
     {
     if(value>0) return 1;
     return 0;
     }
    };
    
    (b) Next you may want to define a class for layers consisting of “perceptron_pe” objects as defined above; this can be done quickly using the template “Layer”:

    typedef Layer< perceptron_pe > perceptron_layer;
    
    (c) Moving on to the connections now. Notice that in Perceptron connections are the only elements modified (updating their weights) during encoding. Among other functionality, each connection knows its source and destination nodes, maintains and updates the weight, modifies transferred data etc. So a C++ class for such Percepton connections could be:
    class perceptron_connection: public connection
    {
    public:
    
    // mapping, multiply value coming from source node by weight
    // and send it to destination node.
    void recall()
     {
     destin_pe().receive_input_value( weight() * source_pe().output );
     }
    
    // training, weights are updated:
    void encode()
     {
     weight() = weight() + 0.3 * (destin_pe().input - destin_pe().output) * source_pe().output;
     }
    };
    
    for simplicity, during training learning rate is fixed to 0.3 and the connection assumes that the desired output values will be placed in the “input” registers of the destination nodes before updating weights.
    Note: for compatibility with nnlib2Rcpp version 0.1.4 (the current version on CRAN)- the example above assumes that the desired values are placed as input to the processing nodes right before update of weights (encoding); version 0.1.5 and above provides direct access from R to the “misc” variables that nodes and connections maintain (via “NN” method “set_misc_values_at”, more on “NN” below). It may have been more elegant to use these “misc” variables for holding desired output in processing nodes instead of “input”.

    (d) Next, you may want to define a class for groups of such connections, which can be done quickly using the template “Connection_Set”:
    typedef Connection_Set< perceptron_connection > perceptron_connection_set;
    
    Step 3: Add the ability to create such components at run-time.

    Again in the “additional_parts.h” C++ header file found in directory “src” add code that creates Percepron layers and groups of connections when a particular name is used. Locate the “generate_custom_layer” function and add to it the line:
    if(name == "perceptron") return new perceptron_layer(name,size);
    
    (you will notice other similar definitions are already there). Finally, locate the “generate_custom_connection_set” function and add to it the line:
    if(name == "perceptron") return new perceptron_connection_set(name);
    
    (again, you will notice other similar definition examples are already there).
    Note: as of July 15, 2020 all the aforementioned changes to “additional_parts.h” C++ header are already implemented as an example in the nnlib2Rcpp version 0.1.5 github repo [5].

    That is it. You can now build the modified library and then return to the R world to use your newly created Perceptron components in a NN. The “NN” R module in nnlib2Rcpp allows you to combine these (and other) components in a network and then use it in R.

    It is now time to see if this cut-down modified Perceptron is any good. In the example below, the iris dataset is used to train it. The example uses the “NN” R module in nnlib2Rcpp to build the network and then trains and tests it. The network topology consists of a generic input layer (component #1) of size 4 i.e. as many nodes as the iris features, a set of connections (component #2) whose weights are initialized to 0 (in create_connections_in_sets) below, and a processing layer (component #3) of size 3 i.e. as many nodes as the iris species:

    library("nnlib2Rcpp")
    
    # create the NN and define its components
    p <- new("NN")
    p$add_layer("generic",4)
    p$add_connection_set("perceptron")
    p$add_layer("perceptron",3)
    p$create_connections_in_sets(0,0)
    
    # show the NN topology
    p$outline()
    
    # prepare some data based on iris dataset
    data_in <- as.matrix(iris[1:4])
    iris_cases <- nrow((data_in))
    species_id <- unclass(iris[,5])
    desired_data_out <- matrix(data=0, nrow=cases, ncol=3)
    for(c in 1:iris_cases) desired_data_out[c,species_id[c]]=1
    
    # encode data and desired output (for 30 training epochs)
    for(i in 1:30)
    for(c in 1:iris_cases)
    {
    p$input_at(1,data_in[c,])
    p$recall_all(TRUE)
    p$input_at(3,desired_data_out[c,])
    p$encode_at(2)
    }
    
    # show the NN
    p$print()
    
    # Recall the data to see what species Perceptron returns:
    for(c in 1:iris_cases)
    {
    p$input_at(1,data_in[c,])
    p$recall_all(TRUE)
    cat("iris case ",c,", desired = ", desired_data_out[c,], " returned = ", p$get_output_from(3),"\n")
    }
    
    Checking the output one sees that our Perceptron variation is not THAT bad. At least it recognizes Iris setosa and virginica quite well. However, classification performance on versicolor cases is rather terrible.

    iris case 1 , desired = 1 0 0 returned = 1 0 0
    iris case 2 , desired = 1 0 0 returned = 1 0 0
    iris case 3 , desired = 1 0 0 returned = 1 0 0

    iris case 148 , desired = 0 0 1 returned = 0 0 1
    iris case 149 , desired = 0 0 1 returned = 0 0 1
    iris case 150 , desired = 0 0 1 returned = 0 0 1

    Anyway, this example was not about classification success but about creating a new NN type in the nnlib2Rcpp R package. I hope it will be useful to some of you out there.

    Links (all accessed July 12, 2020):

    [1] Perceptron:
    https://en.wikipedia.org/w/index.php?title=Perceptron&oldid=961536136

    [2] RTools:
    https://cran.r-project.org/bin/windows/Rtools/history.html

    [3] Rcpp package:
    https://cran.r-project.org/web/packages/Rcpp/index.html

    [4] nnlib2Rcpp package on CRAN:
    https://cran.r-project.org/web/packages/nnlib2Rcpp/index.html

    [5] nnlib2Rcpp package on github:
    https://github.com/VNNikolaidis/nnlib2Rcpp

    [6] Frank Rosenblatt:
    https://en.wikipedia.org/wiki/Frank_Rosenblatt


    PS. Time permitting, more components will be added to the collection (and added to nnlib2Rcpp), maybe accompanied by posts similar to this one; these will eventually be available in the package. Any interesting or useful NN component that you would like to contribute is welcomed (credit, of course, will go to you, its creator); if so, please contact me using the comments below. (Another project is to create parallel-processing versions of the components, if anyone wants to help).

    A single loop is not enough. A collection of hello world control structures




    As the post on “hello world” functions has been quite appreciated by the R community, here follows the second round of functions for wannabe R programmer.

    # If else statement:
    # See the code syntax below for if else statement
    x=10
    if(x>1){
    print(“x is greater than 1”)
    }else{
    print(“x is less than 1”)
    }
    
    # See the code below for nested if else statement
    
    x=10
    if(x>1 & x<7){
    print(“x is between 1 and 7”)} else if(x>8 & x< 15){
    print(“x is between 8 and 15”)
    }
    
    
    # For loops:
    # Below code shows for loop implementation
    x = c(1,2,3,4,5)
    for(i in 1:5){
    print(x[i])
    }
    
    
    # While loop :
    # Below code shows while loop in R
    x = 2.987
    while(x <= 4.987) {
    x = x + 0.987
    print(c(x,x-2,x-1))
    }
    
    
    # Repeat Loop:
    # The repeat loop is an infinite loop and used in association with a break statement.
    
    # Below code shows repeat loop:
    a = 1
    repeat{
    print(a)
    a = a+1
    if (a > 4) {
    break
    }
    }
    
    # Break statement:
    # A break statement is used in a loop to stop the iterations and flow the control outside of the loop.
    
    #Below code shows break statement:
    x = 1:10
    for (i in x){
    if (i == 6){
    break
    }
    print(i)
    }
    
    # Next statement:
    # Next statement enables to skip the current iteration of a loop without terminating it.
    
    #Below code shows next statement
    x = 1: 4
    for (i in x) {
    if (i == 2){
    next
    }
    print(i)
    }
    
    
    # function
    
    words = c(“R”, “datascience”, “machinelearning”,”algorithms”,”AI”)
    words.names = function(x) {
    for(name in x){
    print(name)
    }
    }
    
    words.names(words) # Calling the function
    
    
    # extract the elements above the main diagonal of a (square) matrix
    # example of a correlation matrix
    
    cor_matrix <- matrix(c(1, -0.25, 0.89, -0.25, 1, -0.54, 0.89, -0.54, 1), 3,3)
    rownames(cor_matrix) <- c(“A”,”B”,”C”)
    colnames(cor_matrix) <- c(“A”,”B”,”C”)
    cor_matrix
    
    rho <- list()
    name <- colnames(cor_matrix)
    var1 <- list()
    var2 <- list()
    for (i in 1:ncol(cor_matrix)){
    for (j in 1:ncol(cor_matrix)){
    if (i != j & i<j){
    rho <- c(rho,cor_matrix[i,j])
    var1 <- c(var1, name[i])
    var2 <- c(var2, name[j])
    }
    }
    }
    
    d <- data.frame(var1=as.character(var1), var2=as.character(var2), rho=as.numeric(rho))
    d
    
    var1 var2 rho
    1 A B -0.25
    2 A C 0.89
    3 B C -0.54
    
    

    As programming is the best way to learn and think, have fun programming awesome functions!

    This post is also shared in R-bloggers and LinkedIn