And you can find code on my github
![Example](http://i.imgur.com/7SxJwXm.png)
def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum") def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20") def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8) def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", link = "log") def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical") def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, link = "log") def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")To create a simple data set based on these definitions, all one needs to do is execute a single
genData
command. In this example, we generate 500 records that are based on the definition in the def
table:
dt <- genData(500, def) dt
[code language="lang="r"] ## idnum nr x1 y1 y2 xCat g1 a1 ## 1: 1 7 10.74929 30.01273 123 3 13310.84731 0 ## 2: 2 7 18.56196 44.77329 17 1 395.41606 0 ## 3: 3 7 16.96044 43.76427 42 3 522.45258 0 ## 4: 4 7 19.51511 45.14214 22 3 3045.06826 0 ## 5: 5 7 10.79791 27.25012 128 1 406.88647 0 ## --- ## 496: 496 7 19.96636 52.05377 21 3 264.85557 1 ## 497: 497 7 15.97957 39.62428 44 2 40.59823 0 ## 498: 498 7 19.74036 47.32292 21 3 3410.54787 0 ## 499: 499 7 19.71597 48.26259 25 3 206.90961 1 ## 500: 500 7 14.60405 28.94185 53 1 97.43068 0There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from across the globe a couple of days ago. (I’d say the best thing about building an R package is hearing from folks literally all over the world and getting to talk to them about statistics and R.)
# Define the outcome ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T", variance = 3) # Generate a 'blank' data.table with 24 observations and assign them to groups set.seed(1234) indData <- genData(24) indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T") # Create a longitudinal data set of 3 records for each id longData <- addPeriods(indData, nPeriods = 3, idvars = "id") longData <- addColumns(dtDefs = ydef, longData) longData[, `:=`(T, factor(T, labels = c("No", "Yes")))] # Let's look at the data ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
# Fit a GEE model to the data fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), data = longData, id = id, corstr = "exchangeable") summary(fit)
## ## Call: ## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, ## family = gaussian(link = "identity"), data = longData, id = id, ## corstr = "exchangeable") ## ## Coefficients: ## Estimate Std.err Wald Pr(>|W|) ## (Intercept) 4.98268 0.07227 4753.4 <2e-16 *** ## factor(T)Yes 1.48555 0.10059 218.1 <2e-16 *** ## period 2.53946 0.05257 2333.7 <2e-16 *** ## factor(T)Yes:period 3.51294 0.07673 2096.2 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Estimated Scale Parameters: ## Estimate Std.err ## (Intercept) 2.952 0.07325 ## ## Correlation: Structure = exchangeable Link = identity ## ## Estimated Correlation Parameters: ## Estimate Std.err ## alpha 0.01737 0.01862 ## Number of clusters: 1000 Maximum cluster size: 3
genCorData
. In this example, we generate correlated errors that are normally distributed with mean 0, variance 3, and common correlation coefficient of 0.7. Using this approach, the underlying data generation process is a bit cryptic, because we are acknowledging that we don’t know what is driving the relationship between the three outcomes, just that they have some common cause or other relationship that results in a strong relationship. However, it does the trick:
# define the outcome ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + e") # define the correlated errors mu <- c(0, 0, 0) sigma <- rep(sqrt(3), 3) # generate correlated data for each id and assign treatment dtCor <- genCorData(24, mu = mu, sigma = sigma, rho = 0.7, corstr = "cs") dtCor <- trtAssign(dtCor, nTrt = 2, balanced = TRUE, grpName = "T") # create longitudinal data set and generate outcome based on definition longData <- addPeriods(dtCor, nPeriods = 3, idvars = "id", timevars = c("V1", "V2", "V3"), timevarName = "e") longData <- addColumns(ydef, longData) longData[, `:=`(T, factor(T, labels = c("No", "Yes")))] # look at the data, outcomes should appear more correlated, lines a bit straighter ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), data = longData, id = id, corstr = "exchangeable") summary(fit)
## ## Call: ## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, ## family = gaussian(link = "identity"), data = longData, id = id, ## corstr = "exchangeable") ## ## Coefficients: ## Estimate Std.err Wald Pr(>|W|) ## (Intercept) 5.0636 0.0762 4411 <2e-16 *** ## factor(T)Yes 1.4945 0.1077 192 <2e-16 *** ## period 2.4972 0.0303 6798 <2e-16 *** ## factor(T)Yes:period 3.5204 0.0426 6831 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Estimated Scale Parameters: ## Estimate Std.err ## (Intercept) 3.07 0.117 ## ## Correlation: Structure = exchangeable Link = identity ## ## Estimated Correlation Parameters: ## Estimate Std.err ## alpha 0.711 0.0134 ## Number of clusters: 1000 Maximum cluster size: 3
ydef1 <- defData(varname = "randomEffect", dist = "normal", formula = 0, variance = sqrt(3)) ydef2 <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + randomEffect", variance = 1) indData <- genData(24, ydef1) indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T") indData[1:6]
## id T randomEffect ## 1: 1 0 -1.3101 ## 2: 2 1 0.3423 ## 3: 3 0 0.5716 ## 4: 4 1 2.6723 ## 5: 5 0 -0.9996 ## 6: 6 1 -0.0722
longData <- addPeriods(indData, nPeriods = 3, idvars = "id") longData <- addColumns(dtDefs = ydef2, longData) longData[, `:=`(T, factor(T, labels = c("No", "Yes")))] ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), data = longData, id = id, corstr = "exchangeable") summary(fit)
## ## Call: ## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, ## family = gaussian(link = "identity"), data = longData, id = id, ## corstr = "exchangeable") ## ## Coefficients: ## Estimate Std.err Wald Pr(>|W|) ## (Intercept) 4.9230 0.0694 5028 <2e-16 *** ## factor(T)Yes 1.4848 0.1003 219 <2e-16 *** ## period 2.5310 0.0307 6793 <2e-16 *** ## factor(T)Yes:period 3.5076 0.0449 6104 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Estimated Scale Parameters: ## Estimate Std.err ## (Intercept) 2.63 0.0848 ## ## Correlation: Structure = exchangeable Link = identity ## ## Estimated Correlation Parameters: ## Estimate Std.err ## alpha 0.619 0.0146 ## Number of clusters: 1000 Maximum cluster size: 3I sent all this back to my correspondent, but I haven’t heard yet if either of these solutions meets her needs. I certainly hope so. Speaking of which, if there are specific topics you’d like me to discuss related to simstudy, definitely get in touch, and I will try to write something up.
Df Sum Sq Mean Sq F value Pr(>F) dose:gender:type 2 187 93.4 22.367 3.81e-10The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects. Simple second-order interaction effects The simple second-order interaction effects are the effects of each pair of factors at each level of the third factor. Specifically, we have to compute the following effects:
Df Sum Sq Mean Sq F value Pr(>F)
dose:type 2 249 124.7 28.42 3.57e-12
Df Sum Sq Mean Sq F value Pr(>F)
dose:type 2 137.2 68.6 17.31 6.74e-08
Df Sum Sq Mean Sq F value Pr(>F)
dose 2 2943.5 1471.8 349.9 <2e-16
TukeyHSD(aov1) diff lwr upr p adj
low dose-high dose -2.528333 -3.413363 -1.643303 0
placebo-high dose -9.558333 -10.443363 -8.673303 0
placebo-low dose -7.030000 -7.915030 -6.144970 0
TukeyHSD(aov1) diff lwr upr p adjThe simple main effect is statistically significant, as it results from the first table. Furthermore, all the differences between dose levels are significant. The highest difference is the difference between low dose and placebo (3.5 points). To learn more on data analysis in R, check the free “Statistics with R” video course here.
low dose-high dose 1.083333 0.1797508 1.986916 0.0141485
placebo-high dose -2.476667 -3.3802492 -1.573084 0.0000000
placebo-low dose -3.560000 -4.4635826 -2.656417 0.0000000
defDataAdd
and genData
are both from simstudy
. The first part of the simulation involves specifying the transition matrix P
that determine a state I am calling status
, and then defining the probability of an event that are based on a particular status level at a particular time point. For each individual, I generate 36 months of data and a status and event for each month.
library(data.table) library(simstudy) set.seed(123) P <- matrix(c(0.985, 0.015, 0.000, 0.000, 0.000, 0.950, 0.050, 0.000, 0.000, 0.000, 0.850, 0.150, 0.000, 0.000, 0.000, 1.000), nrow = 4, byrow = TRUE) form <- "(status == 1) * 0.02 + (status == 2) * 0.10 + (status == 3) * 0.20" dtDef <- defDataAdd(varname = "event", formula = form, dist = "binary", link = "identity") N = 5000 did <- genData(N)In order to simulate the Markov process, I decided immediately that
Rcpp
would be most appropriate because I knew I could not avoid looping. Since each state of a Markov process depends on the state immediately preceding, states need to be generated sequentially, which means no obvious way to vectorize (if someone has figured that out, let me know.)
#include < RcppArmadilloExtensions/sample.h >; // [[Rcpp::depends(RcppArmadillo)]] using namespace Rcpp; // [[Rcpp::export]] IntegerVector MCsim( unsigned int nMonths, NumericMatrix P, int startStatus, unsigned int startMonth ) { IntegerVector sim( nMonths ); IntegerVector m( P.ncol()); NumericVector currentP; IntegerVector newstate; unsigned int q = P.ncol(); m = Rcpp::seq(1, q); sim[startMonth - 1] = startStatus; for (unsigned int i = startMonth; i < nMonths; i++) { newstate = RcppArmadillo::sample(m, 1, TRUE, P.row(sim(i-1) - 1)); sim(i) = newstate(0); } return sim; }The process is simulated for each individual using the Rcpp function
MCsim
, but is done in the context of a data.table
statement. The key here is that each individual is processed separately through the keyby = id
statement. This obviates the requirement to loop through individuals even though I still need to loop within individuals for the stochastic process. This algorithm is quite fast, even with very large numbers of individuals and large numbers of observations (in this case months) per individual.
dt <- did[, .(status = MCsim(36, P, 1, 1)), keyby = id] dt[, month := 1 : .N, keyby = id] dt <- addColumns(dtDefs = dtDef, dtOld = dt) dt
## id status month event ## 1: 1 1 1 0 ## 2: 1 1 2 0 ## 3: 1 1 3 0 ## 4: 1 1 4 0 ## 5: 1 1 5 0 ## --- ## 179996: 5000 4 32 0 ## 179997: 5000 4 33 0 ## 179998: 5000 4 34 0 ## 179999: 5000 4 35 0 ## 180000: 5000 4 36 0
1
, and (3) whether or not the individual experienced 2 or more events in the prior 12 months. An indivdual might be eligible for more than one cohort, but will be assigned to the first possible cohort (i.e. the earliest month where all three criteria are met.)
Again, the specifics of the simulation are not important here. What is important, is the notion that the problem requires looking through individual data sequentially, something R is generally not so good at when the sequences get particularly long, and they must be repeated a large number of times.
My first, naïve, approach was to create an R function that loops through all the individuals and loops within each individual until a cohort is found:
rAssignCohortID <- function(id, month, status, event, nInds, startMonth, thresholdNum) { cohort <- rep(0, length(id)); for (j in (1 : nInds)) { idMonth <- month[id == j]; idEvent <- event[id == j]; idStatus <- status[id == j]; endMonth <- length(idMonth); done <- FALSE; i <- max(startMonth - idMonth[1], 13); while (i <= endMonth & !done) { if (idEvent[i] == 1 & idStatus[i] > 1) { begin = i-12; end = i-1; sumED = sum(idEvent[begin:end]); if (sumED <= thresholdNum) { cohort[id == j] <- i - 1 + month[1]; done = TRUE; } } i = i + 1; } } return(cohort); }
system.time(dt[, cohort1 := rAssignCohortID(id, month, status, event, nInds = N, startMonth = 13, thresholdNum = 2)])
## user system elapsed ## 10.92 1.81 12.89
#include < Rcpp.h >; using namespace Rcpp; // [[Rcpp::export]] IntegerVector cAssignCohortID( IntegerVector id, IntegerVector month, IntegerVector status, IntegerVector event, int nInds, int startMonth, int thresholdNum) { IntegerVector cohort(id.length(), 0); IntegerVector idMonth; IntegerVector idEvent; IntegerVector idStatus; for (int j = 0; j < nInds; j++) { idMonth = month[id == j+1]; idEvent = event[id == j+1]; idStatus = status[id == j+1]; int endMonth = idMonth.length(); int sumED; bool done = FALSE; int i = std::max(startMonth - idMonth(0), 12); int begin; int end; while (i < endMonth & !done) { if (idEvent(i) == 1 & idStatus(i) > 1) { begin = i-12; end = i-1; sumED = sum(idEvent[Rcpp::seq(begin, end)]); if (sumED >= thresholdNum) { cohort[id == j + 1] = i + month(0); done = TRUE; } } i += 1; } } return(cohort); }
system.time(dt[, cohort2 := cAssignCohortID(id, month, status, event, nInds = N, startMonth = 13, thresholdNum = 2)])
## user system elapsed ## 13.88 2.03 16.05I know that the function
cAssignCohortID
bogs down not in the loop, but in each phase where I need to subset the data set to work on a single id
. For example, I need to execute the statement idMonth = month[id == j+1]
for each id
, and this apparently uses a lot of resources. I tried variations on this theme, alternatives to subset the data set within the Rcpp function, but could get no improvements.
But a light bulb went off in my head (dim as it might be), telling me that this is one of the many things data.table
is particularly good at. In fact, I used this trick earlier in generating the stochastic process data. So, rather than subsetting the data within the function, I created a regular R function that handles only a single individual id
at a time, and let data.table
do the hard work of splitting up the data set to process by individual.
As you can see, things got markedly faster.
rAssignCohort <- function(id, month, status, event, nInds, startMonth, thresholdNum) { cohort <- 0 endMonth = length(month); done = FALSE; i = max(startMonth - month[1], 13); while (i <= endMonth & !done) { if (event[i] == 1 & status[i] > 1) { begin = i-12; end = i-1; sumED = sum(event[begin:end]); if (sumED >= thresholdNum) { cohort <- i - 1 + month[1]; done = TRUE; } } i = i + 1; } return(cohort) }
system.time(dt[, cohort3 := rAssignCohort(id, month, status, event, nInds = N, startMonth = 13, thresholdNum = 2), keyby=id])
## user system elapsed ## 0.2 0.0 0.2Finally, it occurred to me that an
Rcpp
function that is not required to subset the data might offer more yet improvements in speed. So, for the last iteration, I combined the strengths of looping in Rcpp
with the strengths of subsetting in data.table
to create a formidable combination. (Even when sample sizes exceed 1 million, the data are generated in a flash.)
#include < Rcpp.h >; using namespace Rcpp; // [[Rcpp::export]] int cAssignCohort( IntegerVector month, IntegerVector status, IntegerVector event, int startMonth, int thresholdNum) { int endMonth = month.length(); int sumED; int cohort = 0; bool done = FALSE; int i = std::max(startMonth - month(0), 12); int begin; int end; while (i < endMonth & !done) { if (event(i) == 1 & status(i) > 1) { begin = i-12; end = i-1; sumED = sum(event[Rcpp::seq(begin, end)]); if (sumED >= thresholdNum) { cohort = i + month(0); done = TRUE; } } i += 1; } return(cohort); }
system.time(dt[, cohort4 := cAssignCohort(month, status, event, startMonth=13, thresholdNum = 2), keyby=id])
## user system elapsed ## 0.01 0.00 0.01For a more robust comparison, let’s use the
benchmark
function in package rbenchmark
, and you can see how well data.table
performs and how much Rcpp
can add when used efficiently.
library(rbenchmark) benchmark( dt[, cohort1 := rAssignCohortID(id, month, status, event, # Naïve approach nInds = N, startMonth = 13, thresholdNum = 2)], dt[, cohort2 := cAssignCohortID(id, month, status, event, # Rcpp approach nInds = N, startMonth = 13, thresholdNum = 2)], dt[, cohort3 := rAssignCohort(id, month, status, event, # data.table approach nInds = N, startMonth = 13, thresholdNum = 2), keyby=id], dt[, cohort4 := cAssignCohort(month, status, event, # combined data.table/Rcpp startMonth=13, thresholdNum = 2), keyby=id], replications = 5, columns = c("replications", "elapsed", "relative"))
## replications elapsed relative ## 1 5 46.18 461.8 ## 2 5 68.01 680.1 ## 3 5 0.96 9.6 ## 4 5 0.10 1.0
keyby
), it is still no match for the fastest solution. (But, this all just goes to show you there will always be new approaches to consider, and I don’t claim to have come any where near to trying them all out.)
dtfunc <- function(dx) { dx[, prev12 := Reduce(`+`, shift(event, 1:12)), by=id] map <- CJ(id=1:N, start=13L, end=36L, event=1L, statusx=1L, prev12x=1L) ans <- dx[map, on=.(id, event, status > statusx, prev12 > prev12x, month >= start, month <= end), .I, allow=TRUE, by=.EACHI, nomatch=0L][, .(id, I)] minans <- ans[, .(I=min(I)), by=id] dx <- dx[, cohort5 := 0L][minans, cohort5 := min(month) - 1L + dx$month[I], on="id", by=.EACHI] return(dx) } system.time(dtfunc(dt))
## user system elapsed ## 0.18 0.00 0.17And here is a more complete comparison of the fastest version with this additional approach:
benchmark( dt[, cohort6 := cAssignCohort(month, status, event, # combined data.table/Rcpp startMonth=13, thresholdNum = 2), keyby=id], dt2 <- dtfunc(dt), replications = 5, columns = c("replications", "elapsed", "relative"))
## replications elapsed relative ## 1 5 0.10 1.0 ## 2 5 0.85 8.5
#CRAN install.packages('sinew') #DEV devtools::install_github('metrumresearchgroup/sinew')
makeOxygen(lm)
#' @title FUNCTION_TITLE
#' @description FUNCTION_DESCRIPTION
#' @param formula PARAM_DESCRIPTION
#' @param data PARAM_DESCRIPTION
#' @param subset PARAM_DESCRIPTION
#' @param weights PARAM_DESCRIPTION
#' @param na.action PARAM_DESCRIPTION
#' @param method PARAM_DESCRIPTION, Default: 'qr'
#' @param model PARAM_DESCRIPTION, Default: TRUE
#' @param x PARAM_DESCRIPTION, Default: FALSE
#' @param y PARAM_DESCRIPTION, Default: FALSE
#' @param qr PARAM_DESCRIPTION, Default: TRUE
#' @param singular.ok PARAM_DESCRIPTION, Default: TRUE
#' @param contrasts PARAM_DESCRIPTION, Default: NULL
#' @param offset PARAM_DESCRIPTION
#' @param ... PARAM_DESCRIPTION
#' @return OUTPUT_DESCRIPTION
#' @importFrom stats model.frame
-u
and -p
flags:
$ mysql < database_backup.sql
Then run the R command line, using the command below:
$ R
And install the Shiny and RMySQL packages as well as initialize the Shiny library at the end:
> install.packages("shiny")
> install.packages("RMySQL")
> library(shiny)
If you face any problems during the installation of these dependencies, carefully read error messages, e.g. you might need sudo apt-get install libmysqlclient-dev
for installing RMySQL.
Finally, run the application:
> runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/r-shiny-mysql-template")
And the new tab that should have just opened in your browser shows you the example included in the template:
LICENSE
and README.md
contain information about the license and the template (how to run it, technologies, structure, etc.) respectively. They are not functionally important to our project, and therefore we will not explore them here. Please check these files by yourself for a general understanding.
The style.css
file is responsible for the styles of the page.
The database_backup.sql
file contains a code for the MySQL table and user creation and for writing data to the table. You can use your own table or change the data in this one.
Let’s move on to the code. First, open the app.R
file. This file ensures the connection to the MySQL database, reads data, and passes it to the index.html
file, which contains the main code of using AnyChart. The following is a part of the app.R
code, which contains the htmlTemplate
function; here we specify the name of the file where the data will be transmitted to, the names of our page and chart, and the JSON encoded chart data from MySQL database.
htmlTemplate("index.html",
title = "Anychart R Shiny template",
chartTitle = shQuote("Top 5 fruits"),
chartData = toJSON(loadData())
The main thing here is the index.html
file, which is actually where the template for creating charts is. As you see, the first part of this file simply connects all necessary files to the code, including the AnyChart library, the CSS file with styles, and so on. I’ll skip this for now and proceed directly to the script
tag and the anychart.onDocumentReady (function () {...})
function.
[code lang=”javascript”]
anychart.onDocumentReady(function() {
var chart = anychart.pie({{ chartData }});
chart.title({{ chartTitle }});
chart.container("container");
chart.draw();
});
This pattern works as follows. We create a pie chart by using the function pie()
and get the data that have already been read and prepared using the R code. Please note that the names of the variables containing data are the same in the app.R
and index.html
files. Then we display the chart title via (chart.title({{ chartTitle }})
) and specify the ID of the element that will contain a chart, which is a div with id = container
in this case. To show all that was coded, we use сhart.draw()
.
anychart
. Its structure should look like illustrated below. Please note that some difference is possible (and acceptable) if you are using a new project in RStudio.
anychart/
www/
css/
style.css # css style
ui.R # main application code
server.R # sub code
database_backup.sql # data set
index.html # html template
Now you know what files you need. If you’ve made a project with the studio, the ui.R
and server.R
files are created automatically. If you’ve made a project by yourself, just create empty files with the same names and extensions as specified above.
The main difference from the original example included in the template is that we should change the file index.html
and divide app.R
into parts. You can copy the rest of the files or create new ones for your own chart.
Please take a look at the file server.R
. If you’ve made a project using the studio, it was created automatically and you don’t need to change anything in it. However, if you’ve made it by yourself, open it in the Notepad and add the code below, which is standard for the Shiny framework. You can read more about that here.
The file structure of ui.R
is similar to the one of app.R
, so you can copy app.R
from the template and change/add the following lines:
loadData = dbGetQuery(db, "SELECT name, value FROM fruits")
data1 <- character()
#data preparation
for(var in 1:nrow(loadData)){
c = c(as.character(loadData[var, 1]), loadData[var, 2])
data1 <- c(data1, c)
}
data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE)
ui = function(){
htmlTemplate("index.html",
title = "Anychart R Shiny template",
chartTitle = shQuote("Fruits"),
chartData = toJSON(data)
)}
Since we are going to change the chart type, from pie to 3D vertical bar (column), the data needs some preparation before being passed to index.html
. The main difference is that we will use the entire data from the database, not simply the top 5 positions.
We will slightly modify and expand the basic template. Let’s see the resulting code of the index.html
first (the script
tag) and then explore it.
[code lang=”javascript”]
anychart.onDocumentReady(function() {
var chart = anychart.column3d({{ chartData }});
chart.title({{ chartTitle }});
chart.animation(true);
var xAxis = chart.xAxis();
xAxis.title("fruits");
var yAxis = chart.yAxis();
yAxis.title("pounds, t");
var yScale = chart.yScale();
yScale.minimum(0);
yScale.maximum(120);
chart.container("container");
chart.draw();
});
With the help of var chart = anychart.column3d({{chartData}})
, we are creating a 3D column chart by using the function column3d()
. Here you can choose any other chart type you need; consider getting help from Chartopedia if you are unsure which one works best in your situation.
Next, we are adding animation to the column chart via chart.animation (true)
to make it appear on page load gradually.
In the following section, we are creating two variables, xAxis
and yAxis
. Including these is required if you want to provide the coordinate axes of the chart with captions. So, you should create variables that will match the captions for the X and Y axes, and then use the function, transmit the values that you want to see.
The next unit is basically optional. We are explicitly specifying the maximum and minimum values for the Y axis, or else AnyChart will independently calculate these values. You can do that the same way for the X axis.
And that’s it! Our 3D column chart is ready, and all seems to be fine for successfully running the code. The only thing left to do before that is to change the MySQL table to make it look as follows:
(‘apple’,100),
(‘orange’,58),
(‘banana’,81),
(‘lemon’,42),
(‘melon’,21),
(‘kiwi’,66),
(‘mango’,22),
(‘pear’,48),
(‘coconut’,29),
(‘cherries’,65),
(‘grapes’,31),
(‘strawberries’,76),
To see what you’ve got, follow the same steps as for running the R, Shiny and MySQL template example, but do not forget to change the path to the folder and the folder name to anychart
. So, let’s open the terminal and command the following:
$ cd anychart
$ mysql < database_backup.sql
$ R
> install.packages("shiny")
> install.packages("RMySQL")
> library(shiny)
> runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/anychart")
ui.R
and server.R
below. The full source code of this example can be found on GitHub.
ui.R
:server.R
:Nomograms are useful computational tools for model visualisation, graphical assessment of variable importance and the calculation of predicted values. The nomogram function in the rms package is a popular way of creating (static) nomograms for a variety of regression models. For example, the following code will generate a nomogram from a logistic regression model used to model the probability of survival on the titanic (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)
library(PASWR) data(titanic3) library("rms") t.data <- datadist(titanic3) options(datadist = ‘t.data’) fit <- lrm(formula = survived ~ age + pclass + sex, data = titanic3) plot(nomogram(fit, fun = function(x)plogis(x)))
This nomogram is a graphical calculator of the probability of surviving the Titanic, based on three explanatory variables, sex, age and passenger class. The nomogram can become more cumbersome to read and use when higher order interaction terms and smoothers are present in the model. For example, consider the following nomogram of the model containing all possible interactions of the three main effects.
The DynNom package (Jalali, A, Newell, J), built on shiny, allows the creation of dynamic nomograms from any Generalised Linear models or Cox Proportional hazards models. DynNom supports model objects created by the lm, glm and coxph functions. It also supports models generated using the Ols, Glm, lrm and cph in the rms package which allows the inclusion of smoothing splines. For example, the following R code will build a dynamic nomogram for the corresponding higher order logistic regression model using the DynNom package.
fit2 <- glm(survived ~ (age + pclass + sex) ^ 3, titanic3, family = "binomial") library(DynNom) DynNom(fit2, titanic3)The resulting dynamic nomogram allows the prediction of the probability of survival (and corresponding 95% Confidence interval) for any chosen set of values of the explanatory variables. The shiny tabs display the corresponding predicted values graphically and numerically and the underlying model summary.
The ability to interact with a model, in terms of the effect of changes in the explanatory variables on the predicted response, is a useful translational tool. In theory all published model summaries could be accompanied with a hyperlink directing the reader to the corresponding model nomogram. In order to do this the DNbuilder function within DynNom will generate the necessary ui.R, server.R and global.r files needed to share the nomogram online in any server (e.g. http://shinyapps.io).
For example, the dynamic nomogram created for the titanic dataset, published using the DNbuilder function, is available at https://amir.shinyapps.io/titanic.