WhatsR – An R-Package for processing exported WhatsApp Chat Logs

WhatsApp is one of the most heavily used mobile instant messaging applications around the world. It is especially popular for everyday communication with friends and family and most users communicate on a daily or a weekly basis through the app. Interestingly, it is possible for WhatsApp users to extract a log file from each of their chats. This log file contains all textual communication in the chat that was not manually deleted or is not too far in the past.

This logging of digital communication is on the one hand interesting for researchers seeking to investigate interpersonal communication, social relationships, and linguistics, and can on the other hand also be interesting for individuals seeking to learn more about their own chatting behavior (or their social relationships).

The WhatsR R-package enables users to transform exported WhatsApp chat logs into a usable data frame object with one row per sent message and multiple variables of interest. In this blog post, I will demonstrate how the package can be used to process and visualize chat log files.


Installing the Package
The package can either be installed via CRAN or via GitHub for the most up-to-date version. I recommend to install the GitHub version for the most recent features and bugfixes.
# from CRAN
# install.packages("WhatsR")

# from GitHub
devtools::install_github("gesiscss/WhatsR")
The package also needs to be attached before it can be used. For creating nicer plots, I recommend to also install and attach the patchwork package.
# installing patchwork package
install.packages("patchwork")

# attaching packages
library(WhatsR)
library(patchwork)
Obtaining a Chat Log
You can export one of your own chat logs from your phone to your email address as explained in this tutorial. If you do this, I recommend to use the “without media” export option as this allows you to export more messages.

If you don’t want to use one of your own chat logs, you can create an artificial chat log with the same structure as a real one but with made up text using the WhatsR package!
## creating chat log for demonstration purposes

# setting seed for reproducibility
set.seed(1234)

# simulating chat log
# (and saving it automatically as a .txt file in the working directory)
create_chatlog(n_messages = 20000,
               n_chatters = 5,
               n_emoji = 5000,
               n_diff_emoji = 50,
               n_links = 999,
               n_locations = 500,
               n_smilies = 2500,
               n_diff_smilies = 10,
               n_media = 999,
               n_sdp = 300,
               startdate = "01.01.2019",
               enddate = "31.12.2023",
               language = "english",
               time_format = "24h",
               os = "android",
               path = getwd(),
               chatname = "Simulated_WhatsR_chatlog")
Parsing Chat Log File
Once you have a chat log on your device, you can use the WhatsR package to import the chat log and parse it into a usable data frame structure.
data <- parse_chat("Simulated_WhatsR_chatlog.txt", verbose = TRUE)
Checking the parsed Chat Log
You should now have a data frame object with one row per sent message and 19 variables with information extracted from the individual messages. For a detailed overview what each column contains and how it is computed, you can check the related open source publication for the package. We also add a tabular overview here.
## Checking the chat log
dim(data)
colnames(data)

Column Name

Description

DateTime

Timestamp for date and time the message was sent. Formatted as yyyy-mm-dd hh:mm:ss

Sender

Name of the sender of the message as saved in the contact list of the exporting phone or telephone number. Messages inserted by WhatsApp into the chat are coded with “WhatsApp System Message”

Message

Text of user-generated messages with all information contained in the exported chat log

Flat

Simplified version of the message with emojis, numbers, punctuation, and URLs removed. Better suited for some text mining or machine learning tasks

TokVec

Tokenized version of the Flat column. Instead of one text string, each cell contains a list of individual words. Better suited for some text mining or machine learning tasks

URL

A list of all URLs or domains contained in the message body

Media

A list of all media attachment filenames contained in the message body

Location

A list of all shared location URLs or indicators in the message body, or indicators for shared live locations

Emoji

A list of all emoji glyphs contained in the message body

EmojiDescriptions

A list of all emojis as textual representations contained in the message body

Smilies

A list of all smileys contained in the message body

SystemMessage

Messages that are inserted by WhatsApp into the conversation and not generated by users

TokCount

Amount of user-generated tokens per message

TimeOrder

Order of messages as per the timestamps on the exporting phone

DisplayOrder

Order of messages as they appear in the exported chat log

Checking Descriptives of Chat Logs
Now, you can have a first look at the overall statistics of the chat log. You can check the number of messages, sent tokens, number of chat participants, date of first message, date of last message, the timespan of the chat, and the number of emoji, smilies, links, media files, as well as locations in the chat log.
# Summary statistics
summarize_chat(data, exclude_sm = TRUE)
We can also check the distribution of the number of tokens per message and a set of summary statistics for each individual chat participant.
# Summary statistics
summarize_tokens_per_person(data, exclude_sm = TRUE)
Visualizing Chat Logs
The chat characteristics can now be visualized using the custom functions from the WhatsR package. These functions are basically wrappers to ggplot2 with some options for customizing the plots. Most plots have multiple ways of visualizing the data. For the visualizations, we can exclude the WhatsApp System Messages using ‘exclude_sm= TRUE’. Lets try it out:

Amount of sent messages
# Plotting amount of messages
p1 <- plot_messages(data, plot = "bar", exclude_sm = TRUE)
p2 <- plot_messages(data, plot = "cumsum", exclude_sm = TRUE)
p3 <- plot_messages(data, plot = "heatmap", exclude_sm = TRUE)
p4 <- plot_messages(data, plot = "pie", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p3) | free(p2)) / (free(p1) | free(p4))
The graphic shows four different ways of visualizing the amount of sent messages in a WhatsApp chat log
Four different ways of visualizing the amount of sent messages in a WhatsApp chat log. Click image to zoom in.

Amount of sent tokens
# Plotting amount of messages
p5 <- plot_tokens(data, plot = "bar", exclude_sm = TRUE)
p6 <- plot_tokens(data, plot = "box", exclude_sm = TRUE)
p7 <- plot_tokens(data, plot = "violin", exclude_sm = TRUE)
p8 <- plot_tokens(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p5) | free(p6)) / (free(p7) | free(p8))
The graphic shows Four different ways of visualizing the amount of sent tokens in a WhatsApp chat log
Four different ways of visualizing the amount of sent tokens in a WhatsApp chat log. Click image to zoom in.

Amount of sent tokens over time
# Plotting amount of tokens over time
p9 <- plot_tokens_over_time(data,
                            plot = "year",
                            exclude_sm = TRUE)

p10 <- plot_tokens_over_time(data,
                             plot = "day",
                             exclude_sm = TRUE)

p11 <- plot_tokens_over_time(data,
                             plot = "hour",
                             exclude_sm = TRUE)

p12 <- plot_tokens_over_time(data,
                             plot = "heatmap",
                             exclude_sm = TRUE)

p13 <- plot_tokens_over_time(data,
                             plot = "alltime",
                             exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p9) | free(p10)) / (free(p11) | free(p12))
The plot shows Four different ways of visualizing the amount of sent tokens over time in a WhatsApp chat log
Four different ways of visualizing the amount of sent tokens over time in a WhatsApp chat log. Click image to zoom in.

Amount of sent links
# Plotting amount of links
p14 <- plot_links(data, plot = "bar", exclude_sm = TRUE)
p15 <- plot_links(data, plot = "splitbar", exclude_sm = TRUE)
p16 <- plot_links(data, plot = "heatmap", exclude_sm = TRUE)
p17 <- plot_links(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p14) | free(p15)) / (free(p16) | free(p17))
The graphic shows four different ways of visualizing the amount of sent links in a WhatsApp chat log
Four different ways of visualizing the amount of sent links in a WhatsApp chat log. Click image to zoom in.

Amount of sent smilies
# Plotting amount of smilies
p18 <- plot_smilies(data, plot = "bar", exclude_sm = TRUE)
p19 <- plot_smilies(data, plot = "splitbar", exclude_sm = TRUE)
p20 <- plot_smilies(data, plot = "heatmap", exclude_sm = TRUE)
p21 <- plot_smilies(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p18) | free(p19)) / (free(p20) | free(p21))
The graphic shows Four different ways of visualizing the amount of sent smileys in a WhatsApp chat log
Four different ways of visualizing the amount of sent smileys in a WhatsApp chat log. Click image to zoom in.

Amount of sent emoji
# Plotting amount of messages
p22 <- plot_emoji(data,
 plot = "bar",
 min_occur = 300,
 exclude_sm = TRUE,
 emoji_size=5)

p23 <- plot_emoji(data,
 plot = "splitbar",
 min_occur = 70,
 exclude_sm = TRUE,
 emoji_size=5)

p24 <- plot_emoji(data,
 plot = "heatmap",
 min_occur = 300,
 exclude_sm = TRUE)

p25 <- plot_emoji(data,
 plot = "cumsum",
 min_occur = 300,
 exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p22) | free(p23)) / (free(p24) | free(p25))
The graphic shows four different ways of visualizing the amount of sent emoji in a WhatsApp chat log
Four different ways of visualizing the amount of sent emoji in a WhatsApp chat log. Click image to zoom in.

Distribution of reaction times
# Plotting distribution of reaction times
p26 <- plot_replytimes(data,
                       type = "replytime",
                       exclude_sm = TRUE)
p27 <- plot_replytimes(data,
                       type = "reactiontime",
                       exclude_sm = TRUE)

# Printing plots with patchwork package
free(p26) | free(p27)
The graphic shows the average response times and times it takes to answer to messages for each individual chat participant in a WhatsApp chat log
Average response times and times it takes to answer to messages for each individual chat participant in a WhatsApp chat log. Click image to zoom in.

Lexical Dispersion
A lexical dispersion plot is a visualization of where specific words occur within a text corpus. Because the simulated chat log in this example is using lorem ipsum text where all words occur similarly often, we add the string “testword” to a random subsample of messages. For visualizing real chat logs, this would of course not be necessary.
# Adding "testword" to random subset of messages for demonstration         # purposes
set.seed(12345)
word_additions <- sample(dim(data)[1],50)
data$TokVec[word_additions]
sapply(data$TokVec[word_additions],function(x){c(x,"testword")})
data$Flat[word_additions] <- sapply(data$Flat[word_additions],
  function(x){x <- paste(x,"testword");return(x)})
Now you can create the lexical dispersion plot:
# Plotting lexical dispersion plot
plot_lexical_dispersion(data,
                        keywords = c("testword"),
                        exclude_sm = TRUE)
Lexical dispersion plot for the occurance of the word "testword" in the simulated WhatsApp chat log.
Lexical dispersion plot for the occurance of the word “testword” in the simulated WhatsApp chat log. Click image to zoom in.

Response Networks
# Plotting response network
plot_network(data,
             edgetype = "n",
             collapse_sessions = TRUE,
             exclude_sm = TRUE)
Network graph showing how often each chat participant directly responded to the previous messages (A subsequent message is counted as a "response" here)
Network graph showing how often each chat participant directly responded to the previous messages (a subsequent message is counted as a “response” here). Click image to zoom in.

Issues and long-term availability.
Unfortunately, WhatsApp chat logs are a moving target when it comes to plotting and visualization. The structure of exported WhatsApp chat logs keeps changing from time to time. On top of that, the structure of chat logs is different for chats exported from different operating systems (Android & iOS) and for different time (am/pm vs. 24h format) and language (e.g. English & German) settings on the exporting phone. When the structure changes, the WhatsR package can be limited in its functionality or become completely dysfunctional until it is updated and tested. Should you encounter any issues, all reports on the GitHub issues page are welcome. Should you want to contribute to improving on or maintaining the package, pull requests and collaborations are also welcome!

Visualising connection in R workshop

Join our workshop on Visualising connection in R, which is a part of our workshops for Ukraine series! 

Here’s some more info: 

Title: Visualising connection in R
Date: Thursday, November 9th, 19:00 – 21:00 CEST (Rome, Berlin, Paris timezone)
Speaker: Rita Giordano is a freelance data visualisation consultant and scientific illustrator based in the UK. By training, she is a physicist who holds a PhD in statistics applied to structural biology. She has extensive experience in research and data science. Furthermore, she has over fourteen years of professional experience working with R. She is also a LinkedIn instructor. You can find her course “Build Advanced Charts with R” on LinkedIn Learning.
Description: How to show connection? It depends on the connection we want to visualise. We could use a network, chord, or Sankey diagram.  The workshop will focus on how to visualise connections using chord diagrams. We will explore how to create a chord diagram with the {circlize} package.  In the final part of the workshop, I will briefly mention how to create a Sankey diagram with networkD3. Attendees need to have installed the {circlize} and {networkD3} packages.  
Minimal registration fee: 20 euro (or 20 USD or 800 UAH)


How can I register?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?

  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).


You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!





Ebook launch – Simple Data Science (R)

Simple Data Science (R) covers the fundamentals of data science and machine learning. The book is beginner-friendly and has detailed code examples. It is available at scribd.

cover image

Topics covered in the book –
  • Data science introduction
  • Basic statistics
  • Graphing with ggplot2 package
  • Exploratory Data Analysis
  • Machine Learning with caret package
  • Regression, classification, and clustering
  • Boosting with lightGBM package
  • Hands-on projects
  • Data science use cases

Download recently published book – Learn Data Science with R

Learn Data Science with R is for learning the R language and data science. The book is beginner-friendly and easy to follow. It is available for download as pay what you want. The minimum price is 0 and the suggested contribution is rs 1300 ($18). Please review the book at Goodreads.

book cover

The book topics are –
  • R Language
  • Data Wrangling with data.table package
  • Graphing with ggplot2 package
  • Exploratory Data Analysis
  • Machine Learning with caret package
  • Boosting with lightGBM package
  • Hands-on projects

3D route visualisation – Kilimanjaro





I will show here how to create an interactive 3D visualisation of geospatial data, using rgl library, that allows export of the results into HTML.

The dataset to visualize, is a 7 day hike to Kilimajaro mountain (5895 metres) in Tanzania. 
Dataset contains latitude, longitude and elevation at sequence of timestamps.
route <- readr::read_delim("https://raw.githubusercontent.com/mvhurban/kilimanjaro/master/Kilimanjaro_ascent.csv", ";", escape_double = FALSE, trim_ws = TRUE)
The elevation gain of hiker in meters can be plotted simply:
plot(route$ele)
When visualizing small areas, such as cities, mountains or parts of countries, distortion due to Geoid projection on 2D surface is insignificant and it is sufficient to set the projection center to the visualized area. 
 Our dataset falls into this category and therefore can be bounded by:
max_lat = max(route$lat)
min_lat = min(route$lat)
max_lon = max(route$lon)
min_lon = min(route$lon)
To create a 3D surface, units must match altitude units (meters). This can be achieved by scaling coordinates by factor corresponding to the arclength of 
latitude or longitude degree:

We can use function distGeo (this calculates great circle distance between two points on Earth) to obtain conversion scales.
library(geosphere)
library(rgdal)
lon_scale = geosphere::distGeo(c(min_lon, (min_lat+max_lat)/2), c(max_lon,(min_lat+max_lat)/2))/(max_lon-min_lon) 
lat_scale = geosphere::distGeo(c(min_lon, min_lat), c(min_lon, max_lat))/(max_lat-min_lat)
By hving conversion scales for latitude and logitude, we can convert GPS coordinates to cartesian coordinates in the plane tangent to the Earth at the center of the dataset. One can see it as placing a sheet of paper on the scrutinized spot on Earth. 
route$lon_m <- route$lon * lon_scale
route$lat_m <- route$lat * lat_scale
To obtain the elevation data surrounding the track we define a polygon (rectangle) enclosing the track. Note, that the polygon ends and begins with same point.
y_coord <- c(min_lat, min_lat, max_lat, max_lat, min_lat) # latitude
x_coord <- c(min_lon, max_lon, max_lon, min_lon, min_lon) # longitude
Geospatial polygons can be handled by sp library. Such a polygon must contain a projection string prj_sps defining how to interpret coordinates:
library(sp)
polygon_border <- cbind(x_coord, y_coord)
p <- sp::Polygon(polygon_border)
ps <- sp::Polygons(list(p), 1)
sps <- sp::SpatialPolygons(list(ps))
prj_sps <- sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
sp::proj4string(sps) <- prj_sps
spdf = sp::SpatialPolygonsDataFrame(sps, data.frame(f=99.9))
#plot(spdf)
We define grid with points at which elevation data will be requested. Then we define projection string for a grid object:
grid_data <- sp::makegrid(spdf, cellsize = 0.001) # cellsize in map units (here meters)!
grid <- sp::SpatialPoints(grid_data, proj4string = prj_sps)
The Elevation data for raster object can be downloaded using elevatr library by to accessing the Terrain Tiles on AWS.
library(elevatr)
elevation_df <- elevatr::get_elev_raster(grid, prj = prj_sps, z = 12)
#plot(elevation_df)
Transforming raster object to dataframe:
library(raster)
kili_map <- as.data.frame(raster::rasterToPoints(elevation_df))
Since we have equidistant rectangular grid, we can rearrange points into matrix and project to the plane:
lenX = length(unique(kili_map$x))
lenY = length(unique(kili_map$y))
z_full = matrix(kili_map$layer, nrow=lenX, ncol =lenY, byrow = FALSE )
x_full = lon_scale * matrix(kili_map$x, nrow=lenX, ncol =lenY, byrow = FALSE ) # longitude
y_full = lat_scale * matrix(kili_map$y, nrow=lenX, ncol =lenY, byrow = FALSE ) # latitude
At this stage, it is possible to generate 3D model of route, but depending on the dataset, additional problems may occur. 
Here we address mismatch of Track altitudes and of the altitude dowloaded using elevatr library. 

I our case elevations do not match and some points of Kilimanjaro dataset are above and some below elevatr altitudes and this overlap makes the display ugly. 
We solve this problem by discarding provided elevations and projecting GPS coordinates to dowloaded grid. (Another option would be to use elevatr::get_elev_point(), which returns altitude for the dataframe.)

The projection is done by finding the closest point on a grid to the route point and assigning the elevation, the search of closest neighbour is done by a function provided in data.table library:
library(data.table)
dt_route = data.table::data.table(route)
N = nrow(dt_route)
data.table::setkeyv(dt_route, c('lon_m','lat_m'))
coor_x = data.table::data.table(ind_x=c(1:lenX), lon_m=x_full[ , 1])
coor_y = data.table::data.table(ind_y=c(1:lenY),lat_m=y_full[1, ])
dt_route$ind_y = coor_y[dt_route, on = 'lat_m', roll='nearest']$ind_y
dt_route$ind_x = coor_x[dt_route, on = 'lon_m', roll='nearest']$ind_x
for (i in c(1:N)){
dt_route$altitude[i]<-z_full[dt_route$ind_x[i], dt_route$ind_y[i]]}
For our purposes, the size of grid object is too big and we can skip few points:
Note that skipping every 5 grid points makes a reduction of the size by 96%!
skip_cell = 5
x = x_full[seq(1, lenX, by=skip_cell), seq(1, lenY, by=skip_cell)]
y = y_full[seq(1, lenX, by=skip_cell), seq(1, lenY, by=skip_cell)]
z = z_full[seq(1, lenX, by=skip_cell), seq(1, lenY, by=skip_cell)]
Colors can be assigned to a particular altitude by color pallet, for obvious reasons we have choosen terrain.colors. But since Kilimanjaro has snow above 5300 m, we added the snowline. 
snowline <- 5300 # above this line color will be white
colorlut <- terrain.colors(snowline-0, alpha = 0)
# maping altitude to color: minimum is 0 - green, maximum 5300 - white
z_col<-z
z_col[z_col>snowline] <- snowline
col <- colorlut[z_col]
Calculating relative time spent on the hike:
dt_route = dt_route[order(time), ]
time <- as.POSIXlt(dt_route$time, format='%Y-%m-%d-%H-%M-%S')
start_time = time[1]
end_time = time[N]
dt_route$relative_time <- as.numeric(difftime(time, start_time, units ='hours'))/as.numeric(difftime(end_time, start_time, units='hours'))
Here we add some additional perks to the map:
#### CAMPS
camp_index<-c(1, 2897, 4771, 7784, 9275, 10396,15485, 17822)
camp_lon <- dt_route[camp_index, lon_m]
camp_lat <- dt_route[camp_index, lat_m]
camp_alt <- dt_route[camp_index, altitude]
camp_names <- c('Machame gate','Machame camp', 'Shira cave', 'Baranco camp', 'Karanga camp', 'Barafu camp', 'Mweka camp', 'Mweka gate')
#### PEAKS
peak_lon <- lon_scale*c( 37.3536381, 37.4561200, 37.3273936, 37.2370328)
peak_lat <- lat_scale*c(-3.0765783, -3.0938911, -3.0679128, -3.0339558)
peak_alt <- c(5895, 5149, 4689, 2962)-50
peak_names<-c("Uhuru peak (5895)", "Mawenzi (5149)", "Lava Tower (4689)", "Shira (2962)")

#### Hike
skip_points <- 10
aux_index <- seq(1, N, by=skip_points)
time <- dt_route[aux_index, relative_time]
hike <- cbind(dt_route[aux_index, lon_m], dt_route[aux_index, lat_m], dt_route[aux_index, altitude])
Finally, plotting using rgl library. It is important, that rgl window initialized by rgl::open3d() is open while objects are created. Each object created is loaded to html element within rgl scene3d, and at any moment can be saved as scene object by command: kilimanjaro_scene3d <- scene3d() (which assigns current status of rgl window). Note, that the importance of parameter alpha, where alpha=1 corresponds to non-transparent objects, while for display of texts the transparency is allowed, which smoothes imaging of edges (alpha=0.9).
library(rmarkdown)
rgl::open3d()
hike_path <- rgl::plot3d(hike, type="l", alpha = 1, lwd = 4, col = "blue", xlab = "latitude", ylab='longitude', add = TRUE)["data"]
rgl::aspect3d("iso") # set equilateral spacing on axes
camp_shape <- rgl::spheres3d(camp_lon, camp_lat, camp_alt, col="green", radius = 150)
rgl::rgl.surface(x, z, y, color = col, alpha=1)
hiker <- rgl::spheres3d(hike[1, , drop=FALSE], radius = 200, col = "red")
camp_text<-rgl::rgl.texts(camp_lon, camp_lat, camp_alt+400,camp_names
, font = 2, cex=1, color='black', depth_mask = TRUE, depth_test = "always", alpha =0.7)
peak_shape <- rgl::spheres3d(peak_lon, peak_lat, peak_alt, col="black", radius = 150)
peak_text <- rgl::rgl.texts(peak_lon,peak_lat,peak_alt+400,peak_names
,font = 2, cex=1, color='black', depth_mask = TRUE, depth_test = "always", alpha =0.9)
To generate an interactive HTML, rgl library provides the rglwidget function, that can be used to display the motion of a hiker in real time. It is worth to mention how agecontrol object works: it generates a set of objects with predefined life cycle, where birth defines the time of spawn, age defines the vector of time periods (in object age) at which value of object changes like the transparency (alpha).
library(rgl)
rgl::rglwidget() %>%
playwidget(
ageControl(births = 0, ages = time, vertices = hike, objids = hiker),step = 0.01, start = 0, stop=1, rate = 0.01, components = c("Reverse", "Play", "Slower", "Faster","Reset", "Slider", "Label"), loop = TRUE) %>%
toggleWidget(ids = c(camp_shape, camp_text), label="Camps") %>%
toggleWidget(ids = hike_path, label="Route") %>%
toggleWidget(ids = c(peak_shape, peak_text), label="Peaks") %>%
asRow(last = 3)
And that’s it!
Version modified to satisfy requirements of Shiny apps is posted here:
https://mvhurban.shinyapps.io/GPS_zobrazovanie/

MultiNav: anomaly detection and interactive visualization with multivariate data.

In statistical process control (SPC), a.k.a. anomaly-detection, one compares some statistic to its “in control” distribution.  Many statistical and BI software suits (e.g. Tableau, PowerBI) can do SPC. Almost all of these, however, focus on univariate processes, which are easier to visualize and discuss than multivariate processes.
The purpose of our Multinav package is to ingest multivariate data streams, compute multivariate summaries, and allow us to query the data interactively when an anomaly is detected.

In the following example we look into an agricultural IoT use-case, and try to detect sensors with anomalous readings. The data consists of hourly measurements, during 5 days, of the girth of 100 avocado trees. For each such tree, an anomaly score is computed using four variation of Hotelling’s T2 statistic. The full details are available in the official documentation.

This one line of code will produce the following dashboard:
MultiNav(data, type = "scores", xlbl="Date id")


Crucially for our purposes, this dashboard is interactive. Charts are linked so that one may hover over a particular point, and it will light in all panels (JMP and S+ style). Hovering over a point will also draw the raw evolution of the measurement in time and compare it to the group via a functional boxplot.

There is a lot more that MultiNav can do. The details can be found in the inline-help, and the official documentation. For the purpose of this blog, we merely summarize the reasons we find Multinav very useful:

1. Interactive visualizations that link anomaly score with raw data.
2. Functional boxplot for querying the source of the anomalies.
3. Out-of-the-box multivariate scoring functions, with the possibility of extension by the user.
4. The interactive component are embeddable in Shiny apps, for a full blown interactive dashboard.

Using Shiny Dashboards for Financial Analysis

For some time now, I have been trading traditional assets—mostly U.S. equities. About a year ago, I jumped into the cryptocurrency markets to try my hand there as well. In my time in investor Telegram chats and subreddits, I often saw people arguing over which investments had performed better over time, but the reality was that most such statements were anecdotal, and thus unfalsifiable.

Given the paucity of cryptocurrency data available in an easily accessible format, it was quite difficult to say for certain that a particular investment was a good one relative to some alternative, unless you were very familiar with a handful of APIs. Even then, assuming you knew how to get daily OHLC data for a crypto-asset like Bitcoin, in order to compare it to some other asset—say Amazon stock—you would have to eyeball trends from a website like Yahoo finance or scrape that data separately and build your own visualizations and metrics. In short, historical asset performance comparisons in the crypto space were difficult to conduct for all but the most technically savvy individuals, so I set out to build a tool that would remedy this, and the Financial Asset Comparison Tool was born.

In this post, I aim to describe a few key components of the dashboard, and also call out lessons learned from the process of iterating on the tool along the way. Prior to proceeding, I highly recommend that you read the app’s README and take a look at the UI and code base itself, as this will provide the context necessary to understanding the rest of the commentary below.


I’ll start by delving into a few principles that I find to be to key when designing analytic dashboards, drawing on the asset comparison dashboard as my exemplar, and will end with some discussion of the relative utility of a few packages integral to the app. Overall, my goal is not to focus on the tool that I built alone, but to highlight a few main best practices when it comes to building dashboards for any analysis.

Build the app around the story, not the other way around.


Before ever writing a single line of code for an analytic app, I find that it is absolutely imperative to have a clear vision of the story that the tool must tell. I do not mean by this that you should already have conclusions about your data that you will then force the app into telling, but rather, that you must know how you want your user to interact with the app in order glean useful information.

In the case of my asset comparison tool, I wanted to serve multiple audiences—everyone from a casual trader who just wanted to see which investment produced the greatest net profit over a period of time, to a more experience trader, who had more nuanced questions about risk-adjusted return on investment given varying discount rates. The trick is thus building the app in such a way that serves all possible audiences without hindering any one type of user in particular.

The way I designed my app to meet this need was to build the UI such that as you descend the various sections vertically, the metrics displayed scale in complexity. My reasoning for this becomes apparent when you consider the two extremes in terms of users—the most basic vs. the most advanced trader.

The most basic user will care only about the assets of interest, the time period they want to examine, and how their initial investment performed over time. As such, they will start with the sidebar, input their assets and time frame of choice, and then use the top right-most input box to modulate their initial investment amount (although some may choose to stick with the default value here). They will then see the first chart change to reflect their choices, and they will see, both visually, and via the summary table below, which asset performed better.

The experienced trader, on the other hand, will start off exactly as the novice did, by choosing assets of interest, a time frame of reference, and an initial investment amount. They may then choose to modulate the LOESS parameters as they see fit, descending the page, looking over the simple returns section, perhaps stopping to make changes to the corresponding inputs there, and finally ending at the bottom of the page—at the Sharpe Ratio visualizations. Here they will likely spend more time—playing around with the time period over which to measure returns and changing the risk-free rate to align with their own personal macroeconomic assumptions.

The point of these two examples is to illustrate that the app by dint of its structure alone guides the user through the analytic story in a waterfall-like manner—building from simple portfolio performance, to relative performance, to the most complicated metrics for risk-adjusted returns. This keeps the novice trader from being overwhelmed or confused, and also allows the most experienced user to follow the same line of thought that they would anyway when comparing assets, while following a logical progression of complexity, as shown via the screenshot below.




Once you think you have a structure that guides all users through the story you want them to experience, test it by asking yourself if the app flows in such a way that you could pose and answer a logical series of questions as you navigate the app without any gaps in cohesion. In the case of this app, the questions that the UI answers as you descend are as follows:



  • How do these assets compare in terms of absolute profit?
  • How do these assets compare in terms of simple return on investment?
  • How do these assets compare in terms of variance-adjusted and/or risk-adjusted return on investment?


Thus, when you string these questions together, you can make statements of the type: “Asset X seemed to outperform Asset Y in terms of absolute profit, and this trend held true as well when it comes to simple return on investment, over varying time frames. That said, when you take into account the variance inherent to Asset X, it seems that Asset Y may have been the best choice, as the excess downside risk associated with Asset X outweighs its excess net profitability.


Too many cooks in the kitchen—the case for a functional approach to app-building.



While the design of the UI of any analytic app is of great importance, it’s important to not forget that the code base itself should also be well-designed; a fully-functional app from the user’s perspective can still be a terrible app to work with if the code is a jumbled, incomprehensible mess. A poorly designed code base makes QC a tiresome, aggravating process, and knowledge sharing all but impossible.

For this reason, I find that sourcing a separate R script file containing all analytic functions necessitated by the app is the best way to go, as done below (you can see Functions.R at my repo here).


# source the Functions.R file, where all analytic functions for the app are stored
source("Functions.R")


Not only does this allow for a more comprehensible and less-cluttered App.R, but it also drastically improves testability and reusability of the code. Consider the example function below, used to create the portfolio performance chart in the app (first box displayed in the UI, center middle).

build_portfolio_perf_chart <- function(data, port_loess_param = 0.33){
  
  port_tbl <- data[,c(1,4:5)]
  
  # grabbing the 2 asset names
  asset_name1 <- sub('_.*', '', names(port_tbl)[2])
  asset_name2 <- sub('_.*', '', names(port_tbl)[3])
  
  # transforms dates into correct type so smoothing can be done
  port_tbl[,1] <- as.Date(port_tbl[,1])
  date_in_numeric_form <- as.numeric((port_tbl[,1]))
  # assigning loess smoothing parameter
  loess_span_parameter <- port_loess_param
  
  # now building the plotly itself
  port_perf_plot <- plot_ly(data = port_tbl, x = ~port_tbl[,1]) %>%
    # asset 1 data plotted
    add_markers(y =~port_tbl[,2],
                marker = list(color = '#FC9C01'),
                name = asset_name1,
                showlegend = FALSE) %>%
    add_lines(y = ~fitted(loess(port_tbl[,2] ~ date_in_numeric_form, span = loess_span_parameter)),
              line = list(color = '#FC9C01'),
              name = asset_name1,
              showlegend = TRUE) %>%
    # asset 2 data plotted
    add_markers(y =~port_tbl[,3],
                marker = list(color = '#3498DB'),
                name = asset_name2,
                showlegend = FALSE) %>%
    add_lines(y = ~fitted(loess(port_tbl[,3] ~ date_in_numeric_form, span = loess_span_parameter)),
              line = list(color = '#3498DB'),
              name = asset_name2,
              showlegend = TRUE) %>%
    layout(
      title = FALSE,
      xaxis = list(type = "date",
                   title = "Date"),
      yaxis = list(title = "Portfolio Value ($)"),
      legend = list(orientation = 'h',
                    x = 0,
                    y = 1.15)) %>%
    add_annotations(
      x= 1,
      y= 1.133,
      xref = "paper",
      yref = "paper",
      text = "",
      showarrow = F
    )
  
  return(port_perf_plot)
  
}


Writing this function in the sourced Functions.R file instead of directly within the App.R allows for the developer to first test the function itself with fake data—i.e. data not gleaned from the reactive inputs. Once it has been tested in this way, it can be integrated in the app.R on the server side as shown below, with very little code.

  output$portfolio_perf_chart <- 
    debounce(
      renderPlotly({
        data <- react_base_data()
        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)
      }), 
      millis = 2000) # sets wait time for debounce


This process allows for better error-identification and troubleshooting. If, for example, you want to change the work accomplished by the analytic function in some way, you can make the changes necessary to the code, and if the app fails to produce the desired outcome, you simply restart the chain: first you test the function in a vacuum outside of the app, and if it runs fine there, then you know that you have a problem with the way the reactive inputs are integrating with the function itself. This is a huge time saver when debugging.

Lastly, this allows for ease of reproducibility and hand-offs. If, say, one of your functions simply takes in a dataset and produces a chart of some sort, then it can be easily copied from the Functions.R and reused elsewhere. I have done this too many times to count, ripping code from project and, with a few alterations, instantly applying it in other contexts. This is easy to do if the functions are written in a manner not dependent on a particular Shiny reactive structure. For all these reasons, it makes sense in most cases to keep the code for the app UI and inputs cleanly separated from the analytic functions via a sourced R script.

Dashboard documentation—both a story and a manual, not one or the other.


When building an app for a customer at work, I never simply write an email with a link in it and write “here you go!” That will result in, at best, a steep learning curve, and at worst, an app used in an unintended way, resulting in user frustration or incorrect results. I always meet with the customer, explain the purpose and functionalities of the tool, walk through the app live, take feedback, and integrate any key takeaways into further iterations.

Even if you are just planning on writing some code to put up on GitHub, you should still consider all of these steps when working on the documentation for your app. In most cases, the README is the epicenter of your documentation—the README is your meeting with the customer.  As you saw when reading the README for the Asset Comparison Tool, I always start my READMEs with a high-level introduction to the purpose of the app—hopefully written or supplemented with visuals (as seen below) that are easy to understand and will capture the attention of browsing passers-by. 






After this introduction, the rest of the potential sections to include can vary greatly from app-to-app. In some cases apps are meant to answer one particular question, and might have a variety of filters or supplemental functionalities—one such example can be found here. As can be seen, in that README, I spend a great deal of time on the methodology after making the overall purpose clear, calling out additional options along the way. In the case of the README for the Asset Comparison Tool, however, the story is a bit different. Given that there are many questions that the app seeks to answer, it makes sense to answer each in turn, writing the README in such a way that its progression mirrors the logical flow of the progression intended for the app’s user.

One should of course not neglect to cover necessary technical information in the README as well. Anything that is not immediately clear from using the app should be clarified in the README—from calculation details to the source of your data, etc. Finally, don’t neglect the iterative component! Mention how you want to interact with prospective users and collaborators in your documentation. For example, I normally call out how I would like people to use the Issues tab on GitHub to propose any changes or additions to the documentation, or the app in general. In short, your documentation must include both the story you want to tell, and a manual for your audience to follow. 

Why Shiny Dashboard?



One of the first things you will notice about the app.R code is that the entire thing is built using Shiny Dashboard as its skeleton. There are a two main reasons for this, which I will touch on in turn.

Shiny Dashboard provides the biggest bang for your buck in terms of how much UI complexity and customizability you get out of just a small amount of code.


I can think of few cases where any analyst or developer would prefer longer, more verbose code to a shorter, succinct solution. That said, Shiny Dashboard’s simplicity when it comes to UI manipulation and customization is not just helpful because it saves you time as a coder, but because it is intuitive from the perspective of your audience.

Most of the folks that use the tools I have built to shed insight into economic questions don’t know how to code in R or Python, but they can, with a little help from extensive commenting and detailed documentation, understand the broad structure of an app coded in Shiny Dashboard format. This is, I believe, largely a function of two features of Shiny Dashboard: the colloquial-English-like syntax of the code for UI elements, and the lack of the necessity for in-line or external CSS.

As you can see from the example below, Shiny Dashboard’s system of “boxes” for UI building is easy to follow. Users can see a box in the app and easily tie that back to a particular box in the UI code.

Here is the box as visible to the user:







And here is the code that produces the box:

box(
        title = "Portfolio Performance Inputs",
        status= "primary",
        solidHeader = TRUE,
        h5("This box focuses on portfolio value, i.e., how much an initial investment of the amount specified below (in USD) would be worth over time, given price fluctuations."),
        
        textInput(
          inputId = "initial_investment",
          label = "Enter your initial investment amount ($):",
          value = "1000"),
        
        hr(),
        
        h5("The slider below modifies the", a(href = "https://stats.stackexchange.com/questions/2002/how-do-i-decide-what-span-to-use-in-loess-regression-in-r", "smoothing parameter"), "used in the", a(href = "https://en.wikipedia.org/wiki/Local_regression", "LOESS function"), "that produces the lines on the scatterplot."),
        
        sliderInput(
          inputId = "port_loess_param",
          label = "Smoothing parameter for portfolio chart:",
          min = 0.1,
          max = 2,
          value = .33,
          step = 0.01,
          animate = FALSE
        ),
        
        hr(),
        h5("The table below provides metrics by which we can compare the portfolios. For each column, the asset that performed best by that metric is colored green."),
        
        height = 500, 
        width = 4
      )


Secondly, and somewhat related to the first point, with Shiny Dashboard, much of the coloring and overall UI design comes pre-made via dashboard-wide “skins”, and box-specific “statuses.”

This is great if you are okay sacrificing a bit of control for a significant reduction in code complexity. In my experience dealing with non-coding-proficient audiences, I find that in-line CSS or complicated external CSS makes folks far more uncomfortable with the code in general. Anything you can do to reduce this anxiety and make those using your tools feel as though they understand them better is a good thing, and Shiny Dashboard makes that easier.

Shiny Dashboard’s combination of sidebar and boxes makes for easy and efficient data processing when your app has a waterfall-like analytic structure. 


Having written versions of this app both in base Shiny and using Shiny Dashboard, the number one reason I chose Shiny Dashboard was the fact that the analytic questions I sought to solve followed a waterfall-like structure, as explained in the previous section. This works perfectly well with Shiny Dashboard’s combination of sidebar input controls and inputs within UI boxes themselves.  

The inputs of primordial importance to all users are included in the sidebar UI: the two assets to analyze, and the date range over which to compare their performance. These are the only inputs that all users, regardless of experience or intent, must absolutely use, and when they are changed, all views in the dashboard will be affected. All other inputs are stored in the UI Boxes adjacent to the views that they modulate. This makes for a much more intuitive and fluid user experience, as once the initial sidebar inputs have been modulated, the sidebar can be hidden, as all other non-hidden inputs affect only the visualizations to which they are adjacent.

This waterfall-like structure also makes for more efficient reactive processes on the Shiny back-end. The inputs on the sidebar are parameters that, when changed, force the main reactive function that creates that primary dataset to fire, thus recreating the base dataset (as can be seen in the code for that base datasets creation below).

  # utility functions to be used within the server; this enables us to use a textinput for our portfolio values
  exists_as_number <- function(item) {
    !is.null(item) && !is.na(item) && is.numeric(item)
  }
  
  # data-creation reactives (i.e. everything that doesn't directly feed an output)
  
  # first is the main data pull which will fire whenever the primary inputs (asset_1a, asset_2a, initial_investment, or port_dates1a change)
  react_base_data <- reactive({
    if (exists_as_number(as.numeric(input$initial_investment)) == TRUE) {
      # creates the dataset to feed the viz
      return(
        get_pair_data(
          asset_1 = input$asset_1a,
          asset_2 = input$asset_2a, 
          port_start_date = input$port_dates1a[1],
          port_end_date = input$port_dates1a[2],
          initial_investment = (as.numeric(input$initial_investment))
        )
      )
    } else {
      return(
        get_pair_data(
          asset_1 = input$asset_1a,
          asset_2 = input$asset_2a, 
          port_start_date = input$port_dates1a[1],
          port_end_date = input$port_dates1a[2],
          initial_investment = (0)
        )
      )
    }
  })


Each of the visualizations are then produced via their own separate reactive functions, each of which takes as an input the main reactive (as shown below). This makes it so that whenever the sidebar inputs are changed, all reactives fire and all visualizations are updated; however, if all that is changed is a single loess smoothing parameter input, only the reactive used in the creation of that particular parameter-dependent visualization fires, which makes for great computational efficiency.

 # Now the reactives for the actual visualizations
  output$portfolio_perf_chart <- 
    debounce(
      renderPlotly({
        data <- react_base_data()
        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)
      }), 
      millis = 2000) # sets wait time for debounce
  


Why Plotly?


Plotly vs. ggplot is always a fun subject for discussion among folks who build visualizations in R. Sometimes I feel like such discussions just devolve into the same type of argument as R vs. Python for data science (my answer to this question being just pick one and learn it well), but over time I have found that there are actually some circumstances where the plotly vs. ggplot debate can yield cleaner answers.

In particular, I have found in working on this particular type of analytic app that there are two areas where plotly has a bit of an advantage: clickable interactivity, and wide data.

Those familiar with ggplot will know that every good ggplot begins with long data. It is possible, via some functions, to transform wide data into a long format, but that transformation can sometimes be problematic. While there are essentially no circumstances in which it is impossible to transform wide data into long format, there are a handful of cases where it is excessively cumbersome: namely, when dealing with indexed xts objects (as shown below) or time series / OHLC-styled data.




In these cases—either due to the sometimes-awkward way in which you have to handle rowname indexes in xts, or the time and code complexity saved by not having to transform every dataset into long format—plotly offers efficiency gains relative to ggplot.

The aforementioned efficiency gains are a reason to choose plotly in some cases because it makes the life of the coder easier, but there are also reasons why it sometimes make the life of the user easier as well.

If one of the primary utilities of a visualization is to allow the user the ability to seamlessly and intuitively zoom in on, select, or filter the data displayed, particularly in the context of a Shiny App, then plotly should be strongly considered. Sure, ggplotly wrappers can be used to make a ggplot interactive, but with an added layer of abstraction comes an added layer of possible errors. While in most cases a ggplotly wrapper should work seamlessly, I have found that, particularly in cases where auto-sizing and margin size specification is key, ggplotly can require a great deal of added code in order to work correctly in a Shiny context.

In summary, when considering when to start with plotly vs. when to start with ggplot, I find one question to be particularly helpful: what do I value most—visual complexity and/or customization, or interactive versatility and/or preserving wide data?




If I choose the former, then ggplot is what I need; otherwise, I go with plotly. More often than not I find that ggplot emerges victorious, but even if you disagree with me in my decision-making calculus, I think it is helpful to at least think through what your personal calculus is. This will save you time when coding, as instead of playing around with various types of viz, you can simply pose the question(s) behind your calculus and know quickly what solution best fits your problem.

Why Formattable?


The case for formattable is, in my opinion, a much easier case to make than arguing for plotly vs. ggplot. The only question worth asking when deciding on whether or not to use formattable in your app is: do I want my table to tell a quick story via added visual complexity within the same cell that contains my data, or is a reference table all I am looking for? If you chose the former, formattable is probably a good way to go. You’ll notice as well that the case for formattable is very specific–in most cases there is likely a simpler solution via the DT  or kableExtra packages.



The one downside that I have encountered in dealing with formattable code is the amount of code necessary to generate even moderately complicated tables. That said, this problem is easily remedied via a quick function that we can use to kill most of the duplicative coding, as seen in the example below.



First, here is the long form version:


  react_formattable <- reactive({
    return(
      formattable(react_port_summary_table(), 
                  list(
                    "Asset Portfolio Max Worth" = formatter("span",
                                                            style = x ~ style(
                                                              display = "inline-block",
                                                              direction = "rtl",
                                                              "border-radius" = "4px",
                                                              "padding-right" = "2px",
                                                              "background-color" = csscolor("darkslategray"),
                                                              width = percent(proportion(x)),
                                                              color = csscolor(gradient(x, "red", "green"))
                                                            )),
                    "Asset Portfolio Latest Worth" = formatter("span",
                                                               style = x ~ style(
                                                                 display = "inline-block",
                                                                 direction = "rtl",
                                                                 "border-radius" = "4px",
                                                                 "padding-right" = "2px",
                                                                 "background-color" = csscolor("darkslategray"),
                                                                 width = percent(proportion(x)),
                                                                 color = csscolor(gradient(x, "red", "green"))
                                                               )),
                    "Asset Portfolio Absolute Profit" = formatter("span",
                                                                  style = x ~ style(
                                                                    display = "inline-block",
                                                                    direction = "rtl",
                                                                    "border-radius" = "4px",
                                                                    "padding-right" = "2px",
                                                                    "background-color" = csscolor("darkslategray"),
                                                                    width = percent(proportion(x)),
                                                                    color = csscolor(gradient(x, "red", "green"))
                                                                  )),
                    "Asset Portfolio Rate of Return" = formatter("span",
                                                                 style = x ~ style(
                                                                   display = "inline-block",
                                                                   direction = "rtl",
                                                                   "border-radius" = "4px",
                                                                   "padding-right" = "2px",
                                                                   "background-color" = csscolor("darkslategray"),
                                                                   width = percent(proportion(x)),
                                                                   color = csscolor(gradient(x, "red", "green"))
                                                                 ))
                    
                  )
      )
      
    )
  })


This code can easily be shortened via the integration of a custom function, as shown below.


simple_formatter <- function(){
    formatter("span",
              style = x ~ style(
                display = "inline-block",
                direction = "rtl",
                "border-radius" = "4px",
                "padding-right" = "2px",
                "background-color" = csscolor("darkslategray"),
                width = percent(proportion(x)),
                color = csscolor(gradient(x, "red", "green"))
              ))
  }
  
  react_formattable <- reactive({
    return(
      formattable(react_port_summary_table(), 
                  list(
                    "Asset Portfolio Max Worth" = simple_formatter(),
                    "Asset Portfolio Latest Worth" = simple_formatter(),
                    "Asset Portfolio Absolute Profit" = simple_formatter(),
                    "Asset Portfolio Rate of Return" = simple_formatter()
                    )
                  )
      )
    })


As can be seen, formattable allows for a great deal of added complexity in crafting your table—complexity that may not be suited for all apps. That said, if you do want to quickly draw a user’s attention to something in a table, formattable is a great solution, and most of the details of the code can be greatly simplified via a function, as shown.

Conclusions:


That was a lot—I know—but I hope that from this commentary and my exemplar of the Asset Comparison Tool more generally has helped to inform your understanding of how dashboards can serve as a helpful analytic tool. Furthermore, I hope to have prompted some thoughts as to the best practices to be followed when building such a tool. I’ll end with a quick tl;dr:


  • Shave complexity wherever possible, and make code as simple as possible by keeping the code for the app’s UI and inner mechanism (inputs, reactives, etc.) separate from the code for the analytic functions and visualizations.
  • Build with the most extreme cases in mind (think of how your most edge-case user might use the app, and ensure that behavior won’t break the app)
  • Document, document, and then document some more. Make your README both a story and a manual.
  • Give Shiny Dashboard a shot if you want an easy-to-construct UI over which you don’t need complete control when it comes to visual design.
  • Pick your visualization packages based on what you want to prioritize for your user, not the other way around (this applies to ggplot, plotly, formattable, etc.).

Thanks for reading!

Creating Interactive Charts with R, Shiny, MySQL and AnyChart JS via Template

How to creating interactive chart with R, Shiny, MySQL and AnyChart JS via template

Data visualization and charting are actively evolving as a more and more important field of web development. In fact, people perceive information much better when it is represented graphically rather than numerically as raw data. As a result, various business intelligence apps, reports, and so on widely implement graphs and charts to visualize and clarify data and, consequently, to speed up and facilitate its analysis for further decision making. While there are many ways you can follow to handle data visualization in R, today let’s see how to create interactive charts with the help of popular JavaScript (HTML5) charting library AnyChart. It has recently got an official R, Shiny and MySQL template that makes the whole process pretty straightforward and easy. (Disclaimer: I am the CTO at the AnyChart team. The template I am talking about here is released under the Apache 2.0 license; the library itself can be used for free in any personal, educational and other non-profit projects and is open on GitHub, but for commercial purposes it requires a commercial license though is fully functional when taken on a free trial.) In this step-by-step tutorial, we will take a closer look at the template and the basic pie chart example it includes, and then I will show you how to quickly modify it to get some different data visualization, e.g. a 3D column (vertical bar) chart.

Briefly about AnyChart

AnyChart is a flexible, cross-browser JS charting library for adding interactive charts to websites and web apps in quite a simple way. Basically, it does not require any installations and work with any platform and database. Some more of AnyChart’s features include (but are not limited to): Templates for popular technology stacks, like R, Shiny and MySQL in the present case, can further facilitate AnyChart’s integration.

Getting started

First of all, let’s make sure the R language is installed. If not, you can visit the official R website and follow the instructions. If you have worked with R before, most likely you already have RStudio. Then you are welcome to create a project in it now, because the part devoted to R can be done there. If currently you do not have RStudio, you are welcome to install it from the official RStudio website. But, actually, using RStudio is not mandatory, and the pad will be enough in our case. After that, we should check if MySQL is properly installed. To do that, you can open a terminal window and enter the next command: $ mysql –version mysql Ver 14.14 Distrib 5.7.16, for Linux (x86_64) using EditLine wrapper You should receive the above written response (or a similar one) to be sure all is well. Please follow these instructions to install MySQL, if you do not have it at the moment. Now that all the required components have been installed, we are ready to write some code for our example.

Basic template

First, to download the R, Shiny and MySQL template for AnyChart, type the next command in the terminal: $ git clone https://github.com/anychart-integrations/r-shiny-mysql-template.git The folder you are getting here features the following structure: r-shiny-mysql-template/ www/ css/ style.css # css style app.R # main application code database_backup.sql # MySQL database dump LICENSE README.md index.html # html template Let’s take a look at the project files and examine how this sample works. We’ll run the example first. Open the terminal and go to the repository folder: $ cd r-shiny-mysql-template Set up the MySQL database. To specify your username and password, make use of -u and -p flags: $ mysql < database_backup.sql Then run the R command line, using the command below: $ R And install the Shiny and RMySQL packages as well as initialize the Shiny library at the end: > install.packages("shiny") > install.packages("RMySQL") > library(shiny) If you face any problems during the installation of these dependencies, carefully read error messages, e.g. you might need sudo apt-get install libmysqlclient-dev for installing RMySQL. Finally, run the application: > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/r-shiny-mysql-template") And the new tab that should have just opened in your browser shows you the example included in the template: Interactive Pie chart created with R and AnyChart JS charting library. Basic sample from R, Shiny and MySQL integration template

Basic template: code

Now, let’s go back to the folder with our template to see how it works. Files LICENSE and README.md contain information about the license and the template (how to run it, technologies, structure, etc.) respectively. They are not functionally important to our project, and therefore we will not explore them here. Please check these files by yourself for a general understanding. The style.css file is responsible for the styles of the page. The database_backup.sql file contains a code for the MySQL table and user creation and for writing data to the table. You can use your own table or change the data in this one. Let’s move on to the code. First, open the app.R file. This file ensures the connection to the MySQL database, reads data, and passes it to the index.html file, which contains the main code of using AnyChart. The following is a part of the app.R code, which contains the htmlTemplate function; here we specify the name of the file where the data will be transmitted to, the names of our page and chart, and the JSON encoded chart data from MySQL database. htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Top 5 fruits"), chartData = toJSON(loadData()) The main thing here is the index.html file, which is actually where the template for creating charts is. As you see, the first part of this file simply connects all necessary files to the code, including the AnyChart library, the CSS file with styles, and so on. I’ll skip this for now and proceed directly to the script tag and the anychart.onDocumentReady (function () {...}) function. [code lang=”javascript”] anychart.onDocumentReady(function() { var chart = anychart.pie({{ chartData }}); chart.title({{ chartTitle }}); chart.container("container"); chart.draw(); }); This pattern works as follows. We create a pie chart by using the function pie() and get the data that have already been read and prepared using the R code. Please note that the names of the variables containing data are the same in the app.R and index.html files. Then we display the chart title via (chart.title({{ chartTitle }})) and specify the ID of the element that will contain a chart, which is a div with id = container in this case. To show all that was coded, we use сhart.draw().

Modifying the template to create a custom chart

Now that we’ve explored the basic example included in the template, we can move forward and create our own, custom interactive chart. To do that, we simply need to change the template a little bit and add some features if needed. Let’s see how it works. First, we create all the necessary files by ourselves or make a new project using RStudio. Second, we add a project folder named anychart. Its structure should look like illustrated below. Please note that some difference is possible (and acceptable) if you are using a new project in RStudio. anychart/ www/ css/ style.css # css style ui.R # main application code server.R # sub code database_backup.sql # data set index.html # html template Now you know what files you need. If you’ve made a project with the studio, the ui.R and server.R files are created automatically. If you’ve made a project by yourself, just create empty files with the same names and extensions as specified above. The main difference from the original example included in the template is that we should change the file index.html and divide app.R into parts. You can copy the rest of the files or create new ones for your own chart. Please take a look at the file server.R. If you’ve made a project using the studio, it was created automatically and you don’t need to change anything in it. However, if you’ve made it by yourself, open it in the Notepad and add the code below, which is standard for the Shiny framework. You can read more about that here. The file structure of ui.R is similar to the one of app.R, so you can copy app.R from the template and change/add the following lines: loadData = dbGetQuery(db, "SELECT name, value FROM fruits") data1 <- character() #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )} Since we are going to change the chart type, from pie to 3D vertical bar (column), the data needs some preparation before being passed to index.html. The main difference is that we will use the entire data from the database, not simply the top 5 positions. We will slightly modify and expand the basic template. Let’s see the resulting code of the index.html first (the script tag) and then explore it. [code lang=”javascript”] anychart.onDocumentReady(function() { var chart = anychart.column3d({{ chartData }}); chart.title({{ chartTitle }}); chart.animation(true); var xAxis = chart.xAxis(); xAxis.title("fruits"); var yAxis = chart.yAxis(); yAxis.title("pounds, t"); var yScale = chart.yScale(); yScale.minimum(0); yScale.maximum(120); chart.container("container"); chart.draw(); }); With the help of var chart = anychart.column3d({{chartData}}), we are creating a 3D column chart by using the function column3d(). Here you can choose any other chart type you need; consider getting help from Chartopedia if you are unsure which one works best in your situation. Next, we are adding animation to the column chart via chart.animation (true) to make it appear on page load gradually. In the following section, we are creating two variables, xAxis and yAxis. Including these is required if you want to provide the coordinate axes of the chart with captions. So, you should create variables that will match the captions for the X and Y axes, and then use the function, transmit the values that you want to see. The next unit is basically optional. We are explicitly specifying the maximum and minimum values for the Y axis, or else AnyChart will independently calculate these values. You can do that the same way for the X axis. And that’s it! Our 3D column chart is ready, and all seems to be fine for successfully running the code. The only thing left to do before that is to change the MySQL table to make it look as follows: (‘apple’,100), (‘orange’,58), (‘banana’,81), (‘lemon’,42), (‘melon’,21), (‘kiwi’,66), (‘mango’,22), (‘pear’,48), (‘coconut’,29), (‘cherries’,65), (‘grapes’,31), (‘strawberries’,76), To see what you’ve got, follow the same steps as for running the R, Shiny and MySQL template example, but do not forget to change the path to the folder and the folder name to anychart. So, let’s open the terminal and command the following: $ cd anychart $ mysql < database_backup.sql $ R > install.packages("shiny") > install.packages("RMySQL") > library(shiny) > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/anychart") Interactive 3D Column chart made with R and AnyChart JS charting library, based on R, Shiny and MySQL integration template For consistency purposes, I am including the code of ui.R and server.R below. The full source code of this example can be found on GitHub.

ui.R:

library(shiny) library(RMySQL) library(jsonlite) data1 <- character() db = dbConnect(MySQL(), dbname = "anychart_db", host = "localhost", port = 3306, user = "anychart_user", password = "anychart_pass") loadData = dbGetQuery(db, "SELECT name, value FROM fruits") #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) server = function(input, output){} ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )} shinyApp(ui = ui, server = server)

server.R:

library(shiny) shinyServer(function(input, output) { output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1) # draw the chart with the specified number of bins hist(x, breaks = bins, col = ‘darkgray’, border = ‘white’) }) })

Conclusion

When your technology stack includes R, Shiny and MySQL, using AnyChart JS with the integration template we were talking about in this tutorial requires no big effort and allows you to add beautiful interactive JavaScript-based charts to your web apps quite quickly. It is also worth mentioning that you can customize the look and feel of charts created this way as deeply as needed by using some of the library’s numerous out-of-the-box features: add or remove axis labels, change the background color and how the axis is positioned, leverage interactivity, and so on. The scope of this tutorial is likely to be actually even broader, because the process I described here not only applies to the AnyChart JS charting library, but also is mostly the same for its sister libraries AnyMap (geovisualization in maps), AnyStock (date/time graphs), and AnyGantt (charts for project management). All of them are free for non-profit projects but – I must put it clearly here again just in case – require a special license for commercial use. I hope you find this article helpful in your activities when it comes to interactive data visualization in R. Now ask your questions, please, if any.