A single loop is not enough. A collection of hello world control structures

As the post on “hello world” functions has been quite appreciated by the R community, here follows the second round of functions for wannabe R programmer.

# If else statement:
# See the code syntax below for if else statement
print(“x is greater than 1”)
print(“x is less than 1”)

# See the code below for nested if else statement

if(x>1 & x<7){
print(“x is between 1 and 7”)} else if(x>8 & x< 15){
print(“x is between 8 and 15”)

# For loops:
# Below code shows for loop implementation
x = c(1,2,3,4,5)
for(i in 1:5){

# While loop :
# Below code shows while loop in R
x = 2.987
while(x <= 4.987) {
x = x + 0.987

# Repeat Loop:
# The repeat loop is an infinite loop and used in association with a break statement.

# Below code shows repeat loop:
a = 1
a = a+1
if (a > 4) {

# Break statement:
# A break statement is used in a loop to stop the iterations and flow the control outside of the loop.

#Below code shows break statement:
x = 1:10
for (i in x){
if (i == 6){

# Next statement:
# Next statement enables to skip the current iteration of a loop without terminating it.

#Below code shows next statement
x = 1: 4
for (i in x) {
if (i == 2){

# function

words = c(“R”, “datascience”, “machinelearning”,”algorithms”,”AI”)
words.names = function(x) {
for(name in x){

words.names(words) # Calling the function

# extract the elements above the main diagonal of a (square) matrix
# example of a correlation matrix

cor_matrix <- matrix(c(1, -0.25, 0.89, -0.25, 1, -0.54, 0.89, -0.54, 1), 3,3)
rownames(cor_matrix) <- c(“A”,”B”,”C”)
colnames(cor_matrix) <- c(“A”,”B”,”C”)

rho <- list()
name <- colnames(cor_matrix)
var1 <- list()
var2 <- list()
for (i in 1:ncol(cor_matrix)){
for (j in 1:ncol(cor_matrix)){
if (i != j & i<j){
rho <- c(rho,cor_matrix[i,j])
var1 <- c(var1, name[i])
var2 <- c(var2, name[j])

d <- data.frame(var1=as.character(var1), var2=as.character(var2), rho=as.numeric(rho))

var1 var2 rho
1 A B -0.25
2 A C 0.89
3 B C -0.54

As programming is the best way to learn and think, have fun programming awesome functions!

This post is also shared in R-bloggers and LinkedIn

Azure Machine Learning For R Practitioners With The R SDK

[This article was first published on the Azure Medium channel, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

As probably you already know, Microsoft provided its Azure Machine Learning SDK for Python to build and run machine learning workflows, helping organizations to use massive data sets and bring all the benefits of the Azure cloud to machine learning.

Although Microsoft initially invested in R as the Advanced Analytics preferred language introducing the SQL Server R server and R services in the 2016 version, they abruptly shifted their attention to Python, investing exclusively on it. This basically happened for the following reasons:

  • Python’s simple syntax and readability make the language accessible to non-programmers
  • The most popular machine learning and deep learning open source libraries (such as Pandas, scikit-learn, TensorFlow, PyTorch, etc.) are deeply used by the Python community
  • Python is a better choice for productionalization: it’s relatively very fast; it implements OOPs concepts in a better way; it is scalable (Hadoop/Spark); it has better functionality to interact with other systems; etc.

Azure ML Python SDK Main Key Points

One of the most valuable aspects of the Python SDK is its ease to use and flexibility. You can simply use just few classes, injecting them into your existing code or simply referring to your script files into method calls, in order to accomplish the following tasks:

  • Explore your datasets and manage their lifecycle
  • Keep track of what’s going on into your machine learning experiments using the Python SDK tracking and logging features
  • Transform your data or train your models locally or using the best cloud computation resources needed by your workloads
  • Register your trained models on the cloud, package them into container image and deploy them on web services hosted in Azure Container Instances or Azure Kubernetes Services
  • Use Pipelines to automate workflows of machine learning tasks (data transformation, training, batch scoring, etc.)
  • Use automated machine learning (AutoML) to iterate over many combinations of defined data transformation pipelines, machine learning algorithms and hyperparameter settings. It then finds the best-fit model based on your chosen performance metric.

In summary, the scenario is the following one:

AI/ML lifecycle using Azure ML
fig. 1 — AI/ML lifecycle using Azure ML

What About The R Community Engagement?

In the last 3 years Microsoft pushed a lot over the Azure ML Python SDK, making it a stable product and a first class citizen of the Azure cloud. But they seem to have forgotten all the R professionals who developed a huge amount of data science project all around the world.

We must not forget that in Analytics and Data Science the key of success of a project is to quickly try out a large number of analytical tools and find what’s the best one for the case in analysis. R was born for this reason. It has a lot of flexibility when you want to work with data and build some model, because it has tons of packages and easy of use visualization functionality. That’s why a lot of Analytics projects are developed using R by many statisticians and data scientists.

Fortunately in the last months Microsoft extended a hand to the R community, releasing a new project called Azure Machine Learning R SDK.

Can I Use R To Spin The Azure ML Wheels?

Starting from October 2019 Microsoft released a R interface for Azure Machine Learning SDK on GitHub. The idea behind this project is really straightforward. The Azure ML Python SDK is a way to simplify the access and the use of the Azure cloud storage and computation for machine learning purposes keeping the main code as the one a data scientist developed on its laptop.

Why not allow the Azure ML infrastructure to run also R code (using proper “cooked” Docker images) and let R data scientists call the Azure ML Python SDK methods using R functions?

The interoperability between Python and R is obtained thanks to reticulate. So, once the Python SDK module azureml is imported into any R environment using the import function, functions and other data within the azureml module can be accessed via the $ operator, like an R list.

Obviously, the machine hosting your R environment must have Python installed too in order to make the R SDK work properly.

Let’s start to configure your preferred environment.

Set Up A Development Environment For The R SDK

There are two option to start developing with the R SDK:

  1. Using an Azure ML Compute Instance (the fastest way, but not the cheaper one!)
  2. Using your machine (laptop, VM, etc.)

Set Up An Azure ML Compute Instance

Once you created an Azure Machine Learning Workspace through the Azure portal (a basic edition is enough), you can access to the brand new Azure Machine Learning Studio. Under the Compute section, you can create a new Compute Instance, choosing its name and its sizing:

Create an Azure ML Compute Instance
fig. 2 — Create an Azure ML Compute Instance

The advantage of using a Compute Instance is that the most used software and libraries by data scientists are already installed, including the Azure ML Python SDK and RStudio Server Open Source Edition. That said, once your Compute Instance is started, you can connect to RStudio using the proper link:

Launch RStudio from a started Compute Intstance
fig. 3 — Launch RStudio from a started Compute Intstance

At the end of your experimentation, remember to shut down your Compute Instance, otherwise you’ll be charged according to the chosen plan:

Remember to shut down your Compute Instance
fig. 4 — Remember to shut down your Compute Instance

Set Up Your Machine From Scratch

First of all you need to install the R engine from CRAN or MRAN. Then you could also install RStudio Desktop, the preferred IDE of R professionals.

The next step is to install Conda, because the R SDK needs to bind to the Python SDK through reticulate. If you really don’t need Anaconda for specific purposes, it’s recommended to install a lightweight version of it, Miniconda. During its installation, let the installer add the conda installation of Python to your PATH environment variable.

Install The R SDK

Open your RStudio, simply create a new R script (File → New File → R Script) and install the last stable version of Azure ML R SDK package (azuremlsdk) available on CRAN in the following way:


If you want to install the latest committed version of the package from GitHub (maybe because the product team has fixed an annoying bug), you can instead use the following function:

During the installation you could get this error:

Timezone error
fig. 5 — Timezone error

In this case, you just need to set the TZ environment variable with your preferred timezone:


Then simply re-install the R SDK.

You may also be asked to update some dependent packages:

Dependent packages to be updated
fig. 6 — Dependent packages to be updated

If you don’t have any requirement about dependencies in your project, it’s always better to update them all (put focus on the prompt in the console; press 1; press enter).

If you are on your Compute Instance and you get a warning like the following one:

Warning about non-system installation of Python
fig. 7 — Warning about non-system installation of Python

just put the focus on the console and press “n”, since the Compute Instance environment already has a Conda installation. Microsoft engineers are already investigating on this issue.

You need then to install the Azure ML Python SDK, otherwise your azuremlsdk R package won’t work. You can do that directly from RStudio thanks to an azuremlsdk function:

azuremlsdk::install_azureml(remove_existing_env = TRUE)

The remove_existing_env parameter set to TRUE will remove the default Azure ML SDK environment r-reticulate if previously installed (it’s a way to clean up a Python SDK installation).

Just keep in mind that in this way you’ll install the version of the Azure ML Python SDK expected by your installed version of the azuremlsdk package. You can check what version you will install putting the cursor over the install_azureml function and visualizing the code definition clicking F2:

install_azureml code definition
fig. 8 — install_azureml code definition

Sometimes there are new feature and fixes on the latest version of the Python SDK. If you need to install it, first check what version is available on this link:

Azure ML Python SDK latest version
fig. 9 — Azure ML Python SDK latest version

Then use that version number in the following code:

azuremlsdk::install_azureml(version = "1.2.0", remove_existing_env = TRUE)

Sometimes you may need to install an updated version of a single component of the Azure ML Python SDK to test, for example new features. Supposing you want to update the Azure ML Data Prep SDK, here the code you could use:

reticulate::py_install("azureml-dataprep==1.4.2", envname = "r-reticulate", pip = TRUE)

In order to check if the installation is working correctly, try this:


It should return something like this:

Checking that azuremlsdk is correctly installed
fig. 10 — Checking that azuremlsdk is correctly installed

Great! You’re now ready to spin the Azure ML wheels using your preferred programming language: R!


After a long period during which Microsoft focused exclusively on Python SDK to enable data scientists to benefit from Azure computing and storage services, they recently released the R SDK too. This article focuses on the steps needed to install the Azure Machine Learning R SDK on your preferred environment.

Next articles will deal with the R SDK main capabilities.

Using bwimge R package to describe patterns in images of natural structures

This tutorial illustrates how to use the bwimge R package (Biagolini-Jr 2019) to describe patterns in images of natural structures. Digital images are basically two-dimensional objects composed by cells (pixels) that hold information of the intensity of three color channels (red, green and blue). For some file formats (such as png) another channel (the alpha channel) represents the degree of transparency (or opacity) of a pixel. If the alpha channel is equal to 0 the pixel will be fully transparent, if the alpha channel is equal to 1 the pixel will be fully opaque. Bwimage’s images analysis is based on transforming color intensity data to pure black-white data, and transporting the information to a matrix where it is possible to obtain a series of statistics data. Thus, the general routine of bwimage image analysis is initially to transform an image into a binary matrix, and secondly to apply a function to extract the desired information. Here, I provide examples and call attention to the following key aspects: i) transform an image to a binary matrix; ii) introduce distort images function; iii) demonstrate examples of bwimage application to estimate canopy openness; and iv) describe vertical vegetation complexity. The theoretical background of the available methods is presented in Biagolini & Macedo (2019) and in references cited along this tutorial. You can reproduce all examples of this tutorial by typing the given commands at the R prompt. All images used to illustrate the example presented here are in public domain. To download images, check out links in the Data availability section of this tutorial. Before starting this tutorial, make sure that you have installed and loaded bwimage, and all images are stored in your working directory.

install.packages("bwimage") # Download and install bwimage 
library("bwimage") # Load bwimage package 
setwd(choose.dir()) # Choose your directory. Remember to stores images to be analyzed in this folder. 

Transform an image to a binary matrix

Transporting your image information to a matrix is the first step in any bwimage analysis. This step is critical for high quality analysis. The function threshold_color can be used to execute the thresholding process; with this function the averaged intensity of red, green and blue (or only just one channel if desired) is compared to a threshold (argument threshold_value). If the average intensity is less than the threshold (default is 50%) the pixel will be set as black, otherwise it will be white. In the output matrix, the value one represents black pixels, zero represents white pixels and NA represents transparent pixels. Figure 1 shows a comparison of threshold output when using all three channels in contrast to using just one channel (i.e. the effect of change argument channel).

Figure 1. The effect of using different color channels for thresholding a bush image. Figure A represents the original image. Figures B, C, D, and E, represent the output using all three channels, and just red, green and blue channels, respectively.
You can reproduce the threshold image by following the code:
# RGB comparassion
bush_rgb=threshold_color(imagename, channel = "rgb")
bush_r=threshold_color(imagename, channel = "r")
bush_g=threshold_color(imagename, channel = "g")
bush_b=threshold_color(imagename, channel = "b")

par(mfrow = c(2, 2), mar = c(0,0,0,0))
image(t(bush_rgb)[,nrow(bush_rgb):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_r)[,nrow(bush_r):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_g)[,nrow(bush_g):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_b)[,nrow(bush_b):1], col = c("white","black"), xaxt = "n", yaxt = "n")
In this first example, the overall variations in thresholding are hard to detect with a simple visual inspection. This is because the way images were produced create a high contrast between the vegetation and the white background. Later in this tutorial, more information about this image will be presented. For a clear visual difference in the effect of change argument channel, let us repeat the thresholding process with two new images with more extreme color channel contrasts: sunflower (Figure 2), and Brazilian flag (Figure 3).

Figure 2. The effect of using different color channels for thresholding a sunflower image. Figure A represents the original image. Figures B, C, D, and E, represent the output using all three channels, and just red, green and blue, respectively.
Figure 3. The effect of using different color channels for thresholding a Brazilian flag image. Figure A represents the original image. Figures B, C, D, and E, represent the output using all three channels, and just red, green and blue, respectively.
You can reproduce the thresholding output of images 2 and 3, by changing the first line of the previous code for the following codes, and just follow the remaining code lines.
file_name="sunflower.JPG" # for figure 2
file_name="brazilian_flag.JPG" # for figure 03
Another important parameter that can affect output quality is the threshold value used to define if the pixel must be converted to black or white (i.e. the argument threshold_value in function threshold_color). Figure 4 compares the effect of using different threshold limits in the threshold output of the same bush image processed above.

Illustrate tutorial examples
Figure 4 Comparison of different threshold values (i.e. threshold_value argument) to threshold a bush image. In this example, all color channels were considered, and thresholding values selected for images A to H, were 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8 and 0.9, respectively.
You can reproduce the threshold image with the following code:
# Threshold value comparassion
bush_1=threshold_color(file_name, threshold_value = 0.1)
bush_2=threshold_color(file_name, threshold_value = 0.2)
bush_3=threshold_color(file_name, threshold_value = 0.3)
bush_4=threshold_color(file_name, threshold_value = 0.4)
bush_5=threshold_color(file_name, threshold_value = 0.5)
bush_6=threshold_color(file_name, threshold_value = 0.6)
bush_7=threshold_color(file_name, threshold_value = 0.7)
bush_8=threshold_color(file_name, threshold_value = 0.8)

par(mfrow = c(4, 2), mar = c(0,0,0,0))
image(t(bush_1)[,nrow(bush_1):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_2)[,nrow(bush_2):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_3)[,nrow(bush_3):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_4)[,nrow(bush_4):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_5)[,nrow(bush_5):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_6)[,nrow(bush_6):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_7)[,nrow(bush_7):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bush_8)[,nrow(bush_8):1], col = c("white","black"), xaxt = "n", yaxt = "n")
The bwimage package s threshold algorithm (described above) provides a simple, powerful and easy to understand process to convert colored images to a pure black and white scale. However, this algorithm was not designed to meet specific demands that may arise according to user applicability. Users interested in specific algorithms can use others R packages, such as auto_thresh_mask (Nolan 2019), to create a binary matrix to apply bwimage function. Below, we provide examples of how to apply four algorithms (IJDefault, Intermodes, Minimum, and RenyiEntropy) from the auto_thresh_mask function (auto_thresh_mask package – Nolan 2019), and use it to calculate vegetation density of the bush image (i.e. proportion of black pixels in relation to all pixels). I repeated the same analysis using bwimage algorithm to compare results. Figure 5 illustrates differences between image output from algorithms.
# read tif image
img = ijtiff::read_tif("VD01.tif")

#  IJDefault 
IJDefault_mask= auto_thresh_mask(img, "IJDefault")
IJDefault_matrix = 1*!IJDefault_mask[,,1,1]
# 0.1216476

#  Intermodes 
Intermodes_mask= auto_thresh_mask(img, "Intermodes")
Intermodes_matrix = 1*!Intermodes_mask[,,1,1]
# 0.118868

#  Minimum 
Minimum_mask= auto_thresh_mask(img, "Minimum")
Minimum_matrix = 1*!Minimum_mask[,,1,1]
# 0.1133822

#  RenyiEntropy 
RenyiEntropy_mask= auto_thresh_mask(img, "RenyiEntropy")
RenyiEntropy_matrix = 1*!RenyiEntropy_mask[,,1,1]
# 0.1545827

# bWimage
# 0.1398836

The calculated vegetation density for each algorithm was:

Algorithm Vegetation density
IJDefault 0.1334882
Intermodes 0.1199355
Minimum 0.1136603
RenyiEntropy 0.1599628
Bwimage 0.1397852

For a description of each algorithms, check out the documentation of function auto_thresh_mask and its references.


Illustrate tutorial examples
Figure 5 Comparison of thresholding output from the bush image using five algorithms. Image A represents the original image, and images from letters B to F, represent the output from thresholding of bwimage, IJDefault, Intermodes, Minimum, and RenyiEntropy algorithms, respectively.
You can reproduce the threshold image with the following code:
par(mar = c(0,0,0,0)) ## Remove the plot margin
image(t(bw_matrix)[,nrow(bw_matrix):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(bw_matrix)[,nrow(bw_matrix):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(IJDefault_matrix)[,nrow(IJDefault_matrix):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(Intermodes_matrix)[,nrow(Intermodes_matrix):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(Minimum_matrix)[,nrow(Minimum_matrix):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(RenyiEntropy_matrix)[,nrow(RenyiEntropy_matrix):1], col = c("white","black"), xaxt = "n", yaxt = "n")
If you applied the above functions, you may have noticed that high resolution images imply in large R objects that can be computationally heavy (depending on your GPU setup). The argument compress_method from threshold_color and threshold_image_list functions can be used to reduce the output matrix. It reduces GPU usage and time necessary to run analyses. But it is necessary to keep in mind that by reducing resolution the accuracy of data description will be lowered. To compare different resamplings, from a figure of 2500×2500 pixels, check out figure 2 from Biagolini-Jr and Macedo (2019)
The available methods for image reduction are: i) frame_fixed, which resamples images to a desired target width and height; ii) proportional, which resamples the image by a given ratio provided in the argument “proportion”; iii) width_fixed, which resamples images to a target width, and also reduces the image height by the same factor. For instance, if the original file had 1000 pixels in width, and the new width_was set to 100, height will be reduced by a factor of 0.1 (100/1000); and iv) height_fixed, analogous to width_fixed, but assumes height as reference.

Distort images function

In many cases image distortion is intrinsic to image development, for instance global maps face a trade-off between distortion and the total amount of information that can be presented in the image. The bwimage package has two functions for distorting images (stretch and compress functions) which allow allow application of four different algorithms for mapping images, from circle to square and vice versa. Algorithms were adapted from Lambers (2016). Figure 6 compares image distortion of two images using stretch and compress functions, and all available algorithms.

Illustrate tutorial examples
Figure 6. Overview differences in the application of two distortion functions (stretch and compress) and all available algorithms.
You can reproduce distortion images with the following the code:
# Distortion images

## Compress
# chesstablet_matrix

# target_matrix

## stretch
# chesstablet_matrix

# target_matrix

# Plot
par(mfrow = c(4,5), mar = c(0,0,0,0))
image(t(chesstablet_matrix)[,nrow(chesstablet_matrix):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_cmr)[,nrow(comp_cmr):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_cms)[,nrow(comp_cms):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_cmq)[,nrow(comp_cmq):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_cme)[,nrow(comp_cme):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(target_matrix)[,nrow(target_matrix):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_tmr)[,nrow(comp_tmr):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_tms)[,nrow(comp_tms):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_tmq)[,nrow(comp_tmq):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(comp_tme)[,nrow(comp_tme):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(chesstablet_matrix)[,nrow(chesstablet_matrix):1], col = c("white","bisque","black"), xaxt = "n", yaxt = "n")
image(t(stre_cmr)[,nrow(stre_cmr):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(stre_cms)[,nrow(stre_cms):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(stre_cmq)[,nrow(stre_cmq):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(stre_cme)[,nrow(stre_cme):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(target_matrix)[,nrow(target_matrix):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(stre_tmr)[,nrow(stre_tmr):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(stre_tms)[,nrow(stre_tms):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(stre_tmq)[,nrow(stre_tmq):1], col = c("white","black"), xaxt = "n", yaxt = "n")
image(t(stre_tme)[,nrow(stre_tme):1], col = c("white","black"), xaxt = "n", yaxt = "n")

Application examples

Estimate canopy openness

Canopy openness is one of the most common vegetation parameters of interest in field ecology surveys. Canopy openness can be calculated based on pictures on the ground or by an aerial system e.g. (Díaz and Lencinas 2018). Next, we demonstrate how to estimate canopy openness, using a picture taken on the ground. The photo setup is described in Biagolini-Jr and Macedo (2019). Canopy closure can be calculated by estimating the total amount of vegetation in the canopy. Canopy openness is equal to one minus the canopy closure. You can calculate canopy openness for the canopy image example (provide by bwimage package) using the following code:
canopy=system.file("extdata/canopy.JPG",package ="bwimage")
canopy_matrix=threshold_color(canopy,"jpeg", compress_method="proportional",compress_rate=0.1)
1-denseness_total(canopy_matrix) # canopy openness
For users interested in deeper analyses of canopy images, I also recommend the caiman package.

Describe vertical vegetation complexity

There are several metrics to describe vertical vegetation complexity that can be performed using a picture of a vegetation section against a white background, as described by Zehm et al. (2003). Part of the metrics presented by these authors were implemented in bwimage, and the following code shows how to systematically extract information for a set of 12 vegetation pictures. A description of how to obtain a digital image for the following methods is presented in Figure 7.

Illustrate tutorial examples
Figure 7. Illustration of setup to obtain a digital image for vertical vegetation complexity analysis. A vegetation section from a plot of 30 x 100 cm (red line), is photographed against a white cloth panel of 100 x 100 cm (yellow line) placed perpendicularly to the ground on the 100 cm side of the plot. A plastic canvas of 50x100cm (white line) was used to lower the vegetation along a narrow strip in front of a camera positioned on a tripod at a height of 45 cm (blue line).
As illustrated above, the first step to analyze images is to convert them into a binary matrix. You can use the function threshold_image_list to create a list for holding all binary matrices.
files_names= c("VD01.JPG", "VD02.JPG", "VD03.JPG", "VD04.JPG", "VD05.JPG", "VD06.JPG", "VD07.JPG", "VD08.JPG", "VD09.JPG", "VD10.JPG", "VD11.JPG", "VD12.JPG")

image_matrix_list=threshold_image_list(files_names, filetype = "jpeg",compress_method = "frame_fixed",target_width = 500,target_height=500)
Once you obtain the list of matrices, you can use a loop or apply family functions to extract information from all images and save them into objects or a matrix. I recommend storing all image information in a matrix, and exporting this matrix as a csv file. It is easier to transfer information to another database software, such as an excel sheet. Below, I illustrate how to apply functions denseness_total, heigh_propotion_test, and altitudinal_profile, to obtain information on vegetation density, a logical test to calculate the height below which 75% of vegetation denseness occurs, and the average height of 10 vertical image sections and its SD (note: sizes expressed in cm).
colnames(answer_matrix)=c("denseness", "heigh 0.75", "altitudinal mean", "altitudinal SD")
# Loop to analyze all images and store values in the matrix
for(i in 1:length(image_matrix_list)){
  answer_matrix[i,2]=heigh_propotion_test(image_matrix_list[[i]],proportion=0.75, height_size= 100)
  answer_matrix[i,3]=altitudinal_profile(image_matrix_list[[i]],n_sections=10, height_size= 100)[[1]]
  answer_matrix[i,4]=altitudinal_profile(image_matrix_list[[i]],n_sections=10, height_size= 100)[[2]]
Finally, we analyze information of holes data (i.e. vegetation gaps), in 10 image lines equally distributed among image (Zehm et al. 2003). For this purpose, we use function altitudinal_profile. Sizes are expressed in number of pixels.
# set a number of samples
# create a matrix to receive calculated values
colnames(answer_matrix2)=c("Image name", "heigh", "N of holes", "Mean size", "SD","Min","Max")

# Loop to analyze all images and store values in the matrix
for(i in 1:length(image_matrix_list)){
  for(k in 1:nsamples){
  line_heigh= k* length(image_matrix_list[[i]][,1])/nsamples
  aux=hole_section_data(image_matrix_list[[i]][line_heigh,] )
  answer_matrix2[((i-1)*nsamples)+k ,1]=files_names[i]
  answer_matrix2[((i-1)*nsamples)+k ,2]=line_heigh
  answer_matrix2[((i-1)*nsamples)+k ,3:7]=aux

write.table(answer_matrix2, file = "Image_data2.csv", sep = ",", col.names = NA, qmethod = "double")

Data availability

To download images access: https://figshare.com/articles/image_examples_for_bwimage_tutorial/9975656


Biagolini-Jr C (2019) bwimage: Describe Image Patterns in Natural Structures. https://cran.r-project.org/web/packages/bwimage/index.html

Biagolini-Jr C, Macedo RH (2019) bwimage: A package to describe image patterns in natural structures. F1000Research 8 https://f1000research.com/articles/8-1168

Díaz GM, Lencinas JD (2018) Model-based local thresholding for canopy hemispherical photography. Canadian Journal of Forest Research 48:1204-1216 https://www.nrcresearchpress.com/doi/abs/10.1139/cjfr-2018-0006

Lambers M (2016) Mappings between sphere, disc, and square. Journal of Computer Graphics Techniques Vol 5:1-21 http://jcgt.org/published/0005/02/01/paper-lowres.pdf

Nolan R (2019) autothresholdr: An R Port of the ‘ImageJ’ Plugin ‘Auto Threshold. https://cran.r-project.org/web/packages/autothresholdr/

Zehm A, Nobis M, Schwabe A (2003) Multiparameter analysis of vertical vegetation structure based on digital image processing. Flora-Morphology, Distribution, Functional Ecology of Plants 198:142-160 https://doi.org/10.1078/0367-2530-00086

Does imputing model labels using the model predictions can improve it’s performance?

In some scenarios a data scientist may want to train a model for which there exists an abundance of observations, but only a small fraction of is labeled, making the sample size available to train the model rather small. Although there’s plenty of literature on the subject (e.g. “Active learning”, “Semi-supervised learning” etc) one may be tempted (maybe due to fast approaching deadlines) to train a model with the labelled data and use it to impute the missing labels.

While for some the above suggestion might seem simply incorrect, I have encountered such suggestions on several occasions and had a hard time refuting them. To make sure it wasn’t just the type of places I work at I went and asked around in 2 Israeli (sorry non Hebrew readers) machine learning oriented Facebook groups about their opinion: Machine & Deep learning Israel and Statistics and probability group. While many were referring me to methods discussed in the literature, almost no one indicated the proposed method was utterly wrong. I decided to perform a simulation study to get a definitive answer once and for all. If you’re interested in reading what were the results see my analysis on Github.

Reproducible development with Rmarkdown and Github

I’m pretty sure most readers of this blog are already familiar with Rmarkdown and Github. In this post I don’t pretend to invent the wheel but rather give a quick run-down of how I set-up and use these tools to produce high quality and scalable (in human time) reproducible data science development code.


While data science processes usually don’t involve the exact same workflows like software development (for which Git was originally intended) I think Git is actually very well suited to the iterative nature of data-science tasks. When walking down different avenues in the exploration path, it’s worth while to have them reside in different branches. That way instead of jotting down in general pointers what you did along with some code snippets in some text file (or god-forbid word when you want to have images as well) you can instead go back to the relevant branch, see the different iterations and read a neat report with code and images. You can even re-visit ideas that didn’t make it into the master branch. Be sure to use informative branch names and commit messages!

Below is in illustration of how that process might look like:

Using Github allows one to easily package his code, supporting files etc (using repos) and share it with fellow researches, which can in turn clone the repo, re-run the code and go through all the development iterations without a hassle.


Most people familiar with Rmarkdown know it’s a great tool to write neat reports in all sorts of formats (html, PDF and even word!). One format that really makes it a great combo with Github is the github_document format. While one can’t view HTML files on Github, the output file from a github_document knit is an .md file which renders perfectly well on github, supporting images, tables, math, table of contents and many other. What some may not realize is that Rmarkdown is also a great development tool in itself. It behaves much like the popular Jupiter notebooks, with plots, tables and equations showing next to the code that generated them. What’s more, it has tons of cool features that really support reproducible development such as:
  • The first r-chunk (labled “setup” in the Rstudio template) always runs once when you execute code within chunks following it (pressing ctrl+Enter). It’s handy to load all packages used in later chucks (I like installing missing ones too) in this chunk such that whenever you run code within any of the chunks below it the needed packages are loaded.
  • When running code from within a chunk (pressing ctrl+Enter) the working directory will always be the one which the .Rmd file is located at. In short this means no more worrying about setting the working directory – be it when working on several projects simultaneously or when cloning a repo from Github.
  • It has many cool code execution tools such as a button to run code in all chunks up to the current one, run all code in the current chunk and it has a green progress bar so you don’t get lost too!
  • If your script is so long that scrolling around it becomes tedious, you can use this neat feature in Rstudio:  When viewing Rmarkdown files you can view an interactive table of contents that enables you to jump between sections (defined by # headers) in your code:

To summarize this section, I would highly recommend developing with Rmd files rather than R files.

A few set-up tips

  • Place a file “passwords.R” with all passwords in the directory to which you clone repos and source it via the Rmd. That way you don’t accidentally publish your passwords to Github
  • I like working with cache on all chunks in my Rmd. It’s usually good practice to avoid uploading the cache files generated in the process to Github so be sure to add to your .gitignore file the file types: *.RData, *.rdb, *.rdx, *.rds, *__packages
  • Github renders CSV files pretty nicely (and enables searching them conveniently)  so if you have some reference tables you want to include and you have a *.csv entry in your .gitignore  file, you may want to add to your .gitignore the following entry: !reference_table_which_renders_nicely_on_github.csv to exclude it from the exclusion list.

Sample Reproducible development repo

Feel free to clone the sample reproducible development repo below and get your reproducible project running ASAP!


Lyric Analysis with NLP and Machine Learning using R: Part One – Text Mining

June 22
By Debbie Liske

This is Part One of a three part tutorial series originally published on the DataCamp online learning platform in which you will use R to perform a variety of analytic tasks on a case study of musical lyrics by the legendary artist, Prince. The three tutorials cover the following:

Musical lyrics may represent an artist’s perspective, but popular songs reveal what society wants to hear. Lyric analysis is no easy task. Because it is often structured so differently than prose, it requires caution with assumptions and a uniquely discriminant choice of analytic techniques. Musical lyrics permeate our lives and influence our thoughts with subtle ubiquity. The concept of Predictive Lyrics is beginning to buzz and is more prevalent as a subject of research papers and graduate theses. This case study will just touch on a few pieces of this emerging subject.

Prince: The Artist

To celebrate the inspiring and diverse body of work left behind by Prince, you will explore the sometimes obvious, but often hidden, messages in his lyrics. However, you don’t have to like Prince’s music to appreciate the influence he had on the development of many genres globally. Rolling Stone magazine listed Prince as the 18th best songwriter of all time, just behind the likes of Bob Dylan, John Lennon, Paul Simon, Joni Mitchell and Stevie Wonder. Lyric analysis is slowly finding its way into data science communities as the possibility of predicting “Hit Songs” approaches reality.

Prince was a man bursting with music – a wildly prolific songwriter, a virtuoso on guitars, keyboards and drums and a master architect of funk, rock, R&B and pop, even as his music defied genres. – Jon Pareles (NY Times)
In this tutorial, Part One of the series, you’ll utilize text mining techniques on a set of lyrics using the tidy text framework. Tidy datasets have a specific structure in which each variable is a column, each observation is a row, and each type of observational unit is a table. After cleaning and conditioning the dataset, you will create descriptive statistics and exploratory visualizations while looking at different aspects of Prince’s lyrics.

Check out the article here!

(reprint by permission of DataCamp online learning platform)

Anomaly Detection in R

The World of Anomalies

Imagine you are a credit card selling company and you know about a particular customer who makes a purchase of 25$ every week. You guessed this purchase is his fixed weekly rations but one day, this customer makes a different purchase of 700$. This development will not just startle you but also compel you to talk to the customer and find out the reason so you can approve the transaction. This is because, the behavior of the customer had become fixed and the change was so different that it was not expected. Hence we call this event as an anomaly.

Anomalies are hard to detect because they can also be real phenomenon. Let’s say that the customer in the example above made the usual purchases while he was living alone and will be starting his family this week. This will mean that this should be the first of his future purchases of similar magnitude or he is throwing a party this week and this was a one-time large purchase. In all these cases, the customer will be classified as making an ‘abnormal’ choice. We as the credit card seller need to know which of these cases are genuine and which are mistakes which can be corrected if they reconfirm the same with the customer. The usefulness of detecting such anomalies are very useful especially in BFSI industry with the primary use in credit card transactions. Such anomalies can be signs of fraud or theft. Someone making multiple transactions of small amounts from the same credit card, making one very large transaction which is a few order of magnitudes larger than the average, making transactions from an unfamiliar location are such examples that can caused by fraudsters and must be caught. With the popularity of adoption, let’s study the ways we can detect anomalies.

Detecting The Pattern To Find Anomalies

Anomalies are essentially the outliers in our data. If something happens regularly, it is not an anomaly but a trend. Things which happen once or twice and are deviant from the usual behavior, whether continuously or with lags are all anomalies. So it all boils down to the definition of outliers for our data. R provides a lot of packages with different approaches to anomaly detection. We will use the AnomalyDetection package in R to understand the concept of anomalies using one such method. However, the package needs to be installed specially from github. This requires the install_github() function in devtools package. We will also use the Rcpp package which helps us to integrate R with C++ functions. Another github package to be used in this article is the wikipedia trend package which contains the API to access wikipedia and create data for anomaly detection analysis.

The package is capable of identifying outliers in the presence of seasonality and trend in the data. The package uses an algorithm known as Seasonal Hybrid ESD algorithm which finds outliers globally as well as locally in time series or a vector of data. The package has a lot of features, some of which include visualization graphs, type of anomalies (positive or negative) and specifying the window of interest.
#Install the devtools package then github packages

#Loading the libraries
The first step is data preparation. We will use the page views on wikipedia page marked on fifa data starting from date 18th March 2013. (link: https://en.wikipedia.org/wiki/FIFA). The wp_trend function gives us the access statistics for the page with the ability to filter data from within the function. We will use this data to model day wise page views and understand anomalies in the pattern of those view numbers
#Download wikipedia webpage "fifa" 
fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en")
This gives us a dataset of about 1022 observations and 8 columns. Looking at the data reveals some redundant information captured
   project   language article access     agent      granularity date       views
197 wikipedia en       Fifa    all-access all-agents daily       2016-01-13 116  
546 wikipedia en       Fifa    all-access all-agents daily       2016-12-27  64  
660 wikipedia en       Fifa    all-access all-agents daily       2017-04-20 100  
395 wikipedia en       Fifa    all-access all-agents daily       2016-07-29  70  
257 wikipedia en       Fifa    all-access all-agents daily       2016-03-13  75  
831 wikipedia en       Fifa    all-access all-agents daily       2017-10-08 194  
229 wikipedia en       Fifa    all-access all-agents daily       2016-02-14  84  
393 wikipedia en       Fifa    all-access all-agents daily       2016-07-27 140  
293 wikipedia en       Fifa    all-access all-agents daily       2016-04-18 105  
420 wikipedia en       Fifa    all-access all-agents daily       2016-08-23 757
We see that project, language, article, access, agent and granularity appear to be same for all rows and are irrelevant for us. We are only concerned with date and views as the features to work on. Let’s plot the views against date
#Plotting data
ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line()
Anomaly detection
We see some huge spikes at different intervals. There are a lot of anomalies in this data. Before we process them further, let’s keep only the relevant columns.
# Keep only date & page views and discard all other variables
We will now perform anomaly detection using Seasonal Hybrid ESD Test. The technique maps data as a series and captures seasonality while pointing out data which does not follow the seasonality pattern. The AnomalyDetectionTs() function finds the anomalies in the data. It will basically narrow down all the peaks keeping in mind that not more than 10% of data can be anomalies (by default). We can also reduce this number by changing the max_anoms parameter in the data. We can also specify which kind of anomalies are to be identified using the direction parameter. Here, we are going to specify only positive direction anomalies to be identified. That means that sudden dips in the data are not considered.
#Apply anomaly detection and plot the results
anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE)
Anomaly Detection 2

Our data has 5.68% anomalies in positive direction if we take a level of significance (alpha) to be 95%. Since we had a total of 1022 observations, 5.68% of the number is about 58 observations. We can look at the specific dates which are pointed out by the algorithm.
# Look at the anomaly dates
   timestamp anoms
1  2015-07-01   269
2  2015-07-02   233
3  2015-07-04   198
4  2015-07-05   330
5  2015-07-06   582
6  2015-07-07   276
7  2015-07-08   211
8  2015-07-09   250
9  2015-07-10   198
10 2015-07-20   315
11 2015-07-21   209
12 2015-07-25   202
13 2015-07-26   217
14 2015-09-18   278
15 2015-09-25   234
16 2015-09-26   199
17 2015-10-03   196
18 2015-10-07   242
19 2015-10-08   419
20 2015-10-09   240
21 2015-10-11   204
22 2015-10-12   223
23 2015-10-13   237
24 2015-10-18   204
25 2015-10-28   213
26 2015-12-03   225
27 2015-12-21   376
28 2015-12-22   212
29 2016-02-24   240
30 2016-02-26   826
31 2016-02-27   516
32 2016-02-29   199
33 2016-04-04   330
34 2016-05-13   217
35 2016-05-14   186
36 2016-06-10   196
37 2016-06-11   200
38 2016-06-12   258
39 2016-06-13   245
40 2016-06-14   204
41 2016-06-22   232
42 2016-06-27   273
43 2016-06-28   212
44 2016-07-10   221
45 2016-07-11   233
46 2016-08-22   214
47 2016-08-23   757
48 2016-08-24   244
49 2016-09-18   250
50 2016-09-19   346
51 2017-01-10   237
52 2017-03-29   392
53 2017-06-03   333
54 2017-06-21   365
55 2017-10-08   194
56 2017-10-09   208
57 2017-10-11   251
58 2017-10-14   373
We have the exact dates and the anomaly values for each date. In a typical anomaly detection process, each of these dates are looked case by case and the reason for anomalies is identified. For instance, the page views can be higher on these dates if there had been fifa matches or page updates on these particular days. Another reason could be big news about fifa players. However, if page views on any of the dates does not correspond to any special event, then those days are true anomalies and should be flagged. In other situations such as credit card transactions, such anomalies can indicate fraud and quick action must be taken on identification.

The ‘Anomaly Way’

Anomalies are a kind of outlier so SH-ESD (Seasonal Hybrid ESD) is not the only way to detect anomalies. Moreover, ‘AnomalyDetection’ is not the only package we will look upon. Let’s try the anomalize package which is available in CRAN. However, it is always recommended to update the package using github as the owners keep the most recent package versions there and it takes time and testing for the changes to move into standard repositories such as CRAN. We will first install the package from CRAN so that the dependencies are also installed then update the package using devtools
#Installing anomalize
#Update from github
#Load the package
# We will also use tidyverse package for processing and coindeskr to get bitcoin data
I am also using the tidyverse package (Link) and coindeskr package (Link). The coindeskr package is used to download the bitcoin data and tidyverse is used for speedy data processing. We will now download bitcoin data from 1st January 2017
#Get bitcoin data from 1st January 2017
bitcoin_data <- get_historic_price(start = "2017-01-01")
This data indicates the price per date. Let’s convert it into a time series
#Convert bitcoin data to a time series
bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))
In the time series conversion, we are actually converting the data to a tibble_df which the package requires. We could have alternatively converted the data into tibbletime object. Since it is a time series now, we should also see the seasonality and trend patterns in the data. It is important to remove them so that anomaly detection is not affected. We will now decompose the series. We will also plot the series

#Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality
bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()
Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
trend = 90.5 days
Anomaly Detection 2
We have some beautiful plots with the first plot being overall observed data, second being season, third being trend and the final plot analyzed for anomalies. The red points indicate anomalies according to the anomalize function. However, we are not looking for this plot. We only want the anomalies plot with trend and seasonality removed. Let’s plot the data again with recomposed data. This can be done by setting the time_recomposed() function

#Plot the data again by recomposing data
bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
trend = 90.5 days
Anomaly Detection 3
This is a better plot and shows the anomalies. We all know how bitcoin prices shot up in 2018. The grey portion explains the expected trend. Let’s see what these red points are.
#Extract the anomalies
anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')
Now the anomalies dataset consists of the data points which were identified as anomalies by the algorithm

Conclusion: Are You An Anomaly?

We have twitter’s anomaly detection package based on Seasonal Hybrid ESD (SH-ESD) as well as CRAN’s anomaly detection package based on factor analysis, Mahalanobis distance, Horn’s parallel analysis or Principal component analysis. We also have TsOutliers package and anomalize packages in R. There are a lot more packages than one could find in R. They all have the same concept but differ in the underlying algorithm which they use to detect anomaly. Hence, one can get a general idea from all such packages: anomalies are data points which do not follow the general trend or do not lie under the expected behavior of the rest of the data. The next question which is raised is the criteria for a data point to be following expected behavior. The rest of the data points are all anomalies. One can also have varying types of anomalies such as direction based anomalies as described by the anomaly detection package (positive or negative) or anomalies not following events such as matches in fifa data. One can similarly pitch in another logic for anomaly classification and treat them accordingly.

Here is the entire code used in this article

#Install the devtools package then github packages

#Loading the libraries

# Download wikipedia webpage "fifa"
fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en")

# Plotting data
ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line()

# Keep only date & page views and discard all other variables

#Apply anomaly detection and plot the results
anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE)

# Look at the anomaly dates

#Installing anomalize
#Update from github
#Load the package
# We will also use tidyverse package for processing and coindeskr to get bitcoin data

#Get bitcoin data from 1st January 2017
bitcoin_data = get_historic_price(start = "2017-01-01")

#Convert bitcoin data to a time series
bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))

#Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality
bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()

#Plot the data again by recomposing data
bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)

#Extract the anomalies
anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Potuganti, Saneesh Veetil and Chaitanya Sagar contributed to this article. Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

R as learning tool: solving integrals

Integrals are so easy only math teachers could make them difficult.When I was in high school I really disliked math and, with hindsight, I would say it was just because of the the prehistoric teaching tools (when I saw this video I thought I’m not alone). I strongly believe that interaction CAUSES learning (I’m using “causes” here on purpose being quite aware of the difference between correlation and causation), practice should come before theory and imagination is not a skill you, as a teacher, could assume in your students. Here follows a short and simple practical explanation of integrals. The only math-thing I will write here is the following: f(x) = x + 7. From now on everything will be coded in R. So, first of all, what is a function? Instead of using the complex math philosophy let’s just look at it with a programming eye: it is a tool that takes something in input and returns something else as output. For example, if we use the previous tool with 2 as an input we get a 9. Easy peasy. Let’s look at the code:
# here we create the tool (called "f")
# it just takes some inputs and add it to 7
f <- function(x){x+7}

# if we apply it to 2 it returns a 9

Then the second question comes by itself. What is an integral? Even simpler, it is just the sum of this tool applied to many inputs in a range. Quite complicated, let’s make it simpler with code: 
# first we create the range of inputs
# basically x values go from 4 to 6 
# with a very very small step (0.01)
# seq stands for sequence(start, end, step)

x <- seq(4, 6, 0.01) 
4.00 4.01 4.02 4.03 4.04 4.05 4.06 4.07...


As you see, x has many values and each of them is indexed so it’s easy to find, e.g. the first element is 4 (x[1]). Now that we have many x values (201) within the interval from 4 to 6, we compute the integral.
# since we said that the integral is 
# just a sum, let's call it IntSum and 
# set it to the start value of 0
# in this way it will work as an accumulator
IntSum = 0
Differently from the theory in which the calculation of the integral produces a new non-sense formula (just kidding, but this seems to be what math teachers are supposed to explain), the integral does produce an output, i.e. a number. We find this number by summing the output of each input value we get from the tool (e.g. 4+7, 4.01+7, 4.02+7, etc) multiplied by the step between one value and the following (e.g. 4.01-4, 4.02-4.01, 4.03-4.02, etc). Let’s clarify this, look down here:
# for each value of x 
for(i in 2:201){
    # we do a very simple thing:
    # we cumulate with a sum
    # the output value of the function f 
    # multiplied by each steps difference
    IntSum = IntSum + f(x[i])*(x[i]-x[i-1])
    # So for example,  
    # with the first and second x values the numbers will be:
    #0.1101 = 0 + (4.01 + 7)*(4.01 - 4)
    # with the second and third:
    #0.2203 = 0.1101 + (4.02 + 7)*(4.02 - 4.01)
    # with the third and fourth:
    #0.3306 = 0.2203 + (4.03 + 7)*(4.03 - 4.02)
    # and so on... with the sum (integral) growing and growing
    # up until the last value

Done! We have the integral but let’s have a look to the visualization of this because it can be represented and made crystal clear. Let’s add a short line of code to serve the purpose of saving the single number added to the sum each time. The reason why we decide to call it “bin” instead of, for example, “many_sum” will be clear in a moment.
# we need to store 201 calculation and we
# simply do what we did for IntSum but 201 times
bin = rep(0, 201)
0 0 0 0 0 0 0 0 0 0 0 0 ...
Basically, we created a sort of memory to host each of the calculation as you see down here:
for (i in 2:201){
    # the sum as earlier
    IntSum = IntSum + f(x[i])*(x[i]-x[i-1])
    # overwrite each zero with each number
    bin[i] = f(x[i])*(x[i]-x[i-1])


0.0000 0.1101 0.1102 0.1103 0.1104 0.1105 ..

Now if you look at the plot below you get the whole story: each bin is a tiny bar with a very small area and is the smallest part of the integral (i.e. the sum of all the bins).
# plotting them all
barplot(bin, names.arg=x)
This tells you a lot about the purpose of integral and the possibility of calculating areas of curvy surfaces. To have an idea of this just change the function f with, let’s say, sin(x) or log(x). What is happening? And what if you increase/decrease the number of bins? Have fun replicating the code changing some numbers and functions. Integrals should be clearer in the end. That’s all folks! #R #rstats #maRche #Rbloggers 

Using Shiny Dashboards for Financial Analysis

For some time now, I have been trading traditional assets—mostly U.S. equities. About a year ago, I jumped into the cryptocurrency markets to try my hand there as well. In my time in investor Telegram chats and subreddits, I often saw people arguing over which investments had performed better over time, but the reality was that most such statements were anecdotal, and thus unfalsifiable.

Given the paucity of cryptocurrency data available in an easily accessible format, it was quite difficult to say for certain that a particular investment was a good one relative to some alternative, unless you were very familiar with a handful of APIs. Even then, assuming you knew how to get daily OHLC data for a crypto-asset like Bitcoin, in order to compare it to some other asset—say Amazon stock—you would have to eyeball trends from a website like Yahoo finance or scrape that data separately and build your own visualizations and metrics. In short, historical asset performance comparisons in the crypto space were difficult to conduct for all but the most technically savvy individuals, so I set out to build a tool that would remedy this, and the Financial Asset Comparison Tool was born.

In this post, I aim to describe a few key components of the dashboard, and also call out lessons learned from the process of iterating on the tool along the way. Prior to proceeding, I highly recommend that you read the app’s README and take a look at the UI and code base itself, as this will provide the context necessary to understanding the rest of the commentary below.

I’ll start by delving into a few principles that I find to be to key when designing analytic dashboards, drawing on the asset comparison dashboard as my exemplar, and will end with some discussion of the relative utility of a few packages integral to the app. Overall, my goal is not to focus on the tool that I built alone, but to highlight a few main best practices when it comes to building dashboards for any analysis.

Build the app around the story, not the other way around.

Before ever writing a single line of code for an analytic app, I find that it is absolutely imperative to have a clear vision of the story that the tool must tell. I do not mean by this that you should already have conclusions about your data that you will then force the app into telling, but rather, that you must know how you want your user to interact with the app in order glean useful information.

In the case of my asset comparison tool, I wanted to serve multiple audiences—everyone from a casual trader who just wanted to see which investment produced the greatest net profit over a period of time, to a more experience trader, who had more nuanced questions about risk-adjusted return on investment given varying discount rates. The trick is thus building the app in such a way that serves all possible audiences without hindering any one type of user in particular.

The way I designed my app to meet this need was to build the UI such that as you descend the various sections vertically, the metrics displayed scale in complexity. My reasoning for this becomes apparent when you consider the two extremes in terms of users—the most basic vs. the most advanced trader.

The most basic user will care only about the assets of interest, the time period they want to examine, and how their initial investment performed over time. As such, they will start with the sidebar, input their assets and time frame of choice, and then use the top right-most input box to modulate their initial investment amount (although some may choose to stick with the default value here). They will then see the first chart change to reflect their choices, and they will see, both visually, and via the summary table below, which asset performed better.

The experienced trader, on the other hand, will start off exactly as the novice did, by choosing assets of interest, a time frame of reference, and an initial investment amount. They may then choose to modulate the LOESS parameters as they see fit, descending the page, looking over the simple returns section, perhaps stopping to make changes to the corresponding inputs there, and finally ending at the bottom of the page—at the Sharpe Ratio visualizations. Here they will likely spend more time—playing around with the time period over which to measure returns and changing the risk-free rate to align with their own personal macroeconomic assumptions.

The point of these two examples is to illustrate that the app by dint of its structure alone guides the user through the analytic story in a waterfall-like manner—building from simple portfolio performance, to relative performance, to the most complicated metrics for risk-adjusted returns. This keeps the novice trader from being overwhelmed or confused, and also allows the most experienced user to follow the same line of thought that they would anyway when comparing assets, while following a logical progression of complexity, as shown via the screenshot below.

Once you think you have a structure that guides all users through the story you want them to experience, test it by asking yourself if the app flows in such a way that you could pose and answer a logical series of questions as you navigate the app without any gaps in cohesion. In the case of this app, the questions that the UI answers as you descend are as follows:

  • How do these assets compare in terms of absolute profit?
  • How do these assets compare in terms of simple return on investment?
  • How do these assets compare in terms of variance-adjusted and/or risk-adjusted return on investment?

Thus, when you string these questions together, you can make statements of the type: “Asset X seemed to outperform Asset Y in terms of absolute profit, and this trend held true as well when it comes to simple return on investment, over varying time frames. That said, when you take into account the variance inherent to Asset X, it seems that Asset Y may have been the best choice, as the excess downside risk associated with Asset X outweighs its excess net profitability.

Too many cooks in the kitchen—the case for a functional approach to app-building.

While the design of the UI of any analytic app is of great importance, it’s important to not forget that the code base itself should also be well-designed; a fully-functional app from the user’s perspective can still be a terrible app to work with if the code is a jumbled, incomprehensible mess. A poorly designed code base makes QC a tiresome, aggravating process, and knowledge sharing all but impossible.

For this reason, I find that sourcing a separate R script file containing all analytic functions necessitated by the app is the best way to go, as done below (you can see Functions.R at my repo here).

# source the Functions.R file, where all analytic functions for the app are stored

Not only does this allow for a more comprehensible and less-cluttered App.R, but it also drastically improves testability and reusability of the code. Consider the example function below, used to create the portfolio performance chart in the app (first box displayed in the UI, center middle).

build_portfolio_perf_chart <- function(data, port_loess_param = 0.33){
  port_tbl <- data[,c(1,4:5)]
  # grabbing the 2 asset names
  asset_name1 <- sub('_.*', '', names(port_tbl)[2])
  asset_name2 <- sub('_.*', '', names(port_tbl)[3])
  # transforms dates into correct type so smoothing can be done
  port_tbl[,1] <- as.Date(port_tbl[,1])
  date_in_numeric_form <- as.numeric((port_tbl[,1]))
  # assigning loess smoothing parameter
  loess_span_parameter <- port_loess_param
  # now building the plotly itself
  port_perf_plot <- plot_ly(data = port_tbl, x = ~port_tbl[,1]) %>%
    # asset 1 data plotted
    add_markers(y =~port_tbl[,2],
                marker = list(color = '#FC9C01'),
                name = asset_name1,
                showlegend = FALSE) %>%
    add_lines(y = ~fitted(loess(port_tbl[,2] ~ date_in_numeric_form, span = loess_span_parameter)),
              line = list(color = '#FC9C01'),
              name = asset_name1,
              showlegend = TRUE) %>%
    # asset 2 data plotted
    add_markers(y =~port_tbl[,3],
                marker = list(color = '#3498DB'),
                name = asset_name2,
                showlegend = FALSE) %>%
    add_lines(y = ~fitted(loess(port_tbl[,3] ~ date_in_numeric_form, span = loess_span_parameter)),
              line = list(color = '#3498DB'),
              name = asset_name2,
              showlegend = TRUE) %>%
      title = FALSE,
      xaxis = list(type = "date",
                   title = "Date"),
      yaxis = list(title = "Portfolio Value ($)"),
      legend = list(orientation = 'h',
                    x = 0,
                    y = 1.15)) %>%
      x= 1,
      y= 1.133,
      xref = "paper",
      yref = "paper",
      text = "",
      showarrow = F

Writing this function in the sourced Functions.R file instead of directly within the App.R allows for the developer to first test the function itself with fake data—i.e. data not gleaned from the reactive inputs. Once it has been tested in this way, it can be integrated in the app.R on the server side as shown below, with very little code.

  output$portfolio_perf_chart <- 
        data <- react_base_data()
        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)
      millis = 2000) # sets wait time for debounce

This process allows for better error-identification and troubleshooting. If, for example, you want to change the work accomplished by the analytic function in some way, you can make the changes necessary to the code, and if the app fails to produce the desired outcome, you simply restart the chain: first you test the function in a vacuum outside of the app, and if it runs fine there, then you know that you have a problem with the way the reactive inputs are integrating with the function itself. This is a huge time saver when debugging.

Lastly, this allows for ease of reproducibility and hand-offs. If, say, one of your functions simply takes in a dataset and produces a chart of some sort, then it can be easily copied from the Functions.R and reused elsewhere. I have done this too many times to count, ripping code from project and, with a few alterations, instantly applying it in other contexts. This is easy to do if the functions are written in a manner not dependent on a particular Shiny reactive structure. For all these reasons, it makes sense in most cases to keep the code for the app UI and inputs cleanly separated from the analytic functions via a sourced R script.

Dashboard documentation—both a story and a manual, not one or the other.

When building an app for a customer at work, I never simply write an email with a link in it and write “here you go!” That will result in, at best, a steep learning curve, and at worst, an app used in an unintended way, resulting in user frustration or incorrect results. I always meet with the customer, explain the purpose and functionalities of the tool, walk through the app live, take feedback, and integrate any key takeaways into further iterations.

Even if you are just planning on writing some code to put up on GitHub, you should still consider all of these steps when working on the documentation for your app. In most cases, the README is the epicenter of your documentation—the README is your meeting with the customer.  As you saw when reading the README for the Asset Comparison Tool, I always start my READMEs with a high-level introduction to the purpose of the app—hopefully written or supplemented with visuals (as seen below) that are easy to understand and will capture the attention of browsing passers-by. 

After this introduction, the rest of the potential sections to include can vary greatly from app-to-app. In some cases apps are meant to answer one particular question, and might have a variety of filters or supplemental functionalities—one such example can be found here. As can be seen, in that README, I spend a great deal of time on the methodology after making the overall purpose clear, calling out additional options along the way. In the case of the README for the Asset Comparison Tool, however, the story is a bit different. Given that there are many questions that the app seeks to answer, it makes sense to answer each in turn, writing the README in such a way that its progression mirrors the logical flow of the progression intended for the app’s user.

One should of course not neglect to cover necessary technical information in the README as well. Anything that is not immediately clear from using the app should be clarified in the README—from calculation details to the source of your data, etc. Finally, don’t neglect the iterative component! Mention how you want to interact with prospective users and collaborators in your documentation. For example, I normally call out how I would like people to use the Issues tab on GitHub to propose any changes or additions to the documentation, or the app in general. In short, your documentation must include both the story you want to tell, and a manual for your audience to follow. 

Why Shiny Dashboard?

One of the first things you will notice about the app.R code is that the entire thing is built using Shiny Dashboard as its skeleton. There are a two main reasons for this, which I will touch on in turn.

Shiny Dashboard provides the biggest bang for your buck in terms of how much UI complexity and customizability you get out of just a small amount of code.

I can think of few cases where any analyst or developer would prefer longer, more verbose code to a shorter, succinct solution. That said, Shiny Dashboard’s simplicity when it comes to UI manipulation and customization is not just helpful because it saves you time as a coder, but because it is intuitive from the perspective of your audience.

Most of the folks that use the tools I have built to shed insight into economic questions don’t know how to code in R or Python, but they can, with a little help from extensive commenting and detailed documentation, understand the broad structure of an app coded in Shiny Dashboard format. This is, I believe, largely a function of two features of Shiny Dashboard: the colloquial-English-like syntax of the code for UI elements, and the lack of the necessity for in-line or external CSS.

As you can see from the example below, Shiny Dashboard’s system of “boxes” for UI building is easy to follow. Users can see a box in the app and easily tie that back to a particular box in the UI code.

Here is the box as visible to the user:

And here is the code that produces the box:

        title = "Portfolio Performance Inputs",
        status= "primary",
        solidHeader = TRUE,
        h5("This box focuses on portfolio value, i.e., how much an initial investment of the amount specified below (in USD) would be worth over time, given price fluctuations."),
          inputId = "initial_investment",
          label = "Enter your initial investment amount ($):",
          value = "1000"),
        h5("The slider below modifies the", a(href = "https://stats.stackexchange.com/questions/2002/how-do-i-decide-what-span-to-use-in-loess-regression-in-r", "smoothing parameter"), "used in the", a(href = "https://en.wikipedia.org/wiki/Local_regression", "LOESS function"), "that produces the lines on the scatterplot."),
          inputId = "port_loess_param",
          label = "Smoothing parameter for portfolio chart:",
          min = 0.1,
          max = 2,
          value = .33,
          step = 0.01,
          animate = FALSE
        h5("The table below provides metrics by which we can compare the portfolios. For each column, the asset that performed best by that metric is colored green."),
        height = 500, 
        width = 4

Secondly, and somewhat related to the first point, with Shiny Dashboard, much of the coloring and overall UI design comes pre-made via dashboard-wide “skins”, and box-specific “statuses.”

This is great if you are okay sacrificing a bit of control for a significant reduction in code complexity. In my experience dealing with non-coding-proficient audiences, I find that in-line CSS or complicated external CSS makes folks far more uncomfortable with the code in general. Anything you can do to reduce this anxiety and make those using your tools feel as though they understand them better is a good thing, and Shiny Dashboard makes that easier.

Shiny Dashboard’s combination of sidebar and boxes makes for easy and efficient data processing when your app has a waterfall-like analytic structure. 

Having written versions of this app both in base Shiny and using Shiny Dashboard, the number one reason I chose Shiny Dashboard was the fact that the analytic questions I sought to solve followed a waterfall-like structure, as explained in the previous section. This works perfectly well with Shiny Dashboard’s combination of sidebar input controls and inputs within UI boxes themselves.  

The inputs of primordial importance to all users are included in the sidebar UI: the two assets to analyze, and the date range over which to compare their performance. These are the only inputs that all users, regardless of experience or intent, must absolutely use, and when they are changed, all views in the dashboard will be affected. All other inputs are stored in the UI Boxes adjacent to the views that they modulate. This makes for a much more intuitive and fluid user experience, as once the initial sidebar inputs have been modulated, the sidebar can be hidden, as all other non-hidden inputs affect only the visualizations to which they are adjacent.

This waterfall-like structure also makes for more efficient reactive processes on the Shiny back-end. The inputs on the sidebar are parameters that, when changed, force the main reactive function that creates that primary dataset to fire, thus recreating the base dataset (as can be seen in the code for that base datasets creation below).

  # utility functions to be used within the server; this enables us to use a textinput for our portfolio values
  exists_as_number <- function(item) {
    !is.null(item) && !is.na(item) && is.numeric(item)
  # data-creation reactives (i.e. everything that doesn't directly feed an output)
  # first is the main data pull which will fire whenever the primary inputs (asset_1a, asset_2a, initial_investment, or port_dates1a change)
  react_base_data <- reactive({
    if (exists_as_number(as.numeric(input$initial_investment)) == TRUE) {
      # creates the dataset to feed the viz
          asset_1 = input$asset_1a,
          asset_2 = input$asset_2a, 
          port_start_date = input$port_dates1a[1],
          port_end_date = input$port_dates1a[2],
          initial_investment = (as.numeric(input$initial_investment))
    } else {
          asset_1 = input$asset_1a,
          asset_2 = input$asset_2a, 
          port_start_date = input$port_dates1a[1],
          port_end_date = input$port_dates1a[2],
          initial_investment = (0)

Each of the visualizations are then produced via their own separate reactive functions, each of which takes as an input the main reactive (as shown below). This makes it so that whenever the sidebar inputs are changed, all reactives fire and all visualizations are updated; however, if all that is changed is a single loess smoothing parameter input, only the reactive used in the creation of that particular parameter-dependent visualization fires, which makes for great computational efficiency.

 # Now the reactives for the actual visualizations
  output$portfolio_perf_chart <- 
        data <- react_base_data()
        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)
      millis = 2000) # sets wait time for debounce

Why Plotly?

Plotly vs. ggplot is always a fun subject for discussion among folks who build visualizations in R. Sometimes I feel like such discussions just devolve into the same type of argument as R vs. Python for data science (my answer to this question being just pick one and learn it well), but over time I have found that there are actually some circumstances where the plotly vs. ggplot debate can yield cleaner answers.

In particular, I have found in working on this particular type of analytic app that there are two areas where plotly has a bit of an advantage: clickable interactivity, and wide data.

Those familiar with ggplot will know that every good ggplot begins with long data. It is possible, via some functions, to transform wide data into a long format, but that transformation can sometimes be problematic. While there are essentially no circumstances in which it is impossible to transform wide data into long format, there are a handful of cases where it is excessively cumbersome: namely, when dealing with indexed xts objects (as shown below) or time series / OHLC-styled data.

In these cases—either due to the sometimes-awkward way in which you have to handle rowname indexes in xts, or the time and code complexity saved by not having to transform every dataset into long format—plotly offers efficiency gains relative to ggplot.

The aforementioned efficiency gains are a reason to choose plotly in some cases because it makes the life of the coder easier, but there are also reasons why it sometimes make the life of the user easier as well.

If one of the primary utilities of a visualization is to allow the user the ability to seamlessly and intuitively zoom in on, select, or filter the data displayed, particularly in the context of a Shiny App, then plotly should be strongly considered. Sure, ggplotly wrappers can be used to make a ggplot interactive, but with an added layer of abstraction comes an added layer of possible errors. While in most cases a ggplotly wrapper should work seamlessly, I have found that, particularly in cases where auto-sizing and margin size specification is key, ggplotly can require a great deal of added code in order to work correctly in a Shiny context.

In summary, when considering when to start with plotly vs. when to start with ggplot, I find one question to be particularly helpful: what do I value most—visual complexity and/or customization, or interactive versatility and/or preserving wide data?

If I choose the former, then ggplot is what I need; otherwise, I go with plotly. More often than not I find that ggplot emerges victorious, but even if you disagree with me in my decision-making calculus, I think it is helpful to at least think through what your personal calculus is. This will save you time when coding, as instead of playing around with various types of viz, you can simply pose the question(s) behind your calculus and know quickly what solution best fits your problem.

Why Formattable?

The case for formattable is, in my opinion, a much easier case to make than arguing for plotly vs. ggplot. The only question worth asking when deciding on whether or not to use formattable in your app is: do I want my table to tell a quick story via added visual complexity within the same cell that contains my data, or is a reference table all I am looking for? If you chose the former, formattable is probably a good way to go. You’ll notice as well that the case for formattable is very specific–in most cases there is likely a simpler solution via the DT  or kableExtra packages.

The one downside that I have encountered in dealing with formattable code is the amount of code necessary to generate even moderately complicated tables. That said, this problem is easily remedied via a quick function that we can use to kill most of the duplicative coding, as seen in the example below.

First, here is the long form version:

  react_formattable <- reactive({
                    "Asset Portfolio Max Worth" = formatter("span",
                                                            style = x ~ style(
                                                              display = "inline-block",
                                                              direction = "rtl",
                                                              "border-radius" = "4px",
                                                              "padding-right" = "2px",
                                                              "background-color" = csscolor("darkslategray"),
                                                              width = percent(proportion(x)),
                                                              color = csscolor(gradient(x, "red", "green"))
                    "Asset Portfolio Latest Worth" = formatter("span",
                                                               style = x ~ style(
                                                                 display = "inline-block",
                                                                 direction = "rtl",
                                                                 "border-radius" = "4px",
                                                                 "padding-right" = "2px",
                                                                 "background-color" = csscolor("darkslategray"),
                                                                 width = percent(proportion(x)),
                                                                 color = csscolor(gradient(x, "red", "green"))
                    "Asset Portfolio Absolute Profit" = formatter("span",
                                                                  style = x ~ style(
                                                                    display = "inline-block",
                                                                    direction = "rtl",
                                                                    "border-radius" = "4px",
                                                                    "padding-right" = "2px",
                                                                    "background-color" = csscolor("darkslategray"),
                                                                    width = percent(proportion(x)),
                                                                    color = csscolor(gradient(x, "red", "green"))
                    "Asset Portfolio Rate of Return" = formatter("span",
                                                                 style = x ~ style(
                                                                   display = "inline-block",
                                                                   direction = "rtl",
                                                                   "border-radius" = "4px",
                                                                   "padding-right" = "2px",
                                                                   "background-color" = csscolor("darkslategray"),
                                                                   width = percent(proportion(x)),
                                                                   color = csscolor(gradient(x, "red", "green"))

This code can easily be shortened via the integration of a custom function, as shown below.

simple_formatter <- function(){
              style = x ~ style(
                display = "inline-block",
                direction = "rtl",
                "border-radius" = "4px",
                "padding-right" = "2px",
                "background-color" = csscolor("darkslategray"),
                width = percent(proportion(x)),
                color = csscolor(gradient(x, "red", "green"))
  react_formattable <- reactive({
                    "Asset Portfolio Max Worth" = simple_formatter(),
                    "Asset Portfolio Latest Worth" = simple_formatter(),
                    "Asset Portfolio Absolute Profit" = simple_formatter(),
                    "Asset Portfolio Rate of Return" = simple_formatter()

As can be seen, formattable allows for a great deal of added complexity in crafting your table—complexity that may not be suited for all apps. That said, if you do want to quickly draw a user’s attention to something in a table, formattable is a great solution, and most of the details of the code can be greatly simplified via a function, as shown.


That was a lot—I know—but I hope that from this commentary and my exemplar of the Asset Comparison Tool more generally has helped to inform your understanding of how dashboards can serve as a helpful analytic tool. Furthermore, I hope to have prompted some thoughts as to the best practices to be followed when building such a tool. I’ll end with a quick tl;dr:

  • Shave complexity wherever possible, and make code as simple as possible by keeping the code for the app’s UI and inner mechanism (inputs, reactives, etc.) separate from the code for the analytic functions and visualizations.
  • Build with the most extreme cases in mind (think of how your most edge-case user might use the app, and ensure that behavior won’t break the app)
  • Document, document, and then document some more. Make your README both a story and a manual.
  • Give Shiny Dashboard a shot if you want an easy-to-construct UI over which you don’t need complete control when it comes to visual design.
  • Pick your visualization packages based on what you want to prioritize for your user, not the other way around (this applies to ggplot, plotly, formattable, etc.).

Thanks for reading!