XGBoostLSS – An extension of XGBoost to probabilistic forecasting


Introduction 


To reason rigorously under uncertainty we need to invoke the language of  probability (Zhang et al. 2020). Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to yield an incomplete and potentially misleading picture. While this is an irrevocable consensus in statistics, a common misconception, albeit a very persistent one, is that machine learning models usually lack proper ways of quantifying uncertainty. Despite the fact that the two terms exist in parallel and are used  interchangeably, the perception that machine learning and statistics imply a non-overlapping set of techniques remains lively, both among practitioners and academics. This is vividly portrayed by the provocatively (and potentially tongue-in-cheek) statement of Brian D. Ripley that ‘machine learning is statistics minus any checking of models and assumptions‘ that he made during the useR! 2004, Vienna conference that served to illustrate the difference between machine learning and statistics.

In fact, the relationship between statistics and machine learning is artificially complicated by such statements and is unfortunate at best, as it implies a profound and qualitative distinction between the two disciplines (Januschowski et al. 2020). The paper by Breiman (2001) is a noticeable exception, as it proposes to differentiate the two based on scientific culture, rather than on methods alone. While the approaches discussed in Breiman (2001) are an admissible partitioning of the space of how to analyse and model data, more recent advances have gradually made this distinction less clear-cut. In fact, the current research trend in both statistics and machine learning gravitates towards bringing both disciplines closer together. In an era of increasing necessity that the output of prediction models needs to be turned into explainable and reliable insights, this is an exceedingly promising and encouraging development, as both disciplines have much to learn from each other. Along with Januschowski et al. (2020), we argue that it is more constructive to seek common ground than it is to introduce artificial boundaries. 

In order to further closing the gap between the two cultures, we propose a new framework of the eminent XGBoost that predicts the entire conditional distribution of a univariate response variable. We term our model XGBoostLSS, as it combines the accuracy and speed of XGBoost with the flexibility and interpretability of GAMLSS that allow for the estimation and prediction of the entire conditional distribution. In particular, XGBoostLSS models all moments of a parametric distribution (i.e., mean, location, scale and shape [LSS]) instead of the conditional mean only. XGBoostLSS allows the user to choose from a wide range of continuous, discrete and mixed discrete-continuous distributions to better adapt to the data at hand, as well as to provide predictive distributions, from which prediction intervals and quantiles can be derived.  

Applications

In the following, we present both a simulation study and a real-world example that demonstrate the functionality of XGBoostLSS.

Simulation

We start with a simulated a data set that exhibits heteroskedasticity, where the interest lies in predicting the 5% and 95% quantiles. Points outside the 5% and 95% quantile are coloured in red, with the black dashed lines depicting the actual 5% and 95% quantiles

Let’s fit the XGBoostLSS model to the data. In general, the syntax is similar to the original XGBoost implementation. However, the user has to make a distributional assumption by specifying a family in the function call. As the data has been generated by a Normal distribution, we use the Normal as a function input. The user also has the option of providing a list of hyper-parameters that are used for training the surrogate regression model to find an optimized set of parameters. As our model allows to model the entire conditional distribution, we obtain prediction intervals and quantiles of interest directly from the predicted quantile function. Once the model is trained, we can predict all parameter of the distribution. The following Figure shows the predictions of XGBoostLSS for the 5% and 95% quantile in blue.
Comparing the coverage of the intervals with the nominal level of 90% shows that XGBoostLSS does not only correctly model the heteroskedasticity in the data, but it also provides an accurate forecast for the 5% and 95% quantiles. The flexibility of XGBoostLSS also comes from its ability to provide attribute importance, as well as partial dependence plots, for all of the distributional parameters. In the following we only investigate the effect on the conditional variance.

Munich Rent

Considering there is an active discussion around imposing a freeze in German cities on rents, we have chosen to re-visit the famous Munich Rent data set commonly used in the GAMLSS literature, as Munich is among the most expensive cities in Germany when it comes to living costs. As our dependent variable, we select Net rent per square meter in EUR.

Even though the Generalized Beta Type 2 to provide the best approximation to the data, we use the more parsimonious Normal distribution, as it has only two distributional parameters, compared to 4 of the Generalized Beta Type 2. 



Now that we have specified the distribution, we fit XGBoostLSS to the data. Again, we use Bayesian Optimization for finding an optimal set of hyper-parameters.  Looking at the estimated effects presented in the following Figure indicates that newer flats are on average more expensive, with the variance first decreasing and increasing again for flats built around 1980 and later. Also, as expected, rents per square meter decrease with an increasing size of the apartment.

The diagnostics for XGBoostLSS are based on quantile residuals of the fitted model and are shown in the following Figure.

Despite some slight under-fitting in the tails of the distribution, XGBoostLSS provides a well calibrated forecast and confirms that our model is a good approximation to the data. XGBoostLSS also allows to investigate feature importance for all distributional parameters. Looking at the top 10 features with the highest Shapley values for both the conditional mean and variance indicates that both yearc and area are considered as being the most important variables.

Besides the global attribute importance, the user might also be interested in local attribute importance for each single prediction individually. This allows to answer questions like ‘How did the feature values of a single data point affect its prediction?‘  For illustration purposes, we select the first predicted rent of the test data set and present the local attribute importance for the conditional mean.

As we have modelled all parameters of the Normal distribution, XGBoostLSS provides a probabilistic forecast, from which any quantity of interest can be derived. Th following Figure shows a random subset of 50 predictions only for ease of readability. The red dots show the actual out of sample rents, while the boxplots visualize the predicted distributions.

We can also plot a subset of the forecasted densities and cumulative distributions.

Comparison to other approaches



All measures, except the Median Absolute Error, show that XGBoostLSS provides more accurate forecasts than existing implementations.


Software Implementation

In its current implementation, XGBoostLSS is available in R and made publicly available soon on the Git-repo StatMixedML/XGBoostLSS. Extensions to Julia and Python are also planned. I am currently also working on an extension of CatBoost to allow for probabilistic forecasting. You may find it on the Git-repo StatMixedML/CatBoostLSS. Please note that due to time some restrictions, the public release of the package(s) may be somewhat delayed.

Reference Paper


März, Alexander (2019) “XGBoostLSS – An extension of XGBoost to probabilistic forecasting”.
März, Alexander (2020) “CatBoostLSS – An extension of CatBoost to probabilistic forecasting”.

The Premier Machine Learning Conference (15% discount code)

Attend any or all of the five jointly scheduled events! Get an additional 15% discount on your booking with the code RBLOGGERSMLW . Join us in Las Vegas for the largest Predictive Analytics World event to date! Learn more



5 days, 8 tracks, 160 speaker and over 150 exciting sessions


Join the Machine Learning Week 2020, May 31 – June 4, Las Vegas! It brings together five co-located events:
 

    • Predictive Analytics World for Business
    • Predictive Analytics World for Financial Services
    • Predictive Analytics World for Industry 4.0
    • Predictive Analytics World for Healthcare
    • Deep Learning World
gathering the top practitioners and the leading experts in data science and machine learning. By design, this mega-conference is where to meet the who’s who and keep up on the latest techniques, making it the leading machine learning event.  You can expect top-class experts from world-famous companies such as: 

Google: Join Richard Dutton (Head of Machine Learning for Corporate Engineering) at his session “How Google Uses AI and Machine Learning in the Enterprise”!
Lyft: Gil Arditi (Product Lead, Machine Learning) will speak about “From Self-Driving to Fraud Detection – How Lyft Streamlines Machine Learning Deployment”.
LinkedIn: You want to learn something about launching artificial intelligence products? Then the session from Priyanka Gariba (Head of Artificial Intelligence Technical Program Management) fits best for you!
Visa: Abhishek Joshi ‘AJ’ (Senior Director) will explain you “The Predictive Power of Card Transaction Data”.
Microsoft: Jaya Mathew (Senior Data Scientist) will present you “Getting the Most out of Your IoT Data: Basics of Predictive Maintenance, Bootstrapping you Model with Your Data & Comparing the Traditional vs Deep Learning Methodologies”
Verizon: Join Bryan Larish Ph.D. (Director of Technology) at his session “Auditing Cellular Networks with Deep Learning”

 See the full 8-track agenda here.

What are the 5 conferences about?

Predictive Analytics World for Business focuses on concrete examples of deployed predictive analytics. Hear from the horse’s mouth precisely how Fortune 500 analytics competitors and other top practitioners deploy machine learning, and the kind of business results they achieve. Predictive Analytics World for Financial Services is the leading data science event covering the deployment of machine learning by banks, insurance companies, credit card companies, investment firms, and other financial institutions. The Predictive Analytics World Healthcare program will feature sessions and case studies across Healthcare Business Operations and Clinical applications so you can witness how data science and machine learning are employed at leading enterprises, resulting in improved outcomes, lower costs, and higher patient satisfaction. Predictive Analytics World for Industry 4.0 is the leading vendor-neutral conference for machine learning for smart manufacturing and IoT. Data scientists, industrial planners, and other machine learning experts will meet in Las Vegas to explore the latest trends and technologies in machine & deep learning for the IoT era. Deep Learning World is the premier conference covering the commercial deployment of deep learning. The event’s mission is to foster breakthroughs in the value-driven operationalization of established deep learning methods.

Attend any or all of the five jointly scheduled events! Get an additional 15% discount on your booking with the code RBLOGGERSMLW . Join us in Las Vegas for the largest Predictive Analytics World event to date! Learn more

New Package to Process TVDI index and Filter Golay Savitzky Raster

Description

  • Use MODIS image to calculate TVDI index
  • Make multiple Raster images at the same time
  • Can be used to calculate large image files
  • UI interface calculates TVDI index
  • UI interface exports Golay Savitzky filter images
  • The functions in the TVDI package
    • Golay_Raster
    • Golay_GUI (may be failed if you don’t have GTK+)
    • Mean_Raster
    • Mask_Multi_Raster
    • IQR_Raster
    • TVDI_process
    • TVDI_Largefiles_process
    • TVDI_GUI (may be failed if you don’t have GTK+)

How to Download and Install

  • Download and Install from Github
install.packages("devtools")
library("devtools")
install_github("nguyenduclam/TVDIpk")
library("TVDIpk")
  • Install from Cran (waiting for update in Cran)
install.packages("TVDIpk")
  • Note that GTK+ library is not already installed on your system, installation may fail. In that case, please install and load the gWidgetsRGtk2 library beforehand:
install.packages("gWidgetsRGtk2")
library("gWidgetsRGtk2")

How to use Pakages

  1. Golay UI
    • Golay_GUI()
  2. TVDI UI
    • TVDI_GUI()

References

rco: Make Your R Code Run Faster Today!

The rco package can optimize R code in a variety of different ways. The package implements common subexpression elimination, constant folding, constant propagation, dead code elimination, among other very relevant code optimization strategies.

Currently, the rco could be downloaded as a GitHub package. The rco  package functions as an RStudio Addin, be used through a shiny GUI interface, or as an R function through either the optimize_files() or optimize_text()  functions. Found an optimization method not currently supported by the rco package? Feel free to contribute! Your contribution is more than welcome. Install rco now and make your R code run faster today!

15+ Resources to Get Started with R

R is the second most sought after language in data science behind Python, so gaining mastery of R is a prerequisite to a thriving career in the field. Whether you’re an experienced developer or a newbie considering a career move, here are some excellent resources so you can get started with R.

[Related Article: Data-Driven Exploration of the R User Community Worldwide]

What is R?

R is a programming language and environment designed for statistical analysis. It’s used mainly by data miners and statisticians. It’s a free resource and runs on a wide variety of platforms, including UNIX, Windows, and Mac OS.  It has thousands of well-documented extensions and is cross-platform compatible. It’s not quite as popular outside of the field of data science, but it’s one of the best options for exploring datasets in a deep dive manner or for going after data insights for a single time. Head over to the R sight and download a copy of R, so you’re ready to get started.

Free R Resources for Beginners

Let’s take a look at how a beginner might break into R. It’s not quite as friendly as Python, but it’s definitely accessible with good resources and practice. 

Platforms and Documentation

r-bloggers.com: R-bloggers is a collection of blogs designed by R experts that covers a wide range of R topics. No matter what you’re curious about or have an issue with, R-bloggers has it covered.

Books

R for Data Science: This classic handbook provides resources and documentation. It’s available for free on the website, or you can purchase a physical copy from Amazon. Hands-on Programming with R: Garrett Grolemund’s classic is a practical, hands-on approach to R programming. It gives you the instruction you need plus practical programming skills to begin with R right from the very beginning.

Courses

Codecademy: Codecademy’s mission is to bring development knowledge even to beginners, and its R offers are no different. While many of the lessons will require a membership, it does offer a basic set of courses to get you started. edX.org: EdX offers a range of free R courses to get you started, but we recommend starting with Microsoft’s Introduction to R for Data Science for a comprehensive overview. The courses cost nothing and are offered asynchronously. Some do come with official certification for a fee.

Free R Resources for Developers

If you’ve already got some development experience under your belt, these resources could be a great way to get started with R by utilizing your current experience. Even better, they’re free.

Platforms and Documentation

storybench.com: Storybench is an experiential learning platform designed to provide exercises in digital storytelling. They offer projects in R, most notably “How to Explore Correlations in R.” Once you’ve gotten the basics, the next logical step is to take on projects for hands-on learning.

Books

R Programming for Data Science: This option is available for free (though you can choose to donate in support of the project). It offers full resources for learning R and understanding key data science principles. If you upgrade the package, the online book comes with a full video suite. Text Mining with R: Another book available for free on the website, this option offers a targeted approach to text mining with a full Github repository as support. R in Action: Another entirely free resource for business developers. If you’ve already got an established career in development in the business world, this could be an excellent resource for getting started with R in the business world.

Courses

Coursera: Johns Hopkins’s popular partnership with Coursera, “Data Science, Foundations Using R” is a great way for developers to build skills to break into the field of Data Science. edX + Harvard: X Series Program in Data Analysis for Life Sciences is a course series designed to implement both learning R and real-life projects for a full learning experience. You can upgrade to an official learning certificate for a fee or take the individual courses for free.

Paid Resources for Beginners and Beyond

Sometimes, you’ve got to invest a little in your learning experience. Here are a couple of things that can really jumpstart your R-programming skills if you’ve got some wiggle room in your budget. Getting Started with R: A primer on using R for the biological sciences. It contains valuable information for getting started on statistical analysis using the R programming language. flowingdata.com: Flowingdata is a membership site designed to elevate your visualizations. It’s another excellent way to get experiential learning with R projects. Rstudio: It’s not cheap, but if you’re serious about making a career in R, you’ll want to get it. Save up and invest. They do, however, have a series of free webinars you can peruse.

Bonus! More Blogs and Newsletters

https://blog.revolutionanalytics.com/r/ : Blog designed to highlight milestones in Data Science and includes a range of topics including both R and Python for you multilingual developers out there. https://journal.r-project.org/: Open access, refereed journal detailing the latest in R-programming news and projects. Papers have a focus on accessibility, and the articles are tended to reach a wide audience.  https://morningcupofcoding.com/: Offers a wide range of curated coding articles, including R programming. Check their back issues for articles of interest. opendatascience.com: ODSC’s general weekly newsletter provides members with trending topics in the fields of modeling, tools & platforms, and more.

Getting Started with R Programming

[Related Article: Where is Data Science Heading? Watching R’s Most Popular Packages May Have the Answer]

While both Python and R are invaluable resources for getting started in Data Science, the statistical capabilities of R are in wide demand. Whether you’re new to the world of coding or an experienced developer, R can open doors into the world of Data Science.

Beginners guide to Bubble Map with Shiny

Map Bubble
Map bubble is type of map chart where bubble or circle position  indicates geoghraphical location and bubble size is used to show differences in magnitude of quantitative variables like population.

We will be using
Highcharter package to show earthquake magnitude and depth . Highcharter is a versatile charting library to build interactive charts, one of the easiest to learn and for shiny integration.

Bubble Map

 

About dataset
Dataset used here is from US Geological survey website of recent one week earthquake events. There are about 420 recorded observation with magnitude more than 2.0 globally. Dataset has 22 variables, of which we will be using time, latitude, longitude, depth, magnitude(mag) and nearest named place of event.

Shiny Application
This application has single app.R file and earthquake dataset. Before we start with UI function, we will load dataset  and fetch world json object from highcharts map collection with hcmap function. See the app here

library(shiny)
library(highcharter)
library(dplyr)
edata <- read.csv('earthquake.csv') %>% rename(lat=latitude,lon = longitude)
wmap <- hcmap()
Using dplyr package latitude and longitude variables are renamed as lat and lon with rename verb. Column names are important. 


ui
It has sidebar panel with 3 widgets and main panel for displaying map.

  • Two sliders, one for filtering out low magnitude values and other for adjusting bubble  size.
  • One select widget for bubble size variable. User can select magnitude or depth of earthquake event. mag and depth are columns in dataset.
  • Widget output function highchartOutput for use in shiny.
ui <- fluidPage(
   
titlePanel("MapBubble"), # Application title
 sidebarLayout(
 sidebarPanel(     
  
   sliderInput('mag','Magnitude more than(Richter Scale)', min = 1,max = 6,step = 0.5,value = 0),
 
   selectInput('bubble','Bubble Size indicates',choices = c('Magnitude'= 'mag','Depth(in Km)' = 'depth')),

   sliderInput('bublesize','Adjust bubble Size',min = 2,max = 10,step = 1,value = 6)      
        ),
      
      # Display a Map Bubble
      mainPanel(
        highchartOutput('eqmap',height = "500px")         
      )
   )
)
Server
Before rendering, we will filter the dataset within reactive context. Any numeric column that we want to indicate with bubble size should be named z. input$bubble comes from select widget. 

renderHighchart is render function for use in shiny. We will pass the filtered data and chart type as mapbubble in hc_add_series function. Place, time and z variable are displayed in the tooltip with “point” format. 
Sub-title is used to show no. of  filtered observation  in the map

 
  server <- function(input, output) { 
  data <- reactive(edata %>% 
               filter(mag >= input$mag) %>%
               rename(z = input$bubble))


output$eqmap <-renderHighchart(
  
 wmap %>% hc_legend(enabled = F) %>%

  hc_add_series(data = data(), type = "mapbubble", name = "", maxSize = paste0(input$bublesize,'%')) %>% #bubble size in perc %

 hc_tooltip(useHTML = T,headerFormat='',pointFormat = paste('Location :{point.place} 
 Time: {point.time} 
',input$bubble,': {point.z}')) %>%

 hc_title(text = "Global Seismic Activity") %>%
 hc_subtitle(text = paste('No of obs:', nrow(data()),sep = '')) %>%
 hc_mapNavigation(enabled = T)%>%
 )
}

# Run the application 
shinyApp(ui = ui, server = server)
Shiny R file can be found here at the github repository

Generate synthetic data using R

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’. 

Steps to build synthetic data


1. Installation


Install conjurer package by using the following code. Since the package uses base R functions, it does not have any dependencies.
 install.packages("conjurer") 

2. Build customers


A customer is identified by a unique customer identifier(ID). A customer ID is alphanumeric with prefix “cust” followed by a numeric. This numeric ranges from 1 and extend to the number of customers provided as the argument within the function. For example, if there are 100 customers, then the customer ID will range from cust001 to cust100. This ensures that the customer ID is always of the same length. Let us build a group of customer IDs using the following code. For simplicity, let us assume that there are 100 customers. customer ID is built using the function buildCust. This function takes one argument “numOfCust” that specifies the number of customer IDs to be built.
library(conjurer)
customers <- buildCust(numOfCust =  100)
print(head(customers))
#[1] "cust001" "cust002" "cust003" "cust004" "cust005" "cust006"

3. Build products


The next step is building some products. A product is identified by a product ID. Similar to a customer ID, a product ID is also an alphanumeric with prefix “sku” which signifies a stock keeping unit. This prefix is followed by a numeric ranging from 1 and extending to the number of products provided as the argument within the function. For example, if there are 10 products, then the product ID will range from sku01 to sku10. This ensures that the product ID is always of the same length. Besides product ID, the product price range must be specified. Let us build a group of products using the following code. For simplicity, let us assume that there are 10 products and the price range for them is from 5 dollars to 50 dollars. Products are built using the function buildProd. This function takes 3 arguments as given below.
    • numOfProd. This defines the number of product IDs to be generated.
    • minPrice. This is the minimum value of the price range.
    • maxPrice. This is the maximum value of the price range.
library(conjurer)
products <- buildProd(numOfProd = 10, minPrice = 5, maxPrice = 50)
print(head(products))
#     SKU Price
# 1 sku01 43.60
# 2 sku02 48.56
# 3 sku03 36.16
# 4 sku04 19.02
# 5 sku05 17.19
# 6 sku06 25.35

4. Build transactions


Now that a group of customer IDs and Products are built, the next step is to build transactions. Transactions are built using the function genTrans. This function takes 5 arguments. The details of them are as follows.
    • cylces. This represents the cyclicality of data. It can take the following values
      • y“. If cycles is set to the value “y”, it means that there is only one instance of a high number of transactions during the entire year. This is a very common situation for some retail clients where the highest number of sales are during the holiday period in December.
      • q“. If cycles is set to the value “q”, it means that there are 4 instances of a high number of transactions. This is generally noticed in the financial services industry where the financial statements are revised every quarter and have an impact on the equity transactions in the secondary market.
      • m“. If cycles is set to the value “m”, it means that there are 12 instances of a high number of transactions for a year. This means that the number of transactions increases once every month and then subside for the rest of the month.
    • spike. This represents the seasonality of data. It can take any value from 1 to 12. These numbers represent months in an year, from January to December respectively. For example, if spike is set to 12, it means that December has the highest number of transactions.
    • trend. This represents the slope of data distribution. It can take a value of 1 or -1.
      • If the trend is set to value 1, then the aggregated monthly transactions will exhibit an upward trend from January to December and vice versa if it is set to -1.
    • outliers. This signifies the presence of outliers. If set to value 1, then outliers are generated randomly. If set to value 0, then no outliers are generated. The presence of outliers is a very common occurrence and hence setting the outliers to 1 is recommended. However, there are instances where outliers are not needed. For example, if the objective of data generation is solely for visualization purposes then outliers may not be needed.
    • transactions. This represents the number of transactions to be generated.
Let us build transactions using the following code
transactions <- genTrans(cycles = "y", spike = 12, outliers = 1, transactions = 10000)
Visualize generated transactions by using
TxnAggregated <- aggregate(transactions$transactionID, by = list(transactions$dayNum), length)
plot(TxnAggregated, type = "l", ann = FALSE)

5. Build final data


Bringing customers, products and transactions together is the final step of generating synthetic data. This process entails 3 steps as given below.

5.1 Allocate customers to transactions


The allocation of transactions is achieved with the help of buildPareto function. This function takes 3 arguments as detailed below.
    • factor1 and factor2. These are factors to be mapped to each other. As the name suggests, they must be of data type factor.
    • Pareto. This defines the percentage allocation and is a numeric data type. This argument takes the form of c(x,y) where x and y are numeric and their sum is 100. If we set Pareto to c(80,20), it then allocates 80 percent of factor1 to 20 percent of factor 2. This is based on a well-known concept of Pareto principle.
Let us now allocate transactions to customers first by using the following code.
customer2transaction <- buildPareto(customers, transactions$transactionID, pareto = c(80,20))
Assign readable names to the output by using the following code.
names(customer2transaction) <- c('transactionID', 'customer')

#inspect the output
print(head(customer2transaction))
#   transactionID customer
# 1     txn-91-11  cust072
# 2    txn-343-25  cust089
# 3    txn-264-08  cust076
# 4    txn-342-07  cust030
# 5      txn-2-19  cust091
# 6    txn-275-06  cust062

5.2 Allocate products to transactions


Now, using similar step as mentioned above, allocate transactions to products using following code.
product2transaction <- buildPareto(products$SKU,transactions$transactionID,pareto = c(70,30))
names(product2transaction) <- c('transactionID', 'SKU')

#inspect the output
print(head(product2transaction))
#   transactionID   SKU
# 1    txn-182-30 sku10
# 2    txn-179-21 sku01
# 3    txn-179-10 sku10
# 4    txn-360-08 sku01
# 5     txn-23-09 sku01
# 6    txn-264-20 sku10

5.3 Final data


Now, using a similar step as mentioned above, allocate transactions to products using the following code.
df1 <- merge(x = customer2transaction, y = product2transaction, by = "transactionID")

dfFinal <- merge(x = df1, y = transactions, by = "transactionID", all.x = TRUE)

#inspect the output
print(head(dfFinal))
#   transactionID customer   SKU dayNum mthNum
# 1      txn-1-01  cust076 sku03      1      1
# 2      txn-1-02  cust062 sku04      1      1
# 3      txn-1-03  cust087 sku07      1      1
# 4      txn-1-04  cust010 sku04      1      1
# 5      txn-1-05  cust039 sku01      1      1
# 6      txn-1-06  cust010 sku01      1      1
Thus, we have the final data set with transactions, customers and products. Interpret the results The column names of the final data frame can be interpreted as follows.
    • Each row is a transaction and the data frame has all the transactions for a year i.e 365 days.
    • transactionID is the unique identifier for that transaction. + customer is the unique customer identifier. This is the customer who made that transaction.
    • SKU is the product that was bought in that transaction.
    • dayNum is the day number in the year. There would be 365 unique dayNum in the data frame.
    • mthNum is the month number. This ranges from 1 to 12 and represents January to December respectively.

Summary & concluding remarks


In this article, we started by building customers, products and transactions. Later on, we also understood how to bring them all together in to a final data set. At the time of writing this article, the package is predominantly focused on building the basic data set and there is room for improvement. If you are interested in contributing to this package, please find the details at contributions.

How to reverse engineer a heat map into its underlying values

Astrolabe Diagnostics is a fully bootstrapped five-person biotech startup. We offer the Antibody Staining Data Set (ASDS), a free service that helps immunologists find out the expression of different molecules (markers) across subsets in the immune system. Essentially, the ASDS is a big table of numbers, where every row is a subset and every column a marker. Recently, the Sean Bendall lab at Stanford released the preprint of a similar study, where they measured markers for four of the subsets that the ASDS covered. Since the two studies used different techniques for their measurements I was curious to examine the correlation between the results. However, the preprint did not include any of the actual data. The closest was Figure 1D, a heat map for 98 of the markers measured in the study:



I decided to take the heat map image and “reverse engineer” it into the underlying values. Specifically, what I needed was the “Median scaled expression” referred to in the legend in the bottom right. Since I could not find any existing packages or use cases for easily doing this I decided to hack a solution (check out the code and PNG and CSV files at the github repository). First, I manually entered the marker names from the X-axis into a spreadsheet. Then, I cropped the above image, removing the legends, axes, and the top heat map row which includes an aggregate statistic not relevant to this exercise.



I loaded the image into R using the readPNG function from the png package. This results in a three-dimensional matrix where the first two dimensions are the X- and Y-values and the third is the RGB values. The X axis maps to the markers and the Y axis maps to the four subsets (“Transitional”, “Naive”, “Non-switched”, and “Switched”), and I wanted to get a single pixel value for each (Subset, Marker) combination. Deciding on the row for each subset was easy enough: I loaded the image in GIMP and picked rows 50, 160, 270, and 380. In order to find the column for each marker I initially planned to iterate over the tile width. Unfortunately, tile widths are not consistent, which is further complicated by the vertical white lines. I ended up choosing them manually in GIMP as well:

Marker,Pixel
CD1d,14
CD31,40
HLA-DQ,70
CD352,100
CD21,128
CD196,156
CD79b,185
CD1c,219
...
I could now get the RGB value for a (Subset, Marker) from the PNG. For example, if I wanted the CD31 value for the “Non-switched” subset, I would go to heat_map_png[270, 40, ]. This will give me the vector [0.6823529, 0.0000000, 0.3882353]. In order to map these values into the “Median scaled expression” values, I used the legend in the bottom left. First, I cropped it into its own PNG file:



I imported it into R using readPNG, arbitrarily took the pixels from row 10, and mapped them into values using seq:

# Import legend PNG, keep only one row, and convert to values. The values "0"
# and "0.86" are taken from the image.
legend_png <- png::readPNG("legend.png")
legend_mtx <- legend_png[10, , ]
legend_vals <- seq(0, 0.86, length.out = nrow(legend_mtx))
At this point I planned to reshape the heat map PNG matrix into a data frame and join the RGB values into the legend values. However, this led to two issues.

One, reshaping a three-dimensional matrix into two dimensions is a headache since I want to make sure I end up with the row and column order I need. Sticking to the spirit of the hack, I iterated over all (Subset, Marker) values instead. This is inelegant (iterating in R is frowned upon) but is a reasonable compromise given the small image size.

Two, I can’t actually join on the legend RGB values. The heat map uses a gradient and therefore some of its values might be missing from the legend itself (the reader can visually infer them). Instead, I calculated the distance between each heat map pixel and the legend pixels and picked the nearest legend pixel for its “Median scaled expression”.

heat_map_df <- lapply(names(marker_cols), function(marker) {
  lapply(names(cell_subset_rows), function(cell_subset) {
    v <- t(heat_map_png[cell_subset_rows[cell_subset], marker_cols[marker], ])
    dists <- apply(legend_mtx, 1, function(x) sqrt(sum((x - v) ^ 2)))
    data.frame(
      Marker = marker,
      CellSubset = cell_subset,
      Median = legend_vals[which.min(dists)],
      stringsAsFactors = FALSE
    )
  }) %>% dplyr::bind_rows()
}) %>% dplyr::bind_rows()
I now have the heat_map_df values I need to compare to the ASDS! As a sanity check, I reproduced the original heat map using ggplot:
heat_map_df$Marker <- 
  factor(heat_map_df$Marker, levels = names(marker_cols))
heat_map_df$CellSubset <-
  factor(heat_map_df$CellSubset, levels = rev(names(cell_subset_rows)))

ggplot(heat_map_df, aes(x = Marker, y = CellSubset)) +
  geom_tile(aes(fill = Median), color = "white") +
  scale_fill_gradient2(
    name = "Median Scaled Expression",
    low = "black", mid = "red", high = "yellow",
    midpoint = 0.4) +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.4),
        axis.title = element_blank(),
        legend.position = "bottom",
        panel.background = element_blank())


The resulting code gets the job done and can be easily repurposed for other heat maps. There will be some manual work involved, namely, setting cell_subset_rows to the rows in the new heat map, updating marker_cols.csv accordingly, and setting the boundary values in the seq call when calculating legend_vals. Furthermore, we should be able to adapt the above into a more autonomous solution by calculating the boundaries between tiles using diff, running it separately on the rows and the columns (getting the row and column labels will not be trivial and will require OCR). For a one-time exercise, though, the above hack works remarkably well — sometimes that is all the data science you need to get the job done. Check out this YouTube video for the actual comparison between the data sets!

Vignette: Podlover – A Package to Analyze Podcasting Data

Note: Some of the code blocks below got reformated by the WordPress editor. To see the working code, please visit the original vignette in the Repo’s README.md file.

The Backstory: Podlove – a WordPress plugin for Podcasting

The Podlove podcasting suite is an open source toolset to help you publish and manage a podcast within a WordPress blog. Over the years it has become the de facto standard for easy while flexible podcast publishing in the German-speaking podcast community. Podlove includes an analytics dashboard to give you an overview of how your podcast is performing over time and various dimensions such as media formats. While being a practical overview for everyday analytics, it is limited when it comes to more complex or fine-grained analytics.

podlover Brings Podlove Data into R

The podlover package allows you to access the access data behind the Podlove dashboard. It connects to the relevant WordPress MySQL tables, fetches the raw data, connects and cleans it into a tidy dataset with one row per download attempt. Furthermore, it allows you to:
    • plot download data for multiple episodes as point, line, area and ridgeline graphs
    • use options for absolute vs relative display (think: release dates vs. days since release) and cumulative vs. non-cumulative display.
    • compare episodes, epsiode formats, sources/contexts, podcast clients, and operating systems over time
    • create and compare performance data for episode launches and long-term performance
    • calculate and plot regressions to see if you are gaining or losing listeners over time.
This vignette demonstrates what podlover can do.

Installation

This package is based on the statistical programming framework R. If you’re a podcast producer who is new R, you will need to install R as well as its programming environment RStudio and familiarize yourself with it. Both of them are free open source tools. Start here to get RStudio and R. podlover is available as a package from GitHub at https://github.com/lordyo/podlover and can be installed via devtools:
# install devtools if you don't have it already
install.packages("devtools")
# install podlover from GitHub
devtools::install_github("lordyo/podlover")
Once installed, you can load the package.
library(podlover)

Ways to Access Podlove Data

There are two ways to get your download data into podlover:
    • You either fetch the data directly from the WordPress data base with podlove_get_and_clean(). This will give you the most recent data and, once established, is the most comfortable way of accessing data. Establishing the connection can be tricky though.
    • Or you download the necessary tables and feed it to podlover with podlove_clean_stats(). This is easier, but takes longer and will only give you a snapshot of the data at a certain point in time.

Fetching Download Data via a Database Connection

Behind every WordPress site is a MySQL database containing almost everything that’s stored in the blog. When installing Podlove under WordPress, the plugin creates additional database tables containing podcast-specific data. The function podlove_get_and_clean() fetches those. To make that happen, you will need:
    1. db_name: The WordPress database’s name.
    2. db_host: The (external) hostname of the database.
    3. db_user: The databases’s user name (usually not the same as your WordPress login)
    4. db_password: The user’s password for database (usually not the same as your Worpdress password)
    5. Permissions to access your database from an external IP address.
    6. The names of the database tables

Database name, user and password

db_name, db_user and db_password can be found in the wp-config.php file in the root folder of your podcast’s WordPress directory. Look for the following passage:
// ** MySQL settings - You can get this info from your web host ** //
/** The name of the database for WordPress */
define( 'DB_NAME', 'lorem_wp_ipsum' );
/** MySQL database username */
define( 'DB_USER', 'lorem_dolor' );
/** MySQL database password */
define( 'DB_PASSWORD', 'my_password' );

External Database Hostname

Note that this file also includes a hostname, but this is the internal hostname – you’re looking for the external hostname. This you will need to get from your hoster’s admin panel, usually where MySQL databases are managed. Check your hoster’s support section if you get stuck.

Access permissions

MySQL databases are sensitive to hacking attacks, which is why they usually aren’t accessible to just any visitor – even if she has the correct access information. You will probably need to set allow an access permission to your database and user from the IP address R is running on (“whitelisting”). This is also done via your hoster’s admin panel. Check your hoster’s support sections for more info. (Note: Some hosters are stricter and don’t allow any access except via SSH tunnels. podlover doesn’t provide that option yet.)

Prefix of the Table Names

Finally, you might need to check if the tables name prefix in your WordPress database corresponds to the usual naming conventions. Most WordPress installations start the tables with wp_, but sometimes, this prefix differs (e.g. wp_wtig_). For starters, you can just try to use the default prefix built into the function. If the prefix is different than the default, you will get an error message. If that’s the case, access your hoster’s MySQL management tool (e.g. phpMyAdmin, PHP Workbench), open your database and check if the tables are starting with something else than just wp_... and write down the prefix. You can of course also use a locally installed MySQL tool to do so (e.g. HeidiSQL).

Downloading the data

Once you gathered all this, it’s time to access your data and store it to a data frame:
download_data <- podlove_get_and_clean()
Four input prompts will show up, asking for the database name, user, password and host. You have the option to save these values to your system’s keyring, so you don’t have to enter them repeatedly. Use ?rstudioapi::askForSecret to learn more about where these values are stored or ?keyring to learn more how keyrings and how they are used within R. After entering the information, you should see something like this:
connection established
fetched table wp_podlove_downloadintentclean
fetched table wp_podlove_mediafile
fetched table wp_podlove_useragent
fetched table wp_podlove_episode
fetched table wp_posts
connection closed

Troubleshooting

You might also get an error message, meaning something went wrong. If you see the following error message…
Error in .local(drv, ...) : 
  Failed to connect to database: Error: Access denied for user 'username'@'XX.XX.XX.XXX' to database 'databasename'
…then the function couldn’t access the databse. This means either that there’s something wrong with your database name, user name, password or host name (see sections “Database name, user name, password” or “External Database Hostname” above) Or it could mean that access to this database with this username is restricted, i.e. your IP is not whitelisted (see “Access Permissions” above). If you can’t make that work, you’re only option is to download the tables yourself (see “Working with Local Table Downloads”). If your error says…
connection established
Error in .local(conn, statement, ...) : 
  could not run statement: Table 'dbname.tablename' doesn't exist
…this means you were able to access the database (congrats!), but the table names/prefix are incorrect. Check your table name prefix as described under “Prefix of the table names” and try again while specifying the prefix:
download_data <- podlove_get_and_clean(tbl_prefix = "PREFIX")

Working with Local Table Downloads

If you can’t or don’t want to work with podlove_get_and_clean(), you can still analyze your data by downloading the individual database tables yourself and feed it directly to the cleaning function podlove_clean_stats(). First, you will need to get the necessary tables. The easiest way is to use your hoster’s database management tool, e.g. phpMyAdmin or PHP Workbench. These can usually be accessed from your hosting administration overview: Look for a “databases”, “MySQL” or “phpMyAdmin” option, find a list of tables, usually starting with wp_..., select the table and look for an “export” option. Export the tables to CSV. If you get stuck, check your hoster’s support section. Warning: When using database tools, you can break things – i.e. your site and your podcast. To be on the safe side, always make a backup first, don’t change any names or options, and don’t delete anything! You will need the following tables, each in its own CSV file with headings (column titles):
    • wp_podlove_downloadintentclean
    • wp_podlove_episode
    • wp_podlove_mediafile
    • wp_podlove_useragent
    • wp_posts
Note: The prefix of the tables (here wp_) might be different or longer in your case. Once you have downloaded the tables, you need to import them into R as data frames: to use the podlove_clean_stats() function to connect the table and clean the data:
# replace file names with your own
download_table <- read.csv("wp_podlove_downloadintentclean.csv", as.is = TRUE)
episode_table <- read.csv("wp_podlove_episode.csv", as.is = TRUE)
mediafile_table <- read.csv("wp_podlove_mediafile.csv", as.is = TRUE)
useragent_table <- read.csv("wp_podlove_useragent.csv", as.is = TRUE)
posts_table <- read.csv("wp_posts.csv", as.is = TRUE)
# connect & clean the tables
download_data <- podlove_clean_stats(df_stats = download_table,
                                     df_episode = episode_table,
                                     df_mediafile = mediafile_table,
                                     df_user = useragent_table,
                                     df_posts = posts_table)

Create Example Data

podlover includes a number of functions to generate example download tables. This can be useful if you want to test the package without having real data, or to write reproducible examples for a vignette like this. We will use an example data set for the next chapters. Generate some random data with the function podlove_create_example() with ~10.000 downloads in total. The seed parameter fixes the randomization to give you the same data as in this example. The clean parameter states that you want a dataframe of cleaned data, not raw input tables.
downloads <- podlove_create_example(total_dls = 10000, seed = 12, clean = TRUE)
Here it is:
print(downloads)
#> # A tibble: 6,739 x 20
#>    ep_number title ep_num_title duration post_date  post_datehour      
#>                                        
#>  1 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  2 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  3 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  4 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  5 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  6 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  7 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  8 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  9 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#> 10 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#> # ... with 6,729 more rows, and 14 more variables: ep_age_hours ,
#> #   ep_age_days , hours_since_release , days_since_release ,
#> #   source , context , dldate , dldatehour ,
#> #   weekday , hour , client_name , client_type ,
#> #   os_name , dl_attempts 
Nice – this is all the data you need for further analysis. It contains information about the episode (what was downloaded?), the download (when was it downloaded?) and the user agent (how / by which agent was it downloaded?). By the way, if you want get access the raw data or try out the cleaning function, you can set the podlove_create_example() parameter clean = FALSE:
table_list <- podlove_create_example(total_dls = 10000, seed = 12)
The result is a list of 5 named tables (posts, episodes, mediafiles, useragents, downloads) wrapped in a list. Now all you have to do is feed them to the cleaning function:
downloads <- podlove_clean_stats(df_stats = table_list$downloads,
                                 df_mediafile = table_list$mediafiles,
                                 df_user = table_list$useragents,
                                 df_episodes = table_list$episodes,
                                 df_posts = table_list$posts)

Summary

Now that you have the clean download data, it’s time to check it out. podlover includes a simple summary function to give you an overview of the data:
podlove_podcast_summary(downloads)
#> 'downloads': 
#> 
#> A podcast with 10 episodes, released between 2019-01-01 and 2019-12-05.
#> 
#> Total runtime:  11m 4d 22H 0M 0S.
#> Average time between episodes: 2928240s (~4.84 weeks).
#> 
#> Episodes were downloaded 6739 times between 2019-01-01 and 2020-01-04.
#> 
#> Downloads per episode: 673.9
#> min: 132 | 25p: 375 | med: 703 | 75p: 791 | max: 1327
#> 
#> Downloads per day: 18.3
#> min: 1 | 25p: 3 | med: 7 | 75p: 16 | max: 572
#> NULL
If you set the parameter return_params to TRUE, you can access the individual indicators directly. The verbose parameter defines if you want to see the printed summary.
pod_sum <- podlove_podcast_summary(downloads, return_params = TRUE, verbose = FALSE)
names(pod_sum)
#>  [1] "n_episodes"                 "ep_first_date"             
#>  [3] "ep_last_date"               "runtime"                   
#>  [5] "ep_interval"                "n_downloads"               
#>  [7] "dl_first_date"              "dl_last_date"              
#>  [9] "downloads_per_episode_mean" "downloads_per_episode_5num"
#> [11] "downloads_per_day_mean"     "downloads_per_day_5num"
pod_sum$n_downloads
#> [1] 6739
pod_sum$dl_last_date
#> [1] "2020-01-04 22:00:00 UTC"

Download curves

One of the main features of the podlover package is that it lets you plot all kinds of download curves over time – aggregated and grouped, with relative and absolute starting points. The plotting function relies on the ggplot2 package and the data needs to be prepared first. The function podlove_prepare_stats_for_graph() does just that, and the function podlove_graph_download_curves() takes care of the plotting.

Parameters

The functional combo of podlove_prepare_stats_for_graph() and podlove_graph_download_curves() accepts the following parameters for a graph:
    • df_stats: The clean data to be analyzed, as prepared by the import or cleaning function.
    • gvar: The grouping variable. Defining one will create multiple curves, one for each group. This needs to be one of the variables (columns) in the clean data:
      • ep_number: The episode’s official number
      • title: The episode’s title
      • ep_num_title: The episode’s title with the number in front
      • source: The dowload source – e.g. “feed” for RSS, “webplayer” for plays on a website, “download” for file downloads
      • context: The file type for feeds and downloads, “episode” for feed accesses
      • client_name: The client application (e.g. the podcatcher’s or brower’s name)
      • client_type: A more coarse grouping of the clients, e.g. “mediaplayer”, “browser”, “mobile app”.
      • os_name: The operating system’s name of the client (e.g. Android, Linux, Mac)
      • Any other grouping variable you create yourself from the existing data.
    • hourly (podlove_prepare_stats_for_graph() only): If set to TRUE, the downloads will be shown per hour, otherwise per day
    • relative (podlove_prepare_stats_for_graph() only): If set to TRUE, the downloads will be shown relative to their publishing date, i.e. all curves starting at 0. Otherwise, the curves will show the download on their specific dates.
    • cumulative (podlove_graph_download_curves() only): If set to TRUE, the downloads will accumulate and show the total sum over time (rising curve). Otherwise, they will uncumulated downloads (scattered peaks).
    • plot_type (podlove_graph_download_curves() only): What kind of plot to use – either line plots ("line") on one graph, or individual ridgeline plots ("ridge").
    • labelmethod (podlove_graph_download_curves() only): Where to attach the labels ("first.points" for the beginning of the line, "last.points" for the end of the line)

Total downloads over time

Let’s say you want to see the daily total downloads of your podcast over time, in accumulated fashion. First, you prepare the graphics data necessary:
total_dls_acc <- podlove_prepare_stats_for_graph(df_stats = downloads, 
                                                 hourly = FALSE, 
                                                 relative = FALSE)
Here, you are not specifying any gvar (which means you’ll get just one curve instead of many). hourly is set to FALSE (= daily data) and relative is set FALSE (absolute dates). Now feed this data over to the plotting function:
g_tdlacc <- podlove_graph_download_curves(df_tidy_data = total_dls_acc,
                                          cumulative = TRUE, 
                                          plot_type = "line",
                                          printout = FALSE)
print(g_tdlacc)
If we don’t cumulate the data, we can see the individual spikes of the episode launches:
g_tdl <- podlove_graph_download_curves(df_tidy_data = total_dls_acc,
                                          cumulative = FALSE, 
                                          plot_type = "line",
                                          printout = FALSE)
print(g_tdl)

Downloads by episode

Now you want to look at the individual episodes. For this, you will need to use the gvar parameter. For an episode overviewer, you can either set it to title, ep_number or ep_num_title. Here, we’re using title (unquoted!), and add specify the labelmethod to show the labels at the beginning of the curves.
ep_dls_acc <- podlove_prepare_stats_for_graph(df_stats = downloads,  
                                              gvar = title, # group by episode title
                                              hourly = FALSE,  
                                              relative = FALSE)
g_ep_dlsacc <- podlove_graph_download_curves(df_tidy_data = ep_dls_acc,
                                             gvar = title, # use the same gvar!
                                             cumulative = TRUE,
                                             plot_type = "line", 
                                             labelmethod = "first.points",
                                             printout = FALSE)
print(g_ep_dlsacc)
As you can see, this shows the curves spread over the calendar X axis. But how do the episodes hold up against each other? For this, we will use the parameter relative = TRUE, which lets all curves start at the same point. The labelling paramter labelmethod = "last.points" works better for this kind of curve.
ep_dls_acc_rel <- podlove_prepare_stats_for_graph(df_stats = downloads,  
                                              gvar = title,
                                              hourly = FALSE,  
                                              relative = TRUE) # relative plotting
g_ep_dlsaccrel <- podlove_graph_download_curves(df_tidy_data = ep_dls_acc_rel,
                                             gvar = title,
                                             cumulative = TRUE,
                                             plot_type = "line", 
                                             labelmethod = "last.points",
                                             printout = FALSE)
print(g_ep_dlsaccrel)
If you want to look at the uncumulated data, the line plot doesn’t work very well. For this, a ridge plot is the right choice (but only if you don’t have too many episodes):
ep_dls <- podlove_prepare_stats_for_graph(df_stats = downloads,  
                                              gvar = ep_num_title, # better for sorting
                                              hourly = FALSE,  
                                              relative = FALSE)
g_ep_dls <- podlove_graph_download_curves(df_tidy_data = ep_dls,
                                             gvar = ep_num_title,
                                             cumulative = FALSE, # no cumulation
                                             plot_type = "ridge", # use a ridgeline plot
                                             printout = FALSE)
print(g_ep_dls)

Downloads by other parameters

You can compare not only episodes, but also aspects of episodes or downloads. Let’s look at the parameter source, which lists by what way our listeners get their episodes. The labelmethod here is set to angled.boxes:
source_acc <- podlove_prepare_stats_for_graph(df_stats = downloads,  
                                              gvar = source, # new gvar
                                              hourly = FALSE,  
                                              relative = FALSE) 
g_source_acc <- podlove_graph_download_curves(df_tidy_data = source_acc,
                                             gvar = source,  # same as above!
                                             cumulative = TRUE,
                                             plot_type = "line", 
                                             labelmethod = "angled.boxes",
                                             printout = FALSE)
print(g_source_acc)

Epsiode Performance

Podcast episode downloads typically follow a “heavy front, long tail” distribution: Many downloads are made over automatic downloads by podcatchers, while fewer are downloaded in the long-term. It therefore helps to distinguish between the episode launch period (the heavy front) and the long term performance (the long tail). For this, you can use the function podlove_performance_stats(), which creates a table of performance stats per episode. For this, you will first need to define how long a “launch” goes and when the long-term period starts. Those two don’t have to be the same. Here, we’ll use 0-3 days for the launch and 30 and above for the long-term.
perf <- podlove_performance_stats(downloads, launch = 3, post_launch = 30)
perf
#> # A tibble: 10 x 5
#>    title               dls dls_per_day dls_per_day_at_lau~ dls_per_day_after_la~
#>                                                        
#>  1 Acute myeloid le~   669        4.17               131                  1.20  
#>  2 Ashton-under-Lyne  1243        3.37               198.                 1.46  
#>  3 Cortinarius viol~   375        4.57                75.4                1.05  
#>  4 Debora Green        132        4.40                39                  0.0333
#>  5 Ficus aurea         461        4.26                86.2                1.17  
#>  6 Gwoyeu Romatzyh     776        4.16               132.                 1.23  
#>  7 Mary Toft          1327        3.87               226.                 1.49  
#>  8 Samantha Smith      737        2.79               133.                 1.38  
#>  9 Shapinsay           791        3.72               140                  1.33  
#> 10 White-winged fai~   228        4.07                61.5                0.732
colnames(perf)
#> [1] "title"                    "dls"                     
#> [3] "dls_per_day"              "dls_per_day_at_launch"   
#> [5] "dls_per_day_after_launch"
The table shows four values per episode: The overall downlaods, the overall average downloads, the average downloads during the launch and the average downloads during the post-launch period. If you want to see a ranking of the best launches, you can just sort the list:
perf %>%
  dplyr::select(title, dls_per_day_at_launch) %>% 
  dplyr::arrange(desc(dls_per_day_at_launch))
#> # A tibble: 10 x 2
#>    title                  dls_per_day_at_launch
#>                                      
#>  1 Mary Toft                              226. 
#>  2 Ashton-under-Lyne                      198. 
#>  3 Shapinsay                              140  
#>  4 Samantha Smith                         133. 
#>  5 Gwoyeu Romatzyh                        132. 
#>  6 Acute myeloid leukemia                 131  
#>  7 Ficus aurea                             86.2
#>  8 Cortinarius violaceus                   75.4
#>  9 White-winged fairywren                  61.5
#> 10 Debora Green                            39
So there are episodes with different launches strengths long-term performance. Can you plot them against each other? Yes, you can! The function podlove_graph_performance() gives you a nice four-box grid. The top right corner is showing “evergreen” episodes with strong launches and long-term performance, the top left the “shooting stars” with strong launches and loss of interest over time, the bottom right shows “slow burners” which took a while to get an audience, and the bottom left is showing… well, the rest.
g_perf <- podlove_graph_performance(perf, printout = FALSE)
print(g_perf)

Regression Analysis: Is your podcast gaining or losing listeners?

The one question every podcast producer is asking themselves is: “Is my podcast’s audience growing?”. This is not an easy question to answer, because you’re dealing with lagging time series data. One approach to deal answer the question is to calculate a regression model of downloads against time or episode number (tip of the hat to Bernhard Fischer, who prototyped this idea). If the number of downloads after a specified date after launch is rising over time, the podcast is gaining listeners. If it falls, it’s losing listeners. If it stays stable, it’s keeping listeners. To prepare the regression, you first need to use the function podlove_downloads_until() to create a dataset of downloads at a specific point. The longer the period between launch and measuring point is, the more valid the model will be – but you’ll also have less data points. For this example, we’ll pick a period of 30 days after launch:
du <- podlove_downloads_until(downloads, 30)
du
#> # A tibble: 10 x 12
#>    ep_number title ep_num_title duration post_date  post_datehour      
#>                                        
#>  1 01        Asht~ 01: Ashton-~ 00:32:2~ 2019-01-01 2019-01-01 00:00:00
#>  2 02        Mary~ 02: Mary To~ 00:46:0~ 2019-01-27 2019-01-27 01:00:00
#>  3 03        Sama~ 03: Samanth~ 00:30:3~ 2019-04-15 2019-04-15 06:00:00
#>  4 04        Shap~ 04: Shapins~ 00:41:1~ 2019-06-06 2019-06-06 10:00:00
#>  5 05        Gwoy~ 05: Gwoyeu ~ 00:27:0~ 2019-07-02 2019-07-02 12:00:00
#>  6 06        Acut~ 06: Acute m~ 00:23:4~ 2019-07-28 2019-07-28 13:00:00
#>  7 07        Ficu~ 07: Ficus a~ 00:33:2~ 2019-09-18 2019-09-18 17:00:00
#>  8 08        Cort~ 08: Cortina~ 00:19:0~ 2019-10-14 2019-10-14 18:00:00
#>  9 09        Whit~ 09: White-w~ 00:18:3~ 2019-11-09 2019-11-09 20:00:00
#> 10 10        Debo~ 10: Debora ~ 00:28:5~ 2019-12-05 2019-12-05 22:00:00
#> # ... with 6 more variables: ep_age_hours , ep_age_days ,
#> #   ep_rank , measure_day , measure_hour , downloads 
This dataset we can feed into the regression function podlove_episode_regression(). You can choose if you want to use the post_datehour parameter (better if your episodes don’t come out regularly), or ep_rank, which corresponds to the episode number (it has a different name because episode numbers are strings):
reg <- podlove_episode_regression(du, terms = "ep_rank")
If you’re statistically inclined, you can check out the model directly and see if your model is significant:
summary(reg)
#> 
#> Call:
#> stats::lm(formula = formula_string, data = df_regression_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -222.21  -23.69  -11.74   57.09  155.70 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   789.47      71.28  11.075 3.94e-06 ***
#> ep_rank       -64.08      11.49  -5.578 0.000523 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 104.3 on 8 degrees of freedom
#> Multiple R-squared:  0.7955, Adjusted R-squared:  0.7699 
#> F-statistic: 31.12 on 1 and 8 DF,  p-value: 0.0005233
Or you could just check out the regression plot with the function podlove_graph_regression() and see in which direction the line points:
g_reg <- podlove_graph_regression(du, predictor = ep_rank)
Oh noes! It seems like your podcast is steadily losing listeners at the rate of -64 listeners per episode!

Finally

There could be much more possible with Podlove data and R and podlover. For example, some of the download data have not yet been included in the import script, such as the geolocation data. Some of the functions are still too manual and require wrapping functions to quickly analyze data. If you want to contribute, check out the GitHub repo under: https://github.com/lordyo/podlover And if you want to use or contribute to the Podlove project (the Podcasting tools), check out: https://podlove.org

Choropleth maps with Highcharts and Shiny

We use Choropleth maps to show differences in colors or shading of pre-defined regions like states or countries, which correspond to differences in quantitative values like total rainfall, average temperature, economic indicators etc



In our case we will use sales of a toy making company, as quantitative value, in different countries around the world. See example with this shiny app



Highcharter is a R wrapper for Highcharts javascript based charting  modules.

Rendering Choropleth Maps with Highcharts in Shiny

  • To see Highcharts/Shiny interaction, we will begin by creating basic Shiny dashboard layout: It contains a single select widget and single tab for displaying map.

library(shiny)
library(shinydashboard)
library(highcharter)
library(countrycode)
library(dplyr)

sales <- read.csv('salespoint.csv')
ui<-

dashboardPage(
dashboardHeader(title = "Map"),

dashboardSidebar(
sidebarMenu(
selectInput('yearid','Select Year',choices = c(2003,2004,2005),selected = 2003)
)),

dashboardBody(
tabBox(title = 'ToyShop',id = 'tabset1',width = 12, tabPanel('World-Sales',highchartOutput('chart',height = '500px')))

)
)



  • In server function, filtering and summarize data with dplyr library and create a reactive object
server <- function(input, output, session){

total <- reactive(
{
sales %>%
filter(YEAR_ID == as.numeric(input$yearid)) %>%
group_by(COUNTRY) %>%
summarize("TOTAL_SALES" = as.integer(sum(SALES))) %>%
mutate(iso3 = countrycode(COUNTRY,"country.name","iso3c"))
}
)


Here we used library countrycode to convert long country names into one of many coding schemes. Adding new column iso3 in the summarized data with mutate function.

  • Passing reactive object in renderHighchart function. Customizing tooltip and sub-title content with reactive widgets.

output$chart <- renderHighchart(highchart(type = "map") %>%
hc_add_series_map(map = worldgeojson, df = total(), value = "TOTAL_SALES", joinBy = "iso3") %>%
hc_colorAxis(stops = color_stops()) %>%
hc_tooltip(useHTML=TRUE,headerFormat='',pointFormat = paste0(input$yearid,' {point.COUNTRY} Sales : {point.TOTAL_SALES} ')) %>%
hc_title(text = 'Global Sales') %>%
hc_subtitle(text = paste0('Year: ',input$yearid)) %>%
hc_exporting(enabled = TRUE,filename = 'custom')
)
}



Dataset and shiny R file can be downloaded from here