R – Page 26 – R-posts.com

Trend Forecasting Models and Seasonality with Time Series

Gasoline prices always is an issue in Turkey; because Turkish people love to drive where they would go but they complain about the prices anyway. I wanted to start digging for the last seven years’ prices and how they went. I have used unleaded gasoline 95 octane prices from Petrol Ofisi which is a fuel distribution company in Turkey. I arranged the prices monthly between 2013 and 2020 as the Turkish currency, Lira (TL). The dataset is here

head(gasoline_df)
#        date gasoline
#1 2013-01-01     4.67
#2 2013-02-01     4.85
#3 2013-03-01     4.75
#4 2013-04-01     4.61
#5 2013-05-01     4.64
#6 2013-06-01     4.72

First of all, I wanted to see how gasoline prices went during the period and analyze the fitted trend lines.

#Trend lines 
library(tidyverse)

ggplot(gasoline_df, aes(x = date, y = gasoline)) + geom_point() +
  stat_smooth(method = 'lm', aes(colour = 'linear'), se = FALSE) +
  stat_smooth(method = 'lm', formula = y ~ poly(x,2), aes(colour = 'quadratic'), se= FALSE) +
  stat_smooth(method = 'lm', formula = y ~ poly(x,3), aes(colour = 'cubic'), se = FALSE)+
  stat_smooth(data=exp.model.df, method = 'loess',aes(x,y,colour = 'exponential'), se = FALSE)

Because the response variable is measured by logarithmic in the exponential model, we have to get the exponent of it and create a different data frame for the exponential trend line.

#exponential trend model
exp.model <- lm(log(gasoline)~date,data = gasoline_df) 
exp.model.df <- data.frame(x=gasoline_df$date,
                           y=exp(fitted(exp.model)))

As we see above, the cubic and quadratic models almost overlap each other and they seem to be fit better to the data. However, we will analyze each model in detail.

The Forecasting Trend Models

The linear trend; $latex y_{t}&s=1$, the value of the series at given time, $latex t&s=1$, is described as:

$latex y_{t}&s=1=\beta_{0}+\beta_{1}t&s=1+\epsilon_{t}$

$latex \beta_{0}&s=1$ and $latex \beta_{1}&s=1$ are the coefficients.

model_linear <- lm(data = gasoline_df,gasoline~date)

Above, we created a model variable for the linear trend model. In order to compare the models, we have to extract the adjusted coefficients of determination, that is used to compare regression models with a different number of explanatory variables, from each trend models.

Adjusted $latex R^2&s=1=1-(1-R^2&s=1)(\frac {n-1} {n-k-1})$
$latex n$: sample size
$latex k$: number of variables
$latex R^2&s=1$: coefficient of determination which is the square of correlation of observation values with predicted values: $latex r^2_{y \hat {y}}$

adj_r_squared_linear <- summary(model_linear) %>% 
  .$adj.r.squared %>% 
  round(4)

The exponential trend; unlike the linear trend, allows the series to increase at an increasing rate in each period, is described as:
$latex \ln(y_{t}&s=1)=\beta_{0}+\beta_{1}t&s=1+\epsilon_{t}$. $latex \ln(y_{t}&s=1)$ is a natural logarithm of the response variable.

To make predictions on the fitted model, we use exponential function as $latex \hat{y_{t}}&s=1=exp(b_{0}+b_{1}t+s_{e}^2/2)$ because the dependent variable was transformed by a natural logarithmic function. In order for $latex \hat{y_{t}}&s=1$ not to be under the expected value of $latex y{t}&s=1$, we add half of the residual standard error's square. In order to find $latex s_{e}&s=2$, we execute the summary function as below. In addition, we can see the level of significance of the coefficients and the model. As we see below, because the p-value of the coefficients and the model are less than 0.05, they are significant at the %5 level of significance.

summary(model_exponential)

#Call:
#lm(formula = log(gasoline) ~ date, data = gasoline_df)

#Residuals:
#      Min        1Q    Median        3Q       Max 
#-0.216180 -0.083077  0.009544  0.087953  0.151469 

#Coefficients:
#              Estimate Std. Error t value Pr(<|t|)    
#(Intercept) -1.0758581  0.2495848  -4.311  4.5e-05 ***
#date         0.0001606  0.0000147  10.928  < 2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 0.09939 on 82 degrees of freedom
#Multiple R-squared:  0.5929,  Adjusted R-squared:  0.5879 
#F-statistic: 119.4 on 1 and 82 DF,  p-value: < 2.2e-16

We create an exponential model variable.

model_exponential <- lm(data = gasoline_df,log(gasoline)~date)

To get adjusted $latex R^2&s=1$, we first transform the predicted variable.

library(purrr)

y_predicted <- model_exponential$fitted.values %>% 
  map_dbl(~exp(.+0.09939^2/2))

And then, we execute the formula shown above to get adjusted $latex R^2&s=1$

r_squared_exponential <- (cor(y_predicted,gasoline_df$gasoline))^2

adj_r_squared_exponential <- 1-(1-r_squared_exponential)*
  ((nrow(gasoline_df)-1)/(nrow(gasoline_df)-1-1))

Sometimes a time-series changes the direction with respect to many reasons. This kind of target variable is fitted to polynomial trend models. If there is one change of direction we use the quadratic trend model.

$latex y_{t}&s=1=\beta_{0}+\beta_{1}t+\beta_{2}t^2+\epsilon$

If there are two change of direction we use the cubic trend models:

$latex y_{t}&s=1=\beta_{0}+\beta_{1}t+\beta_{2}t^2+\beta_{3}t^3+\epsilon$

Then again, we execute the same process as we did in the linear model.

#Model variables
model_quadratic <- lm(data = df,gasoline~poly(date,2))
model_cubic <- lm(data = df,gasoline~poly(date,3))

#adjusted coefficients of determination
adj_r_squared_quadratic <- summary(model_quadratic) %>% 
  .$adj.r.squared

adj_r_squared_cubic <- summary(model_cubic) %>% 
  .$adj.r.squared

In order to compare all models, we create a variable that gathers all adjusted $latex R^2&s=1$.

adj_r_squared_all <- c(linear=round(adj_r_squred_linear,4),
                   exponential=round(adj_r_squared_exponential,4),
                   quadratic=round(adj_r_squared_quadratic,4),
                   cubic=round(adj_r_squared_cubic,4))

As we can see below, the results justify the graphic we built before; polynomial models are much better than the linear and exponential models. Despite polynomial models look almost the same, the quadratic model is slightly better than the cubic model.

adj_r_squared_all
     #linear exponential   quadratic       cubic 
     #0.6064      0.6477      0.8660      0.8653

Seasonality

The seasonality component represents the repeats in a specific period of time. Time series with weekly monthly or quarterly observations tend to show seasonal variations that repeat every year. For example, the sale of retail goods increases every year in the Christmas period or the holiday tours increase in the summer. In order to detect seasonality, we use decomposition analysis.

Decomposition analysis: $latex T, S, I&s=1$, are trend, seasonality and random components of the series respectively. When seasonal variation increases as the time series increase, we'd use the multiplicative model.

$latex y_{t}&s=1=T_{t} \times S_{t} \times I_{t}$

If the variation looks constant, we should use additive model.

$latex y_{t}&s=1=T_{t} + S_{t} + I_{t}$

To find which model is fit, we have to look at it on the graph.

#we create a time series variable for the plot
gasoline_ts <- ts(gasoline_df$gasoline,start = c(2013,1),end = c(2019,12),frequency = 12)
#The plot have all the components of the series(T, S, I)
plot(gasoline_ts,lwd=2,ylab="Gasoline")

As we can see from the plot above, it seems that the multiplicative model would be more fit for the data; especially if we look at between 2016-2019. When we analyze the graph, we see some conjuncture variations caused by the expansion or contraction of the economy. Unlike seasonality, conjuncture fluctuations can be several months or years in length.

Additionally, the magnitude of conjuncture fluctuation can change in time because the fluctuations differ in magnitude and length so they are hard to reflect with historical data; because of that, we neglected the conjuncture component of the series.

Extracting the seasonality

Moving averages (ma) usually be used to separate the effect of a trend from seasonality.

$latex m\,\,term\,\,moving\,\,averages&s=1=\frac {Average\,\,of\,\,the\,\,last\,\,m\,\,observations} {m}$

We use the cumulative moving average (CMA), which is the average of two consecutive averages, to show the even-order moving average. For example; first two ma terms are $latex \bar{y}_{6.5}&s=1$ and $latex \bar{y}_{7.5}&s=1$ but there are no such terms in the original series; therefore we average the two terms to find CMA term matched up in the series:

$latex \bar y_7&s=3=$$latex \frac {\bar y_{6,5}&s=3+\bar y_{7,5}} {2}$

The default value of the 'centre' argument of the ma function remains TRUE to get CMA value.

library(forecast)

gasoline_trend <- forecast::ma(gasoline_ts,12)

If noticed, $latex \bar y_{t}&s=1$ eliminates both seasonal and random variations; hence: $latex \bar y_{t}&s=1=T_{t}$

Since $latex y_{t}&s=1=T_{t} \times S_{t} \times I_{t}$ and $latex \bar y_{t}&s=1=T_{t}$, the detrend variable is found by: $latex \frac {\bar y_{t}} {y_{t}}&s=1=S_{t} \times I_{t}$; it is also called ratio-to moving averages method.

gasoline_detrend <- gasoline_ts/gasoline_trend

Each month has multiple ratios, where each ratio coincides with a different year; in this sample, each month has seven different ratios. The arithmetic average is used to determine common value for each month; by doing this, we eliminate the random component and subtract the seasonality from the detrend variable. It is called the unadjusted seasonal index.

unadjusted_seasonality <- sapply(1:12, function(x) mean(window(gasoline_detrend,c(2013,x),c(2019,12),deltat=1),na.rm = TRUE)) %>% round(4)

Seasonal indexes for monthly data should be completed to 12, with an average of 1; in order to do that each unadjusted seasonal index is multiplied by 12 and divided by the sum of 12 unadjusted seasonal indexes.

adjusted_seasonality <- (unadjusted_seasonality*(12/sum(unadjusted_seasonality))) %>% 
                        round(4)

The average of all adjusted seasonal indices is 1; if an index is equal to 1, that means there would be no seasonality. In order to plot seasonality:

#building a data frame to plot in ggplot
adjusted_seasonality_df <- data_frame(
  months=month.abb,
  index=adjusted_seasonality)

#converting char month names to factor sequentially
adjusted_seasonality_df$months <- factor(adjusted_seasonality_df$months,levels = month.abb)


ggplot(data = adjusted_seasonality_df,mapping = aes(x=months,y=index))+
 geom_point(aes(colour=months))+
 geom_hline(yintercept=1,linetype="dashed")+
 theme(legend.position = "none")+
 ggtitle("Seasonality of Unleaded Gasoline Prices for 95 Octane")+
 theme(plot.title = element_text(h=0.5))

As we can see above, there is approximately a three-percent seasonal decrease in January and December; on the other hand, there is a two-percent seasonal increase in May and June. Seasonality is not seen in March, July, and August; because their index values are approximately equal to 1.

Decomposing the time series graphically

We will first show the trend line on the time series.

#Trend is shown by red line
plot(gasoline_ts,lwd=2,ylab="Gasoline")+
lines(gasoline_trend,col="red",lwd=3)

And will isolate the trend line from the plot:

plot(gasoline_trend,lwd=3,col="red",ylab="Trend")

Let's see the seasonality line:

plot(as.ts(rep(unadjusted_seasonality,12)),lwd=2,ylab="Seasonality",xlab="")

And finally, we will show the randomness line; in order to do that:
$latex I_{t}&s=1=\frac {y_{t}} {S_{t}\times T_{t}}$

randomness <- gasoline_ts/(gasoline_trend*unadjusted_seasonality)
plot(randomness,ylab="Randomness",lwd=2)

References

Sanjiv Jaggia, Alison Kelly (2013). Business Intelligence: Communicating with Numbers.

tsmp v0.4.8 release – Introducing the Matrix Profile API

A new tool for painlessly analyzing your time series

We’re surrounded by time-series data. From finance to IoT to marketing, many organizations produce thousands of these metrics and mine them to uncover business-critical insights. A Site Reliability Engineer might monitor hundreds of thousands of time series streams from a server farm, in the hopes of detecting anomalous events and preventing catastrophic failure. Alternatively, a brick and mortar retailer might care about identifying patterns of customer foot traffic and leveraging them to guide inventory decisions.

Identifying anomalous events (or “discords”) and repeated patterns (“motifs”) are two fundamental time-series tasks. But how does one get started? There are dozens of approaches to both questions, each with unique positives and drawbacks. Furthermore, time-series data is notoriously hard to analyze, and the explosive growth of the data science community has led to a need for more “black-box” automated solutions that can be leveraged by developers with a wide range of technical backgrounds.

We at the Matrix Profile Foundation believe there’s an easy answer. While it’s true that there’s no such thing as a free lunch, the Matrix Profile (a data structure & set of associated algorithms developed by the Keogh research group at UC-Riverside) is a powerful tool to help solve this dual problem of anomaly detection and motif discovery. Matrix Profile is robust, scalable, and largely parameter-free: we’ve seen it work for a wide range of metrics including website user data, order volume and other business-critical applications. And as we will detail below, the Matrix Profile Foundation has implemented the Matrix Profile across three of the most common data science languages (Python, R and Golang) as an easy-to-use API that’s relevant for time series novices and experts alike.

The basics of Matrix Profile are simple: If I take a snippet of my data and slide it along the rest of the time series, how well does it overlap at each new position? More specifically, we can evaluate the Euclidean distance between a subsequence and every possible time series segment of the same length, building up what’s known as the snippet’s “Distance Profile.” If the subsequence repeats itself in the data, there will be at least one match and the minimum Euclidean distance will be zero (or close to zero in the presence of noise).

In contrast, if the subsequence is highly unique (say it contains a significant outlier), matches will be poor and all overlap scores will be high. Note that the type of data is irrelevant: We’re only looking at pattern conservation. We then slide every possible snippet across the time series, building up a collection of Distance Profiles. By taking the minimum value for each time step across all distance profiles, we can build the final Matrix Profile. Notice that both ends of the Matrix Profile value spectrum are useful. High values indicate uncommon patterns or anomalous events; in contrast, low values highlight repeatable motifs and provide valuable insight into your time series of interest. For those interested, this post by one of our co-founders provides a more in-depth discussion of the Matrix Profile.

Although the Matrix Profile can be a game-changer for time series analysis, leveraging it to produce insights is a multi-step computational process, where each step requires some level of domain experience. However, we believe that the most powerful breakthroughs in data science occur when the complex is made accessible. When it comes to the Matrix Profile, there are three facets to accessibility: “out-of-the-box” working implementations, gentle introductions to core concepts that can naturally lead into deeper exploration, and multi-language accessibility. Today, we’re proud to unveil the Matrix Profile API (MPA), a common codebase written in R, Python and Golang that achieves all three of these goals.

Using the Matrix Profile consists of three steps. First, you Compute the Matrix Profile itself. However, this is not the end: you need to Discover something by leveraging the Matrix Profile that you’ve created. Do you want to find repeated patterns? Or perhaps uncover anomalous events? Finally, it’s critical that you Visualize your findings, as time series analysis greatly benefits from some level of visual inspection.

Normally, you’d need to read through pages of documentation (both academic and technical) to figure out how to execute each of these three steps. This may not be a challenge if you’re an expert with prior knowledge of the Matrix Profile, but we’ve seen that many users simply want to Analyze their data by breaking through the methodology to get to a basic starting point. Can the code simply leverage some reasonable defaults to produce a reasonable output?

To parallel this natural computational flow, MPA consists of three core components:

Compute (computing the Matrix Profile)
Discover (evaluate the MP for motifs, discords, etc)
Visualize (display results through basic plots)

These three capabilities are wrapped up into a high-level capability called Analyze. This is a user-friendly interface that enables people who know nothing about the inner workings of Matrix Profile to quickly leverage it for their own data. And as users gain more experience and intuition with MPA, they can easily dive deeper into any of the three core components to make further functional gains.

As an example, we’ll use the R flavour of MPA to analyze the synthetic time series shown below (here is the code):

Visual inspection reveals that there are both patterns and discords present. However, one immediate problem is that your choice of subsequence length will change both the number and location of your motifs! Are there only two sinusoidal motifs present between indices 0-500, or is each cycle an instance of the pattern? Let’s see how MPA handles this challenge:

Because we haven’t specified any information regarding our subsequence length, `analyze` begins by leveraging a powerful calculation known as the pan-matrix profile (or PMP) to generate insights that will help us evaluate different subsequence lengths. We’ll discuss the details of PMP in a later post (or you can read the associated paper), but in a nutshell, it is a global calculation of all possible subsequence lengths condensed into a single visual summary. The X-axis is the index of the matrix profile, and the Y-axis is the corresponding subsequence length. The darker the shade, the lower the Euclidean distance at that point. We can use the “peaks” of the triangles to find the 6 “big” motifs visually present in the synthetic time series.

The PMP is all well and good, but we promised a simple way of understanding your time series. To facilitate this, `analyze` will combine PMP with an under the hood algorithm to choose sensible motifs and discords from across all possible window sizes. The additional graphs created by `analyze` show the top three motifs and top three discords, along with the corresponding window size and position within the Matrix Profile (and, by extension, your time series).

Not surprisingly, this is a lot of information coming out of the default setting. Our goal is that this core function call can serve as a jumping-off point for many of your future analyses. For example, the PMP indicates that there is a conserved motif of length ~175 within our time series. Try calling `analyze` on that subsequence length and see what happens!

We hope that MPA enables you to more painlessly analyze your time series! For further information, visit our website (https://matrixprofile.org/), GitHub repos (https://github.com/matrix-profile-foundation) or follow us on Twitter (https://twitter.com/matrixprofile). MPF also operates a Discord channel where you can engage with fellow users of the Matrix Profile and ask questions. Happy time series hunting!

Acknowledgements

Thank you to Tyler Marrs, Frankie Cancino, Francisco Bischoff, Austin Ouyang and Jack Green for reviewing this article and assisting in its creation. And above all, thank you to Eamonn Keogh, Abdullah Mueen and their numerous graduate students for creating the Matrix Profile and continuing to drive its development.

Supplemental

Matrix Profile research papers can be found on Eamonn Keogh’s UCR web page: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

The Python implementation of Matrix Profile algorithms can be found here: https://github.com/matrix-profile-foundation/matrixprofile

The R implementation of Matrix Profile algorithms can be found here: https://github.com/matrix-profile-foundation/tsmp

The Golang implementation of Matrix Profile algorithms can be found here: https://github.com/matrix-profile-foundation/go-matrixprofile

Machine Learning with R: A Hands-on Introduction from Robert Muenchen at Machine Learning Week, Las Vegas

Join Robert Muenchen’s workshop about Machine Learning with R at Machine Learning Week on May 31 – June 4, 2020 in Las Vegas!

Workshop Description
The Workshop will take place in May 31, 2020.

R offers a wide variety of machine learning (ML) functions, each of which works in a slightly different way. This one-day, hands-on workshop starts with ML basics and takes you step-by-step through increasingly complex modeling styles. This workshop makes ML modeling easier through the use of packages that standardize the way the various functions work. When finished, you should be able to use R to apply the most popular and effective machine learning models to make predictions and assess the likely accuracy of those predictions.

The instructor will guide attendees on hands-on execution with R, covering:

- A brief introduction to R’s tidyverse functions, including a comparison of the caret and parsnip packages
- Pre-processing data
- Selecting variables
- Partitioning data for model development and validation
- Setting model training controls
- Developing predictive models using naïve Bayes, classification and regression trees, random forests, gradient boosting machines, and neural networks (more, if time permits)
- Evaluating model effectiveness using measures of accuracy and visualization
- Interpreting what “black-box” models are doing internally

Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their laptop. Software installation instructions are available at http://r4stats.com/workshops/PAW2020. Attendees receive an electronic copy of the course materials and related R code after the workshop.

See more information about the workshop here.

XGBoostLSS – An extension of XGBoost to probabilistic forecasting

Introduction

To reason rigorously under uncertainty we need to invoke the language of probability (Zhang et al. 2020). Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to yield an incomplete and potentially misleading picture. While this is an irrevocable consensus in statistics, a common misconception, albeit a very persistent one, is that machine learning models usually lack proper ways of quantifying uncertainty. Despite the fact that the two terms exist in parallel and are used interchangeably, the perception that machine learning and statistics imply a non-overlapping set of techniques remains lively, both among practitioners and academics. This is vividly portrayed by the provocatively (and potentially tongue-in-cheek) statement of Brian D. Ripley that ‘machine learning is statistics minus any checking of models and assumptions‘ that he made during the useR! 2004, Vienna conference that served to illustrate the difference between machine learning and statistics.

In fact, the relationship between statistics and machine learning is artificially complicated by such statements and is unfortunate at best, as it implies a profound and qualitative distinction between the two disciplines (Januschowski et al. 2020). The paper by Breiman (2001) is a noticeable exception, as it proposes to differentiate the two based on scientific culture, rather than on methods alone. While the approaches discussed in Breiman (2001) are an admissible partitioning of the space of how to analyse and model data, more recent advances have gradually made this distinction less clear-cut. In fact, the current research trend in both statistics and machine learning gravitates towards bringing both disciplines closer together. In an era of increasing necessity that the output of prediction models needs to be turned into explainable and reliable insights, this is an exceedingly promising and encouraging development, as both disciplines have much to learn from each other. Along with Januschowski et al. (2020), we argue that it is more constructive to seek common ground than it is to introduce artificial boundaries.

In order to further closing the gap between the two cultures, we propose a new framework of the eminent XGBoost that predicts the entire conditional distribution of a univariate response variable. We term our model XGBoostLSS, as it combines the accuracy and speed of XGBoost with the flexibility and interpretability of GAMLSS that allow for the estimation and prediction of the entire conditional distribution. In particular, XGBoostLSS models all moments of a parametric distribution (i.e., mean, location, scale and shape [LSS]) instead of the conditional mean only. XGBoostLSS allows the user to choose from a wide range of continuous, discrete and mixed discrete-continuous distributions to better adapt to the data at hand, as well as to provide predictive distributions, from which prediction intervals and quantiles can be derived.

Applications

In the following, we present both a simulation study and a real-world example that demonstrate the functionality of XGBoostLSS.

Simulation

We start with a simulated a data set that exhibits heteroskedasticity, where the interest lies in predicting the 5% and 95% quantiles. Points outside the 5% and 95% quantile are coloured in red, with the black dashed lines depicting the actual 5% and 95% quantiles

Let’s fit the XGBoostLSS model to the data. In general, the syntax is similar to the original XGBoost implementation. However, the user has to make a distributional assumption by specifying a family in the function call. As the data has been generated by a Normal distribution, we use the Normal as a function input. The user also has the option of providing a list of hyper-parameters that are used for training the surrogate regression model to find an optimized set of parameters. As our model allows to model the entire conditional distribution, we obtain prediction intervals and quantiles of interest directly from the predicted quantile function. Once the model is trained, we can predict all parameter of the distribution. The following Figure shows the predictions of XGBoostLSS for the 5% and 95% quantile in blue.

Comparing the coverage of the intervals with the nominal level of 90% shows that XGBoostLSS does not only correctly model the heteroskedasticity in the data, but it also provides an accurate forecast for the 5% and 95% quantiles. The flexibility of XGBoostLSS also comes from its ability to provide attribute importance, as well as partial dependence plots, for all of the distributional parameters. In the following we only investigate the effect on the conditional variance.

Munich Rent

Considering there is an active discussion around imposing a freeze in German cities on rents, we have chosen to re-visit the famous Munich Rent data set commonly used in the GAMLSS literature, as Munich is among the most expensive cities in Germany when it comes to living costs. As our dependent variable, we select Net rent per square meter in EUR.

Even though the Generalized Beta Type 2 to provide the best approximation to the data, we use the more parsimonious Normal distribution, as it has only two distributional parameters, compared to 4 of the Generalized Beta Type 2.

Now that we have specified the distribution, we fit XGBoostLSS to the data. Again, we use Bayesian Optimization for finding an optimal set of hyper-parameters. Looking at the estimated effects presented in the following Figure indicates that newer flats are on average more expensive, with the variance first decreasing and increasing again for flats built around 1980 and later. Also, as expected, rents per square meter decrease with an increasing size of the apartment.

The diagnostics for XGBoostLSS are based on quantile residuals of the fitted model and are shown in the following Figure.

Despite some slight under-fitting in the tails of the distribution, XGBoostLSS provides a well calibrated forecast and confirms that our model is a good approximation to the data. XGBoostLSS also allows to investigate feature importance for all distributional parameters. Looking at the top 10 features with the highest Shapley values for both the conditional mean and variance indicates that both yearc and area are considered as being the most important variables.

Besides the global attribute importance, the user might also be interested in local attribute importance for each single prediction individually. This allows to answer questions like ‘How did the feature values of a single data point affect its prediction?‘ For illustration purposes, we select the first predicted rent of the test data set and present the local attribute importance for the conditional mean.

As we have modelled all parameters of the Normal distribution, XGBoostLSS provides a probabilistic forecast, from which any quantity of interest can be derived. Th following Figure shows a random subset of 50 predictions only for ease of readability. The red dots show the actual out of sample rents, while the boxplots visualize the predicted distributions.

We can also plot a subset of the forecasted densities and cumulative distributions.

Comparison to other approaches

All measures, except the Median Absolute Error, show that XGBoostLSS provides more accurate forecasts than existing implementations.

Software Implementation

In its current implementation, XGBoostLSS is available in R and made publicly available soon on the Git-repo StatMixedML/XGBoostLSS. Extensions to Julia and Python are also planned. I am currently also working on an extension of CatBoost to allow for probabilistic forecasting. You may find it on the Git-repo StatMixedML/CatBoostLSS. Please note that due to time some restrictions, the public release of the package(s) may be somewhat delayed.

Reference Paper

März, Alexander (2019) “XGBoostLSS – An extension of XGBoost to probabilistic forecasting”.
März, Alexander (2020) “CatBoostLSS – An extension of CatBoost to probabilistic forecasting”.

The Premier Machine Learning Conference (15% discount code)

Attend any or all of the five jointly scheduled events! Get an additional 15% discount on your booking with the code RBLOGGERSMLW . Join us in Las Vegas for the largest Predictive Analytics World event to date! Learn more

5 days, 8 tracks, 160 speaker and over 150 exciting sessions

Join the Machine Learning Week 2020, May 31 – June 4, Las Vegas! It brings together five co-located events:

- Predictive Analytics World for Business
- Predictive Analytics World for Financial Services
- Predictive Analytics World for Industry 4.0
- Predictive Analytics World for Healthcare
- Deep Learning World

gathering the top practitioners and the leading experts in data science and machine learning. By design, this mega-conference is where to meet the who’s who and keep up on the latest techniques, making it the leading machine learning event. You can expect top-class experts from world-famous companies such as:

– Google: Join Richard Dutton (Head of Machine Learning for Corporate Engineering) at his session “How Google Uses AI and Machine Learning in the Enterprise”!
– Lyft: Gil Arditi (Product Lead, Machine Learning) will speak about “From Self-Driving to Fraud Detection – How Lyft Streamlines Machine Learning Deployment”.
– LinkedIn: You want to learn something about launching artificial intelligence products? Then the session from Priyanka Gariba (Head of Artificial Intelligence Technical Program Management) fits best for you!
– Visa: Abhishek Joshi ‘AJ’ (Senior Director) will explain you “The Predictive Power of Card Transaction Data”.
– Microsoft: Jaya Mathew (Senior Data Scientist) will present you “Getting the Most out of Your IoT Data: Basics of Predictive Maintenance, Bootstrapping you Model with Your Data & Comparing the Traditional vs Deep Learning Methodologies”
– Verizon: Join Bryan Larish Ph.D. (Director of Technology) at his session “Auditing Cellular Networks with Deep Learning”

See the full 8-track agenda here.

What are the 5 conferences about?

Predictive Analytics World for Business focuses on concrete examples of deployed predictive analytics. Hear from the horse’s mouth precisely how Fortune 500 analytics competitors and other top practitioners deploy machine learning, and the kind of business results they achieve. Predictive Analytics World for Financial Services is the leading data science event covering the deployment of machine learning by banks, insurance companies, credit card companies, investment firms, and other financial institutions. The Predictive Analytics World Healthcare program will feature sessions and case studies across Healthcare Business Operations and Clinical applications so you can witness how data science and machine learning are employed at leading enterprises, resulting in improved outcomes, lower costs, and higher patient satisfaction. Predictive Analytics World for Industry 4.0 is the leading vendor-neutral conference for machine learning for smart manufacturing and IoT. Data scientists, industrial planners, and other machine learning experts will meet in Las Vegas to explore the latest trends and technologies in machine & deep learning for the IoT era. Deep Learning World is the premier conference covering the commercial deployment of deep learning. The event’s mission is to foster breakthroughs in the value-driven operationalization of established deep learning methods.

Attend any or all of the five jointly scheduled events! Get an additional 15% discount on your booking with the code RBLOGGERSMLW . Join us in Las Vegas for the largest Predictive Analytics World event to date! Learn more

New Package to Process TVDI index and Filter Golay Savitzky Raster

Description

Use MODIS image to calculate TVDI index
Make multiple Raster images at the same time
Can be used to calculate large image files
UI interface calculates TVDI index
UI interface exports Golay Savitzky filter images
The functions in the TVDI package
- Golay_Raster
- Golay_GUI (may be failed if you don’t have GTK+)
- Mean_Raster
- Mask_Multi_Raster
- IQR_Raster
- TVDI_process
- TVDI_Largefiles_process
- TVDI_GUI (may be failed if you don’t have GTK+)

How to Download and Install

Download and Install from Github

install.packages("devtools")
library("devtools")
install_github("nguyenduclam/TVDIpk")
library("TVDIpk")

Install from Cran (waiting for update in Cran)

install.packages("TVDIpk")

Note that GTK+ library is not already installed on your system, installation may fail. In that case, please install and load the gWidgetsRGtk2 library beforehand:

install.packages("gWidgetsRGtk2")
library("gWidgetsRGtk2")

If none of the above works, download package at this link to download this package type tar.gz
Github link: https://github.com/nguyenduclam/TVDIpk

How to use Pakages

Golay UI
- ```
Golay_GUI()
```
TVDI UI
- ```
TVDI_GUI()
```

References

rco: Make Your R Code Run Faster Today!

The rco package can optimize R code in a variety of different ways. The package implements common subexpression elimination, constant folding, constant propagation, dead code elimination, among other very relevant code optimization strategies.

Currently, the rco could be downloaded as a GitHub package. The rco package functions as an RStudio Addin, be used through a shiny GUI interface, or as an R function through either the optimize_files() or optimize_text() functions. Found an optimization method not currently supported by the rco package? Feel free to contribute! Your contribution is more than welcome. Install rco now and make your R code run faster today!

15+ Resources to Get Started with R

R is the second most sought after language in data science behind Python, so gaining mastery of R is a prerequisite to a thriving career in the field. Whether you’re an experienced developer or a newbie considering a career move, here are some excellent resources so you can get started with R.

What is R?

R is a programming language and environment designed for statistical analysis. It’s used mainly by data miners and statisticians. It’s a free resource and runs on a wide variety of platforms, including UNIX, Windows, and Mac OS. It has thousands of well-documented extensions and is cross-platform compatible. It’s not quite as popular outside of the field of data science, but it’s one of the best options for exploring datasets in a deep dive manner or for going after data insights for a single time. Head over to the R sight and download a copy of R, so you’re ready to get started.

Free R Resources for Beginners

Let’s take a look at how a beginner might break into R. It’s not quite as friendly as Python, but it’s definitely accessible with good resources and practice.

Platforms and Documentation

r-bloggers.com: R-bloggers is a collection of blogs designed by R experts that covers a wide range of R topics. No matter what you’re curious about or have an issue with, R-bloggers has it covered.

Books

R for Data Science: This classic handbook provides resources and documentation. It’s available for free on the website, or you can purchase a physical copy from Amazon. Hands-on Programming with R: Garrett Grolemund’s classic is a practical, hands-on approach to R programming. It gives you the instruction you need plus practical programming skills to begin with R right from the very beginning.

Courses

Codecademy: Codecademy’s mission is to bring development knowledge even to beginners, and its R offers are no different. While many of the lessons will require a membership, it does offer a basic set of courses to get you started. edX.org: EdX offers a range of free R courses to get you started, but we recommend starting with Microsoft’s Introduction to R for Data Science for a comprehensive overview. The courses cost nothing and are offered asynchronously. Some do come with official certification for a fee.

Free R Resources for Developers

If you’ve already got some development experience under your belt, these resources could be a great way to get started with R by utilizing your current experience. Even better, they’re free.

Platforms and Documentation

storybench.com: Storybench is an experiential learning platform designed to provide exercises in digital storytelling. They offer projects in R, most notably “How to Explore Correlations in R.” Once you’ve gotten the basics, the next logical step is to take on projects for hands-on learning.

Books

R Programming for Data Science: This option is available for free (though you can choose to donate in support of the project). It offers full resources for learning R and understanding key data science principles. If you upgrade the package, the online book comes with a full video suite. Text Mining with R: Another book available for free on the website, this option offers a targeted approach to text mining with a full Github repository as support. R in Action: Another entirely free resource for business developers. If you’ve already got an established career in development in the business world, this could be an excellent resource for getting started with R in the business world.

Courses

Coursera: Johns Hopkins’s popular partnership with Coursera, “Data Science, Foundations Using R” is a great way for developers to build skills to break into the field of Data Science. edX + Harvard: X Series Program in Data Analysis for Life Sciences is a course series designed to implement both learning R and real-life projects for a full learning experience. You can upgrade to an official learning certificate for a fee or take the individual courses for free.

Paid Resources for Beginners and Beyond

Sometimes, you’ve got to invest a little in your learning experience. Here are a couple of things that can really jumpstart your R-programming skills if you’ve got some wiggle room in your budget. Getting Started with R: A primer on using R for the biological sciences. It contains valuable information for getting started on statistical analysis using the R programming language. flowingdata.com: Flowingdata is a membership site designed to elevate your visualizations. It’s another excellent way to get experiential learning with R projects. Rstudio: It’s not cheap, but if you’re serious about making a career in R, you’ll want to get it. Save up and invest. They do, however, have a series of free webinars you can peruse.

Bonus! More Blogs and Newsletters

https://blog.revolutionanalytics.com/r/ : Blog designed to highlight milestones in Data Science and includes a range of topics including both R and Python for you multilingual developers out there. https://journal.r-project.org/: Open access, refereed journal detailing the latest in R-programming news and projects. Papers have a focus on accessibility, and the articles are tended to reach a wide audience. https://morningcupofcoding.com/: Offers a wide range of curated coding articles, including R programming. Check their back issues for articles of interest. opendatascience.com: ODSC’s general weekly newsletter provides members with trending topics in the fields of modeling, tools & platforms, and more.

Getting Started with R Programming

While both Python and R are invaluable resources for getting started in Data Science, the statistical capabilities of R are in wide demand. Whether you’re new to the world of coding or an experienced developer, R can open doors into the world of Data Science.

Beginners guide to Bubble Map with Shiny

Map Bubble
Map bubble is type of map chart where bubble or circle position indicates geoghraphical location and bubble size is used to show differences in magnitude of quantitative variables like population.

We will be using Highcharter package to show earthquake magnitude and depth . Highcharter is a versatile charting library to build interactive charts, one of the easiest to learn and for shiny integration.

About dataset
Dataset used here is from US Geological survey website of recent one week earthquake events. There are about 420 recorded observation with magnitude more than 2.0 globally. Dataset has 22 variables, of which we will be using time, latitude, longitude, depth, magnitude(mag) and nearest named place of event.

Shiny Application
This application has single app.R file and earthquake dataset. Before we start with UI function, we will load dataset and fetch world json object from highcharts map collection with hcmap function. See the app here

library(shiny)
library(highcharter)
library(dplyr)
edata <- read.csv('earthquake.csv') %>% rename(lat=latitude,lon = longitude)
wmap <- hcmap()

Using dplyr package latitude and longitude variables are renamed as lat and lon with rename verb. Column names are important.

ui
It has sidebar panel with 3 widgets and main panel for displaying map.

Two sliders, one for filtering out low magnitude values and other for adjusting bubble size.
One select widget for bubble size variable. User can select magnitude or depth of earthquake event. mag and depth are columns in dataset.
Widget output function highchartOutput for use in shiny.

ui <- fluidPage(
   
titlePanel("MapBubble"), # Application title
 sidebarLayout(
 sidebarPanel(     
  
   sliderInput('mag','Magnitude more than(Richter Scale)', min = 1,max = 6,step = 0.5,value = 0),
 
   selectInput('bubble','Bubble Size indicates',choices = c('Magnitude'= 'mag','Depth(in Km)' = 'depth')),

   sliderInput('bublesize','Adjust bubble Size',min = 2,max = 10,step = 1,value = 6)      
        ),
      
      # Display a Map Bubble
      mainPanel(
        highchartOutput('eqmap',height = "500px")         
      )
   )
)

Server
Before rendering, we will filter the dataset within reactive context. Any numeric column that we want to indicate with bubble size should be named z. input$bubble comes from select widget.

renderHighchart is render function for use in shiny. We will pass the filtered data and chart type as mapbubble in hc_add_series function. Place, time and z variable are displayed in the tooltip with “point” format.
Sub-title is used to show no. of filtered observation in the map

 
  server <- function(input, output) { 
  data <- reactive(edata %>% 
               filter(mag >= input$mag) %>%
               rename(z = input$bubble))


output$eqmap <-renderHighchart(
  
 wmap %>% hc_legend(enabled = F) %>%

  hc_add_series(data = data(), type = "mapbubble", name = "", maxSize = paste0(input$bublesize,'%')) %>% #bubble size in perc %

 hc_tooltip(useHTML = T,headerFormat='',pointFormat = paste('Location :{point.place} 
 Time: {point.time} 
',input$bubble,': {point.z}')) %>%

 hc_title(text = "Global Seismic Activity") %>%
 hc_subtitle(text = paste('No of obs:', nrow(data()),sep = '')) %>%
 hc_mapNavigation(enabled = T)%>%
 )
}

# Run the application 
shinyApp(ui = ui, server = server)

Shiny R file can be found here at the github repository

Generate synthetic data using R

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’.

Steps to build synthetic data

1. Installation

Install conjurer package by using the following code. Since the package uses base R functions, it does not have any dependencies.

 install.packages("conjurer")

2. Build customers

A customer is identified by a unique customer identifier(ID). A customer ID is alphanumeric with prefix “cust” followed by a numeric. This numeric ranges from 1 and extend to the number of customers provided as the argument within the function. For example, if there are 100 customers, then the customer ID will range from cust001 to cust100. This ensures that the customer ID is always of the same length. Let us build a group of customer IDs using the following code. For simplicity, let us assume that there are 100 customers. customer ID is built using the function buildCust. This function takes one argument “numOfCust” that specifies the number of customer IDs to be built.

library(conjurer)
customers <- buildCust(numOfCust =  100)
print(head(customers))
#[1] "cust001" "cust002" "cust003" "cust004" "cust005" "cust006"

3. Build products

The next step is building some products. A product is identified by a product ID. Similar to a customer ID, a product ID is also an alphanumeric with prefix “sku” which signifies a stock keeping unit. This prefix is followed by a numeric ranging from 1 and extending to the number of products provided as the argument within the function. For example, if there are 10 products, then the product ID will range from sku01 to sku10. This ensures that the product ID is always of the same length. Besides product ID, the product price range must be specified. Let us build a group of products using the following code. For simplicity, let us assume that there are 10 products and the price range for them is from 5 dollars to 50 dollars. Products are built using the function buildProd. This function takes 3 arguments as given below.

- numOfProd. This defines the number of product IDs to be generated.
- minPrice. This is the minimum value of the price range.
- maxPrice. This is the maximum value of the price range.

library(conjurer)
products <- buildProd(numOfProd = 10, minPrice = 5, maxPrice = 50)
print(head(products))
#     SKU Price
# 1 sku01 43.60
# 2 sku02 48.56
# 3 sku03 36.16
# 4 sku04 19.02
# 5 sku05 17.19
# 6 sku06 25.35

4. Build transactions

Now that a group of customer IDs and Products are built, the next step is to build transactions. Transactions are built using the function genTrans. This function takes 5 arguments. The details of them are as follows.

- cylces. This represents the cyclicality of data. It can take the following values
  - “y“. If cycles is set to the value “y”, it means that there is only one instance of a high number of transactions during the entire year. This is a very common situation for some retail clients where the highest number of sales are during the holiday period in December.
  - “q“. If cycles is set to the value “q”, it means that there are 4 instances of a high number of transactions. This is generally noticed in the financial services industry where the financial statements are revised every quarter and have an impact on the equity transactions in the secondary market.
  - “m“. If cycles is set to the value “m”, it means that there are 12 instances of a high number of transactions for a year. This means that the number of transactions increases once every month and then subside for the rest of the month.
- spike. This represents the seasonality of data. It can take any value from 1 to 12. These numbers represent months in an year, from January to December respectively. For example, if spike is set to 12, it means that December has the highest number of transactions.
- trend. This represents the slope of data distribution. It can take a value of 1 or -1.
  - If the trend is set to value 1, then the aggregated monthly transactions will exhibit an upward trend from January to December and vice versa if it is set to -1.
- outliers. This signifies the presence of outliers. If set to value 1, then outliers are generated randomly. If set to value 0, then no outliers are generated. The presence of outliers is a very common occurrence and hence setting the outliers to 1 is recommended. However, there are instances where outliers are not needed. For example, if the objective of data generation is solely for visualization purposes then outliers may not be needed.
- transactions. This represents the number of transactions to be generated.

Let us build transactions using the following code

transactions <- genTrans(cycles = "y", spike = 12, outliers = 1, transactions = 10000)

Visualize generated transactions by using

TxnAggregated <- aggregate(transactions$transactionID, by = list(transactions$dayNum), length)
plot(TxnAggregated, type = "l", ann = FALSE)

5. Build final data

Bringing customers, products and transactions together is the final step of generating synthetic data. This process entails 3 steps as given below.

5.1 Allocate customers to transactions

The allocation of transactions is achieved with the help of buildPareto function. This function takes 3 arguments as detailed below.

- factor1 and factor2. These are factors to be mapped to each other. As the name suggests, they must be of data type factor.
- Pareto. This defines the percentage allocation and is a numeric data type. This argument takes the form of c(x,y) where x and y are numeric and their sum is 100. If we set Pareto to c(80,20), it then allocates 80 percent of factor1 to 20 percent of factor 2. This is based on a well-known concept of Pareto principle.

Let us now allocate transactions to customers first by using the following code.

customer2transaction <- buildPareto(customers, transactions$transactionID, pareto = c(80,20))

Assign readable names to the output by using the following code.

names(customer2transaction) <- c('transactionID', 'customer')

#inspect the output
print(head(customer2transaction))
#   transactionID customer
# 1     txn-91-11  cust072
# 2    txn-343-25  cust089
# 3    txn-264-08  cust076
# 4    txn-342-07  cust030
# 5      txn-2-19  cust091
# 6    txn-275-06  cust062

5.2 Allocate products to transactions

Now, using similar step as mentioned above, allocate transactions to products using following code.

product2transaction <- buildPareto(products$SKU,transactions$transactionID,pareto = c(70,30))
names(product2transaction) <- c('transactionID', 'SKU')

#inspect the output
print(head(product2transaction))
#   transactionID   SKU
# 1    txn-182-30 sku10
# 2    txn-179-21 sku01
# 3    txn-179-10 sku10
# 4    txn-360-08 sku01
# 5     txn-23-09 sku01
# 6    txn-264-20 sku10

5.3 Final data

Now, using a similar step as mentioned above, allocate transactions to products using the following code.

df1 <- merge(x = customer2transaction, y = product2transaction, by = "transactionID")

dfFinal <- merge(x = df1, y = transactions, by = "transactionID", all.x = TRUE)

#inspect the output
print(head(dfFinal))
#   transactionID customer   SKU dayNum mthNum
# 1      txn-1-01  cust076 sku03      1      1
# 2      txn-1-02  cust062 sku04      1      1
# 3      txn-1-03  cust087 sku07      1      1
# 4      txn-1-04  cust010 sku04      1      1
# 5      txn-1-05  cust039 sku01      1      1
# 6      txn-1-06  cust010 sku01      1      1

Thus, we have the final data set with transactions, customers and products. Interpret the results The column names of the final data frame can be interpreted as follows.

- Each row is a transaction and the data frame has all the transactions for a year i.e 365 days.
- transactionID is the unique identifier for that transaction. + customer is the unique customer identifier. This is the customer who made that transaction.
- SKU is the product that was bought in that transaction.
- dayNum is the day number in the year. There would be 365 unique dayNum in the data frame.
- mthNum is the month number. This ranges from 1 to 12 and represents January to December respectively.

Summary & concluding remarks

In this article, we started by building customers, products and transactions. Later on, we also understood how to bring them all together in to a final data set. At the time of writing this article, the package is predominantly focused on building the basic data set and there is room for improvement. If you are interested in contributing to this package, please find the details at contributions.