Francisco Bischoff – R-posts.com

A new tool for painlessly analyzing your time series

We’re surrounded by time-series data. From finance to IoT to marketing, many organizations produce thousands of these metrics and mine them to uncover business-critical insights. A Site Reliability Engineer might monitor hundreds of thousands of time series streams from a server farm, in the hopes of detecting anomalous events and preventing catastrophic failure. Alternatively, a brick and mortar retailer might care about identifying patterns of customer foot traffic and leveraging them to guide inventory decisions.

Identifying anomalous events (or “discords”) and repeated patterns (“motifs”) are two fundamental time-series tasks. But how does one get started? There are dozens of approaches to both questions, each with unique positives and drawbacks. Furthermore, time-series data is notoriously hard to analyze, and the explosive growth of the data science community has led to a need for more “black-box” automated solutions that can be leveraged by developers with a wide range of technical backgrounds.

We at the Matrix Profile Foundation believe there’s an easy answer. While it’s true that there’s no such thing as a free lunch, the Matrix Profile (a data structure & set of associated algorithms developed by the Keogh research group at UC-Riverside) is a powerful tool to help solve this dual problem of anomaly detection and motif discovery. Matrix Profile is robust, scalable, and largely parameter-free: we’ve seen it work for a wide range of metrics including website user data, order volume and other business-critical applications. And as we will detail below, the Matrix Profile Foundation has implemented the Matrix Profile across three of the most common data science languages (Python, R and Golang) as an easy-to-use API that’s relevant for time series novices and experts alike.

The basics of Matrix Profile are simple: If I take a snippet of my data and slide it along the rest of the time series, how well does it overlap at each new position? More specifically, we can evaluate the Euclidean distance between a subsequence and every possible time series segment of the same length, building up what’s known as the snippet’s “Distance Profile.” If the subsequence repeats itself in the data, there will be at least one match and the minimum Euclidean distance will be zero (or close to zero in the presence of noise).

In contrast, if the subsequence is highly unique (say it contains a significant outlier), matches will be poor and all overlap scores will be high. Note that the type of data is irrelevant: We’re only looking at pattern conservation. We then slide every possible snippet across the time series, building up a collection of Distance Profiles. By taking the minimum value for each time step across all distance profiles, we can build the final Matrix Profile. Notice that both ends of the Matrix Profile value spectrum are useful. High values indicate uncommon patterns or anomalous events; in contrast, low values highlight repeatable motifs and provide valuable insight into your time series of interest. For those interested, this post by one of our co-founders provides a more in-depth discussion of the Matrix Profile.

Although the Matrix Profile can be a game-changer for time series analysis, leveraging it to produce insights is a multi-step computational process, where each step requires some level of domain experience. However, we believe that the most powerful breakthroughs in data science occur when the complex is made accessible. When it comes to the Matrix Profile, there are three facets to accessibility: “out-of-the-box” working implementations, gentle introductions to core concepts that can naturally lead into deeper exploration, and multi-language accessibility. Today, we’re proud to unveil the Matrix Profile API (MPA), a common codebase written in R, Python and Golang that achieves all three of these goals.

Using the Matrix Profile consists of three steps. First, you Compute the Matrix Profile itself. However, this is not the end: you need to Discover something by leveraging the Matrix Profile that you’ve created. Do you want to find repeated patterns? Or perhaps uncover anomalous events? Finally, it’s critical that you Visualize your findings, as time series analysis greatly benefits from some level of visual inspection.

Normally, you’d need to read through pages of documentation (both academic and technical) to figure out how to execute each of these three steps. This may not be a challenge if you’re an expert with prior knowledge of the Matrix Profile, but we’ve seen that many users simply want to Analyze their data by breaking through the methodology to get to a basic starting point. Can the code simply leverage some reasonable defaults to produce a reasonable output?

To parallel this natural computational flow, MPA consists of three core components:

Compute (computing the Matrix Profile)
Discover (evaluate the MP for motifs, discords, etc)
Visualize (display results through basic plots)

These three capabilities are wrapped up into a high-level capability called Analyze. This is a user-friendly interface that enables people who know nothing about the inner workings of Matrix Profile to quickly leverage it for their own data. And as users gain more experience and intuition with MPA, they can easily dive deeper into any of the three core components to make further functional gains.

As an example, we’ll use the R flavour of MPA to analyze the synthetic time series shown below (here is the code):

Visual inspection reveals that there are both patterns and discords present. However, one immediate problem is that your choice of subsequence length will change both the number and location of your motifs! Are there only two sinusoidal motifs present between indices 0-500, or is each cycle an instance of the pattern? Let’s see how MPA handles this challenge:

Because we haven’t specified any information regarding our subsequence length, `analyze` begins by leveraging a powerful calculation known as the pan-matrix profile (or PMP) to generate insights that will help us evaluate different subsequence lengths. We’ll discuss the details of PMP in a later post (or you can read the associated paper), but in a nutshell, it is a global calculation of all possible subsequence lengths condensed into a single visual summary. The X-axis is the index of the matrix profile, and the Y-axis is the corresponding subsequence length. The darker the shade, the lower the Euclidean distance at that point. We can use the “peaks” of the triangles to find the 6 “big” motifs visually present in the synthetic time series.

The PMP is all well and good, but we promised a simple way of understanding your time series. To facilitate this, `analyze` will combine PMP with an under the hood algorithm to choose sensible motifs and discords from across all possible window sizes. The additional graphs created by `analyze` show the top three motifs and top three discords, along with the corresponding window size and position within the Matrix Profile (and, by extension, your time series).

Not surprisingly, this is a lot of information coming out of the default setting. Our goal is that this core function call can serve as a jumping-off point for many of your future analyses. For example, the PMP indicates that there is a conserved motif of length ~175 within our time series. Try calling `analyze` on that subsequence length and see what happens!

We hope that MPA enables you to more painlessly analyze your time series! For further information, visit our website (https://matrixprofile.org/), GitHub repos (https://github.com/matrix-profile-foundation) or follow us on Twitter (https://twitter.com/matrixprofile). MPF also operates a Discord channel where you can engage with fellow users of the Matrix Profile and ask questions. Happy time series hunting!

Acknowledgements

Thank you to Tyler Marrs, Frankie Cancino, Francisco Bischoff, Austin Ouyang and Jack Green for reviewing this article and assisting in its creation. And above all, thank you to Eamonn Keogh, Abdullah Mueen and their numerous graduate students for creating the Matrix Profile and continuing to drive its development.

Supplemental

Matrix Profile research papers can be found on Eamonn Keogh’s UCR web page: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

The Python implementation of Matrix Profile algorithms can be found here: https://github.com/matrix-profile-foundation/matrixprofile

The R implementation of Matrix Profile algorithms can be found here: https://github.com/matrix-profile-foundation/tsmp

The Golang implementation of Matrix Profile algorithms can be found here: https://github.com/matrix-profile-foundation/go-matrixprofile

Recently I began to look further into Time Series(TS). During the course of my Master’s degree, I used the forecast package quite a bit (Thanks to Prof. Hyndman), and TS got my attention. So, after reading lots of publications about everything you can imagine about TS, I came across one publication from Prof. Eamonn, of the University of California, that made me contact him to ask a few questions. After receiving a considerable amount of information from him, one particular subject caught my attention: Matrix Profile. I was so much impressed that I started to write the TSMP R Package.

Enough talking, so what is a Matrix Profile? Why should you care about it?

Matrix Profile

If you are here, you are likely aware of what a Distance Matrix (DM) is. If not, think about those tables that used to be on maps with the distance between cities. It is widely used in TS for clustering, classification, motif search, etc. However, even for modestly sized datasets, the algorithms can take months to compute it and even with speed-up techniques (i.e., indexing, lower-bounding, early abandoning) they can be, at best, one or two orders of magnitude faster. Matrix Profile it’s like a DM but faster (much faster) to compute. Figure 1 shows a DM and a Matrix Profile. As you can see, in the Matrix Profile, as the name suggests, you see the Profile of a DM. It stores the minimum Euclidean distance of every subset of one TS (think of a Sliding Window) with another (or itself, called Self-Join). It also stores a companion vector called Profile Index, that gives us the index of each nearest neighbor.

And is this good?

Why should you care?

DM usually stores redundant information, useless for most TS applications. The Matrix Profile has a host of interesting and exploitable properties. For example, the highest point on the Matrix Profile corresponds to the TS discord, the (tied) lowest points correspond to the locations of the best TS motif pair, and the variance can be seen as a measure of the TS complexity. Moreover, the histogram of the values in the Matrix Profile is the exact answer to the TS density estimation. Particularly, it has implications for TS motif discovery, TS joins, shapelet discovery (classification), density estimation, semantic segmentation, visualization, rule discovery, clustering, etc.¹

This method has the following advantages/features:

It is exact, providing no false positives or false dismissals.
It is simple and parameter-free. In contrast, the more general metric space APSS² algorithms require building and tuning spatial access methods and/or hash functions.
It requires an inconsequential space overhead, just O(n) with a small constant factor.
It is extremely scalable, and for massive datasets, we can compute the results in an anytime fashion, allowing ultra-fast approximate solutions.
Having computed the similarity join for a dataset, we can incrementally update it very efficiently. In many domains, this means we can effectively maintain exact joins on streaming data forever.
It provides full joins, eliminating the need to specify a similarity threshold, which is a near-impossible task in this domain.
It is embarrassingly parallelizable, both on multicore processors and in distributed systems.

Performance on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz using a random walk dataset

set.seed(2018)
data <- cumsum(sample(c(-1, 1), 40000, TRUE))

	Elapsed Time³	Data size	Window size	Threads
scrimp()	45.30s	40000	1000	1
stomp_par()	52.72s	40000	1000	8
stomp()	136.01s	40000	1000	1
stamp_par()	140.25s	40000	1000	8
stamp()	262.03s	40000	1000	1

Author: Francisco Bischoff

tsmp v0.4.8 release – Introducing the Matrix Profile API

Time Series with Matrix Profile

Matrix Profile

Why should you care?

Performance on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz using a random walk dataset

Further readings