Matrix profiles are used to annotate a time series in a way that makes data mining tasks within the time series simple. I will not explain what a matrix profile is within this post. Feel free to visit this page to understand more about them:

https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

During the KDD 2017 conference, I had the pleasure to be introduced to the concept of the matrix profiles. Unfortunately, at the time all of the code was implemented in MatLab. I dabbled a little bit with converting the code to Python and never completed it. Last month, December 2018, someone (a Target employee) created a github repository with working algorithms. Here is the github repository:

https://github.com/target/matrixprofile-ts

My main interest of matrix profiles was the usefulness in anomaly detection. This blog post is going to demonstrate how to use the Python module to detect anomalies within a NAB dataset. Specifically, I am working with the NYC Taxi dataset.

Data Overview¶

The data consists of the number of taxi passengers from 2014-07-01 to 2015-01-31. There are 5 known anomalies during these periods:

NYC Marathon - 2014-11-02
Thanksgiving - 2014-11-27
Christmas - 2014-12-25
New Years - 2015-01-01
Snow Blizzard - 2015-01-26 and 2015-01-27

I will see how close the anomaly detection is using matrix profiles.

In [2]:

from matrixprofile import *
from matrixprofile.discords import discords

In [3]:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

%matplotlib inline

Load Data¶

In [4]:

df = pd.read_csv('/home/tyler/data/nab/realKnownCause/realKnownCause/nyc_taxi.csv')

In [5]:

df['timestamp'] = pd.to_datetime(df['timestamp'])

In [6]:

df = df.set_index('timestamp').sort_index()

In [7]:

df.head()

Out[7]:

	value
timestamp
2014-07-01 00:00:00	10844
2014-07-01 00:30:00	8127
2014-07-01 01:00:00	6210
2014-07-01 01:30:00	4656
2014-07-01 02:00:00	3820

Resample Hourly¶

Originally the dataset is within 30 minute increments.

In [8]:

df = df.resample('1H').sum()

In [9]:

df.head()

Out[9]:

	value
timestamp
2014-07-01 00:00:00	18971
2014-07-01 01:00:00	10866
2014-07-01 02:00:00	6693
2014-07-01 03:00:00	4433
2014-07-01 04:00:00	4379

In [10]:

a = df.values.squeeze()

# subsequence length to compute the matrix profile
# since we have hourly measurements and want to find daily events,
# we will create a length of 24 - number of hours in a day
m = 24
profile = matrixProfile.stomp(a,m)

In [11]:

df['profile'] = np.append(profile[0],np.zeros(m-1)+np.nan)

Plot Matrix Profile¶

Below is a plot of the hourly data and the matrix profile. Visually, you can see both motifs and discords. We are interested in finding the discords which are high peaks in the plot. A couple of periods jump out that seem close to Thanksgiving and the snow storm.

In [12]:

#Plot the signal data
fig, (ax1, ax2) = plt.subplots(2,1,sharex=True,figsize=(15,10))
df['value'].plot(ax=ax1, title='Raw Data')

#Plot the Matrix Profile
df['profile'].plot(ax=ax2, c='r', title='Matrix Profile')
plt.show()

In [13]:

# exclude up to a day on the left and right side
ex_zone = 24

# we look for the 5 events specified in the data explaination
anoms = discords(df['profile'], ex_zone, k=5)

In [14]:

df.iloc[anoms]

Out[14]:

	value	profile
timestamp
2015-01-27 09:00:00	3874	3.275818
2014-11-02 00:00:00	48219	2.468237
2015-01-25 20:00:00	29503	2.094151
2014-12-31 23:00:00	35978	1.803195
2014-12-24 00:00:00	20646	1.581157

Conclusions¶

Using the matrix profile to identify discords within the NYC taxi dataset seems fruitful. All of the anomalies mentioned in the dataset overview were found. At first it can be a little tricky to think about what subsequence length to use, but once you understand your problem it becomes clear. I will be using this module more frequently when I need to identify anomalies in the future.