Anomaly Detection: Matrix Profile Discords
Matrix profiles are used to annotate a time series in a way that makes data mining tasks within the time series simple. I will not explain what a matrix profile is within this post. Feel free to visit this page to understand more about them:
https://www.cs.ucr.edu/~eamonn/MatrixProfile.html
During the KDD 2017 conference, I had the pleasure to be introduced to the concept of the matrix profiles. Unfortunately, at the time all of the code was implemented in MatLab. I dabbled a little bit with converting the code to Python and never completed it. Last month, December 2018, someone (a Target employee) created a github repository with working algorithms. Here is the github repository:
https://github.com/target/matrixprofile-ts
My main interest of matrix profiles was the usefulness in anomaly detection. This blog post is going to demonstrate how to use the Python module to detect anomalies within a NAB dataset. Specifically, I am working with the NYC Taxi dataset.
Data Overview¶
The data consists of the number of taxi passengers from 2014-07-01 to 2015-01-31. There are 5 known anomalies during these periods:
- NYC Marathon - 2014-11-02
- Thanksgiving - 2014-11-27
- Christmas - 2014-12-25
- New Years - 2015-01-01
- Snow Blizzard - 2015-01-26 and 2015-01-27
I will see how close the anomaly detection is using matrix profiles.
from matrixprofile import *
from matrixprofile.discords import discords
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
Load Data¶
df = pd.read_csv('/home/tyler/data/nab/realKnownCause/realKnownCause/nyc_taxi.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.set_index('timestamp').sort_index()
df.head()
Resample Hourly¶
Originally the dataset is within 30 minute increments.
df = df.resample('1H').sum()
df.head()
a = df.values.squeeze()
# subsequence length to compute the matrix profile
# since we have hourly measurements and want to find daily events,
# we will create a length of 24 - number of hours in a day
m = 24
profile = matrixProfile.stomp(a,m)
df['profile'] = np.append(profile[0],np.zeros(m-1)+np.nan)
Plot Matrix Profile¶
Below is a plot of the hourly data and the matrix profile. Visually, you can see both motifs and discords. We are interested in finding the discords which are high peaks in the plot. A couple of periods jump out that seem close to Thanksgiving and the snow storm.
#Plot the signal data
fig, (ax1, ax2) = plt.subplots(2,1,sharex=True,figsize=(15,10))
df['value'].plot(ax=ax1, title='Raw Data')
#Plot the Matrix Profile
df['profile'].plot(ax=ax2, c='r', title='Matrix Profile')
plt.show()
# exclude up to a day on the left and right side
ex_zone = 24
# we look for the 5 events specified in the data explaination
anoms = discords(df['profile'], ex_zone, k=5)
df.iloc[anoms]
Conclusions¶
Using the matrix profile to identify discords within the NYC taxi dataset seems fruitful. All of the anomalies mentioned in the dataset overview were found. At first it can be a little tricky to think about what subsequence length to use, but once you understand your problem it becomes clear. I will be using this module more frequently when I need to identify anomalies in the future.