Matrix profiles are used to annotate a time series in a way that makes data mining tasks within the time series simple. I will not explain what a matrix profile is within this post. Feel free to visit this page to understand more about them:
During the KDD 2017 conference, I had the pleasure to be introduced to the concept of the matrix profiles. Unfortunately, at the time all of the code was implemented in MatLab. I dabbled a little bit with converting the code to Python and never completed it. Last month, December 2018, someone (a Target employee) created a github repository with working algorithms. Here is the github repository:
My main interest of matrix profiles was the usefulness in anomaly detection. This blog post is going to demonstrate how to use the Python module to detect anomalies within a NAB dataset. Specifically, I am working with the NYC Taxi dataset.
The data consists of the number of taxi passengers from 2014-07-01 to 2015-01-31. There are 5 known anomalies during these periods:
- NYC Marathon - 2014-11-02
- Thanksgiving - 2014-11-27
- Christmas - 2014-12-25
- New Years - 2015-01-01
- Snow Blizzard - 2015-01-26 and 2015-01-27
I will see how close the anomaly detection is using matrix profiles.
from matrixprofile import * from matrixprofile.discords import discords
import pandas as pd import numpy as np from matplotlib import pyplot as plt %matplotlib inline
df = pd.read_csv('/home/tyler/data/nab/realKnownCause/realKnownCause/nyc_taxi.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.set_index('timestamp').sort_index()
df = df.resample('1H').sum()
a = df.values.squeeze() # subsequence length to compute the matrix profile # since we have hourly measurements and want to find daily events, # we will create a length of 24 - number of hours in a day m = 24 profile = matrixProfile.stomp(a,m)
df['profile'] = np.append(profile,np.zeros(m-1)+np.nan)
Plot Matrix Profile¶
Below is a plot of the hourly data and the matrix profile. Visually, you can see both motifs and discords. We are interested in finding the discords which are high peaks in the plot. A couple of periods jump out that seem close to Thanksgiving and the snow storm.
#Plot the signal data fig, (ax1, ax2) = plt.subplots(2,1,sharex=True,figsize=(15,10)) df['value'].plot(ax=ax1, title='Raw Data') #Plot the Matrix Profile df['profile'].plot(ax=ax2, c='r', title='Matrix Profile') plt.show()
# exclude up to a day on the left and right side ex_zone = 24 # we look for the 5 events specified in the data explaination anoms = discords(df['profile'], ex_zone, k=5)
Using the matrix profile to identify discords within the NYC taxi dataset seems fruitful. All of the anomalies mentioned in the dataset overview were found. At first it can be a little tricky to think about what subsequence length to use, but once you understand your problem it becomes clear. I will be using this module more frequently when I need to identify anomalies in the future.