# MASS (Mueen's Algorithm for Similarity Search) - Robot Dog Example

This post is used to illustrate how to use mass-ts; a library that I created composed of similarity search algorithms in time series data. For this example, I will walk you through similarity search on the UCR data set - robot dog.

### Data Explaination¶

The dataset comes from an accelerometer inside a Sony AIBO robot dog. The query comes from a period when the dog was walking on carpet, the test data we will search comes from a time the robot walked on cement (for 5000 data points), then carpet (for 3000 data points), then back onto cement.

Note
All of the code in this example is using Python 3.

In [1]:
import numpy as np
import mass_ts as mts
from matplotlib import pyplot as plt

%matplotlib inline


In [2]:
robot_dog = np.loadtxt('robot_dog.txt')


### Plot Data¶

In [3]:
plt.figure(figsize=(25,5))
plt.plot(np.arange(len(robot_dog)), robot_dog)
plt.title('Robot Dog Sample')
plt.show()


The series data for the robot dog is difficult to see a pattern unless you make the plot very large. At 25 by 5 you can see a pattern in the middle of the data set. This is where the carpet walk takes place.

In [4]:
plt.figure(figsize=(25,5))
plt.plot(np.arange(len(carpet_walk)), carpet_walk)
plt.title('Carpet Walk Sample')
plt.show()


## Search for Carpet Walk¶

Now we can search for the carpet walk snippet within the robot dog sample using MASS.

In [5]:
distances = mts.mass3(robot_dog, carpet_walk, 256)

In [6]:
min_idx = np.argmin(distances)

In [7]:
min_idx

Out[7]:
7479

The minimum index found is the same as the author claims. We can now visualize this below.

In [8]:
plt.figure(figsize=(25,5))
plt.plot(np.arange(len(robot_dog)), robot_dog)
plt.plot(np.arange(min_idx, min_idx + 100), carpet_walk, c='r')
plt.title('Robot Dog Sample')
plt.show()

In [9]:
# Plot the carpet walk query
fig, (ax1, ax2, ax3) = plt.subplots(3,1,sharex=True,figsize=(20,10))
ax1.plot(np.arange(len(carpet_walk)), carpet_walk)
ax1.set_ylabel('Carpet Walk Query', size=12)

# Plot use case best match from original authors
ax2.plot(np.arange(100), robot_dog[7478:7478+100])
ax2.set_ylabel('Original Best Match', size=12)

# Plot our best match
ax3.plot(np.arange(100), robot_dog[min_idx:min_idx+100])
ax3.set_ylabel('Our Best Match', size=12)

plt.show()


# AAMP - Matrix Profile

This notebook is used to implement the AAMP algorithm of the matrix profile from the following paper:

https://arxiv.org/pdf/1901.05708.pdf

It claims to be faster than the Keogh and Mueen algorithm SCRIMP++. It computes the distances with a non-normalized Euclidean distance.

In [1]:
from IPython.display import Image

In [2]:
Image('images/matrixprofileAAMP.PNG')

Out[2]:
In [3]:
import numpy as np

In [4]:
sampledata = np.loadtxt('sampledata.txt')

In [5]:
import matplotlib.pyplot as plt

%matplotlib inline

In [6]:
from matrixprofile.matrixProfile import stomp

In [7]:
import numpy as np

def aamp(t, m):
"""
Computes the matrix profile using squared distance to optimize
computation.

Parameters
----------
t : np.array
Series object to compute the matrix profile.
m : int
Window size.

Returns
-------
tuple : (np.array, np.array)
The matrix profile and matrix profile index.
"""
t = np.array(t)

# initialize profile
n = len(t)
s = n - m
p = np.full(s, np.inf)
pi = np.full(s, np.inf)

# Additional optimization - Powers of 2 are better to compute than
# odd numbers.
if len(t) % 2 == 1:
t = np.insert(t, 0, 0)

for k in range(s - 1):
a = t[0:m]
b = t[k+1:m+k+1]
d = np.sum((a - b) ** 2)

if d < p[0]:
p[0] = d
pi[0] = k

if d < p[k]:
p[k] = d
pi[k] = 0

for i in range(s - k - 1):
kplusi = k + i + 1
d = d - (t[i] - t[kplusi]) ** 2 + (t[m+i] - t[m+kplusi]) ** 2

if p[i] > d:
p[i] = d
pi[i] = kplusi

if p[kplusi] > d:
p[kplusi] = d
pi[kplusi] = i

p = np.sqrt(p)
return (p, pi)

In [8]:
a = np.array([1, 2, 3])
a ** 2

Out[8]:
array([1, 4, 9])

### Experiment 1¶

In this experiment I compare STOMP with the AAMP algorithm on a synthetic data set.

In [9]:
t = sampledata
m = 32

In [10]:
plt.figure(figsize=(20, 5))
plt.plot(range(len(t)), t)
plt.title('Synthetic Data')
plt.show()

In [11]:
%%timeit
p, pi = aamp(t, m)

976 ms ± 55.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]:
p, pi = aamp(t, m)

In [13]:
%%timeit
stomp_mp, stomp_idx = stomp(np.array(t), m)

263 ms ± 34.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]:
stomp_mp, stomp_idx = stomp(np.array(t), m)

In [15]:
#Plot the signal data
fig, (ax1, ax2, ax3) = plt.subplots(3,1,sharex=True,figsize=(20,5))
ax1.plot(np.arange(len(t)), t, label="Synthetic Data")
ax1.set_ylabel('Signal', size=12)

#Plot the Matrix Profile
ax2.set_ylabel('AAMP', size=12)

#Plot the Matrix Profile
ax3.set_ylabel('STOMP', size=12)
ax3.set_xlabel('Sample', size=12)

plt.tight_layout()
plt.show()


In the first expirment, you see that STOMP is significantly faster when compared in Python. The research paper claims that the window size (m) and series length (n) are major factors when comparing the algorithms. Note that in the paper they compare SCRIMP++. However, at this time I do not have a Python implementation of SCRIMP++. Also, STOMP is faster than SCRIMP++ according to Keogh's paper, Matrix Proﬁle XI: SCRIMP++: Time Series Motif Discovery at Interactive Speed.

The distance profiles of both the AAMP and STOMP algorithm differ. This is due to the differences in calculation. The AAMP algorith does not Z-Normalize the data.

### Experiment 2¶

In this experiment I generate a uniform distribution and follow what the research claims; m an n are major factors in time complexity.

In [16]:
t = np.random.uniform(size=2**10)
m = 2 ** 5

In [17]:
plt.figure(figsize=(20, 5))
plt.plot(range(len(t)), t)
plt.title('Synthetic Data')
plt.show()

In [18]:
%%timeit
p, pi = aamp(t, m)

1.3 s ± 36.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]:
%%timeit
stomp_mp, stomp_idx = stomp(np.array(t), m)

285 ms ± 9.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In this experiment we see that STOMP is still faster.

### Experiment 3 - Matlab runtimes of AAMP vs SCRIMP++¶

I reached out to the author of the paper to identify any issues with their claims. They provided me with the Matlab code used in their experiments. This experiment compares Matlab implementations to one another.

In [20]:
Image('images/matlab_scrimp_aamp.PNG')

Out[20]:

The AAMP algorithm performs significantly faster than the SCRIMP++ algorithm within Matlab. I'm not too familiar with Matlab as a whole, but it seems that there is some sort of optimization that lacks within the Python implementation.

# Dedupe - Record Deduplication

A Data Scientist's task is 80% data cleaning and 20% modelling. In this post, I show how you can deduplicate records quicker utilizing the dedupe library. The dedupe library, from the company Dedupe.io, essentially makes the task of identifying duplicate records easy. You train a model and it clusters duplicates. Thankfully the company released an open source library that can be used by anyone with knowledge of coding. However, if you are not inclined to write code, I suggest that you check out their GUI software at dedupe.io.

This post will focus on a library, pandas dedupe, that I have contributed to. It brings the power of dedupe to the pandas library making it interactive within a Jupyter notebook. The pandas dedupe library is found at:

https://github.com/Lyonk71/pandas-dedupe

## Install Pandas Dedupe Library¶

In [1]:
!pip install git+git://github.com/Lyonk71/pandas-dedupe.git

Collecting git+git://github.com/Lyonk71/pandas-dedupe.git
Cloning git://github.com/Lyonk71/pandas-dedupe.git to c:\users\tyler\appdata\local\temp\pip-req-build-xytctzaq
Requirement already satisfied: dedupe in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (1.9.6)
Requirement already satisfied: unidecode in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (1.0.23)
Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (0.20.1)
Requirement already satisfied: affinegap>=1.3 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.10)
Requirement already satisfied: future>=0.14 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.17.1)
Requirement already satisfied: dedupe-hcluster in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.3.6)
Requirement already satisfied: rlr>=2.4.3 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (2.4.5)
Requirement already satisfied: doublemetaphone in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.1)
Requirement already satisfied: simplecosine>=1.2 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.2)
Requirement already satisfied: zope.index in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (4.4.0)
Requirement already satisfied: highered>=0.2.0 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.2.1)
Requirement already satisfied: numpy>=1.13 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.16.2)
Requirement already satisfied: fastcluster<1.1.25 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.1.24)
Requirement already satisfied: BTrees>=4.1.4 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (4.5.1)
Requirement already satisfied: Levenshtein-search in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.4.4)
Requirement already satisfied: categorical-distance>=1.9 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.9)
Requirement already satisfied: dedupe-variable-datetime in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.1.5)
Requirement already satisfied: simplejson in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (3.16.0)
Requirement already satisfied: haversine>=0.4.1 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (2.0.0)
Requirement already satisfied: python-dateutil>=2 in c:\programdata\anaconda3\lib\site-packages (from pandas->pandas-dedupe==0.42) (2.6.0)
Requirement already satisfied: pytz>=2011k in c:\programdata\anaconda3\lib\site-packages (from pandas->pandas-dedupe==0.42) (2017.2)
Requirement already satisfied: pylbfgs in c:\programdata\anaconda3\lib\site-packages (from rlr>=2.4.3->dedupe->pandas-dedupe==0.42) (0.2.0.12)
Requirement already satisfied: setuptools in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (39.2.0)
Requirement already satisfied: persistent in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (4.4.3)
Requirement already satisfied: six in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (1.10.0)
Requirement already satisfied: zope.interface in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (4.6.0)
Requirement already satisfied: datetime-distance in c:\programdata\anaconda3\lib\site-packages (from dedupe-variable-datetime->dedupe->pandas-dedupe==0.42) (0.1.3)
Building wheels for collected packages: pandas-dedupe
Building wheel for pandas-dedupe (setup.py): started
Building wheel for pandas-dedupe (setup.py): finished with status 'done'
Stored in directory: C:\Users\tyler\AppData\Local\Temp\pip-ephem-wheel-cache-yfm93kjy\wheels\13\cd\fe\56faa6c628f81a5fac9c9f4245ab7b1f57dc2df48159f0a9b5
Successfully built pandas-dedupe

You are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## Example of Deduplication¶

In [2]:
from pandas_dedupe import dedupe_dataframe

In [3]:
import pandas as pd


### Generate Fake Data¶

In this section I generate some fake data and duplicate some records.

In [4]:
import faker

In [5]:
fake = faker.Faker()

In [6]:
data = {
'Name': [],
}

for i in range(100):
data['Name'].append(fake.name())
df = pd.DataFrame(data)


### Duplicate Records¶

Here I duplicate some records so that we can demonstrate dedupe. When you have already trained the model pandas_dedupe reads that training file and uses that for clustering.

In [7]:
df = pd.concat([df, df.sample(frac=0.2)])

In [8]:
len(df)

Out[8]:
120
In [9]:
len(df.drop_duplicates())

Out[9]:
100
In [10]:
dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])

WARNING:dedupe.backport:Dedupe does not currently support multiprocessing on Windows

importing data ...
clustering...
# duplicate sets 6


### Illustrate Dedupe¶

Dedupe will prompt with many records that it thinks are similar. You tell it what is and isn't similar so that the model can give better results.

### Dedupe Output¶

Once the training and clustering process is complete, you are presented with a dataframe that provides a cluster id and confidence. Records with similar cluster ids are considered as duplicates. The confidence score provides you with a certaintity score from 0 to 1.

In [11]:
dedupe_df.sort_values(['confidence', 'cluster id'], ascending=False)

Out[11]:
52 028 michael orchard suite 654 carterside, in 6... michael pugh 3.0 0.434289
98 881 adrian centers apt. 030 lake timothymouth,... brett bowman 3.0 0.434289
9 unit 1117 box 7761 dpo ap 36039 tami sanders 1.0 0.406689
62 unit 5686 box 6589 dpo ap 37867 holly zuniga 1.0 0.406689
62 unit 5686 box 6589 dpo ap 37867 holly zuniga 1.0 0.406689
57 0871 ross court apt. 021 north josephstad, dc ... edward patrick 4.0 0.399691
57 0871 ross court apt. 021 north josephstad, dc ... edward patrick 4.0 0.399691
79 873 leblanc rapid apt. 613 new virginia, nd 66835 veronica barnes 4.0 0.359868
13 psc 7201, box 7089 apo ap 08072 donna rodriguez 0.0 0.358922
59 873 becker lodge haroldfort, wa 06364 carolyn cruz 4.0 0.355432
2 psc 4428, box 6674 apo ae 16543 kathryn rios 0.0 0.354226
41 psc 6766, box 2963 apo ap 87901 timothy ellis 0.0 0.353936
41 psc 6766, box 2963 apo ap 87901 timothy ellis 0.0 0.353936
14 psc 7160, box 2173 apo aa 44802 todd fitzgerald 0.0 0.350237
66 607 edward plains suite 962 sydneyhaven, in 63516 kimberly becker 5.0 0.342488
80 605 allen motorway apt. 482 veronicaport, ny 9... lauren jensen 5.0 0.342488
56 psc 2245, box 6643 apo aa 25112 alexander gardner 0.0 0.332051
67 psc 0257, box 1093 apo ae 29265 lisa walter 0.0 0.328142
67 psc 0257, box 1093 apo ae 29265 lisa walter 0.0 0.328142
19 552 pierce lodge port kari, ky 85896 daniel hoffman 2.0 0.314833
69 8935 kidd river apt. 554 north charleneport, t... tammie foster 2.0 0.314833
0 8436 jones pine suite 444 east heidi, nh 32076 jennifer villa NaN NaN
1 084 chambers drive apt. 915 bryanmouth, nc 41999 peggy scott NaN NaN
3 1034 garrison flat phillipview, nh 17993 teresa young NaN NaN
4 usns owens fpo ae 15528 michael russell NaN NaN
4 usns owens fpo ae 15528 michael russell NaN NaN
5 5288 cervantes spurs apt. 210 lake matthewport... susan allen NaN NaN
6 6775 joseph lakes apt. 870 west olivia, in 47685 johnny newton NaN NaN
7 78796 mallory street port mark, sd 96285 ronald guerra NaN NaN
8 800 sanders drive port mark, md 29074 elizabeth cummings NaN NaN
... ... ... ... ...
72 29989 julie street apt. 167 north dylanbury, n... gregory smith NaN NaN
73 57766 alvarez neck apt. 525 new mauricefurt, s... jennifer stephens NaN NaN
74 21038 scott fork port coryville, in 32757 charles hunt NaN NaN
75 7783 hughes rest apt. 006 lake william, nd 66845 jill alexander NaN NaN
76 345 april camp suite 232 east christopher, il ... donald rogers NaN NaN
77 0638 connie centers gardnermouth, or 34880 jose burns NaN NaN
77 0638 connie centers gardnermouth, or 34880 jose burns NaN NaN
78 648 zimmerman spur tranville, il 59313 christopher kelly NaN NaN
81 383 daniel trail suite 919 port frank, co 15515 carlos walker NaN NaN
82 4882 davis passage suite 848 lake danaville, k... michael miranda NaN NaN
83 0038 rogers keys apt. 163 fergusonchester, mo ... derek johnson NaN NaN
84 8991 dillon well hunterborough, fl 59577 chad nelson NaN NaN
84 8991 dillon well hunterborough, fl 59577 chad nelson NaN NaN
85 777 turner extension tammybury, ok 61562 matthew lamb NaN NaN
86 86773 lauren union apt. 215 davidbury, ok 85820 margaret bennett NaN NaN
86 86773 lauren union apt. 215 davidbury, ok 85820 margaret bennett NaN NaN
87 493 armstrong vista port james, dc 08747 troy garcia NaN NaN
87 493 armstrong vista port james, dc 08747 troy garcia NaN NaN
88 6651 giles crest ortegamouth, nm 21927 yvonne willis NaN NaN
89 5751 garrett curve suite 957 emilyhaven, ut 51537 christopher martinez NaN NaN
90 461 kimberly turnpike michaelstad, dc 03933 scott mitchell NaN NaN
91 72268 white inlet guerrerotown, sd 47498 john gill NaN NaN
92 78043 klein vista apt. 904 west adrianshire, n... tracy newman NaN NaN
93 172 gordon vista apt. 897 mayton, ca 88189 jordan brown NaN NaN
94 889 kenneth lake hernandezberg, la 76212 kimberly meadows NaN NaN
95 1640 watkins view north omarhaven, al 93195 christopher garcia NaN NaN
96 48848 jennifer loaf apt. 203 port katherinemou... joanna snyder NaN NaN
96 48848 jennifer loaf apt. 203 port katherinemou... joanna snyder NaN NaN
97 9160 moreno knoll apt. 368 new sean, nc 61255 sarah hayes NaN NaN
99 157 bonilla cliff suite 675 clarktown, wy 74303 eric lee NaN NaN

120 rows × 4 columns

Notice that I suggested dedupe should consider invalid records as similar. This can affect the end result of your clustered, however for demonstration purposes this suffices.