Dedupe - Record Deduplication

A Data Scientist's task is 80% data cleaning and 20% modelling. In this post, I show how you can deduplicate records quicker utilizing the dedupe library. The dedupe library, from the company Dedupe.io, essentially makes the task of identifying duplicate records easy. You train a model and it clusters duplicates. Thankfully the company released an open source library that can be used by anyone with knowledge of coding. However, if you are not inclined to write code, I suggest that you check out their GUI software at dedupe.io.

This post will focus on a library, pandas dedupe, that I have contributed to. It brings the power of dedupe to the pandas library making it interactive within a Jupyter notebook. The pandas dedupe library is found at:

https://github.com/Lyonk71/pandas-dedupe

Install Pandas Dedupe Library

In [1]:
!pip install git+git://github.com/Lyonk71/pandas-dedupe.git
Collecting git+git://github.com/Lyonk71/pandas-dedupe.git
  Cloning git://github.com/Lyonk71/pandas-dedupe.git to c:\users\tyler\appdata\local\temp\pip-req-build-xytctzaq
Requirement already satisfied (use --upgrade to upgrade): pandas-dedupe==0.42 from git+git://github.com/Lyonk71/pandas-dedupe.git in c:\users\tyler\src\pandas-dedupe
Requirement already satisfied: dedupe in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (1.9.6)
Requirement already satisfied: unidecode in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (1.0.23)
Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (0.20.1)
Requirement already satisfied: affinegap>=1.3 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.10)
Requirement already satisfied: future>=0.14 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.17.1)
Requirement already satisfied: dedupe-hcluster in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.3.6)
Requirement already satisfied: rlr>=2.4.3 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (2.4.5)
Requirement already satisfied: doublemetaphone in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.1)
Requirement already satisfied: simplecosine>=1.2 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.2)
Requirement already satisfied: zope.index in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (4.4.0)
Requirement already satisfied: highered>=0.2.0 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.2.1)
Requirement already satisfied: numpy>=1.13 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.16.2)
Requirement already satisfied: fastcluster<1.1.25 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.1.24)
Requirement already satisfied: BTrees>=4.1.4 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (4.5.1)
Requirement already satisfied: Levenshtein-search in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.4.4)
Requirement already satisfied: categorical-distance>=1.9 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.9)
Requirement already satisfied: dedupe-variable-datetime in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.1.5)
Requirement already satisfied: simplejson in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (3.16.0)
Requirement already satisfied: haversine>=0.4.1 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (2.0.0)
Requirement already satisfied: python-dateutil>=2 in c:\programdata\anaconda3\lib\site-packages (from pandas->pandas-dedupe==0.42) (2.6.0)
Requirement already satisfied: pytz>=2011k in c:\programdata\anaconda3\lib\site-packages (from pandas->pandas-dedupe==0.42) (2017.2)
Requirement already satisfied: pylbfgs in c:\programdata\anaconda3\lib\site-packages (from rlr>=2.4.3->dedupe->pandas-dedupe==0.42) (0.2.0.12)
Requirement already satisfied: setuptools in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (39.2.0)
Requirement already satisfied: persistent in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (4.4.3)
Requirement already satisfied: six in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (1.10.0)
Requirement already satisfied: zope.interface in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (4.6.0)
Requirement already satisfied: pyhacrf-datamade>=0.2.0 in c:\programdata\anaconda3\lib\site-packages (from highered>=0.2.0->dedupe->pandas-dedupe==0.42) (0.2.3)
Requirement already satisfied: datetime-distance in c:\programdata\anaconda3\lib\site-packages (from dedupe-variable-datetime->dedupe->pandas-dedupe==0.42) (0.1.3)
Building wheels for collected packages: pandas-dedupe
  Building wheel for pandas-dedupe (setup.py): started
  Building wheel for pandas-dedupe (setup.py): finished with status 'done'
  Stored in directory: C:\Users\tyler\AppData\Local\Temp\pip-ephem-wheel-cache-yfm93kjy\wheels\13\cd\fe\56faa6c628f81a5fac9c9f4245ab7b1f57dc2df48159f0a9b5
Successfully built pandas-dedupe
You are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Example of Deduplication

In [2]:
from pandas_dedupe import dedupe_dataframe
In [3]:
import pandas as pd

Generate Fake Data

In this section I generate some fake data and duplicate some records.

In [4]:
import faker
In [5]:
fake = faker.Faker()
In [6]:
data = {
    'Name': [],
    'Address': [],
}

for i in range(100):
    data['Name'].append(fake.name())
    data['Address'].append(fake.address())
df = pd.DataFrame(data)

Duplicate Records

Here I duplicate some records so that we can demonstrate dedupe. When you have already trained the model pandas_dedupe reads that training file and uses that for clustering.

In [7]:
df = pd.concat([df, df.sample(frac=0.2)])
In [8]:
len(df)
Out[8]:
120
In [9]:
len(df.drop_duplicates())
Out[9]:
100
In [10]:
dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])
WARNING:dedupe.backport:Dedupe does not currently support multiprocessing on Windows
importing data ...
reading from dedupe_dataframe_learned_settings
clustering...
# duplicate sets 6

Illustrate Dedupe

Dedupe will prompt with many records that it thinks are similar. You tell it what is and isn't similar so that the model can give better results.

dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])

Dedupe Output

Once the training and clustering process is complete, you are presented with a dataframe that provides a cluster id and confidence. Records with similar cluster ids are considered as duplicates. The confidence score provides you with a certaintity score from 0 to 1.

In [11]:
dedupe_df.sort_values(['confidence', 'cluster id'], ascending=False)
Out[11]:
Address Name cluster id confidence
52 028 michael orchard suite 654 carterside, in 6... michael pugh 3.0 0.434289
98 881 adrian centers apt. 030 lake timothymouth,... brett bowman 3.0 0.434289
9 unit 1117 box 7761 dpo ap 36039 tami sanders 1.0 0.406689
62 unit 5686 box 6589 dpo ap 37867 holly zuniga 1.0 0.406689
62 unit 5686 box 6589 dpo ap 37867 holly zuniga 1.0 0.406689
57 0871 ross court apt. 021 north josephstad, dc ... edward patrick 4.0 0.399691
57 0871 ross court apt. 021 north josephstad, dc ... edward patrick 4.0 0.399691
79 873 leblanc rapid apt. 613 new virginia, nd 66835 veronica barnes 4.0 0.359868
13 psc 7201, box 7089 apo ap 08072 donna rodriguez 0.0 0.358922
59 873 becker lodge haroldfort, wa 06364 carolyn cruz 4.0 0.355432
2 psc 4428, box 6674 apo ae 16543 kathryn rios 0.0 0.354226
41 psc 6766, box 2963 apo ap 87901 timothy ellis 0.0 0.353936
41 psc 6766, box 2963 apo ap 87901 timothy ellis 0.0 0.353936
14 psc 7160, box 2173 apo aa 44802 todd fitzgerald 0.0 0.350237
66 607 edward plains suite 962 sydneyhaven, in 63516 kimberly becker 5.0 0.342488
80 605 allen motorway apt. 482 veronicaport, ny 9... lauren jensen 5.0 0.342488
56 psc 2245, box 6643 apo aa 25112 alexander gardner 0.0 0.332051
67 psc 0257, box 1093 apo ae 29265 lisa walter 0.0 0.328142
67 psc 0257, box 1093 apo ae 29265 lisa walter 0.0 0.328142
19 552 pierce lodge port kari, ky 85896 daniel hoffman 2.0 0.314833
69 8935 kidd river apt. 554 north charleneport, t... tammie foster 2.0 0.314833
0 8436 jones pine suite 444 east heidi, nh 32076 jennifer villa NaN NaN
1 084 chambers drive apt. 915 bryanmouth, nc 41999 peggy scott NaN NaN
3 1034 garrison flat phillipview, nh 17993 teresa young NaN NaN
4 usns owens fpo ae 15528 michael russell NaN NaN
4 usns owens fpo ae 15528 michael russell NaN NaN
5 5288 cervantes spurs apt. 210 lake matthewport... susan allen NaN NaN
6 6775 joseph lakes apt. 870 west olivia, in 47685 johnny newton NaN NaN
7 78796 mallory street port mark, sd 96285 ronald guerra NaN NaN
8 800 sanders drive port mark, md 29074 elizabeth cummings NaN NaN
... ... ... ... ...
72 29989 julie street apt. 167 north dylanbury, n... gregory smith NaN NaN
73 57766 alvarez neck apt. 525 new mauricefurt, s... jennifer stephens NaN NaN
74 21038 scott fork port coryville, in 32757 charles hunt NaN NaN
75 7783 hughes rest apt. 006 lake william, nd 66845 jill alexander NaN NaN
76 345 april camp suite 232 east christopher, il ... donald rogers NaN NaN
77 0638 connie centers gardnermouth, or 34880 jose burns NaN NaN
77 0638 connie centers gardnermouth, or 34880 jose burns NaN NaN
78 648 zimmerman spur tranville, il 59313 christopher kelly NaN NaN
81 383 daniel trail suite 919 port frank, co 15515 carlos walker NaN NaN
82 4882 davis passage suite 848 lake danaville, k... michael miranda NaN NaN
83 0038 rogers keys apt. 163 fergusonchester, mo ... derek johnson NaN NaN
84 8991 dillon well hunterborough, fl 59577 chad nelson NaN NaN
84 8991 dillon well hunterborough, fl 59577 chad nelson NaN NaN
85 777 turner extension tammybury, ok 61562 matthew lamb NaN NaN
86 86773 lauren union apt. 215 davidbury, ok 85820 margaret bennett NaN NaN
86 86773 lauren union apt. 215 davidbury, ok 85820 margaret bennett NaN NaN
87 493 armstrong vista port james, dc 08747 troy garcia NaN NaN
87 493 armstrong vista port james, dc 08747 troy garcia NaN NaN
88 6651 giles crest ortegamouth, nm 21927 yvonne willis NaN NaN
89 5751 garrett curve suite 957 emilyhaven, ut 51537 christopher martinez NaN NaN
90 461 kimberly turnpike michaelstad, dc 03933 scott mitchell NaN NaN
91 72268 white inlet guerrerotown, sd 47498 john gill NaN NaN
92 78043 klein vista apt. 904 west adrianshire, n... tracy newman NaN NaN
93 172 gordon vista apt. 897 mayton, ca 88189 jordan brown NaN NaN
94 889 kenneth lake hernandezberg, la 76212 kimberly meadows NaN NaN
95 1640 watkins view north omarhaven, al 93195 christopher garcia NaN NaN
96 48848 jennifer loaf apt. 203 port katherinemou... joanna snyder NaN NaN
96 48848 jennifer loaf apt. 203 port katherinemou... joanna snyder NaN NaN
97 9160 moreno knoll apt. 368 new sean, nc 61255 sarah hayes NaN NaN
99 157 bonilla cliff suite 675 clarktown, wy 74303 eric lee NaN NaN

120 rows × 4 columns

Notice that I suggested dedupe should consider invalid records as similar. This can affect the end result of your clustered, however for demonstration purposes this suffices.

Matrix Profile - Brute Force

This notebook is used to demonstrate the brute force algorithm for computing the matrix profile. This algorithm came from the slide deck:

https://www.cs.ucr.edu/~eamonn/Matrix_Profile_Tutorial_Part2.pdf

In [1]:
from IPython.display import Image
In [2]:
Image('images/bruteforce.png')
Out[2]:
In [3]:
import numpy as np
In [4]:
import warnings
In [5]:
def znormalize(series):
    series -= np.mean(series)
    std = np.std(series)

    if std == 0:
        raise ValueError("The Standard Deviation cannot be zero")
    else:
        series /= std

    return series

def brute_force_mp(series, window_size=None, query=None):    
    if isinstance(query, np.ndarray) and window_size:
        warnings.warn('Query and window_size provided. Using query to determine window size.')        
    
    if isinstance(query, np.ndarray):
        query = znormalize(query)
        window_size = len(query)
        
    #If length is odd, zero-pad time time series
    if len(series) % 2 == 1:
        profile = np.insert(profile, 0, 0)
    
    #If length is odd, zero-pad query
    if len(query) % 2 == 1:
        query = np.insert(query, 0, 0)
        window_size += 1
    
    series_len = len(series)
    shape = series_len - window_size + 1
    profile = np.full(shape, np.inf)
    
    for index in range(0, series_len - window_size + 1):
        sub_sequence = series[index:index + window_size]
        sub_sequence = znormalize(sub_sequence)
        distance = np.sqrt(np.sum(np.power(sub_sequence - query, 2)))
        profile[index] = distance
    
    return profile

Data

The data consists of closing stock prices. I use the algorithm to identify which part of the data is closest to our subquery.

In [42]:
series = np.loadtxt('stock_close.txt')
query = series[35:55]

mp = brute_force_mp(series, query=query)
In [43]:
from matplotlib import pyplot as plt

%matplotlib inline
In [44]:
plt.figure(figsize=(15,2))
plt.plot(series)
x = np.arange(30, 40)
y = series[x]
plt.plot(x, y, c='r')
plt.title('Raw Data')
plt.show()

This plot shows the raw data and the subquery that is being searched for in red.

In [46]:
plt.figure(figsize=(15,2))
plt.plot(mp)
x = np.arange(30, 40)
y = mp[x]
plt.plot(x, y, c='r')
plt.title('Matrix Profile')
plt.show()

The matrix profile illustrates that the pattern was found hence the huge dip in the plot. This tells us that the distance between the subquery and that location of the series are very similar. In this case the subquery is identical. This pattern that is found is called a motif.

Pareto Plot With Matplotlib

A Pareto plot is essentially a sorted bar chart in descending order that shows the cumulative total via a line. It combines the information of a bar chart and a pie chart into one. In this post I demonstrate how to create a Pareto plot in Python with matplotlib.

In [1]:
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

Create some sample data that is sales from an ice cream shop.

In [2]:
df = pd.DataFrame({
    'Flavor': ['Chocolate', 'Vanilla', 'Mint', 'Swirl', 'Nut'],
    'Orders': [1500,  670,  950,  450,   75]
})

Plot a bar and pie chart of the data to illustrate the benefit of using a Pareto plot.

In [3]:
df.plot(kind='bar', x='Flavor', y='Orders', legend=None, title='Ice Cream Sales')
plt.show()
In [4]:
df.plot(kind='pie', labels=df['Flavor'], y='Orders', legend=None, autopct='%.2f%%', title='Ice Cream Sales')
plt.tight_layout()
plt.show()
In [5]:
def pareto_plot(df, x=None, y=None, title=None, show_pct_y=False, pct_format='{0:.0%}'):
    xlabel = x
    ylabel = y
    tmp = df.sort_values(y, ascending=False)
    x = tmp[x].values
    y = tmp[y].values
    weights = y / y.sum()
    cumsum = weights.cumsum()
    
    fig, ax1 = plt.subplots()
    ax1.bar(x, y)
    ax1.set_xlabel(xlabel)
    ax1.set_ylabel(ylabel)

    ax2 = ax1.twinx()
    ax2.plot(x, cumsum, '-ro', alpha=0.5)
    ax2.set_ylabel('', color='r')
    ax2.tick_params('y', colors='r')
    
    vals = ax2.get_yticks()
    ax2.set_yticklabels(['{:,.2%}'.format(x) for x in vals])

    # hide y-labels on right side
    if not show_pct_y:
        ax2.set_yticks([])
    
    formatted_weights = [pct_format.format(x) for x in cumsum]
    for i, txt in enumerate(formatted_weights):
        ax2.annotate(txt, (x[i], cumsum[i]), fontweight='heavy')    
    
    if title:
        plt.title(title)
    
    plt.tight_layout()
    plt.show()
In [6]:
pareto_plot(df, x='Flavor', y='Orders', title='Ice Cream Sales')

From our toy example, we can see that chocolate, mint and vanilla make up 86% of our sales. From an advertising perspective, we could have some sort of discount associated with those flavors to further increase them. Hopefully, you can see the benefit of using this plot instead of a separate bar and pie chart.

Contents © 2020 Tyler Marrs