A Data Scientist's task is 80% data cleaning and 20% modelling. In this post, I show how you can deduplicate records quicker utilizing the dedupe library. The dedupe library, from the company Dedupe.io, essentially makes the task of identifying duplicate records easy. You train a model and it clusters duplicates. Thankfully the company released an open source library that can be used by anyone with knowledge of coding. However, if you are not inclined to write code, I suggest that you check out their GUI software at dedupe.io.

This post will focus on a library, pandas dedupe, that I have contributed to. It brings the power of dedupe to the pandas library making it interactive within a Jupyter notebook. The pandas dedupe library is found at:

https://github.com/Lyonk71/pandas-dedupe

Install Pandas Dedupe Library¶

In [1]:

!pip install git+git://github.com/Lyonk71/pandas-dedupe.git

Collecting git+git://github.com/Lyonk71/pandas-dedupe.git
  Cloning git://github.com/Lyonk71/pandas-dedupe.git to c:\users\tyler\appdata\local\temp\pip-req-build-xytctzaq
Requirement already satisfied (use --upgrade to upgrade): pandas-dedupe==0.42 from git+git://github.com/Lyonk71/pandas-dedupe.git in c:\users\tyler\src\pandas-dedupe
Requirement already satisfied: dedupe in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (1.9.6)
Requirement already satisfied: unidecode in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (1.0.23)
Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (from pandas-dedupe==0.42) (0.20.1)
Requirement already satisfied: affinegap>=1.3 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.10)
Requirement already satisfied: future>=0.14 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.17.1)
Requirement already satisfied: dedupe-hcluster in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.3.6)
Requirement already satisfied: rlr>=2.4.3 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (2.4.5)
Requirement already satisfied: doublemetaphone in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.1)
Requirement already satisfied: simplecosine>=1.2 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.2)
Requirement already satisfied: zope.index in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (4.4.0)
Requirement already satisfied: highered>=0.2.0 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.2.1)
Requirement already satisfied: numpy>=1.13 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.16.2)
Requirement already satisfied: fastcluster<1.1.25 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.1.24)
Requirement already satisfied: BTrees>=4.1.4 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (4.5.1)
Requirement already satisfied: Levenshtein-search in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.4.4)
Requirement already satisfied: categorical-distance>=1.9 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (1.9)
Requirement already satisfied: dedupe-variable-datetime in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (0.1.5)
Requirement already satisfied: simplejson in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (3.16.0)
Requirement already satisfied: haversine>=0.4.1 in c:\programdata\anaconda3\lib\site-packages (from dedupe->pandas-dedupe==0.42) (2.0.0)
Requirement already satisfied: python-dateutil>=2 in c:\programdata\anaconda3\lib\site-packages (from pandas->pandas-dedupe==0.42) (2.6.0)
Requirement already satisfied: pytz>=2011k in c:\programdata\anaconda3\lib\site-packages (from pandas->pandas-dedupe==0.42) (2017.2)
Requirement already satisfied: pylbfgs in c:\programdata\anaconda3\lib\site-packages (from rlr>=2.4.3->dedupe->pandas-dedupe==0.42) (0.2.0.12)
Requirement already satisfied: setuptools in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (39.2.0)
Requirement already satisfied: persistent in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (4.4.3)
Requirement already satisfied: six in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (1.10.0)
Requirement already satisfied: zope.interface in c:\programdata\anaconda3\lib\site-packages (from zope.index->dedupe->pandas-dedupe==0.42) (4.6.0)
Requirement already satisfied: pyhacrf-datamade>=0.2.0 in c:\programdata\anaconda3\lib\site-packages (from highered>=0.2.0->dedupe->pandas-dedupe==0.42) (0.2.3)
Requirement already satisfied: datetime-distance in c:\programdata\anaconda3\lib\site-packages (from dedupe-variable-datetime->dedupe->pandas-dedupe==0.42) (0.1.3)
Building wheels for collected packages: pandas-dedupe
  Building wheel for pandas-dedupe (setup.py): started
  Building wheel for pandas-dedupe (setup.py): finished with status 'done'
  Stored in directory: C:\Users\tyler\AppData\Local\Temp\pip-ephem-wheel-cache-yfm93kjy\wheels\13\cd\fe\56faa6c628f81a5fac9c9f4245ab7b1f57dc2df48159f0a9b5
Successfully built pandas-dedupe

You are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Example of Deduplication¶

In [2]:

from pandas_dedupe import dedupe_dataframe

In [3]:

import pandas as pd

Generate Fake Data¶

In this section I generate some fake data and duplicate some records.

In [4]:

import faker

In [5]:

fake = faker.Faker()

In [6]:

data = {
    'Name': [],
    'Address': [],
}

for i in range(100):
    data['Name'].append(fake.name())
    data['Address'].append(fake.address())
df = pd.DataFrame(data)

Duplicate Records¶

Here I duplicate some records so that we can demonstrate dedupe. When you have already trained the model pandas_dedupe reads that training file and uses that for clustering.

In [7]:

df = pd.concat([df, df.sample(frac=0.2)])

In [8]:

len(df)

Out[8]:

In [9]:

len(df.drop_duplicates())

Out[9]:

In [10]:

dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])

WARNING:dedupe.backport:Dedupe does not currently support multiprocessing on Windows

importing data ...
reading from dedupe_dataframe_learned_settings
clustering...
# duplicate sets 6

Illustrate Dedupe¶

Dedupe will prompt with many records that it thinks are similar. You tell it what is and isn't similar so that the model can give better results.

dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])

Dedupe Output¶

Once the training and clustering process is complete, you are presented with a dataframe that provides a cluster id and confidence. Records with similar cluster ids are considered as duplicates. The confidence score provides you with a certaintity score from 0 to 1.

In [11]:

dedupe_df.sort_values(['confidence', 'cluster id'], ascending=False)

Out[11]:

	Address	Name	cluster id	confidence
52	028 michael orchard suite 654 carterside, in 6...	michael pugh	3.0	0.434289
98	881 adrian centers apt. 030 lake timothymouth,...	brett bowman	3.0	0.434289
9	unit 1117 box 7761 dpo ap 36039	tami sanders	1.0	0.406689
62	unit 5686 box 6589 dpo ap 37867	holly zuniga	1.0	0.406689
62	unit 5686 box 6589 dpo ap 37867	holly zuniga	1.0	0.406689
57	0871 ross court apt. 021 north josephstad, dc ...	edward patrick	4.0	0.399691
57	0871 ross court apt. 021 north josephstad, dc ...	edward patrick	4.0	0.399691
79	873 leblanc rapid apt. 613 new virginia, nd 66835	veronica barnes	4.0	0.359868
13	psc 7201, box 7089 apo ap 08072	donna rodriguez	0.0	0.358922
59	873 becker lodge haroldfort, wa 06364	carolyn cruz	4.0	0.355432
2	psc 4428, box 6674 apo ae 16543	kathryn rios	0.0	0.354226
41	psc 6766, box 2963 apo ap 87901	timothy ellis	0.0	0.353936
41	psc 6766, box 2963 apo ap 87901	timothy ellis	0.0	0.353936
14	psc 7160, box 2173 apo aa 44802	todd fitzgerald	0.0	0.350237
66	607 edward plains suite 962 sydneyhaven, in 63516	kimberly becker	5.0	0.342488
80	605 allen motorway apt. 482 veronicaport, ny 9...	lauren jensen	5.0	0.342488
56	psc 2245, box 6643 apo aa 25112	alexander gardner	0.0	0.332051
67	psc 0257, box 1093 apo ae 29265	lisa walter	0.0	0.328142
67	psc 0257, box 1093 apo ae 29265	lisa walter	0.0	0.328142
19	552 pierce lodge port kari, ky 85896	daniel hoffman	2.0	0.314833
69	8935 kidd river apt. 554 north charleneport, t...	tammie foster	2.0	0.314833
0	8436 jones pine suite 444 east heidi, nh 32076	jennifer villa	NaN	NaN
1	084 chambers drive apt. 915 bryanmouth, nc 41999	peggy scott	NaN	NaN
3	1034 garrison flat phillipview, nh 17993	teresa young	NaN	NaN
4	usns owens fpo ae 15528	michael russell	NaN	NaN
4	usns owens fpo ae 15528	michael russell	NaN	NaN
5	5288 cervantes spurs apt. 210 lake matthewport...	susan allen	NaN	NaN
6	6775 joseph lakes apt. 870 west olivia, in 47685	johnny newton	NaN	NaN
7	78796 mallory street port mark, sd 96285	ronald guerra	NaN	NaN
8	800 sanders drive port mark, md 29074	elizabeth cummings	NaN	NaN
...	...	...	...	...
72	29989 julie street apt. 167 north dylanbury, n...	gregory smith	NaN	NaN
73	57766 alvarez neck apt. 525 new mauricefurt, s...	jennifer stephens	NaN	NaN
74	21038 scott fork port coryville, in 32757	charles hunt	NaN	NaN
75	7783 hughes rest apt. 006 lake william, nd 66845	jill alexander	NaN	NaN
76	345 april camp suite 232 east christopher, il ...	donald rogers	NaN	NaN
77	0638 connie centers gardnermouth, or 34880	jose burns	NaN	NaN
77	0638 connie centers gardnermouth, or 34880	jose burns	NaN	NaN
78	648 zimmerman spur tranville, il 59313	christopher kelly	NaN	NaN
81	383 daniel trail suite 919 port frank, co 15515	carlos walker	NaN	NaN
82	4882 davis passage suite 848 lake danaville, k...	michael miranda	NaN	NaN
83	0038 rogers keys apt. 163 fergusonchester, mo ...	derek johnson	NaN	NaN
84	8991 dillon well hunterborough, fl 59577	chad nelson	NaN	NaN
84	8991 dillon well hunterborough, fl 59577	chad nelson	NaN	NaN
85	777 turner extension tammybury, ok 61562	matthew lamb	NaN	NaN
86	86773 lauren union apt. 215 davidbury, ok 85820	margaret bennett	NaN	NaN
86	86773 lauren union apt. 215 davidbury, ok 85820	margaret bennett	NaN	NaN
87	493 armstrong vista port james, dc 08747	troy garcia	NaN	NaN
87	493 armstrong vista port james, dc 08747	troy garcia	NaN	NaN
88	6651 giles crest ortegamouth, nm 21927	yvonne willis	NaN	NaN
89	5751 garrett curve suite 957 emilyhaven, ut 51537	christopher martinez	NaN	NaN
90	461 kimberly turnpike michaelstad, dc 03933	scott mitchell	NaN	NaN
91	72268 white inlet guerrerotown, sd 47498	john gill	NaN	NaN
92	78043 klein vista apt. 904 west adrianshire, n...	tracy newman	NaN	NaN
93	172 gordon vista apt. 897 mayton, ca 88189	jordan brown	NaN	NaN
94	889 kenneth lake hernandezberg, la 76212	kimberly meadows	NaN	NaN
95	1640 watkins view north omarhaven, al 93195	christopher garcia	NaN	NaN
96	48848 jennifer loaf apt. 203 port katherinemou...	joanna snyder	NaN	NaN
96	48848 jennifer loaf apt. 203 port katherinemou...	joanna snyder	NaN	NaN
97	9160 moreno knoll apt. 368 new sean, nc 61255	sarah hayes	NaN	NaN
99	157 bonilla cliff suite 675 clarktown, wy 74303	eric lee	NaN	NaN

120 rows × 4 columns

Notice that I suggested dedupe should consider invalid records as similar. This can affect the end result of your clustered, however for demonstration purposes this suffices.