Dedupe - Record Deduplication
A Data Scientist's task is 80% data cleaning and 20% modelling. In this post, I show how you can deduplicate records quicker utilizing the dedupe library. The dedupe library, from the company Dedupe.io, essentially makes the task of identifying duplicate records easy. You train a model and it clusters duplicates. Thankfully the company released an open source library that can be used by anyone with knowledge of coding. However, if you are not inclined to write code, I suggest that you check out their GUI software at dedupe.io.
This post will focus on a library, pandas dedupe, that I have contributed to. It brings the power of dedupe to the pandas library making it interactive within a Jupyter notebook. The pandas dedupe library is found at:
Install Pandas Dedupe Library¶
!pip install git+git://github.com/Lyonk71/pandas-dedupe.git
Example of Deduplication¶
from pandas_dedupe import dedupe_dataframe
import pandas as pd
Generate Fake Data¶
In this section I generate some fake data and duplicate some records.
import faker
fake = faker.Faker()
data = {
'Name': [],
'Address': [],
}
for i in range(100):
data['Name'].append(fake.name())
data['Address'].append(fake.address())
df = pd.DataFrame(data)
Duplicate Records¶
Here I duplicate some records so that we can demonstrate dedupe. When you have already trained the model pandas_dedupe reads that training file and uses that for clustering.
df = pd.concat([df, df.sample(frac=0.2)])
len(df)
len(df.drop_duplicates())
dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])
Illustrate Dedupe¶
Dedupe will prompt with many records that it thinks are similar. You tell it what is and isn't similar so that the model can give better results.
dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])
Dedupe Output¶
Once the training and clustering process is complete, you are presented with a dataframe that provides a cluster id and confidence. Records with similar cluster ids are considered as duplicates. The confidence score provides you with a certaintity score from 0 to 1.
dedupe_df.sort_values(['confidence', 'cluster id'], ascending=False)
Notice that I suggested dedupe should consider invalid records as similar. This can affect the end result of your clustered, however for demonstration purposes this suffices.