dirty_cat.deduplicate(data, n_clusters=None, ngram_range=(2, 4), analyzer='char_wb', method='average')[source]

Deduplicate data by hierarchically clustering similar strings.


The data to be deduplicated.

n_clustersOptional[int], optional, default=None

Number of clusters to use for hierarchical clustering, if None use the number of clusters that lead to the lowest silhouette score.

ngram_rangeTuple[int, int], optional, default=(2, 4)

Range to use for computing n-gram distance.

analyzertyping.Literal[“word”, “char”, “char_wb”], optional, default=`char_wb`

Analyzer parameter for the CountVectorizer used for the string similarities. Options: {word, char, char_wb}, describing whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

methodstr, optional, default=`average`

Linkage method parameter to use for merging clusters via scipy’s linkage method. Options: {single, complete, average, centroid, median, ward}, describing different methods to calculate the distance between two clusters. Option average calculates the distance between two clusters as the average distance between data points in the first and second cluster.


The deduplicated data.

See also


Encodes dirty categories (strings) by constructing latent topics with continuous encoding.


Encode string columns as a numeric array with the minhash method.


Encode string columns as a numeric array with n-gram string similarity.


Deduplication is done by first computing the n-gram distance between unique categories in data, then performing hierarchical clustering on this distance matrix, and choosing the most frequent element in each cluster as the ‘correct’ spelling. This method works best if the true number of categories is significantly smaller than the number of observed spellings.


>>> from dirty_cat.datasets import make_deduplication_data
>>> duplicated = make_deduplication_data(examples=['black', 'white'],
                                         entries_per_example=[5, 5],
>>> duplicated
['blacn', 'black', 'black', 'black', 'black',
 'hvite', 'white', 'white', 'white', 'white']

To deduplicate the data, we can build a correspondance matrix:

>>> deduplicate_correspondence = deduplicate(duplicated)
>>> deduplicate_correspondence
blacn    black
black    black
black    black
black    black
black    black
hvite    white
white    white
white    white
white    white
white    white
dtype: object

The translation table above is actually a series, giving the deduplicated values, and indexed by the original values. A deduplicated version of the initial list can easily be created:

>>> deduplicated = list(deduplicate_correspondence)
>>> deduplicated
['black', 'black', 'black', 'black', 'black',
'white', 'white', 'white', 'white', 'white']

We have our dirty categories deduplicated.

Examples using dirty_cat.deduplicate

Deduplicating misspelled categories with deduplicate

Deduplicating misspelled categories with deduplicate