dirty_cat.deduplicate

dirty_cat.deduplicate(data, n_clusters=None, ngram_range=(2, 4), analyzer='char_wb', method='average')[source]

Deduplicate categorical data by hierarchically clustering similar strings.

This works best if there is a number of underlying categories that sometimes appear in the data with small variations and/or misspellings.

Parameters:
datasequence of str

The data to be deduplicated.

n_clustersint, optional

Number of clusters to use for hierarchical clustering, if None use the number of clusters that lead to the lowest silhouette score.

ngram_range2-tuple of int, default=(2, 4)

The lower and upper boundaries of the range of n-values for different n-grams used in the string similarity. All values of n such that min_n <= n <= max_n will be used.

analyzer{‘word’, ‘char’, ‘char_wb’}, default=`char_wb`

Analyzer parameter for the CountVectorizer used for the string similarities. Describes whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

method{single, complete, average, centroid, median, ward}, default=`average`

Linkage method parameter to use for merging clusters via scipy.cluster.hierarchy.linkage(). Option average calculates the distance between two clusters as the average distance between data points in the first and second cluster.

Returns:
list of str

The deduplicated data.

See also

dirty_cat.GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

dirty_cat.MinHashEncoder

Encode string columns as a numeric array with the minhash method.

dirty_cat.SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

Notes

Deduplication is done by first computing the n-gram distance between unique categories in data, then performing hierarchical clustering on this distance matrix, and choosing the most frequent element in each cluster as the ‘correct’ spelling. This method works best if the true number of categories is significantly smaller than the number of observed spellings.

Examples

>>> from dirty_cat.datasets import make_deduplication_data
>>> duplicated = make_deduplication_data(examples=['black', 'white'],
                                         entries_per_example=[5, 5],
                                         prob_mistake_per_letter=0.3,
                                         random_state=42)
>>> duplicated
['blacn', 'black', 'black', 'black', 'black',
 'hvite', 'white', 'white', 'white', 'white']

To deduplicate the data, we can build a correspondance matrix:

>>> deduplicate_correspondence = deduplicate(duplicated)
>>> deduplicate_correspondence
blacn    black
black    black
black    black
black    black
black    black
hvite    white
white    white
white    white
white    white
white    white
dtype: object

The translation table above is actually a series, giving the deduplicated values, and indexed by the original values. A deduplicated version of the initial list can easily be created:

>>> deduplicated = list(deduplicate_correspondence)
>>> deduplicated
['black', 'black', 'black', 'black', 'black',
'white', 'white', 'white', 'white', 'white']

We have our dirty categories deduplicated.

Examples using dirty_cat.deduplicate