dirty_cat: machine learning with dirty categories¶
dirty_cat facilitates machine-learning with non-curated categories: robust to morphological variants, such as typos. See examples, such as the first one, for an introduction to problems of dirty categories or misspelled entities.
TableVectorizer
: a transformer automatically turning a pandas
dataframe into a numpy array for machine learning – a default encoding
pipeline you can tweak.
GapEncoder
, scalable and interpretable, where each encoding dimension corresponds to a topic that summarizes substrings captured. ExampleSimilarityEncoder
, a simple modification of one-hot encoding to capture the strings. ExampleMinHashEncoder
, very scalable. Example
fuzzy_join()
, approximate matching using morphological similarity. ExampleFeatureAugmenter
, a scikit-learn transformer for joining multiple tables. Example
deduplicate()
, merging categories of similar morphology (spelling).
- Installing:
$ pip install --user --upgrade dirty_cat
Usage examples¶

Dirty categories: machine learning with non normalized strings

Handling datetime features with the DatetimeEncoder

Fuzzy joining dirty tables and the FeatureAugmenter

Deduplicating misspelled categories with deduplicate
For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].
API documentation¶
Vectorizing a dataframe¶
Easily transform a heterogeneous array to a numerical one. |
Dirty Category encoders¶
Constructs latent topics with continuous encoding. |
|
Encode string categorical features as a numeric array, minhash method applied to ngram decomposition of strings based on ngram decomposition of the string. |
|
Encode string categorical features as a numeric array. |
|
Encode categorical features as a numeric array given a target vector. |
Other encoders¶
Transforms each datetime column into several numeric columns for temporal features (e.g year, month, day...). |
Joining tables¶
Join two tables categorical string columns based on approximate matching and using morphological similarity. |
Transformer augmenting number of features in a table by joining multiple tables. |
Deduplication: merging variants of the same entry¶
Deduplicate data by hierarchically clustering similar strings. |
Data download and generation¶
Fetches the employee_salaries dataset (regression), available at https://openml.org/d/42125 |
|
Fetches the medical charge dataset (regression), available at https://openml.org/d/42720 |
|
Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805 |
|
Fetches the open payments dataset (classification), available at https://openml.org/d/42738 |
|
Fetches the road safety dataset (classification), available at https://openml.org/d/42803 |
|
Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132 |
|
Fetches the drug directory dataset (classification), available at https://openml.org/d/43044 |
|
Fetches a dataset of an indicator from the World Bank open data platform. |
|
Download Wikipedia embeddings by type. |
|
Returns the directory in which dirty_cat looks for data. |
|
Duplicates examples with spelling mistakes. |
About¶
dirty_cat is for now a repository for ideas coming out of a research project: there is still little known about the problems of dirty categories. Tradeoffs will emerge in the long run. We really need people giving feedback on success and failures with the different techniques and pointing us to open datasets on which we can do more empirical work. dirty-cat received funding from project DirtyData (ANR-17-CE23-0018).