dirty_cat: machine learning on dirty categories

dirty_cat facilitates machine-learning on non-curated categories: robust to morphological variants, such as typos.

Automatic features from heterogeneous dataframes

SuperVectorizer: a transformer automatically turning a pandas dataframe into a numpy array for machine learning – a default encoding pipeline you can tweak.

An example

OneHotEncoder but for non-normalized categories
  • GapEncoder, scalable and interpretable, where each encoding dimension corresponds to a topic that summarizes substrings captured.

  • SimilarityEncoder, a simple modification of one-hot encoding to capture the strings.

  • MinHashEncoder, very scalable

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables 1 and Encoding high-cardinality string categorical variables 2.

Recent changes


$ pip install –user dirty_cat


API documentation

Vectorizing a dataframe


Easily transforms a heterogeneous data table (such as a dataframe) to a numerical array for machine learning.

Dirty Category encoders


This encoder can be understood as a continuous encoding on a set of latent categories estimated from the data.


Encode string categorical features as a numeric array, minhash method applied to ngram decomposition of strings based on ngram decomposition of the string.


Encode string categorical features as a numeric array.


Encode categorical features as a numeric array given a target vector.

Other encoders


This encoder transforms each datetime column into several numeric columns corresponding to temporal features, e.g year, month, day.

Data download


Fetches the employee_salaries dataset, available at https://openml.org/d/42125


Fetches the medical charge dataset, available at https://openml.org/d/42720


Fetches the midwest survey dataset, available at https://openml.org/d/42805


Fetches the open payments dataset, available at https://openml.org/d/42738


Fetches the road safety dataset, available at https://openml.org/d/42803


Fetches the traffic violations dataset, available at https://openml.org/d/42132


Returns the directory in which dirty_cat looks for data.


dirty_cat is for now a repository for ideas coming out of a research project: there is still little known about the problems of dirty categories. Tradeoffs will emerge in the long run. We really need people giving feedback on success and failures with the different techniques and pointing us to open datasets on which we can do more empirical work. Dirty-cat received funding from project DirtyData (ANR-17-CE23-0018).


Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.


Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.

See also

Many classic categorical encoding schemes are available here: https://contrib.scikit-learn.org/category_encoders/

Similarity encoding in also available in Spark ML: https://github.com/rakutentech/spark-dirty-cat