dirty_cat: machine learning on dirty categories

dirty_cat helps with machine-learning on non-curated categories. It provides encoders that are robust to morphological variants, such as typos, in the category strings.

The SimilarityEncoder is a drop-in replacement for scikit-learn’s OneHotEncoder.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1].

Installing:$ pip install –user dirty_cat

Recent changes

Requires Python 3

API documentation


SimilarityEncoder Encode string categorical features as a numeric array.
TargetEncoder Encode categorical features as a numeric array given a target vector.

Data download

datasets.fetch_employee_salaries fetches the employee_salaries dataset
datasets.fetch_medical_charge fetches the medical charge dataset
datasets.fetch_midwest_survey fetches the midwest survey dataset
datasets.fetch_open_payments fetches the open payements dataset
datasets.fetch_road_safety fetches the road safety dataset
datasets.fetch_traffic_violations fetches the traffic violations dataset
datasets.get_data_dir Returns the directories in which dirty_cat looks for data.


dirty_cat is for now a repository for developing ideas with high-quality implementations, a form of a research project: there is still little known about the problems of dirty categories. We hope that tradeoffs will emerge in the long run, and that these tradeoffs will enable us to do better software. We really need people giving feedback on success and failures with the different techniques and pointing us to open datasets on which we can do more empirical work. We also welcome contributions in the scope of dirty categories.

See also

Many classic categorical encoding schemes are available here: http://contrib.scikit-learn.org/categorical-encoding/

Similarity encoding in also available in Spark ML: https://github.com/rakutentech/spark-dirty-cat

[1]Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.