dirty_cat: machine learning on dirty categories¶
dirty_cat helps with machine-learning on non-curated categories. It provides encoders that are robust to morphological variants, such as typos, in the category strings.
The SimilarityEncoder
is a drop-in replacement for
scikit-learn’s
OneHotEncoder
.
If speed and scalability are an issue, the MinHashEncoder
provides
a fast encoding method.
If interpretability is important,the GapEncoder
is a good
alternative, as it can be interpreted as one-hot encoding, where each encoding
dimension corresponds to a topic that summarizes the substrings captured.
For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables 1 and Encoding high-cardinality string categorical variables 2.
- Installing
$ pip install –user dirty_cat
Requires Python 3
API documentation¶
Encoders¶
This encoder can be understood as a continuous encoding on a set of latent categories estimated from the data. |
|
Encode string categorical features as a numeric array, minhash method applied to ngram decomposition of strings based on ngram decomposition of the string. |
|
Encode string categorical features as a numeric array. |
|
Encode categorical features as a numeric array given a target vector. |
Data download¶
fetches the employee_salaries dataset |
|
fetches the medical charge dataset |
|
fetches the midwest survey dataset |
|
fetches the open payements dataset |
|
fetches the road safety dataset |
|
fetches the traffic violations dataset |
|
Returns the directories in which dirty_cat looks for data. |
About¶
dirty_cat is for now a repository for developing ideas with high-quality implementations, a form of a research project: there is still little known about the problems of dirty categories. We hope that tradeoffs will emerge in the long run, and that these tradeoffs will enable us to do better software. We really need people giving feedback on success and failures with the different techniques and pointing us to open datasets on which we can do more empirical work. We also welcome contributions in the scope of dirty categories.
See also
Many classic categorical encoding schemes are available here: http://contrib.scikit-learn.org/categorical-encoding/
Similarity encoding in also available in Spark ML: https://github.com/rakutentech/spark-dirty-cat
- 1
Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.
- 2
Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.