dirty_cat: machine learning with dirty categories

dirty_cat facilitates machine-learning with non-curated categories: robust to morphological variants, such as typos. See examples, such as the first one, for an introduction to problems of dirty categories or misspelled entities.


Automatic features from heterogeneous dataframes

TableVectorizer: a transformer automatically turning a pandas dataframe into a numpy array for machine learning – a default encoding pipeline you can tweak.

An example

OneHotEncoder but for non-normalized categories
Joining tables on non-normalized categories
Deduplicating dirty categories

deduplicate(), merging categories of similar morphology (spelling).

An example

Installing:

$ pip install --user --upgrade dirty_cat

Usage examples

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings

Investigating and interpreting dirty categories

Investigating and interpreting dirty categories

Handling datetime features with the DatetimeEncoder

Handling datetime features with the DatetimeEncoder

Fuzzy joining dirty tables and the FeatureAugmenter

Fuzzy joining dirty tables and the FeatureAugmenter

Deduplicating misspelled categories with deduplicate

Deduplicating misspelled categories with deduplicate

Wikipedia embeddings to enrich the data

Wikipedia embeddings to enrich the data

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

API documentation

Vectorizing a dataframe

TableVectorizer

Easily transform a heterogeneous array to a numerical one.

Dirty Category encoders

GapEncoder

Constructs latent topics with continuous encoding.

MinHashEncoder

Encode string categorical features as a numeric array, minhash method applied to ngram decomposition of strings based on ngram decomposition of the string.

SimilarityEncoder

Encode string categorical features as a numeric array.

TargetEncoder

Encode categorical features as a numeric array given a target vector.

Other encoders

DatetimeEncoder

Transforms each datetime column into several numeric columns for temporal features (e.g year, month, day...).

Joining tables

fuzzy_join

Join two tables categorical string columns based on approximate matching and using morphological similarity.

FeatureAugmenter

Transformer augmenting number of features in a table by joining multiple tables.

Deduplication: merging variants of the same entry

deduplicate

Deduplicate data by hierarchically clustering similar strings.

Data download and generation

datasets.fetch_employee_salaries

Fetches the employee_salaries dataset (regression), available at https://openml.org/d/42125

datasets.fetch_medical_charge

Fetches the medical charge dataset (regression), available at https://openml.org/d/42720

datasets.fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805

datasets.fetch_open_payments

Fetches the open payments dataset (classification), available at https://openml.org/d/42738

datasets.fetch_road_safety

Fetches the road safety dataset (classification), available at https://openml.org/d/42803

datasets.fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132

datasets.fetch_drug_directory

Fetches the drug directory dataset (classification), available at https://openml.org/d/43044

datasets.fetch_world_bank_indicator

Fetches a dataset of an indicator from the World Bank open data platform.

datasets.get_ken_embeddings

Download Wikipedia embeddings by type.

datasets.get_data_dir

Returns the directory in which dirty_cat looks for data.

datasets.make_deduplication_data

Duplicates examples with spelling mistakes.

About

dirty_cat is for now a repository for ideas coming out of a research project: there is still little known about the problems of dirty categories. Tradeoffs will emerge in the long run. We really need people giving feedback on success and failures with the different techniques and pointing us to open datasets on which we can do more empirical work. dirty-cat received funding from project DirtyData (ANR-17-CE23-0018).