dirty_cat: machine learning with dirty categories¶

dirty_cat facilitates machine-learning with non-curated categories: robust to morphological variants, such as typos. See examples, such as the first one, for an introduction to problems of dirty categories or misspelled entities.

Automatic features from heterogeneous dataframes

TableVectorizer: a transformer to easily turn a pandas dataframe into a numpy array suitable for machine learning – a default encoding pipeline you can tweak.

An example

OneHotEncoder but for non-normalized categories

GapEncoder, scalable and interpretable, where each encoding dimension corresponds to a topic that summarizes substrings captured. Example
SimilarityEncoder, an enhanced one-hot encoder able to capture the string similarities in the data. Example
MinHashEncoder, very scalable, suitable for big data. Example

Joining tables on non-normalized categories

fuzzy_join(), approximate matching using morphological similarity. Example
FeatureAugmenter, a transformer for joining multiple tables together. Example

Deduplicating dirty categories

deduplicate(), merging categories of similar morphology (spelling).

An example

Recent changes

Contributing

Installing:: $ pip install --user --upgrade dirty_cat

Usage examples¶

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

API documentation¶

Vectorizing a dataframe¶

TableVectorizer

Automatically transform a heterogeneous dataframe to a numerical array.

Dirty category encoders¶

`GapEncoder`	Constructs latent topics with continuous encoding.
`MinHashEncoder`	Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.
`SimilarityEncoder`	Encode string categorical features to a similarity matrix.
`TargetEncoder`	Encode categorical features as a numeric array given a target vector.

Other encoders¶

DatetimeEncoder

Transforms each datetime column into several numeric columns for temporal features (e.g year, month, day...).

Joining tables¶

fuzzy_join

Join two tables with categorical columns based on approximate matching of morphological similarity.

FeatureAugmenter

Augment a main table by automatically joining multiple auxiliary tables on it.

Deduplication: merging variants of the same entry¶

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Data download and generation¶

`datasets.fetch_employee_salaries`	Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125
`datasets.fetch_medical_charge`	Fetches the medical charge dataset (regression), available at https://openml.org/d/42720
`datasets.fetch_midwest_survey`	Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805
`datasets.fetch_open_payments`	Fetches the open payments dataset (classification), available at https://openml.org/d/42738
`datasets.fetch_road_safety`	Fetches the road safety dataset (classification), available at https://openml.org/d/42803
`datasets.fetch_traffic_violations`	Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132
`datasets.fetch_drug_directory`	Fetches the drug directory dataset (classification), available at https://openml.org/d/43044
`datasets.fetch_world_bank_indicator`	Fetches a dataset of an indicator from the World Bank open data platform.
`datasets.get_ken_table_aliases`	Get the supported aliases of embedded KEN entities tables.
`datasets.get_ken_types`	Helper function to search for KEN entity types.
`datasets.get_ken_embeddings`	Download Wikipedia embeddings by type.
`datasets.get_data_dir`	Returns the directory in which dirty_cat looks for data.
`datasets.make_deduplication_data`	Duplicates examples with spelling mistakes.

About¶

dirty_cat is a young project born from research. We really need people giving feedback on successes and failures with the different techniques on real world data, and pointing us to open datasets on which we can do more empirical work. dirty-cat received funding from project DirtyData (ANR-17-CE23-0018).

dirty_cat

Version 0.4.1

Related Topics

dirty_cat: machine learning with dirty categories¶

Usage examples¶

API documentation¶

Vectorizing a dataframe¶

Dirty category encoders¶

Other encoders¶

Joining tables¶

Deduplication: merging variants of the same entry¶

Data download and generation¶

About¶

dirty_cat: machine learning with dirty categories¶

Usage examples¶

API documentation¶

Vectorizing a dataframe¶

Dirty category encoders¶

Other encoders¶

Joining tables¶

Deduplication: merging variants of the same entry¶

Data download and generation¶

About¶

Related projects¶