================================================= dirty_cat: machine learning with dirty categories ================================================= .. toctree:: :maxdepth: 2 .. currentmodule:: dirty_cat .. container:: larger-container `dirty_cat` facilitates machine-learning with non-curated categories: **robust to morphological variants**, such as typos. See :ref:`examples `, such as `the first one `_, for an introduction to problems of dirty categories or misspelled entities. | .. raw:: html

Automatic features from heterogeneous dataframes :class:`TableVectorizer`: a transformer **automatically turning a pandas dataframe into a numpy array** for machine learning -- a default encoding pipeline you can tweak. .. rst-class:: centered :ref:`An example ` .. raw:: html

OneHotEncoder but for non-normalized categories * :class:`GapEncoder`, scalable and interpretable, where each encoding dimension corresponds to a topic that summarizes substrings captured. :ref:`Example ` * :class:`SimilarityEncoder`, a simple modification of one-hot encoding to capture the strings. :ref:`Example ` * :class:`MinHashEncoder`, very scalable. :ref:`Example ` .. raw:: html

Joining tables on non-normalized categories * :func:`fuzzy_join`, approximate matching using morphological similarity. :ref:`Example ` * :class:`FeatureAugmenter`, a scikit-learn transformer for joining multiple tables. :ref:`Example ` .. raw:: html

Deduplicating dirty categories :func:`deduplicate`, merging categories of similar morphology (spelling). .. rst-class:: centered :ref:`An example ` .. raw:: html

.. container:: right-align `Recent changes `_ `Contributing `_ .. container:: install_instructions :Installing: ``$ pip install --user --upgrade dirty_cat`` .. _usage_examples: Usage examples ============== .. container:: larger-container .. include:: auto_examples/index.rst :start-line: 5 :end-before: .. rst-class:: sphx-glr-signature | .. raw:: html

.. raw:: html

For a detailed description of the problem of encoding dirty categorical data, see `Similarity encoding for learning with dirty categorical variables `_ [1]_ and `Encoding high-cardinality string categorical variables `_ [2]_. API documentation ================= Vectorizing a dataframe ----------------------- .. autosummary:: :toctree: generated/ :template: class.rst :nosignatures: TableVectorizer Dirty Category encoders ----------------------- .. autosummary:: :toctree: generated/ :template: class.rst :nosignatures: GapEncoder MinHashEncoder SimilarityEncoder TargetEncoder Other encoders -------------- .. autosummary:: :toctree: generated/ :template: class.rst :nosignatures: DatetimeEncoder Joining tables -------------- .. autosummary:: :toctree: generated/ :template: function.rst :nosignatures: fuzzy_join .. autosummary:: :toctree: generated/ :template: class.rst :nosignatures: FeatureAugmenter Deduplication: merging variants of the same entry --------------------------------------------------- .. autosummary:: :toctree: generated/ :template: function.rst :nosignatures: deduplicate Data download and generation ---------------------------- .. autosummary:: :toctree: generated/ :template: function.rst :nosignatures: datasets.fetch_employee_salaries datasets.fetch_medical_charge datasets.fetch_midwest_survey datasets.fetch_open_payments datasets.fetch_road_safety datasets.fetch_traffic_violations datasets.fetch_drug_directory datasets.fetch_world_bank_indicator datasets.get_ken_embeddings datasets.get_data_dir datasets.make_deduplication_data About ===== dirty_cat is for now a repository for ideas coming out of a research project: there is still little known about the problems of dirty categories. Tradeoffs will emerge in the long run. We really need people giving feedback on success and failures with the different techniques and pointing us to open datasets on which we can do more empirical work. dirty-cat received funding from `project DirtyData `_ (ANR-17-CE23-0018). .. [1] Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering. .. [2] Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer. Related projects ================ - `scikit-learn `_ - a very popular machine learning library; dirty_cat inherits its API - `categorical-encoding `_ - scikit-learn compatible classic categorical encoding schemes - `spark-dirty-cat `_ - a Scala implementation of dirty_cat for Spark ML - `CleverCSV `_ - a package for dealing with dirty csv files - `GAMA `_ - a modular AutoML assistant that uses dirty_cat as part of its search space