Release 0.4.1

Major changes

Minor changes

  • Improvement of date column detection and date format inference in TableVectorizer. The format inference now finds a format which works for all non-missing values of the column, instead of relying on pandas behavior. If no such format exists, the column is not casted to a date column. #543 by Leo Grinsztajn

Release 0.4.0

Major changes

Minor changes

  • Add example Wikipedia embeddings to enrich the data. #487 by Jovan Stojanovic

    • datasets.fetching: contains a new function get_ken_embeddings() that can be used to download Wikipedia

    embeddings and filter them by type.

  • datasets.fetching: contains a new function fetch_world_bank_indicator() that can be used to download indicators from the World Bank Open Data platform. #291 by Jovan Stojanovic

  • Removed example Fitting scalable, non-linear models on data with dirty categories. #386 by Jovan Stojanovic

  • MinHashEncoder’s minhash() method is no longer public. #379 by Jovan Stojanovic

  • Fetching functions now have an additional argument directory, which can be used to specify where to save and load from datasets. #432 by Lilian Boulard

  • Fetching functions now have an additional argument directory, which can be used to specify where to save and load from datasets. #432 and #453 by Lilian Boulard

  • The TableVectorizer’s default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown=”ignore” instead of handle_unknown=”error” (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. #473 by Leo Grinsztajn

Bug fixes

Release 0.3.0

Major changes

  • New encoder: DatetimeEncoder can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, …). It is now the default transformer used in the TableVectorizer for datetime columns. #239 by Leo Grinsztajn

  • The TableVectorizer has seen some major improvements and bug fixes:

    • Fixes the automatic casting logic in transform.

    • To avoid dimensionality explosion when a feature has two unique values, the default encoder (OneHotEncoder) now drops one of the two vectors (see parameter drop=”if_binary”).

    • fit_transform and transform can now return unencoded features, like the ColumnTransformer’s behavior. Previously, a RuntimeError was raised.

    #300 by Lilian Boulard

  • Backward-incompatible change in the TableVectorizer: To apply remainder to features (with the *_transformer parameters), the value 'remainder' must be passed, instead of None in previous versions. None now indicates that we want to use the default transformer. #303 by Lilian Boulard

  • Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. #289 by Lilian Boulard

  • Bumped minimum dependencies:

  • Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.

Notes

Release 0.2.2

Bug fixes

Release 0.2.1

Major changes

Bug-fixes

Notes

Release 0.2.0

Also see pre-release 0.2.0a1 below for additional changes.

Major changes

  • Bump minimum dependencies:

  • datasets.fetching - backward-incompatible changes to the example datasets fetchers:

    • The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.

    • The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.

    • The example notebooks were updated to reflect these changes. #155 by Lilian Boulard

  • Backward incompatible change to MinHashEncoder: The MinHashEncoder now only supports two dimensional inputs of shape (N_samples, 1). #185 by Lilian Boulard and Alexis Cvetkov.

  • Update handle_missing parameters:

    • GapEncoder: the default value “zero_impute” becomes “empty_impute” (see doc).

    • MinHashEncoder: the default value “” becomes “zero_impute” (see doc).

    #210 by Alexis Cvetkov.

  • Add a method “get_feature_names_out” for the GapEncoder and the TableVectorizer, since get_feature_names will be depreciated in scikit-learn 1.2. #216 by Alexis Cvetkov

Notes

  • Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.

  • Improvements to the TableVectorizer

    • Missing values are not systematically imputed anymore

    • Type casting and per-column imputation are now learnt during fitting

    • Several bugfixes

    #201 by Lilian Boulard

Release 0.2.0a1

Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:

pip install --pre dirty_cat==0.2.0a1

or from the GitHub repository:

pip install git+https://github.com/dirty-cat/dirty_cat.git

Major changes

  • Bump minimum dependencies:

    • Python (>= 3.6)

    • NumPy (>= 1.16)

    • SciPy (>= 1.2)

    • scikit-learn (>= 0.20.0)

  • TableVectorizer: Added automatic transform through the TableVectorizer class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn’s ColumnTransformer simpler to use on heterogeneous pandas DataFrame. #167 by Lilian Boulard

  • Backward incompatible change to GapEncoder: The GapEncoder now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent GapEncoder models, and are then concatenated into a single matrix. #185 by Lilian Boulard and Alexis Cvetkov.

Bug-fixes

Release 0.1.1

Major changes

Bug-fixes

Release 0.1.0

Major changes

  • GapEncoder: Added online Gamma-Poisson factorization through the GapEncoder class. This method discovers latent categories formed via combinations of substrings, and encodes string data as combinations of these categories. To be used if interpretability is important. #153 by Alexis Cvetkov

Bug-fixes

Release 0.0.7

  • MinHashEncoder: Added minhash_encoder.py and fast_hast.py files that implement minhash encoding through the MinHashEncoder class. This method allows for fast and scalable encoding of string categorical variables.

  • datasets.fetch_employee_salaries: change the origin of download for employee_salaries.

    • The function now return a bunch with a dataframe under the field “data”, and not the path to the csv file.

    • The field “description” has been renamed to “DESCR”.

  • SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the python-Levenshtein implementation.

  • SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.

  • TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.

  • MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.

Release 0.0.6

  • SimilarityEncoder: Accelerate SimilarityEncoder.transform, by:

    • computing the vocabulary count vectors in fit instead of transform

    • computing the similarities in parallel using joblib. This option can be turned on/off via the n_jobs attribute of the SimilarityEncoder.

  • SimilarityEncoder: Fix a bug that was preventing a SimilarityEncoder to be created when categories was a list.

  • SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.

Release 0.0.5

  • SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.

  • SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.

  • SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.

  • SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.

  • SimilarityEncoder: Performance improvements in the ngram similarity.

  • SimilarityEncoder: Expose a get_feature_names method.