Release 0.4.1¶
Major changes¶
fuzzy_join()
andFeatureAugmenter
can now join on numerical columns based on the euclidean distance. #530 by Jovan Stojanovicfuzzy_join()
andFeatureAugmenter
can perform many-to-many joins on lists of numerical or string key columns. #530 by Jovan StojanovicGapEncoder.transform()
will not continue fitting of the instance anymore. It makes functions that depend on it (get_feature_names_out()
,score()
, etc.) deterministic once fitted. #548 by Lilian Boulardfuzzy_join()
andFeatureAugmenter
now perform joins on missing values as in pandas.merge but raises a warning. #522 and #529 by Jovan StojanovicAdded
get_ken_table_aliases()
andget_ken_types()
for exploring KEN embeddings. #539 by Lilian Boulard.
Minor changes¶
Improvement of date column detection and date format inference in
TableVectorizer
. The format inference now finds a format which works for all non-missing values of the column, instead of relying on pandas behavior. If no such format exists, the column is not casted to a date column. #543 by Leo Grinsztajn
Release 0.4.0¶
Major changes¶
SuperVectorizer is renamed as
TableVectorizer
, a warning is raised when using the old name. #484 by Jovan StojanovicNew experimental feature: joining tables using
fuzzy_join()
by approximate key matching. Matches are based on string similarities and the nearest neighbors matches are found for each category. #291 by Jovan Stojanovic and Leo GrinsztajnNew experimental feature:
FeatureAugmenter
, a transformer that augments withfuzzy_join()
the number of features in a main table by using information from auxilliary tables. #409 by Jovan StojanovicUnnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn’t be imported in your code. #331 by Lilian Boulard
The
MinHashEncoder
now supports a n_jobs parameter to parallelize the hashes computation. #267 by Leo Grinsztajn and Lilian Boulard.New experimental feature: deduplicating misspelled categories using
deduplicate()
by clustering string distances. This function works best when there are significantly more duplicates than underlying categories. #339 by Moritz Boos.
Minor changes¶
Add example Wikipedia embeddings to enrich the data. #487 by Jovan Stojanovic
datasets.fetching: contains a new function
get_ken_embeddings()
that can be used to download Wikipedia
embeddings and filter them by type.
datasets.fetching: contains a new function
fetch_world_bank_indicator()
that can be used to download indicators from the World Bank Open Data platform. #291 by Jovan StojanovicRemoved example Fitting scalable, non-linear models on data with dirty categories. #386 by Jovan Stojanovic
MinHashEncoder
’sminhash()
method is no longer public. #379 by Jovan StojanovicFetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets. #432 by Lilian BoulardFetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets. #432 and #453 by Lilian BoulardThe
TableVectorizer
’s default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown=”ignore” instead of handle_unknown=”error” (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. #473 by Leo Grinsztajn
Bug fixes¶
The
MinHashEncoder
now considers None and empty strings as missing values, rather than raising an error. #378 by Gael Varoquaux
Release 0.3.0¶
Major changes¶
New encoder:
DatetimeEncoder
can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, …). It is now the default transformer used in theTableVectorizer
for datetime columns. #239 by Leo GrinsztajnThe
TableVectorizer
has seen some major improvements and bug fixes:Fixes the automatic casting logic in
transform
.To avoid dimensionality explosion when a feature has two unique values, the default encoder (
OneHotEncoder
) now drops one of the two vectors (see parameter drop=”if_binary”).fit_transform
andtransform
can now return unencoded features, like theColumnTransformer
’s behavior. Previously, aRuntimeError
was raised.
Backward-incompatible change in the TableVectorizer: To apply
remainder
to features (with the*_transformer
parameters), the value'remainder'
must be passed, instead ofNone
in previous versions.None
now indicates that we want to use the default transformer. #303 by Lilian BoulardSupport for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. #289 by Lilian Boulard
Bumped minimum dependencies:
scikit-learn>=0.23
scipy>=1.4.0
numpy>=1.17.3
pandas>=1.2.0 #299 and #300 by Lilian Boulard
Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
The
SimilarityEncoder
now exclusively usesngram
for similarities, and the similarity parameter is deprecated. It will be removed in 0.5. #282 by Lilian Boulard
Notes¶
The
transformers_
attribute of theTableVectorizer
now contains column names instead of column indices for the “remainder” columns. #266 by Leo Grinsztajn
Release 0.2.2¶
Bug fixes¶
Fixed a bug in the
TableVectorizer
causing aFutureWarning
when using theget_feature_names_out()
method. #262 by Lilian Boulard
Release 0.2.1¶
Major changes¶
Improvements to the
TableVectorizer
Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
get_feature_names()
becomesget_feature_names_out()
, following changes in the scikit-learn API.get_feature_names()
is deprecated in scikit-learn > 1.0. #241 by Gael Varoquaux- Improvements to the
MinHashEncoder
It is now possible to fit multiple columns simultaneously with the
MinHashEncoder
. Very useful when using for instance themake_column_transformer()
function, on multiple columns.
- Improvements to the
Bug-fixes¶
Fixed a bug that resulted in the
GapEncoder
ignoring the analyzer argument. #242 by Jovan StojanovicGapEncoder
’s get_feature_names_out now accepts all iterators, not just lists. #255 by Lilian BoulardFixed
DeprecationWarning
raised by the usage of distutils.version.LooseVersion. #261 by Lilian Boulard
Notes¶
Remove trailing imports in the
MinHashEncoder
.Fix typos and update links for website.
Documentation of the
TableVectorizer
and theSimilarityEncoder
improved.
Release 0.2.0¶
Also see pre-release 0.2.0a1 below for additional changes.
Major changes¶
Bump minimum dependencies:
scikit-learn (>=0.21.0) #202 by Lilian Boulard
pandas (>=1.1.5) ! NEW REQUIREMENT ! #155 by Lilian Boulard
datasets.fetching - backward-incompatible changes to the example datasets fetchers:
The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.
The example notebooks were updated to reflect these changes. #155 by Lilian Boulard
Backward incompatible change to
MinHashEncoder
: TheMinHashEncoder
now only supports two dimensional inputs of shape (N_samples, 1). #185 by Lilian Boulard and Alexis Cvetkov.Update handle_missing parameters:
GapEncoder
: the default value “zero_impute” becomes “empty_impute” (see doc).MinHashEncoder
: the default value “” becomes “zero_impute” (see doc).
#210 by Alexis Cvetkov.
Add a method “get_feature_names_out” for the
GapEncoder
and theTableVectorizer
, since get_feature_names will be depreciated in scikit-learn 1.2. #216 by Alexis Cvetkov
Notes¶
Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.
Improvements to the
TableVectorizer
Missing values are not systematically imputed anymore
Type casting and per-column imputation are now learnt during fitting
Several bugfixes
Release 0.2.0a1¶
Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:
pip install --pre dirty_cat==0.2.0a1
or from the GitHub repository:
pip install git+https://github.com/dirty-cat/dirty_cat.git
Major changes¶
Bump minimum dependencies:
Python (>= 3.6)
NumPy (>= 1.16)
SciPy (>= 1.2)
scikit-learn (>= 0.20.0)
TableVectorizer
: Added automatic transform through theTableVectorizer
class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn’sColumnTransformer
simpler to use on heterogeneous pandas DataFrame. #167 by Lilian BoulardBackward incompatible change to
GapEncoder
: TheGapEncoder
now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independentGapEncoder
models, and are then concatenated into a single matrix. #185 by Lilian Boulard and Alexis Cvetkov.
Bug-fixes¶
Fix get_feature_names for scikit-learn > 0.21. #216 by Alexis Cvetkov
Release 0.1.1¶
Major changes¶
Bug-fixes¶
RuntimeWarnings due to overflow in
GapEncoder
. #161 by Alexis Cvetkov
Release 0.1.0¶
Major changes¶
GapEncoder
: Added online Gamma-Poisson factorization through theGapEncoder
class. This method discovers latent categories formed via combinations of substrings, and encodes string data as combinations of these categories. To be used if interpretability is important. #153 by Alexis Cvetkov
Bug-fixes¶
Multiprocessing exception in notebook. #154 by Lilian Boulard
Release 0.0.7¶
MinHashEncoder: Added
minhash_encoder.py
andfast_hast.py
files that implement minhash encoding through theMinHashEncoder
class. This method allows for fast and scalable encoding of string categorical variables.datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
The function now return a bunch with a dataframe under the field “data”, and not the path to the csv file.
The field “description” has been renamed to “DESCR”.
SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the
python-Levenshtein
implementation.SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.
Release 0.0.6¶
SimilarityEncoder: Accelerate
SimilarityEncoder.transform
, by:computing the vocabulary count vectors in
fit
instead oftransform
computing the similarities in parallel using
joblib
. This option can be turned on/off via then_jobs
attribute of theSimilarityEncoder
.
SimilarityEncoder: Fix a bug that was preventing a
SimilarityEncoder
to be created whencategories
was a list.SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.
Release 0.0.5¶
SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
SimilarityEncoder: Performance improvements in the ngram similarity.
SimilarityEncoder: Expose a get_feature_names method.