dirty_cat
.FeatureAugmenter¶
Usage examples at the bottom of this page.
- class dirty_cat.FeatureAugmenter(tables, main_key, match_score=0.0, analyzer='char_wb', ngram_range=(2, 4))[source]¶
Augment a main table by automatically joining multiple auxiliary tables on it.
Given a list of tables and key column names, fuzzy join them to the main table.
The principle is as follows:
The main table and the key column name are provided at initialisation.
The auxiliary tables are provided for fitting, and will be joined sequentially when
transform()
is called.
It is advised to use hyperparameter tuning tools such as
GridSearchCV
to determine the best match_score parameter, as this can significantly improve your results. (see example ‘Fuzzy joining dirty tables with the FeatureAugmenter’ for an illustration)- Parameters:
- tableslist of 2-tuples of (
DataFrame
, str) List of (table, column name) tuples, the tables to join.
- main_keystr
The key column name in the main table (passed during fit) on which the join will be performed.
- match_scorefloat, default=0
Distance score between the closest matches that will be accepted. In a [0, 1] interval. 1 means that only a perfect match will be accepted, and zero means that the closest match will be accepted, no matter how distant. For numerical joins, this defines the maximum Euclidean distance between the matches.
- analyzer{‘word’, ‘char’, ‘char_wb’}, default=`char_wb`
Analyzer parameter for the
CountVectorizer
used for the string similarities. Describes whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.- ngram_range2-tuple of int, default=(2, 4)
- The lower and upper boundaries of the range of n-values for different
n-grams used in the string similarity. All values of n such that
min_n <= n <= max_n
will be used.
- tableslist of 2-tuples of (
See also
dirty_cat.fuzzy_join()
Join two tables (dataframes) based on approximate column matching.
dirty_cat.datasets.get_ken_embeddings()
Download vector embeddings for many common entities (cities, places, people…).
Examples
>>> X = pd.DataFrame(['France', 'Germany', 'Italy'], columns=['Country']) >>> X Country 0 France 1 Germany 2 Italy
>>> aux_table_1 = pd.DataFrame([['Germany', 84_000_000], ['France', 68_000_000], ['Italy', 59_000_000]], columns=['Country', 'Population']) >>> aux_table_1 Country Population 0 Germany 84000000 1 France 68000000 2 Italy 59000000
>>> aux_table_2 = pd.DataFrame([['French Republic', 2937], ['Italy', 2099], ['Germany', 4223], ['UK', 3186]], columns=['Country name', 'GDP (billion)']) >>> aux_table_2 Country name GDP (billion) 0 French Republic 2937 1 Italy 2099 2 Germany 4223 3 UK 3186
>>> aux_table_3 = pd.DataFrame([['France', 'Paris'], ['Italia', 'Rome'], ['Germany', 'Berlin']], columns=['Countries', 'Capital']) >>> aux_table_3 Countries Capital 0 France Paris 1 Italia Rome 2 Germany Berlin
>>> aux_tables = [(aux_table_1, "Country"), (aux_table_2, "Country name"), (aux_table_3, "Countries")]
>>> fa = FeatureAugmenter(tables=aux_tables, main_key='Country')
>>> augmented_table = fa.fit_transform(X) >>> augmented_table Country Country_aux Population Country name GDP (billion) Countries Capital 0 France France 68000000 French Republic 2937 France Paris 1 Germany Germany 84000000 Germany 4223 Germany Berlin 2 Italy Italy 59000000 Italy 2099 Italia Rome
Methods
fit
(X[, y])Fit the instance to the main table.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y])Transform X using the specified encoding scheme.
- fit(X, y=None)[source]¶
Fit the instance to the main table.
In practice, just checks if the key columns in X, the main table, and in the auxiliary tables exist.
- Parameters:
- X
DataFrame
, shape [n_samples, n_features] The main table, to be joined to the auxiliary ones.
- yNone
Unused, only here for compatibility.
- X
- Returns:
FeatureAugmenter
Fitted
FeatureAugmenter
instance (self).
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)[source]¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_output(*, transform=None)[source]¶
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.