dirty_cat.FeatureAugmenter

Usage examples at the bottom of this page.

class dirty_cat.FeatureAugmenter(tables, main_key, match_score=0.0, analyzer='char_wb', ngram_range=(2, 4))[source]

Augment a main table by automatically joining multiple auxiliary tables on it.

Given a list of tables and key column names, fuzzy join them to the main table.

The principle is as follows:

  1. The main table and the key column name are provided at initialisation.

  2. The auxiliary tables are provided for fitting, and will be joined sequentially when transform() is called.

It is advised to use hyperparameter tuning tools such as GridSearchCV to determine the best match_score parameter, as this can significantly improve your results. (see example ‘Fuzzy joining dirty tables with the FeatureAugmenter’ for an illustration)

Parameters:
tableslist of 2-tuples of (DataFrame, str)

List of (table, column name) tuples, the tables to join.

main_keystr

The key column name in the main table (passed during fit) on which the join will be performed.

match_scorefloat, default=0

Distance score between the closest matches that will be accepted. In a [0, 1] interval. 1 means that only a perfect match will be accepted, and zero means that the closest match will be accepted, no matter how distant. For numerical joins, this defines the maximum Euclidean distance between the matches.

analyzer{‘word’, ‘char’, ‘char_wb’}, default=`char_wb`

Analyzer parameter for the CountVectorizer used for the string similarities. Describes whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

ngram_range2-tuple of int, default=(2, 4)
The lower and upper boundaries of the range of n-values for different

n-grams used in the string similarity. All values of n such that min_n <= n <= max_n will be used.

See also

dirty_cat.fuzzy_join()

Join two tables (dataframes) based on approximate column matching.

dirty_cat.datasets.get_ken_embeddings()

Download vector embeddings for many common entities (cities, places, people…).

Examples

>>> X = pd.DataFrame(['France', 'Germany', 'Italy'],
                     columns=['Country'])
>>> X
Country
0   France
1  Germany
2    Italy
>>> aux_table_1 = pd.DataFrame([['Germany', 84_000_000],
                                ['France', 68_000_000],
                                ['Italy', 59_000_000]],
                                columns=['Country', 'Population'])
>>> aux_table_1
   Country  Population
0  Germany    84000000
1   France    68000000
2    Italy    59000000
>>> aux_table_2 = pd.DataFrame([['French Republic', 2937],
                                ['Italy', 2099],
                                ['Germany', 4223],
                                ['UK', 3186]],
                                columns=['Country name', 'GDP (billion)'])
>>> aux_table_2
    Country name  GDP (billion)
0   French Republic      2937
1        Italy           2099
2      Germany           4223
3           UK           3186
>>> aux_table_3 = pd.DataFrame([['France', 'Paris'],
                                ['Italia', 'Rome'],
                                ['Germany', 'Berlin']],
                                columns=['Countries', 'Capital'])
>>> aux_table_3
  Countries Capital
0    France   Paris
1     Italia   Rome
2   Germany  Berlin
>>> aux_tables = [(aux_table_1, "Country"),
                  (aux_table_2, "Country name"),
                  (aux_table_3, "Countries")]
>>> fa = FeatureAugmenter(tables=aux_tables, main_key='Country')
>>> augmented_table = fa.fit_transform(X)
>>> augmented_table
    Country Country_aux  Population Country name  GDP (billion) Countries Capital
0   France      France    68000000  French Republic       2937    France   Paris
1  Germany     Germany    84000000      Germany           4223   Germany  Berlin
2    Italy       Italy    59000000        Italy           2099    Italia    Rome

Methods

fit(X[, y])

Fit the instance to the main table.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X[, y])

Transform X using the specified encoding scheme.

fit(X, y=None)[source]

Fit the instance to the main table.

In practice, just checks if the key columns in X, the main table, and in the auxiliary tables exist.

Parameters:
XDataFrame, shape [n_samples, n_features]

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
FeatureAugmenter

Fitted FeatureAugmenter instance (self).

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X, y=None)[source]

Transform X using the specified encoding scheme.

Parameters:
XDataFrame, shape [n_samples, n_features]

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
DataFrame

The final joined table.

Examples using dirty_cat.FeatureAugmenter