dirty_cat.TargetEncoder

Usage examples at the bottom of this page.

class dirty_cat.TargetEncoder(categories='auto', clf_type='binary-clf', dtype=<class 'numpy.float64'>, handle_unknown='error', handle_missing='')[source]

Encode categorical features as a numeric array given a target vector.

Each category is encoded given the effect that it has in the target variable y. The method considers that categorical variables can present rare categories. It represents each category by the probability of y conditional on this category. In addition, it takes an empirical Bayes approach to shrink the estimate.

Parameters:
categories‘auto’ or list of list of int or str

Categories (unique values) per feature:

  • ‘auto’ : Determine categories automatically from the training data.

  • list : categories[i] holds the categories expected in the i-th column. The passed categories must be sorted and should not mix strings and numeric values.

The categories used can be found in the categories_ attribute.

clf_type{‘regression’, ‘binary-clf’, ‘multiclass-clf’}, default=’binary-clf’

The type of classification/regression problem.

dtypenumber type, default=np.float64

Desired dtype of output.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the encoded columns for this feature will be assigned the prior mean of the target variable.

handle_missing{‘error’, ‘’}, default=’’

Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during fit() (default is to impute). When this parameter is set to ‘’, and a missing value is encountered during fit_transform(), the resulting encoded columns for this feature will be all zeros.

See also

dirty_cat.GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

dirty_cat.MinHashEncoder

Encode string columns as a numeric array with the minhash method.

dirty_cat.SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

References

For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.

Examples

>>> enc = TargetEncoder(handle_unknown='ignore')
>>> X = [['male'], ['Male'], ['Female'], ['male'], ['Female']]
>>> y = np.array([1, 2, 3, 4, 5])
>>> enc.fit(X, y)
TargetEncoder(handle_unknown='ignore')

The encoder has found the following categories:

>>> enc.categories_
[array(['Female', 'Male', 'male'], dtype='<U6')]

We will encode the following categories, of which the first two are unknown :

>>> X2 = [['MALE'], ['FEMALE'], ['Female'], ['male'], ['Female']]
>>> enc.transform(X2)
array([[3.        ],
    [3.        ],
    [3.54545455],
    [2.72727273],
    [3.54545455]])

As expected, they were encoded according to their influence on y. The unknown categories were assigned the mean of the target variable.

Attributes:
n_features_in_int

Number of features in the data seen during fit().

categories_list of ndarray

The categories of each feature determined during fit() (in order corresponding with output of transform()).

n_int

Length of y

Methods

fit(X, y)

Fit the instance to X.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform X using the specified encoding scheme.

fit(X, y)[source]

Fit the instance to X.

Parameters:
Xarray-like, shape [n_samples, n_features]

The data to determine the categories of each feature.

yndarray

The associated target vector.

Returns:
TargetEncoder

Fitted TargetEncoder instance (self).

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]

Transform X using the specified encoding scheme.

Parameters:
Xarray-like, shape [n_samples, n_features_new]

The data to encode.

Returns:
2-d ndarray

Transformed input.

Examples using dirty_cat.TargetEncoder