dirty_cat.TargetEncoder

Usage examples at the bottom of this page.

class dirty_cat.TargetEncoder(categories='auto', clf_type='binary-clf', dtype=<class 'numpy.float64'>, handle_unknown='error', handle_missing='')[source]

Encode categorical features as a numeric array given a target vector.

Each category is encoded given the effect that it has in the target variable y. The method considers that categorical variables can present rare categories. It represents each category by the probability of y conditional on this category. In addition it takes an empirical Bayes approach to shrink the estimate.

Parameters
  • categories ('auto' or a list of lists/arrays of values.) –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the i-th column. The passed categories must be sorted and should not mix strings and numeric values.

    The categories used can be found in the categories_ attribute.

  • clf_type (string {'regression', 'binary-clf', 'multiclass-clf'}) – The type of classification/regression problem.

  • dtype (number type, default np.float64) – Desired dtype of output.

  • handle_unknown ('error' (default) or 'ignore') – Whether to raise an error or ignore if a unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.

  • handle_missing ('error' or '' (default)) – Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during fit (default is to impute). When this parameter is set to ‘’, and a missing value is encountered during fit_transform, the resulting encoded columns for this feature will be all zeros.

categories_

The categories of each feature determined during fitting (in order corresponding with output of transform).

Type

list of arrays

References

For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.

fit(X, y)[source]

Fit the TargetEncoder to X. :param X: The data to determine the categories of each feature. :type X: array-like, shape [n_samples, n_features] :param y: The associated target vector. :type y: array

Return type

self

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)[source]

Transform X using specified encoding scheme.

Parameters

X (array-like, shape [n_samples, n_features_new]) – The data to encode.

Returns

X_new – Transformed input.

Return type

2-d array

Examples using dirty_cat.TargetEncoder