dirty_cat.TargetEncoder

Usage examples at the bottom of this page.

class dirty_cat.TargetEncoder(categories='auto', clf_type='binary-clf', dtype=<class 'numpy.float64'>, handle_unknown='error', handle_missing='')[source]

Encode categorical features as a numeric array given a target vector.

Each category is encoded given the effect that it has in the target variable y. The method considers that categorical variables can present rare categories. It represents each category by the probability of y conditional on this category. In addition, it takes an empirical Bayes approach to shrink the estimate.

Parameters:
categoriestyping.Union[typing.Literal[“auto”], typing.List[typing.List[typing.Union[str, int]]] # noqa

Categories (unique values) per feature: - ‘auto’ : Determine categories automatically from the training data. - list : categories[i] holds the categories expected in the i-th

column. The passed categories must be sorted and should not mix strings and numeric values.

The categories used can be found in the categories_ attribute.

clf_typetyping.Literal[“regression”, “binary-clf”, “multiclass-clf”]

The type of classification/regression problem.

dtypetype, default=np.float64

Desired dtype of output.

handle_unknowntyping.Literal[“error”, “ignore”], default=”error”

Whether to raise an error or ignore if a unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.

handle_missingtyping.Literal[“error”, “”], default=””

Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during fit (default is to impute). When this parameter is set to ‘’, and a missing value is encountered during fit_transform, the resulting encoded columns for this feature will be all zeros.

References

For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.

Attributes:
n_features_in_: int

Number of features in the data seen during fit.

categories_typing.List[np.array]

The categories of each feature determined during fitting (in order corresponding with output of transform).

Methods

fit(X, y)

Fit the TargetEncoder to X.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform X using the specified encoding scheme.

fit(X, y)[source]

Fit the TargetEncoder to X.

Parameters:
Xarray-like, shape [n_samples, n_features]

The data to determine the categories of each feature.

yarray

The associated target vector.

Returns:
TargetEncoder

Fitted TargetEncoder instance.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]

Transform X using the specified encoding scheme.

Parameters:
Xarray-like, shape [n_samples, n_features_new]

The data to encode.

Returns:
2-d np.array

Transformed input.

Examples using dirty_cat.TargetEncoder

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings