dirty_cat
.TargetEncoder¶
Usage examples at the bottom of this page.
- class dirty_cat.TargetEncoder(categories='auto', clf_type='binary-clf', dtype=<class 'numpy.float64'>, handle_unknown='error', handle_missing='')[source]¶
Encode categorical features as a numeric array given a target vector.
Each category is encoded given the effect that it has in the target variable y. The method considers that categorical variables can present rare categories. It represents each category by the probability of y conditional on this category. In addition, it takes an empirical Bayes approach to shrink the estimate.
- Parameters:
- categoriestyping.Union[typing.Literal[“auto”], typing.List[typing.List[typing.Union[str, int]]] # noqa
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list :
categories[i]
holds the categories expected in the i-th column. The passed categories must be sorted and should not mix strings and numeric values.
The categories used can be found in the
categories_
attribute.- clf_typetyping.Literal[“regression”, “binary-clf”, “multiclass-clf”]
The type of classification/regression problem.
- dtypetype, default=np.float64
Desired dtype of output.
- handle_unknowntyping.Literal[“error”, “ignore”], default=”error”
Whether to raise an error or ignore if a unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the encoded columns for this feature will be assigned the prior mean of the target variable.
- handle_missingtyping.Literal[“error”, “”], default=””
Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during fit (default is to impute). When this parameter is set to ‘’, and a missing value is encountered during fit_transform, the resulting encoded columns for this feature will be all zeros.
See also
GapEncoder
Encodes dirty categories (strings) by constructing latent topics with continuous encoding.
MinHashEncoder
Encode string columns as a numeric array with the minhash method.
SimilarityEncoder
Encode string columns as a numeric array with n-gram string similarity.
References
For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.
Examples
>>> enc = TargetEncoder(handle_unknown='ignore') >>> X = [['male'], ['Male'], ['Female'], ['male'], ['Female']] >>> y = np.array([1, 2, 3, 4, 5])
>>> enc.fit(X, y) TargetEncoder(handle_unknown='ignore')
The encoder has found the following categories:
>>> enc.categories_ [array(['Female', 'Male', 'male'], dtype='<U6')]
We will encode the following categories, of which the first two are unknown :
>>> X2 = [['MALE'], ['FEMALE'], ['Female'], ['male'], ['Female']]
>>> enc.transform(X2) array([[3. ], [3. ], [3.54545455], [2.72727273], [3.54545455]])
As expected, they were encoded according to their influence on y. The unknown categories were assigned the mean of the target variable.
- Attributes:
- n_features_in_: int
Number of features in the data seen during fit.
- categories_typing.List[np.ndarray]
The categories of each feature determined during fitting (in order corresponding with output of
transform
).
Methods
fit
(X, y)Fit the instance to X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X using the specified encoding scheme.
- fit(X, y)[source]¶
Fit the instance to X.
- Parameters:
- Xarray-like, shape [n_samples, n_features]
The data to determine the categories of each feature.
- yarray
The associated target vector.
- Returns:
TargetEncoder
Fitted
TargetEncoder
instance (self).
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)[source]¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_output(*, transform=None)[source]¶
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
Examples using dirty_cat.TargetEncoder
¶

Dirty categories: machine learning with non normalized strings