dirty_cat
.TargetEncoder¶
Usage examples at the bottom of this page.
- class dirty_cat.TargetEncoder(categories='auto', clf_type='binary-clf', dtype=<class 'numpy.float64'>, handle_unknown='error', handle_missing='')[source]¶
Encode categorical features as a numeric array given a target vector.
Each category is encoded given the effect that it has in the target variable y. The method considers that categorical variables can present rare categories. It represents each category by the probability of y conditional on this category. In addition, it takes an empirical Bayes approach to shrink the estimate.
- Parameters:
- categories‘auto’ or list of list of int or str
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the i-th column. The passed categories must be sorted and should not mix strings and numeric values.
The categories used can be found in the
categories_
attribute.- clf_type{‘regression’, ‘binary-clf’, ‘multiclass-clf’}, default=’binary-clf’
The type of classification/regression problem.
- dtypenumber type, default=np.float64
Desired dtype of output.
- handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the encoded columns for this feature will be assigned the prior mean of the target variable.
- handle_missing{‘error’, ‘’}, default=’’
Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during
fit()
(default is to impute). When this parameter is set to ‘’, and a missing value is encountered duringfit_transform()
, the resulting encoded columns for this feature will be all zeros.
See also
dirty_cat.GapEncoder
Encodes dirty categories (strings) by constructing latent topics with continuous encoding.
dirty_cat.MinHashEncoder
Encode string columns as a numeric array with the minhash method.
dirty_cat.SimilarityEncoder
Encode string columns as a numeric array with n-gram string similarity.
References
For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.
Examples
>>> enc = TargetEncoder(handle_unknown='ignore') >>> X = [['male'], ['Male'], ['Female'], ['male'], ['Female']] >>> y = np.array([1, 2, 3, 4, 5])
>>> enc.fit(X, y) TargetEncoder(handle_unknown='ignore')
The encoder has found the following categories:
>>> enc.categories_ [array(['Female', 'Male', 'male'], dtype='<U6')]
We will encode the following categories, of which the first two are unknown :
>>> X2 = [['MALE'], ['FEMALE'], ['Female'], ['male'], ['Female']]
>>> enc.transform(X2) array([[3. ], [3. ], [3.54545455], [2.72727273], [3.54545455]])
As expected, they were encoded according to their influence on y. The unknown categories were assigned the mean of the target variable.
- Attributes:
- n_features_in_int
Number of features in the data seen during
fit()
.- categories_list of
ndarray
The categories of each feature determined during
fit()
(in order corresponding with output oftransform()
).- n_int
Length of y
Methods
fit
(X, y)Fit the instance to X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X using the specified encoding scheme.
- fit(X, y)[source]¶
Fit the instance to X.
- Parameters:
- Xarray-like, shape [n_samples, n_features]
The data to determine the categories of each feature.
- y
ndarray
The associated target vector.
- Returns:
TargetEncoder
Fitted
TargetEncoder
instance (self).
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)[source]¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_output(*, transform=None)[source]¶
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.