dirty_cat.SimilarityEncoder

Usage examples at the bottom of this page.

class dirty_cat.SimilarityEncoder(similarity=None, ngram_range=(2, 4), categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='ignore', handle_missing='', hashing_dim=None, n_prototypes=None, random_state=None, n_jobs=None)[source]

Encode string categorical features as a numeric array.

The input to this transformer should be an array-like of strings. The method is based on calculating the morphological similarities between the categories. This encoding is an alternative to OneHotEncoder in the case of dirty categorical variables.

Parameters:
similarityNone

Deprecated in dirty_cat 0.3, will be removed in 0.5. Was used to specify the type of pairwise string similarity to use. Since 0.3, only the ngram similarity is supported.

ngram_rangetuple (min_n, max_n), default=(2, 4)

The range of values for the n_gram similarity.

categoriestyping.Union[typing.Literal[“auto”, “k-means”, “most_frequent”], typing.List[typing.List[str]]] # noqa

Categories (unique values) per feature:

  • ‘auto’ : Determine categories automatically from the training data.

  • list : categories[i] holds the categories expected in the i-th column. The passed categories must be sorted and should not mix strings and numeric values.

  • ‘most_frequent’Computes the most frequent values for every

    categorical variable

  • ‘k-means’Computes the K nearest neighbors of K-mean centroids

    in order to choose the prototype categories

The categories used can be found in the categories_ attribute.

dtypenumber type, default np.float64

Desired dtype of output.

handle_unknown‘error’ or ‘ignore’ (default)

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to ignore). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

handle_missing‘error’ or ‘’ (default)

Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during fit (default is to impute). When this parameter is set to ‘’, and a missing value is encountered during fit_transform, the resulting encoded columns for this feature will be all zeros. In the inverse transform, the missing category will be denoted as None.

hashing_dimint type or None.

If None, the base vectorizer is CountVectorizer, else it’s set to HashingVectorizer with a number of features equal to hashing_dim.

n_prototypesnumber of prototype we want to use.

Useful when most_frequent or k-means is used. Must be a positive non-null integer.

random_stateeither an int used as a seed, a RandomState instance or None.

Useful when k-means strategy is used.

n_jobsint, optional

maximum number of processes used to compute similarity matrices. Used only if fast=True in SimilarityEncoder.transform

See also

MinHashEncoder

Encode string columns as a numeric array with the minhash method.

GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

deduplicate

Deduplicate data by hierarchically clustering similar strings.

References

For a detailed description of the method, see Similarity encoding for learning with dirty categorical variables by Cerda, Varoquaux, Kegl. 2018 (accepted for publication at: Machine Learning journal, Springer).

Examples

>>> enc = SimilarityEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
SimilarityEncoder()

It inherits the same methods as sklearn’s OneHotEncoder:

>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

But it provides a continuous encoding based on similarity instead of a discrete one based on exact matches:

>>> enc.transform([['Female', 1], ['Male', 4]])
array([[1., 0.42857143, 1., 0., 0.],
       [0.42857143, 1., 0. , 0. , 0.]])
>>> enc.inverse_transform([[1., 0.42857143, 1., 0., 0.], [0.42857143, 1., 0. , 0. , 0.]])
array([['Female', 1],
       ['Male', None]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)
Attributes:
categories_typing.List[np.array]

The categories of each feature determined during fitting (in order corresponding with output of transform).

Methods

fit(X[, y])

Fit the instance to X.

fit_transform(X[, y])

Fit SimilarityEncoder to data, then transform it.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_most_frequent(prototypes)

Get the most frequent category prototypes.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Convert the data back to the original representation.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X[, fast])

Transform X using specified encoding scheme.

fit(X, y=None)[source]

Fit the instance to X.

Parameters:
Xarray-like, shape [n_samples, n_features]

The data to determine the categories of each feature.

yNone

Unused, only here for compatibility.

Returns:
SimilarityEncoder

The fitted SimilarityEncoder instance (self).

fit_transform(X, y=None, **fit_params)[source]

Fit SimilarityEncoder to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_params

Additional fit parameters.

Returns:
array of shape (n_samples, n_features_new)

Transformed array.

get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters:
input_featuresarray-like of str or None, default=None

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:
feature_names_outndarray of str objects

Transformed feature names.

get_most_frequent(prototypes)[source]

Get the most frequent category prototypes.

Parameters:
prototypeslist of str

The list of values for a category variable.

Returns:
np.array

The n_prototypes most frequent values for a category variable.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

property infrequent_categories_

Infrequent categories for each feature.

inverse_transform(X)[source]

Convert the data back to the original representation.

When unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse.

For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_encoded_features)

The transformed data.

Returns:
X_trndarray of shape (n_samples, n_features)

Inverse transformed array.

set_output(*, transform=None)[source]

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X, fast=True)[source]

Transform X using specified encoding scheme.

Parameters:
Xarray-like, shape [n_samples, n_features]

The data to encode.

fastbool

Whether to use the fast computation of ngrams.

Returns:
X_new2-d array, shape [n_samples, n_features_new]

Transformed input.

Examples using dirty_cat.SimilarityEncoder

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings
Investigating and interpreting dirty categories

Investigating and interpreting dirty categories

Investigating and interpreting dirty categories
Scalability considerations for similarity encoding

Scalability considerations for similarity encoding

Scalability considerations for similarity encoding