dirty_cat.SimilarityEncoder

Usage examples at the bottom of this page.

class dirty_cat.SimilarityEncoder(similarity='ngram', ngram_range=(2, 4), categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='ignore', handle_missing='', hashing_dim=None, n_prototypes=None, random_state=None, n_jobs=None)[source]

Encode string categorical features as a numeric array.

The input to this transformer should be an array-like of strings. The method is based on calculating the morphological similarities between the categories. The categories can be encoded using one of the implemented string similarities: similarity='ngram' (default), ‘levenshtein-ratio’, ‘jaro’, or ‘jaro-winkler’. This encoding is an alternative to OneHotEncoder in the case of dirty categorical variables.

Parameters
  • similarity (str {'ngram', 'levenshtein-ratio', 'jaro', or'jaro-winkler'}) – The type of pairwise string similarity to use.

  • ngram_range (tuple (min_n, max_n), default=(2, 4)) – Only significant for similarity='ngram'. The range of values for the n_gram similarity.

  • categories ('auto', 'k-means', 'most_frequent' or a list of lists/arrays) –

  • values. (of) –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the i-th column. The passed categories must be sorted and should not mix strings and numeric values.

    • ’most_frequent’Computes the most frequent values for every

      categorical variable

    • ’k-means’Computes the K nearest neighbors of K-mean centroids

      in order to choose the prototype categories

    The categories used can be found in the categories_ attribute.

  • dtype (number type, default np.float64) – Desired dtype of output.

  • handle_unknown ('error' or 'ignore' (default)) – Whether to raise an error or ignore if a unknown categorical feature is present during transform (default is to ignore). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

  • handle_missing ('error' or '' (default)) – Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during fit (default is to impute). When this parameter is set to ‘’, and a missing value is encountered during fit_transform, the resulting encoded columns for this feature will be all zeros. In the inverse transform, the missing category will be denoted as None.

  • hashing_dim (int type or None.) – If None, the base vectorizer is CountVectorizer, else it’s set to HashingVectorizer with a number of features equal to hashing_dim.

  • n_prototypes (number of prototype we want to use.) – Useful when most_frequent or k-means is used. Must be a positive non null integer.

  • random_state (either an int used as a seed, a RandomState instance or None.) – Useful when k-means strategy is used.

  • n_jobs (int, optional) – maximum number of processes used to compute similarity matrices. Used only if fast=True in SimilarityEncoder.transform

categories_

The categories of each feature determined during fitting (in order corresponding with output of transform).

Type

list of arrays

_infrequent_enabled

Avoid taking into account the existance of infrequent categories.

Type

bool, default=False

References

For a detailed description of the method, see Similarity encoding for learning with dirty categorical variables by Cerda, Varoquaux, Kegl. 2018 (accepted for publication at: Machine Learning journal, Springer).

fit(X, y=None)[source]

Fit the SimilarityEncoder to X.

Parameters

X (array-like, shape [n_samples, n_features]) – The data to determine the categories of each feature.

Return type

self

fit_transform(X, y=None, **fit_params)[source]

Fit SimilarityEncoder to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_feature_names(input_features=None)

DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

Return feature names for output features.

For a given input feature, if there is an infrequent category, the most ‘infrequent_sklearn’ will be used as a feature name.

input_featureslist of str of shape (n_features,)

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

output_feature_namesndarray of shape (n_output_features,)

Array of feature names.

get_feature_names_out(input_features=None)

Get output feature names for transformation.

Parameters

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

get_most_frequent(prototypes)[source]

Get the most frequent category prototypes :param prototypes: :type prototypes: the list of values for a category variable

Return type

The n_prototypes most frequent values for a category variable

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

property infrequent_categories_

Infrequent categories for each feature.

inverse_transform(X)

Convert the data back to the original representation.

When unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse.

For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_encoded_features)) – The transformed data.

Returns

X_tr – Inverse transformed array.

Return type

ndarray of shape (n_samples, n_features)

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X, fast=True)[source]

Transform X using specified encoding scheme.

Parameters

X (array-like, shape [n_samples, n_features]) – The data to encode.

Returns

X_new – Transformed input.

Return type

2-d array, shape [n_samples, n_features_new]

Examples using dirty_cat.SimilarityEncoder