dirty_cat.MinHashEncoder

Usage examples at the bottom of this page.

class dirty_cat.MinHashEncoder(n_components=30, ngram_range=(2, 4), hashing='fast', minmax_hash=False, handle_missing='zero_impute')[source]

Encode string categorical features as a numeric array, minhash method applied to ngram decomposition of strings based on ngram decomposition of the string.

Parameters
  • n_components (int, default=30) – The number of dimension of encoded strings. Numbers around 300 tend to lead to good prediction performance, but with more computational cost.

  • ngram_range (tuple (min_n, max_n), default=(2, 4)) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n. will be used.

  • hashing (str {'fast', 'murmur'}, default=fast) – Hashing function. fast is faster but might have some concern with its entropy.

  • minmax_hash (bool, default=False) – if True, return min hash and max hash concatenated.

  • handle_missing ('error' or 'zero_impute' (default)) – Whether to raise an error or encode missing values (NaN) with vectors filled with zeros.

References

For a detailed description of the method, see Encoding high-cardinality string categorical variables by Cerda, Varoquaux (2019).

fit(X, y=None)[source]

Fit the MinHashEncoder to X. In practice, just initializes a dictionary to store encodings to speed up computation. :param X: The string data to encode. :type X: array-like, shape (n_samples, ) or (n_samples, 1)

Returns

The fitted MinHashEncoder instance.

Return type

self

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_fast_hash(string)[source]

Encode a string with fast hashing function. fast hashing supports both min_hash and minmax_hash encoding. :param string: The string to encode. :type string: str

Returns

The encoded string, using specified encoding scheme.

Return type

array, shape (n_components, )

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

get_unique_ngrams(string, ngram_range)[source]

Return the set of unique n-grams of a string. :param string: The string to split in n-grams. :type string: str :param ngram_range: :type ngram_range: tuple (min_n, max_n) :param The lower and upper boundary of the range of n-values for different: :param n-grams to be extracted. All values of n such that min_n <= n <= max_n.:

Returns

The set of unique n-grams of the string.

Return type

set

minhash(string, n_components, ngram_range)[source]

Encode a string using murmur hashing function. :param string: The string to encode. :type string: str :param n_components: The number of dimension of encoded string. :type n_components: int :param ngram_range: :type ngram_range: tuple (min_n, max_n) :param The lower and upper boundary of the range of n-values for different: :param n-grams to be extracted. All values of n such that min_n <= n <= max_n.:

Returns

The encoded string.

Return type

array, shape (n_components, )

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)[source]

Transform X using specified encoding scheme. :param X: The string data to encode. :type X: array-like, shape (n_samples, ) or (n_samples, 1)

Returns

Transformed input.

Return type

array, shape (n_samples, n_components)

Examples using dirty_cat.MinHashEncoder