dirty_cat.MinHashEncoder

Usage examples at the bottom of this page.

class dirty_cat.MinHashEncoder(n_components=30, ngram_range=(2, 4), hashing='fast', minmax_hash=False, handle_missing='zero_impute')[source]

Encode string categorical features as a numeric array, minhash method applied to ngram decomposition of strings based on ngram decomposition of the string.

Parameters:
n_componentsint, default=30

The number of dimension of encoded strings. Numbers around 300 tend to lead to good prediction performance, but with more computational cost.

ngram_rangetyping.Tuple[int, int], default=(2, 4)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n. will be used.

hashingtyping.Literal[“fast”, “murmur”], default=fast

Hashing function. fast is faster but might have some concern with its entropy.

minmax_hashbool, default=False

if True, return min hash and max hash concatenated.

handle_missingtyping.Literal[“error”, “zero_impute”], default=zero_impute

Whether to raise an error or encode missing values (NaN) with vectors filled with zeros.

References

For a detailed description of the method, see Encoding high-cardinality string categorical variables by Cerda, Varoquaux (2019).

Attributes:
hash_dict_LRUDict

Computed hashes.

Methods

fit(X[, y])

Fit the MinHashEncoder to X.

fit_transform(X[, y])

Fit to data, then transform it.

get_fast_hash(string)

Encode a string with fast hashing function.

get_params([deep])

Get parameters for this estimator.

minhash(string, n_components, ngram_range)

Encode a string using murmur hashing function.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform X using specified encoding scheme.

fit(X, y=None)[source]

Fit the MinHashEncoder to X. In practice, just initializes a dictionary to store encodings to speed up computation.

Parameters:
Xarray-like, shape (n_samples, ) or (n_samples, 1)

The string data to encode.

yNone

Unused, only here for compatibility.

Returns:
MinHashEncoder

The fitted MinHashEncoder instance.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_fast_hash(string)[source]

Encode a string with fast hashing function. fast hashing supports both min_hash and minmax_hash encoding.

Parameters:
stringstr

The string to encode.

Returns:
np.array of shape (n_components, )

The encoded string, using specified encoding scheme.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

minhash(string, n_components, ngram_range)[source]

Encode a string using murmur hashing function.

Parameters:
stringstr

The string to encode.

n_componentsint

The number of dimension of encoded string.

ngram_rangetyping.Tuple[int, int]

The lower and upper boundaries of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n.

Returns:
array, shape (n_components, )

The encoded string.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]

Transform X using specified encoding scheme.

Parameters:
Xarray-like, shape (n_samples, ) or (n_samples, 1)

The string data to encode.

Returns:
array, shape (n_samples, n_components)

Transformed input.

Examples using dirty_cat.MinHashEncoder

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings