dirty_cat
.MinHashEncoder¶
Usage examples at the bottom of this page.
- class dirty_cat.MinHashEncoder(n_components=30, ngram_range=(2, 4), hashing='fast', minmax_hash=False, handle_missing='zero_impute')[source]¶
Encode string categorical features as a numeric array, minhash method applied to ngram decomposition of strings based on ngram decomposition of the string.
- Parameters:
- n_componentsint, default=30
The number of dimension of encoded strings. Numbers around 300 tend to lead to good prediction performance, but with more computational cost.
- ngram_rangetyping.Tuple[int, int], default=(2, 4)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n. will be used.
- hashingtyping.Literal[“fast”, “murmur”], default=fast
Hashing function. fast is faster but might have some concern with its entropy.
- minmax_hashbool, default=False
if True, return min hash and max hash concatenated.
- handle_missingtyping.Literal[“error”, “zero_impute”], default=zero_impute
Whether to raise an error or encode missing values (NaN) with vectors filled with zeros.
References
For a detailed description of the method, see Encoding high-cardinality string categorical variables by Cerda, Varoquaux (2019).
- Attributes:
- hash_dict_LRUDict
Computed hashes.
Methods
fit
(X[, y])Fit the MinHashEncoder to X.
fit_transform
(X[, y])Fit to data, then transform it.
get_fast_hash
(string)Encode a string with fast hashing function.
get_params
([deep])Get parameters for this estimator.
minhash
(string, n_components, ngram_range)Encode a string using murmur hashing function.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X using specified encoding scheme.
- fit(X, y=None)[source]¶
Fit the MinHashEncoder to X. In practice, just initializes a dictionary to store encodings to speed up computation.
- Parameters:
- Xarray-like, shape (n_samples, ) or (n_samples, 1)
The string data to encode.
- yNone
Unused, only here for compatibility.
- Returns:
- MinHashEncoder
The fitted MinHashEncoder instance.
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_fast_hash(string)[source]¶
Encode a string with fast hashing function. fast hashing supports both min_hash and minmax_hash encoding.
- Parameters:
- stringstr
The string to encode.
- Returns:
- np.array of shape (n_components, )
The encoded string, using specified encoding scheme.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- minhash(string, n_components, ngram_range)[source]¶
Encode a string using murmur hashing function.
- Parameters:
- stringstr
The string to encode.
- n_componentsint
The number of dimension of encoded string.
- ngram_rangetyping.Tuple[int, int]
The lower and upper boundaries of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n.
- Returns:
- array, shape (n_components, )
The encoded string.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
Examples using dirty_cat.MinHashEncoder
¶

Dirty categories: machine learning with non normalized strings