dirty_cat.GapEncoder

Usage examples at the bottom of this page.

class dirty_cat.GapEncoder(n_components=10, batch_size=128, gamma_shape_prior=1.1, gamma_scale_prior=1.0, rho=0.95, rescale_rho=False, hashing=False, hashing_n_features=4096, init='k-means++', tol=0.0001, min_iter=2, max_iter=5, ngram_range=(2, 4), analyzer='char', add_words=False, random_state=None, rescale_W=True, max_iter_e_step=20, handle_missing='zero_impute')[source]

Constructs latent topics with continuous encoding.

This encoder can be understood as a continuous encoding on a set of latent categories estimated from the data. The latent categories are built by capturing combinations of substrings that frequently co-occur.

The GapEncoder supports online learning on batches of data for scalability through the partial_fit() method.

The principle is as follows:

  1. Given an input string array X, we build its bag-of-n-grams representation V (n_samples, vocab_size).

  2. Instead of using the n-grams counts as encodings, we look for low- dimensional representations by modeling n-grams counts as linear combinations of topics V = HW, with W (n_topics, vocab_size) the topics and H (n_samples, n_topics) the associated activations.

  3. Assuming that n-grams counts follow a Poisson law, we fit H and W to maximize the likelihood of the data, with a Gamma prior for the activations H to induce sparsity.

  4. In practice, this is equivalent to a non-negative matrix factorization with the Kullback-Leibler divergence as loss, and a Gamma prior on H. We thus optimize H and W with the multiplicative update method.

Parameters:
n_componentsint, optional, default=10

Number of latent categories used to model string data.

batch_sizeint, optional, default=128

Number of samples per batch.

gamma_shape_priorfloat, optional, default=1.1

Shape parameter for the Gamma prior distribution.

gamma_scale_priorfloat, optional, default=1.0

Scale parameter for the Gamma prior distribution.

rhofloat, optional, default=0.95

Weight parameter for the update of the W matrix.

rescale_rhobool, optional, default=False

If True, use rho ** (batch_size / len(X)) instead of rho to obtain an update rate per iteration that is independent of the batch size.

hashingbool, optional, default=False

If True, HashingVectorizer is used instead of CountVectorizer. It has the advantage of being very low memory, scalable to large datasets as there is no need to store a vocabulary dictionary in memory.

hashing_n_featuresint, default=2**12

Number of features for the HashingVectorizer. Only relevant if hashing=True.

init{‘k-means++’, ‘random’, ‘k-means’}, default=’k-means++’

Initialization method of the W matrix. If init=’k-means++’, we use the init method of KMeans. If init=’random’, topics are initialized with a Gamma distribution. If init=’k-means’, topics are initialized with a KMeans on the n-grams counts.

tolfloat, default=1e-4

Tolerance for the convergence of the matrix W.

min_iterint, default=2

Minimum number of iterations on the input data.

max_iterint, default=5

Maximum number of iterations on the input data.

ngram_rangeint 2-tuple, default=(2, 4)
The lower and upper boundaries of the range of n-values for different

n-grams used in the string similarity. All values of n such that min_n <= n <= max_n will be used.

analyzer{‘word’, ‘char’, ‘char_wb’}, default=’char’

Analyzer parameter for the HashingVectorizer / CountVectorizer. Describes whether the matrix V to factorize should be made of word counts or character-level n-gram counts. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

add_wordsbool, default=False

If True, add the words counts to the bag-of-n-grams representation of the input data.

random_stateint or RandomState, optional

Random number generator seed for reproducible output across multiple function calls.

rescale_Wbool, default=True

If True, the weight matrix W is rescaled at each iteration to have a l1 norm equal to 1 for each row.

max_iter_e_stepint, default=20

Maximum number of iterations to adjust the activations h at each step.

handle_missing{‘error’, ‘empty_impute’}, default=’empty_impute’

Whether to raise an error or impute with empty string (‘’) if missing values (NaN) are present during fit() (default is to impute). In inverse_transform(), the missing categories will be denoted as None.

See also

dirty_cat.MinHashEncoder

Encode string columns as a numeric array with the minhash method.

dirty_cat.SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

dirty_cat.deduplicate

Deduplicate data by hierarchically clustering similar strings.

References

For a detailed description of the method, see Encoding high-cardinality string categorical variables by Cerda, Varoquaux (2019).

Examples

>>> enc = GapEncoder(n_components=2)

Let’s encode the following non-normalized data:

>>> X = [['paris, FR'], ['Paris'], ['London, UK'], ['Paris, France'],
         ['london'], ['London, England'], ['London'], ['Pqris']]
>>> enc.fit(X)
GapEncoder(n_components=2)

The GapEncoder has found the following two topics:

>>> enc.get_feature_names_out()
['england, london, uk', 'france, paris, pqris']

It got it right, reccuring topics are “London” and “England” on the one side and “Paris” and “France” on the other.

As this is a continuous encoding, we can look at the level of activation of each topic for each category:

>>> enc.transform(X)
array([[ 0.05202843, 10.54797156],
      [ 0.05000118,  4.54999882],
      [12.04734788,  0.05265212],
      [ 0.05263068, 16.54736932],
      [ 6.04999624,  0.05000376],
      [19.546716  ,  0.053284  ],
      [ 6.04999623,  0.05000376],
      [ 0.05002016,  4.54997983]])

The higher the value, the bigger the correspondance with the topic.

Attributes:
rho_float

Effective update rate for the W matrix.

fitted_models_list of GapEncoderColumn

Column-wise fitted GapEncoders.

column_names_list of str

Column names of the data the Gap was fitted on.

Methods

fit(X[, y])

Fit the instance on X.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names([col_names, n_labels, ...])

Return clean feature names.

get_feature_names_out([col_names, n_labels, ...])

Return the labels that best summarize the learned components/topics.

get_params([deep])

Get parameters for this estimator.

partial_fit(X[, y])

Partial fit this instance on X.

score(X)

Score this instance on X.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Return the encoded vectors (activations) H of input strings in X.

fit(X, y=None)[source]

Fit the instance on X.

Parameters:
Xarray-like, shape (n_samples, n_features)

The string data to fit the model on.

yNone

Unused, only here for compatibility.

Returns:
GapEncoder

The fitted GapEncoder instance (self).

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_feature_names(col_names=None, n_labels=3, input_features=None)[source]

Return clean feature names. Compatibility method for sklearn < 1.0.

Use get_feature_names_out() instead.

For each topic, labels with the highest activations are selected.

Parameters:
col_names‘auto’ or list of str, optional

The column names to be added as prefixes before the labels. If col_names=None, no prefixes are used. If col_names=’auto’, column names are automatically defined:

  • if the input data was a DataFrame, its column names are used,

  • otherwise, ‘col1’, …, ‘colN’ are used as prefixes.

Prefixes can be manually set by passing a list for col_names.

n_labelsint, default=3

The number of labels used to describe each topic.

input_featuresNone

Unused, only here for compatibility.

Returns:
list of str

The labels that best describe each topic. Each element contains the labels joined by a comma.

get_feature_names_out(col_names=None, n_labels=3, input_features=None)[source]

Return the labels that best summarize the learned components/topics.

For each topic, labels with the highest activations are selected.

Parameters:
col_names‘auto’ or list of str, optional

The column names to be added as prefixes before the labels. If col_names=None, no prefixes are used. If col_names=’auto’, column names are automatically defined:

  • if the input data was a DataFrame, its column names are used,

  • otherwise, ‘col1’, …, ‘colN’ are used as prefixes.

Prefixes can be manually set by passing a list for col_names.

n_labelsint, default=3

The number of labels used to describe each topic.

input_featuresNone

Unused, only here for compatibility.

Returns:
list of str

The labels that best describe each topic. Each element contains the labels joined by a comma.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

partial_fit(X, y=None)[source]

Partial fit this instance on X.

To be used in an online learning procedure where batches of data are coming one by one.

Parameters:
Xarray-like, shape (n_samples, n_features)

The string data to fit the model on.

yNone

Unused, only here for compatibility.

Returns:
GapEncoder

The fitted GapEncoder instance (self).

score(X)[source]

Score this instance on X.

Returns the sum over the columns of X of the Kullback-Leibler divergence between the n-grams counts matrix V of X, and its non-negative factorization HW.

Parameters:
Xarray-like, shape (n_samples, n_features)

The data to encode.

Returns:
float

The Kullback-Leibler divergence.

set_output(*, transform=None)[source]

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]

Return the encoded vectors (activations) H of input strings in X.

Given the learnt topics W, the activations H are tuned to fit V = HW. When X has several columns, they are encoded separately and then concatenated.

Remark: calling transform multiple times in a row on the same input X can give slightly different encodings. This is expected due to a caching mechanism to speed things up.

Parameters:
Xarray-like, shape (n_samples, n_features)

The string data to encode.

Returns:
ndarray, shape (n_samples, n_topics * n_features)

Transformed input.

Examples using dirty_cat.GapEncoder