Usage examples at the bottom of this page.

class dirty_cat.GapEncoder(n_components=10, batch_size=128, gamma_shape_prior=1.1, gamma_scale_prior=1.0, rho=0.95, rescale_rho=False, hashing=False, hashing_n_features=4096, init='k-means++', tol=0.0001, min_iter=2, max_iter=5, ngram_range=(2, 4), analyzer='char', add_words=False, random_state=None, rescale_W=True, max_iter_e_step=20, handle_missing='zero_impute')[source]

This encoder can be understood as a continuous encoding on a set of latent categories estimated from the data. The latent categories are built by capturing combinations of substrings that frequently co-occur.

The GapEncoder supports online learning on batches of data for scalability through the partial_fit method.

  • n_components (int, default=10) – Number of latent categories used to model string data.

  • batch_size (int, default=128) – Number of samples per batch.

  • gamma_shape_prior (float, default=1.1) – Shape parameter for the Gamma prior distribution.

  • gamma_scale_prior (float, default=1.0) – Scale parameter for the Gamma prior distribution.

  • rho (float, default=0.95) – Weight parameter for the update of the W matrix.

  • rescale_rho (bool, default=False) – If true, use rho ** (batch_size / len(X)) instead of rho to obtain an update rate per iteration that is independent of the batch size.

  • hashing (bool, default=False) – If true, HashingVectorizer is used instead of CountVectorizer. It has the advantage of being very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.

  • hashing_n_features (int, default=2**12) – Number of features for the HashingVectorizer. Only relevant if hashing=True.

  • init (str, default='k-means++') – Initialization method of the W matrix. Options: {‘k-means++’, ‘random’, ‘k-means’}. If init=’k-means++’, we use the init method of sklearn.cluster.KMeans. If init=’random’, topics are initialized with a Gamma distribution. If init=’k-means’, topics are initialized with a KMeans on the n-grams counts. This usually makes convergence faster but is a bit slower.

  • tol (float, default=1e-4) – Tolerance for the convergence of the matrix W.

  • min_iter (int, default=2) – Minimum number of iterations on the input data.

  • max_iter (int, default=5) – Maximum number of iterations on the input data.

  • ngram_range (tuple, default=(2, 4)) – The range of ngram length that will be used to build the bag-of-n-grams representation of the input data.

  • analyzer (str, default='char'.) – Analyzer parameter for the CountVectorizer/HashingVectorizer. Options: {‘word’, ‘char’, ‘char_wb’}, describing whether the matrix V to factorize should be made of word counts or character n-gram counts. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

  • add_words (bool, default=False) – If true, add the words counts to the bag-of-n-grams representation of the input data.

  • random_state (int or None, default=None) – Pass an int for reproducible output across multiple function calls.

  • rescale_W (bool, default=True) – If true, the weight matrix W is rescaled at each iteration to have an l1 norm equal to 1 for each row.

  • max_iter_e_step (int, default=20) – Maximum number of iterations to adjust the activations h at each step.

  • handle_missing ('error' or 'empty_impute' (default)) – Whether to raise an error or impute with empty string ‘’ if missing values (NaN) are present during fit (default is to impute). In the inverse transform, the missing category will be denoted as None.


For a detailed description of the method, see Encoding high-cardinality string categorical variables by Cerda, Varoquaux (2019).

fit(X, y=None)[source]

Fit the GapEncoder on batches of X.


X (array-like, shape (n_samples, n_features)) – The string data to fit the model on.

Return type


fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.


X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_feature_names(input_features=None, col_names=None, n_labels=3)[source]

Deprecated, use “get_feature_names_out”

get_feature_names_out(col_names=None, n_labels=3)[source]

Returns the labels that best summarize the learned components/topics. For each topic, labels with highest activations are selected.

  • col_names ({None, list or str}, default=None) –

    The column names to be added as prefixes before the labels. If col_names == None, no prefixes are used. If col_names == ‘auto’, column names are automatically defined:

    • if the input data was a dataframe, its column names are used

    • otherwise, ‘col1’, …, ‘colN’ are used as prefixes

    Prefixes can be manually set by passing a list for col_names.

  • n_labels (int, default=3) – The number of labels used to describe each topic.


topic_labels – The labels that best describe each topic.

Return type

list of strings


Get parameters for this estimator.


deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.


params – Parameter names mapped to their values.

Return type


partial_fit(X, y=None)[source]

Partial fit of the GapEncoder on X. To be used in a online learning procedure where batches of data are coming one by one.


X (array-like, shape (n_samples, n_features)) – The string data to fit the model on.

Return type



Returns the sum over the columns of X of the Kullback-Leibler divergence between the n-grams counts matrix V of X, and its non-negative factorization HW.


X (array-like (str), shape (n_samples, n_features)) – The data to encode.


kl_divergence – The Kullback-Leibler divergence.

Return type



Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.


**params (dict) – Estimator parameters.


self – Estimator instance.

Return type

estimator instance


Return the encoded vectors (activations) H of input strings in X. Given the learnt topics W, the activations H are tuned to fit V = HW. When X has several columns, they are encoded separately and then concatenated.

Remark: calling transform mutliple times in a row on the same input X can give slightly different encodings. This is expected due to a caching mechanism to speed things up.


X (array-like, shape (n_samples, n_features)) – The string data to encode.


H – Transformed input.

Return type

2-d array, shape (n_samples, n_topics * n_features)

Examples using dirty_cat.GapEncoder