Fitting scalable, non-linear models on data with dirty categories

A very classic dilemma when training a machine learning model consists in choosing between using a linear model or a non-linear one.

Linear models are very well studied and understood. They are generally fast and easy to optimize, and generate interpretable results. However, for some problems with a complex relationship between the input and the output, linear models reach their expressivity limit: whatever the number of samples the training set may have, past some point, its precision won’t get any better.

Non-linear models, however, tend to scale better with sample size: they are able to digest the information in the additional samples to get a better estimate of the link between the input and the output.

Non-linear models form a very large model class. Among others, this class includes:

  • Neural Networks

  • Tree-based methods such as Random Forests, and the very powerful Gradient Boosting Machines [1]

  • Kernel Methods.

However, reaching the phase where the non-linear model outperforms the linear one can be complicated. Indeed, a more complex model often means a longer fitting/tuning process:

  • Neural networks often need extended model tuning time, in order to achieve good optimization and network architecture.

  • Gradient Boosting Machines do not tend to scale extremely well with increasing sample size, as all the data needs to be loaded into the main memory.

  • For kernel methods, parameter fitting requires the inversion of a gram matrix of size \(n \times n\) (\(n\) being the number of samples), yielding a quadratic dependency (with \(n\)) in the final computation time.

In order to get the best out of a non-linear model, one has to make it scalable. For kernel methods, approximation algorithms that drop the quadratic dependency with the sample size while ensuring almost the same model capacity exist.

In this example, you will learn how to:
  1. Build a ML pipeline that uses a kernel method.

  2. Make this pipeline scalable, by using online algorithms and dimension reduction methods.


This example assumes the reader is familiar with similarity encoding and its use-cases.

  • For an introduction to dirty categories, see this example.

  • To learn with dirty categories using the SimilarityEncoder, see this example.

Training a simple pipeline

The data that the model will fit is the drug_directory dataset.

from dirty_cat.datasets import fetch_drug_directory

# We'll only gather the information, and not load the dataset in memory for now
drug_directory = fetch_drug_directory(load_dataframe=False)
Product listing data submitted to the U.S. FDA for all unfinished, unapproved drugs.

The NONPROPRIETARYNAME column, is composed of text observations with describing each drug’s composition. The PRODUCTTYPENAME column consists of categorical values: therefore, our problem is a classification problem. You can have a glimpse of the values here:

import pandas as pd

df = pd.read_csv(
# This will be useful further down in the example.
columns_names = df.columns
0            diluent           HUMAN OTC DRUG
3        Dulaglutide  HUMAN PRESCRIPTION DRUG
4        Dulaglutide  HUMAN PRESCRIPTION DRUG

Estimators construction

Our input is categorical, thus needs to be encoded. As observations often consist in variations around a few concepts (for instance, 'Amlodipine Besylate' and 'Amlodipine besylate and atorvastatin calcium' have one ingredient in common), we need an encoding able to capture similarities between observations.

from dirty_cat import SimilarityEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
similarity_encoder = SimilarityEncoder(similarity='ngram')

Two other columns are used to predict the output: DOSAGEFORMNAME and ROUTENAME. They are both categorical and can be encoded with a OneHotEncoder. We use a ColumnTransformer to stack the OneHotEncoder and the SimilarityEncoder. We can now choose a kernel method, for instance a SupportVectorClassifier, to fit the encoded inputs.

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

column_transformer = make_column_transformer(
    (similarity_encoder, ['NONPROPRIETARYNAME']),
    (OneHotEncoder(handle_unknown='ignore'), ['DOSAGEFORMNAME', 'ROUTENAME']),

# The classifier and the ColumnTransformer are stacked into a Pipeline object
classifier = SVC(kernel='rbf', random_state=42, gamma=1)
steps = [('transformer', column_transformer), ('classifier', classifier)]
model = Pipeline(steps)

Data Loading and Preprocessing

Like in most machine learning setups, the data has to be split into 2 exclusive parts:

  • One for model training.

  • One for model testing.

For this reason, we create a simple wrapper around pandas.read_csv(), that extracts the X, and y from the dataset.

def preprocess(df, label_encoder):
    df = df.loc[df['PRODUCTTYPENAME'].isin(

    df = df.dropna()

    y = df[['PRODUCTTYPENAME']].values

    y_int = label_encoder.transform(np.squeeze(y))

    return X, y_int

def get_X_y(**kwargs):
    """simple wrapper around pd.read_csv that extracts features and labels

    Some systematic preprocessing is also carried out to avoid doing this
    transformation repeatedly in the code.
    global label_encoder
    df = pd.read_csv(drug_directory.path, **drug_directory.read_csv_kwargs, **kwargs)
    return preprocess(df, label_encoder)

Classifier objects in scikit-learn often require y to be integer labels. Additionally, average_precision_score() requires a binary version of the labels. For these two purposes, we create:

Their transform methods can the be called at appropriate times.

import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder = LabelEncoder()['HUMAN OTC DRUG', 'HUMAN PRESCRIPTION DRUG'])

one_hot_encoder = OneHotEncoder(categories="auto", sparse=False)[[0], [1]])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with


During the following training procedures of this example, we will assume that the dataset was shuffled prior to its loading. As a reason, we can take the first \(n\) observations for the training set, the next \(m\) observations for the test set and so on. This may not be the case for all datasets, so be careful before applying this code to your own !

Finally, the X and y are loaded.

train_set_size = 5000
test_set_size = 10000
offset = 100000

X_train, y_train = get_X_y(skiprows=1, names=columns_names,

X_test, y_test = get_X_y(skiprows=offset, names=columns_names,

Evaluating time and sample complexity

Let’s get an idea of model precision and performance depending on the number of the samples used in the train set. The Pipeline is trained over different training set sizes. For this, X_train and y_train get sliced into subsets of increasing size, while X_test and y_test do not change when the sample size varies.

import time
from sklearn.metrics import average_precision_score

# define the different train set sizes on which to evaluate the model
train_set_sizes = [train_set_size // 10, train_set_size // 3, train_set_size]

train_times_svc, test_scores_svc = [], []

for n in train_set_sizes:

    t0 = time.perf_counter()[:n], y_train[:n])
    train_time = time.perf_counter() - t0

    y_pred = model.predict(X_test)

    y_pred_onehot = one_hot_encoder.transform(y_pred.reshape(-1, 1))
    y_test_onehot = one_hot_encoder.transform(y_test.reshape(-1, 1))

    test_score = average_precision_score(y_test_onehot, y_pred_onehot)


    msg = ("using {:>5} samples: model fitting took {:.1f}s, test accuracy of "
    print(msg.format(n, train_time, test_score))
using   500 samples: model fitting took 0.2s, test accuracy of 0.501
using  1666 samples: model fitting took 3.1s, test accuracy of 0.505
using  5000 samples: model fitting took 51.4s, test accuracy of 0.563

Increasing training size clearly improves model accuracy. However, the training time and the input size increase quadratically with the training set size. Indeed, kernel methods need to process an entire \(n \times n\) matrix at once. In order for this matrix to be loaded into memory, \(n\) has to remain low: Using 30000 observations, the input is a \(30000 \times 30000\) matrix. If composed of 32-bit floats, its total size is around \(30000^2 \times 32 = 2.8 \times 10^{10} \text{bits} = 4\text{GB}\)

Reducing input dimension using kernel approximation methods.

The main scalability issues with kernels methods is the processing of a large square matrix. To understand where this matrix comes from, we need to delve a little bit deeper these methods internals.

Kernel approximation methods

From what was said below, two criteria are limiting a kernel algorithm to scale:

  • It processes a matrix, whose size increases quadratically with the number of samples.

  • During fitting time, this matrix is inverted, meaning it has to be loaded into main memory.

Kernel approximation methods such as RBFSampler or Nystroem [4] try to approximate this similarity matrix, without actually creating it. By allowing the program to not compute the perfect similarity matrix, the problem complexity becomes linear! Plus, the samples also do not need to be processed at once into main memory. We are not bound to use a SupportVectorClassifier anymore, and can instead use an online optimization that will process the input by batch.

Reducing the transformers dimensionality

There is one last scalability issue in our pipeline: the SimilarityEncoder and the RBFSampler both implement the fit method. How to adapt those method to an online setting, where the data is never loaded as a whole?

A simple solution is to partially fit the SimilarityEncoder and the RBFSampler on a subset of the data, prior to the online fitting step.

from sklearn.kernel_approximation import RBFSampler
n_out_encoder = 1000
n_out_rbf = 5000
n_samples_encoder = 10000

X_encoder, _ = get_X_y(nrows=n_samples_encoder, names=columns_names)

similarity_encoder = SimilarityEncoder(
    similarity='ngram', categories='most_frequent', n_prototypes=n_out_encoder,
    random_state=42, ngram_range=(2, 4))

# Fit the rbf_sampler with the similarity matrix.
column_transformer = make_column_transformer(
    (similarity_encoder, ['NONPROPRIETARYNAME']),
    (OneHotEncoder(handle_unknown='ignore'), ['DOSAGEFORMNAME', 'ROUTENAME']),

transformed_categories = column_transformer.fit_transform(X_encoder)

# gamma is a parameter of the rbf function, that sets how fast the similarity
# between two points should decrease as the distance between them rises. It
# is data-specific, and needs to be chosen carefully, for example using
# cross-validation.
rbf_sampler = RBFSampler(
    gamma=0.5, n_components=n_out_rbf, random_state=42)

def encode(X, y_int, one_hot_encoder, column_transformer, rbf_sampler):
    X_sim_encoded = column_transformer.transform(X)

    X_highdim = rbf_sampler.transform(X_sim_encoded.toarray())

    y_onehot = one_hot_encoder.transform(y_int.reshape(-1, 1))

    return X_highdim, y_onehot

# The inputs and labels of the val and test sets have to be pre-processed the
# same way the training set was processed:
X_test_kernel_approx, y_true_test_onehot = encode(
    X_test, y_test, one_hot_encoder, column_transformer, rbf_sampler)

Online training for out-of-memory data

We now have all the theoretical elements to create an non-linear, online kernel method.

import warnings
from sklearn.linear_model import SGDClassifier

online_train_set_size = 100000
# Filter warning on max_iter and tol
warnings.filterwarnings('ignore', module='sklearn.linear_model')
sgd_classifier = SGDClassifier(
    max_iter=1, tol=None, random_state=42, average=10)

We can now start the training, by looping over batches one by one. Note that only one pass over the whole dataset is done. It may be worth doing several passes, but for very large sample sizes, the increase in test accuracy is likely to be marginal.

batchsize = 1000
test_scores_rbf = []
train_times_rbf = []
online_train_set_sizes = []
t0 = time.perf_counter()

iter_csv = pd.read_csv(
    drug_directory.path, nrows=online_train_set_size, chunksize=batchsize,
    skiprows=1, names=columns_names, **drug_directory.read_csv_kwargs)

for batch_no, batch in enumerate(iter_csv):
    X_batch, y_batch = preprocess(batch, label_encoder)
    # skip iteration if batch is empty after preprocessing
    if len(y_batch) == 0:
    X_batch_kernel_approx, y_batch_onehot = encode(
        X_batch, y_batch, one_hot_encoder, column_transformer, rbf_sampler)

    # make one pass of stochastic gradient descent over the batch.
        X_batch_kernel_approx, y_batch, classes=[0, 1])

    # print train/test accuracy metrics every 5 batch
    if (batch_no % 5) == 0:
        message = "batch {:>4} ".format(batch_no)
        for origin, X, y_true_onehot in zip(
                ('train', 'val'),
                (X_batch_kernel_approx, X_test_kernel_approx),
                (y_batch_onehot, y_true_test_onehot)):

            y_pred = sgd_classifier.predict(X)

            # preprocess correctly the labels and prediction to match
            # average_precision_score expectations
            y_pred_onehot = one_hot_encoder.transform(
                y_pred.reshape(-1, 1))

            score = average_precision_score(y_true_onehot, y_pred_onehot)
            message += "{} precision: {:.4f}  ".format(origin, score)
            if origin == 'val':
                train_times_rbf.append(time.perf_counter() - t0)
                online_train_set_sizes.append((batch_no + 1)*batchsize)

batch    0 train precision: 0.8728  val precision: 0.5003
batch    5 train precision: 0.5139  val precision: 0.5592
batch   10 train precision: 0.5559  val precision: 0.6715
batch   15 train precision: 0.7414  val precision: 0.7931
batch   20 train precision: 0.8488  val precision: 0.8054
batch   25 train precision: 0.6988  val precision: 0.8200
batch   30 train precision: 0.8881  val precision: 0.8281
batch   35 train precision: 0.9507  val precision: 0.8411
batch   40 train precision: 0.9227  val precision: 0.8471
batch   45 train precision: 0.9041  val precision: 0.8508
batch   50 train precision: 0.7899  val precision: 0.8534
batch   55 train precision: 0.8975  val precision: 0.8555
batch   60 train precision: 0.9122  val precision: 0.8575
batch   65 train precision: 0.8437  val precision: 0.8599
batch   70 train precision: 0.6775  val precision: 0.8644
batch   75 train precision: 0.9011  val precision: 0.8653
batch   80 train precision: 0.9120  val precision: 0.8658
batch   85 train precision: 0.9219  val precision: 0.8645
batch   90 train precision: 0.9446  val precision: 0.8647
batch   95 train precision: 0.9473  val precision: 0.8661

So far, we fitted two kinds of models: a exact kernel algorithm, and an approximate, online one. Lets compare both the accuracies and the number of visited samples for each model as we increase our time budget:

import matplotlib.pyplot as plt

f, axs = plt.subplots(2, 1, sharex=True, figsize=(10, 7))
ax_score, ax_capacity = axs

ax_capacity.set_ylabel('training set size')

ax_score.plot(train_times_svc, test_scores_svc, 'b-', label='exact')
ax_score.plot(train_times_rbf, test_scores_rbf, 'r-', label='online')

ax_capacity.plot(train_times_svc, train_set_sizes, 'b-', label='exact')
ax_capacity.plot(train_times_rbf, online_train_set_sizes, 'r-', label='online')

    bbox_to_anchor=(0., 1.02, 1., .102),
    loc=3, ncol=2, mode='expand',

# compare the two methods in their common time range
ax_score.set_xlim(0, min(train_times_svc[-1], train_times_rbf[-1]))

title = """Test set accuracy and number of samples visited
samples seen by the SGDClassifier"""
Test set accuracy and number of samples visited samples seen by the SGDClassifier
Text(0.5, 0.98, 'Test set accuracy and number of samples visited\nsamples seen by the SGDClassifier')

This plot shows us that for the time budget, the online model will eventually process more samples, be faster and reach a far higher test accuracy that the non-online, exact kernel method.

Our online model also outperforms online linear models (for instance, SimilarityEncoder + SGDClassifier). We did not fit two online models here for simplicity purposes, but to train an online linear model, simply comment out the line in encode X_highdim = rbf_sampler.transform(X_sim_encoded.toarray()) and change it for example with X_highdim = X_sim_encoded.

In particular, this hierarchy between the linear model and the non-linear one shows that there were some significant non-linear relationships between the input and the output. By scaling a kernel method, we successfully took this non-linearity into account in our model, which was a far from trivial task at the beginning of this example!


Total running time of the script: ( 4 minutes 24.073 seconds)

Gallery generated by Sphinx-Gallery