Investigating dirty categories

What are dirty categorical variables and how can a good encoding help with statistical learning.

What do we mean by dirty categories?

Let’s look at a dataset called employee salaries:

from dirty_cat import datasets

employee_salaries = datasets.fetch_employee_salaries()
print(employee_salaries.description)
data = employee_salaries.X
print(data.head(n=5))
Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year.
  gender department  ... date_first_hired year_first_hired
0      F        POL  ...       09/22/1986             1986
1      M        POL  ...       09/12/1988             1988
2      F        HHS  ...       11/19/1989             1989
3      M        COR  ...       05/05/2014             2014
4      M        HCA  ...       03/05/2007             2007

[5 rows x 9 columns]

Here is how many unique entries there is per column

print(data.nunique())
gender                        2
department                   37
department_name              37
division                    694
assignment_category           2
employee_position_title     385
underfilled_job_title        84
date_first_hired           2264
year_first_hired             51
dtype: int64

As we can see, some entries have many different unique values:

print(data['employee_position_title'].value_counts().sort_index())
Abandoned Vehicle Code Enforcement Specialist        4
Accountant/Auditor I                                 3
Accountant/Auditor II                                1
Accountant/Auditor III                              35
Administrative Assistant to the County Executive     1
                                                    ..
Welder                                               3
Work Force Leader I                                  1
Work Force Leader II                                28
Work Force Leader III                                2
Work Force Leader IV                                 9
Name: employee_position_title, Length: 385, dtype: int64

These different entries are often variations on the same entities: there are 3 kinds of Accountant/Auditor.

Such variations will break traditional categorical encoding methods:

  • Using simple one-hot encoding will create orthogonal features, whereas it is clear that those 3 terms have a lot in common.

  • If we wanted to use word embedding methods such as word2vec, we would have to go through a cleaning phase: those algorithms are not trained to work on data such as ‘Accountant/Auditor I’. However, this can be error prone and time consuming.

The problem becomes easier if we can capture relationships between entries.

To simplify understanding, we will focus on the column describing the employee’s position title: data values = data[[‘employee_position_title’, ‘gender’]] + employee_salaries.y

values = data[['employee_position_title', 'gender']]
values.insert(0, 'current_annual_salary', employee_salaries.y)

String similarity between entries

That’s where our encoders get into play. In order to robustly embed dirty semantic data, the SimilarityEncoder creates a similarity matrix based on the 3-gram structure of the data.

sorted_values = values['employee_position_title'].sort_values().unique()

from dirty_cat import SimilarityEncoder

similarity_encoder = SimilarityEncoder(similarity='ngram')
transformed_values = similarity_encoder.fit_transform(
    sorted_values.reshape(-1, 1))

Plotting the new representation using multi-dimensional scaling

Let’s now plot a couple points at random using a low-dimensional representation to get an intuition of what the similarity encoder is doing:

from sklearn.manifold import MDS

mds = MDS(dissimilarity='precomputed', n_init=10, random_state=42)
two_dim_data = mds.fit_transform(
    1 - transformed_values)  # transformed values lie
# in the 0-1 range, so 1-transformed_value yields a positive dissimilarity matrix
print(two_dim_data.shape)
print(sorted_values.shape)
(385, 2)
(385,)

We first quickly fit a KNN so that the plots does not get too busy:

Then we plot it, adding the categories in the scatter plot:

import matplotlib.pyplot as plt

f, ax = plt.subplots()
ax.scatter(x=two_dim_data[indices, 0], y=two_dim_data[indices, 1])
# adding the legend
for x in indices:
    ax.text(x=two_dim_data[x, 0], y=two_dim_data[x, 1], s=sorted_values[x],
            fontsize=8)
ax.set_title(
    'multi-dimensional-scaling representation using a 3gram similarity matrix')
multi-dimensional-scaling representation using a 3gram similarity matrix
Text(0.5, 1.0, 'multi-dimensional-scaling representation using a 3gram similarity matrix')

Heatmap of the similarity matrix

We can also plot the distance matrix for those observations:

Similarities across categories

As shown in the previous plot, we see that the nearest neighbor of “Communication Equipment Technician” is “telecommunication technician”, although it is also very close to senior “supply technician”: therefore, we grasp the “communication” part (not initially present in the category as a unique word) as well as the technician part of this category.

Encoding categorical data using SimilarityEncoder

A typical data-science workflow uses one-hot encoding to represent categories.

from sklearn.preprocessing import OneHotEncoder

# encoding simply a subset of the observations
n_obs = 20
employee_position_titles = values['employee_position_title'].head(
    n_obs).to_frame()
categorical_encoder = OneHotEncoder(sparse=False)
one_hot_encoded = categorical_encoder.fit_transform(employee_position_titles)
f3, ax3 = plt.subplots(figsize=(6, 6))
ax3.matshow(one_hot_encoded)
ax3.set_title('Employee Position Title values, one-hot encoded')
ax3.axis('off')
f3.tight_layout()
Employee Position Title values, one-hot encoded

The corresponding is very sparse

SimilarityEncoder can be used to replace one-hot encoding capturing the similarities:

Employee Position Title values, similarity encoded

Other examples in the dirty_cat documentation show how similarity encoding impacts prediction performance.

Total running time of the script: ( 0 minutes 5.655 seconds)

Gallery generated by Sphinx-Gallery