Feature interpretation with the GapEncoder

We illustrate here how categorical encodings obtained with the GapEncoder can be interpreted in terms of latent topics. We use as example the employee salaries dataset, and encode the column Employee Position Title, that contains dirty categorical data.

Data Importing

We first get the dataset:

from dirty_cat.datasets import fetch_employee_salaries
employee_salaries = fetch_employee_salaries()
print(employee_salaries.description)
Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year.

Now, we retrieve the dirty column to encode:

dirty_column = 'employee_position_title'
X_dirty = employee_salaries.X[[dirty_column]]
print(X_dirty.head(), end='\n\n')
print(f'Number of dirty entries = {len(X_dirty)}')
       employee_position_title
0  Office Services Coordinator
1        Master Police Officer
2             Social Worker IV
3       Resident Supervisor II
4      Planning Specialist III

Number of dirty entries = 9228

Encoding dirty job titles

We first create an instance of the GapEncoder with n_components=10:

from dirty_cat import GapEncoder
enc = GapEncoder(n_components=10, random_state=42)

Then we fit the model on the dirty categorical data and transform it to obtain encoded vectors of size 10:

X_enc = enc.fit_transform(X_dirty)
print(f'Shape of encoded vectors = {X_enc.shape}')
Shape of encoded vectors = (9228, 10)

Interpreting encoded vectors

The GapEncoder can be understood as a continuous encoding on a set of latent topics estimated from the data. The latent topics are built by capturing combinations of substrings that frequently co-occur, and encoded vectors correspond to their activations. To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic.

topic_labels = enc.get_feature_names_out(n_labels=3)
for k in range(len(topic_labels)):
    labels = topic_labels[k]
    print(f'Topic n°{k}: {labels}')
Topic n°0: correctional, correction, warehouse
Topic n°1: administrative, specialist, principal
Topic n°2: services, officer, service
Topic n°3: coordinator, equipment, operator
Topic n°4: firefighter, rescuer, rescue
Topic n°5: management, enforcement, permitting
Topic n°6: technology, technician, mechanic
Topic n°7: community, sergeant, sheriff
Topic n°8: representative, accountant, auditor
Topic n°9: assistant, library, safety

As expected, topics capture labels that frequently co-occur. For instance, the labels firefighter, rescuer, rescue appear together in Firefigther/Rescuer III, or Fire/Rescue Lieutenant.

This enables us to understand the encoding of different samples

import matplotlib.pyplot as plt
encoded_labels = enc.transform(X_dirty[:20])
plt.figure(figsize=(8,10))
plt.imshow(encoded_labels)
plt.xlabel('Latent topics', size=12)
plt.xticks(range(0, 10), labels=topic_labels, rotation=50, ha='right')
plt.ylabel('Data entries', size=12)
plt.yticks(range(0, 20), labels=X_dirty[:20].to_numpy().flatten())
plt.colorbar().set_label(label='Topic activations', size=12)
plt.tight_layout()
plt.show()
03 feature interpretation gap encoder

As we can see, each dirty category encodes on a small number of topics, These can thus be reliably used to summarize each topic, which are in effect latent categories captured from the data.

Total running time of the script: ( 0 minutes 3.127 seconds)

Gallery generated by Sphinx-Gallery