Feature interpretation with the GapEncoder¶
We illustrate here how categorical encodings obtained with the GapEncoder can be interpreted in terms of latent topics. We use as example the employee salaries dataset, and encode the column Employee Position Title, that contains dirty categorical data.
We first get the dataset:
Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year.
Now, we retrieve the dirty column to encode:
employee_position_title 0 Office Services Coordinator 1 Master Police Officer 2 Social Worker IV 3 Resident Supervisor II 4 Planning Specialist III Number of dirty entries = 9228
Encoding dirty job titles¶
We first create an instance of the GapEncoder with n_components=10:
Then we fit the model on the dirty categorical data and transform it to obtain encoded vectors of size 10:
Shape of encoded vectors = (9228, 10)
Interpreting encoded vectors¶
The GapEncoder can be understood as a continuous encoding on a set of latent topics estimated from the data. The latent topics are built by capturing combinations of substrings that frequently co-occur, and encoded vectors correspond to their activations. To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic.
Topic n°0: correctional, correction, warehouse Topic n°1: administrative, specialist, principal Topic n°2: services, officer, service Topic n°3: coordinator, equipment, operator Topic n°4: firefighter, rescuer, rescue Topic n°5: management, enforcement, permitting Topic n°6: technology, technician, mechanic Topic n°7: community, sergeant, sheriff Topic n°8: representative, accountant, auditor Topic n°9: assistant, library, safety
As expected, topics capture labels that frequently co-occur. For instance, the labels firefighter, rescuer, rescue appear together in Firefigther/Rescuer III, or Fire/Rescue Lieutenant.
This enables us to understand the encoding of different samples
import matplotlib.pyplot as plt encoded_labels = enc.transform(X_dirty[:20]) plt.figure(figsize=(8,10)) plt.imshow(encoded_labels) plt.xlabel('Latent topics', size=12) plt.xticks(range(0, 10), labels=topic_labels, rotation=50, ha='right') plt.ylabel('Data entries', size=12) plt.yticks(range(0, 20), labels=X_dirty[:20].to_numpy().flatten()) plt.colorbar().set_label(label='Topic activations', size=12) plt.tight_layout() plt.show()
As we can see, each dirty category encodes on a small number of topics, These can thus be reliably used to summarize each topic, which are in effect latent categories captured from the data.
Total running time of the script: ( 0 minutes 3.127 seconds)