.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/02_investigating_dirty_categories.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_02_investigating_dirty_categories.py: Investigating and interpreting dirty categories =============================================== What are dirty categorical variables and how can a good encoding help with statistical learning? We illustrate how categorical encodings obtained with the |Gap| can be interpreted in terms of latent topics. We use as example the `employee salaries `_ dataset. .. |Gap| replace:: :class:`~dirty_cat.GapEncoder` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |SE| replace:: :class:`~dirty_cat.SimilarityEncoder` .. GENERATED FROM PYTHON SOURCE LINES 26-30 What do we mean by dirty categories? ------------------------------------ Let's look at the dataset: .. GENERATED FROM PYTHON SOURCE LINES 30-37 .. code-block:: default from dirty_cat import datasets employee_salaries = datasets.fetch_employee_salaries() print(employee_salaries.description) data = employee_salaries.X print(data.head(n=5)) .. rst-class:: sphx-glr-script-out .. code-block:: none Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year. gender department ... date_first_hired year_first_hired 0 F POL ... 09/22/1986 1986 1 M POL ... 09/12/1988 1988 2 F HHS ... 11/19/1989 1989 3 M COR ... 05/05/2014 2014 4 M HCA ... 03/05/2007 2007 [5 rows x 9 columns] .. GENERATED FROM PYTHON SOURCE LINES 38-39 Here is the number of unique entries per column: .. GENERATED FROM PYTHON SOURCE LINES 39-41 .. code-block:: default print(data.nunique()) .. rst-class:: sphx-glr-script-out .. code-block:: none gender 2 department 37 department_name 37 division 694 assignment_category 2 employee_position_title 385 underfilled_job_title 84 date_first_hired 2264 year_first_hired 51 dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 42-43 As we can see, some entries have many unique values: .. GENERATED FROM PYTHON SOURCE LINES 43-45 .. code-block:: default print(data["employee_position_title"].value_counts().sort_index()) .. rst-class:: sphx-glr-script-out .. code-block:: none Abandoned Vehicle Code Enforcement Specialist 4 Accountant/Auditor I 3 Accountant/Auditor II 1 Accountant/Auditor III 35 Administrative Assistant to the County Executive 1 .. Welder 3 Work Force Leader I 1 Work Force Leader II 28 Work Force Leader III 2 Work Force Leader IV 9 Name: employee_position_title, Length: 385, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 46-65 These different entries are often variations of the same entity. For example, there are 3 kinds of "Accountant/Auditor". Such variations will break traditional categorical encoding methods: * Using a simple |OneHotEncoder| will create orthogonal features, whereas it is clear that those 3 terms have a lot in common. * If we wanted to use word embedding methods such as `Word2vec `_, we would have to go through a cleaning phase: those algorithms are not trained to work on data such as "Accountant/Auditor I". However, this can be error-prone and time-consuming. The problem becomes easier if we can capture relationships between entries. To simplify understanding, we will focus on the column describing the employee's position title: .. GENERATED FROM PYTHON SOURCE LINES 65-69 .. code-block:: default values = data[["employee_position_title", "gender"]] values.insert(0, "current_annual_salary", employee_salaries.y) .. GENERATED FROM PYTHON SOURCE LINES 70-78 .. _example_similarity_encoder: String similarity between entries --------------------------------- That's where our encoders get into play. In order to robustly embed dirty semantic data, the |SE| creates a similarity matrix based on an n-gram representation of the data. .. GENERATED FROM PYTHON SOURCE LINES 78-86 .. code-block:: default sorted_values = values["employee_position_title"].sort_values().unique() from dirty_cat import SimilarityEncoder similarity_encoder = SimilarityEncoder() transformed_values = similarity_encoder.fit_transform(sorted_values.reshape(-1, 1)) .. GENERATED FROM PYTHON SOURCE LINES 87-92 Plotting the new representation using multi-dimensional scaling ................................................................ Let's now plot a couple of points at random using a low-dimensional representation to get an intuition of what the |SE| is doing: .. GENERATED FROM PYTHON SOURCE LINES 92-102 .. code-block:: default from sklearn.manifold import MDS mds = MDS(dissimilarity="precomputed", n_init=10, random_state=42) two_dim_data = mds.fit_transform(1 - transformed_values) # transformed values lie in the 0-1 range, # so 1-transformed_value yields a positive dissimilarity matrix print(two_dim_data.shape) print(sorted_values.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/manifold/_mds.py:299: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`. warnings.warn( (385, 2) (385,) .. GENERATED FROM PYTHON SOURCE LINES 103-104 We first quickly fit a KNN so that the plots does not get too busy: .. GENERATED FROM PYTHON SOURCE LINES 104-118 .. code-block:: default import numpy as np from sklearn.neighbors import NearestNeighbors n_points = 5 np.random.seed(42) random_points = np.random.choice( len(similarity_encoder.categories_[0]), n_points, replace=False ) nn = NearestNeighbors(n_neighbors=2).fit(transformed_values) _, indices_ = nn.kneighbors(transformed_values[random_points]) indices = np.unique(indices_.squeeze()) .. GENERATED FROM PYTHON SOURCE LINES 119-120 Then we plot it, adding the categories in the scatter plot: .. GENERATED FROM PYTHON SOURCE LINES 120-135 .. code-block:: default import matplotlib.pyplot as plt f, ax = plt.subplots() ax.scatter(x=two_dim_data[indices, 0], y=two_dim_data[indices, 1]) # adding the legend for x in indices: ax.text( x=two_dim_data[x, 0], y=two_dim_data[x, 1], s=sorted_values[x], fontsize=8, ) ax.set_title("multi-dimensional-scaling representation using a 3gram similarity matrix") .. image-sg:: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_001.png :alt: multi-dimensional-scaling representation using a 3gram similarity matrix :srcset: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'multi-dimensional-scaling representation using a 3gram similarity matrix') .. GENERATED FROM PYTHON SOURCE LINES 136-140 Heatmap of the similarity matrix ................................ We can also plot the distance matrix for those observations: .. GENERATED FROM PYTHON SOURCE LINES 140-152 .. code-block:: default f2, ax2 = plt.subplots(figsize=(6, 6)) cax2 = ax2.matshow(transformed_values[indices, :][:, indices]) ax2.set_yticks(np.arange(len(indices))) ax2.set_xticks(np.arange(len(indices))) ax2.set_yticklabels(sorted_values[indices], rotation=30) ax2.set_xticklabels(sorted_values[indices], rotation=60, ha="right") ax2.xaxis.tick_bottom() ax2.set_title("Similarities across categories") f2.colorbar(cax2) f2.tight_layout() .. image-sg:: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_002.png :alt: Similarities across categories :srcset: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 153-159 As shown in the previous plot, we see that the nearest neighbor of "Communication Equipment Technician" is "Telecommunication Technician", although it is also very close to senior "Supply Technician": therefore, we grasp the "Communication" part (not initially present in the category as a unique word) as well as the "Technician" part of this category. .. GENERATED FROM PYTHON SOURCE LINES 162-165 Feature interpretation with the |Gap| ------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 167-172 The |Gap| is a better encoder than the |SE| in the sense that it is more scalable and interpretable, which we will present now. First, let's retrieve the dirty column to encode: .. GENERATED FROM PYTHON SOURCE LINES 172-178 .. code-block:: default dirty_column = "employee_position_title" X_dirty = data[[dirty_column]] print(X_dirty.head(), end="\n\n") print(f"Number of dirty entries = {len(X_dirty)}") .. rst-class:: sphx-glr-script-out .. code-block:: none employee_position_title 0 Office Services Coordinator 1 Master Police Officer 2 Social Worker IV 3 Resident Supervisor II 4 Planning Specialist III Number of dirty entries = 9228 .. GENERATED FROM PYTHON SOURCE LINES 179-185 .. _example_gap_encoder: Encoding dirty job titles ......................... Then, we'll create an instance of the |Gap| with 10 components: .. GENERATED FROM PYTHON SOURCE LINES 185-190 .. code-block:: default from dirty_cat import GapEncoder enc = GapEncoder(n_components=10, random_state=42) .. GENERATED FROM PYTHON SOURCE LINES 191-193 Finally, we'll fit the model on the dirty categorical data and transform it in order to obtain encoded vectors of size 10: .. GENERATED FROM PYTHON SOURCE LINES 193-197 .. code-block:: default X_enc = enc.fit_transform(X_dirty) print(f"Shape of encoded vectors = {X_enc.shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none Shape of encoded vectors = (9228, 10) .. GENERATED FROM PYTHON SOURCE LINES 198-208 Interpreting encoded vectors ............................ The |Gap| can be understood as a continuous encoding on a set of latent topics estimated from the data. The latent topics are built by capturing combinations of substrings that frequently co-occur, and encoded vectors correspond to their activations. To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic. .. GENERATED FROM PYTHON SOURCE LINES 208-214 .. code-block:: default topic_labels = enc.get_feature_names_out(n_labels=3) for k in range(len(topic_labels)): labels = topic_labels[k] print(f"Topic n°{k}: {labels}") .. rst-class:: sphx-glr-script-out .. code-block:: none Topic n°0: correctional, correction, warehouse Topic n°1: administrative, specialist, principal Topic n°2: services, officer, service Topic n°3: coordinator, equipment, operator Topic n°4: firefighter, rescuer, rescue Topic n°5: management, enforcement, permitting Topic n°6: technology, technician, mechanic Topic n°7: community, sergeant, sheriff Topic n°8: representative, accountant, auditor Topic n°9: assistant, library, safety .. GENERATED FROM PYTHON SOURCE LINES 215-220 As expected, topics capture labels that frequently co-occur. For instance, the labels "firefighter", "rescuer", "rescue" appear together in "Firefighter/Rescuer III", or "Fire/Rescue Lieutenant". This enables us to understand the encoding of different samples .. GENERATED FROM PYTHON SOURCE LINES 220-232 .. code-block:: default encoded_labels = enc.transform(X_dirty[:20]) plt.figure(figsize=(8, 10)) plt.imshow(encoded_labels) plt.xlabel("Latent topics", size=12) plt.xticks(range(0, 10), labels=topic_labels, rotation=50, ha="right") plt.ylabel("Data entries", size=12) plt.yticks(range(0, 20), labels=X_dirty[:20].to_numpy().flatten()) plt.colorbar().set_label(label="Topic activations", size=12) plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_003.png :alt: 02 investigating dirty categories :srcset: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 233-236 As we can see, each dirty category encodes on a small number of topics, These can thus be reliably used to summarize each topic, which are in effect latent categories captured from the data. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 9.756 seconds) .. _sphx_glr_download_auto_examples_02_investigating_dirty_categories.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/dirty-cat/dirty-cat/0.4.1?urlpath=lab/tree/notebooks/auto_examples/02_investigating_dirty_categories.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 02_investigating_dirty_categories.py <02_investigating_dirty_categories.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 02_investigating_dirty_categories.ipynb <02_investigating_dirty_categories.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_