.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/02_investigating_dirty_categories.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_02_investigating_dirty_categories.py>`
        to download the full example code or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_02_investigating_dirty_categories.py:


Investigating and interpreting dirty categories
===============================================

What are dirty categorical variables and how can
a good encoding help with statistical learning?

We illustrate how categorical encodings obtained with
the |Gap| can be interpreted in terms of latent topics.

We use as example the `employee salaries <https://www.openml.org/d/42125>`_
dataset.


.. |Gap| replace::
     :class:`~dirty_cat.GapEncoder`

.. |OneHotEncoder| replace::
     :class:`~sklearn.preprocessing.OneHotEncoder`

.. |SE| replace::
     :class:`~dirty_cat.SimilarityEncoder`

.. GENERATED FROM PYTHON SOURCE LINES 26-30

What do we mean by dirty categories?
------------------------------------

Let's look at the dataset:

.. GENERATED FROM PYTHON SOURCE LINES 30-37

.. code-block:: default

    from dirty_cat import datasets

    employee_salaries = datasets.fetch_employee_salaries()
    print(employee_salaries.description)
    data = employee_salaries.X
    print(data.head(n=5))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year.
      gender department  ... date_first_hired year_first_hired
    0      F        POL  ...       09/22/1986             1986
    1      M        POL  ...       09/12/1988             1988
    2      F        HHS  ...       11/19/1989             1989
    3      M        COR  ...       05/05/2014             2014
    4      M        HCA  ...       03/05/2007             2007

    [5 rows x 9 columns]


.. GENERATED FROM PYTHON SOURCE LINES 38-39

Here is the number of unique entries per column:

.. GENERATED FROM PYTHON SOURCE LINES 39-41

.. code-block:: default

    print(data.nunique())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    gender                        2
    department                   37
    department_name              37
    division                    694
    assignment_category           2
    employee_position_title     385
    underfilled_job_title        84
    date_first_hired           2264
    year_first_hired             51
    dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 42-43

As we can see, some entries have many unique values:

.. GENERATED FROM PYTHON SOURCE LINES 43-45

.. code-block:: default

    print(data["employee_position_title"].value_counts().sort_index())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Abandoned Vehicle Code Enforcement Specialist        4
    Accountant/Auditor I                                 3
    Accountant/Auditor II                                1
    Accountant/Auditor III                              35
    Administrative Assistant to the County Executive     1
                                                        ..
    Welder                                               3
    Work Force Leader I                                  1
    Work Force Leader II                                28
    Work Force Leader III                                2
    Work Force Leader IV                                 9
    Name: employee_position_title, Length: 385, dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 46-65

These different entries are often variations of the same entity.
For example, there are 3 kinds of "Accountant/Auditor".

Such variations will break traditional categorical encoding methods:

* Using a simple |OneHotEncoder|
  will create orthogonal features, whereas it is clear that
  those 3 terms have a lot in common.

* If we wanted to use word embedding methods such as `Word2vec <https://www.tensorflow.org/tutorials/text/word2vec>`_,
  we would have to go through a cleaning phase: those algorithms
  are not trained to work on data such as "Accountant/Auditor I".
  However, this can be error-prone and time-consuming.

The problem becomes easier if we can capture relationships between
entries.

To simplify understanding, we will focus on the column describing the
employee's position title:

.. GENERATED FROM PYTHON SOURCE LINES 65-69

.. code-block:: default


    values = data[["employee_position_title", "gender"]]
    values.insert(0, "current_annual_salary", employee_salaries.y)


.. GENERATED FROM PYTHON SOURCE LINES 70-78

.. _example_similarity_encoder:

String similarity between entries
---------------------------------

That's where our encoders get into play.
In order to robustly embed dirty semantic data, the |SE|
creates a similarity matrix based on an n-gram representation of the data.

.. GENERATED FROM PYTHON SOURCE LINES 78-86

.. code-block:: default


    sorted_values = values["employee_position_title"].sort_values().unique()

    from dirty_cat import SimilarityEncoder

    similarity_encoder = SimilarityEncoder()
    transformed_values = similarity_encoder.fit_transform(sorted_values.reshape(-1, 1))


.. GENERATED FROM PYTHON SOURCE LINES 87-92

Plotting the new representation using multi-dimensional scaling
................................................................

Let's now plot a couple of points at random using a low-dimensional
representation to get an intuition of what the |SE| is doing:

.. GENERATED FROM PYTHON SOURCE LINES 92-102

.. code-block:: default


    from sklearn.manifold import MDS

    mds = MDS(dissimilarity="precomputed", n_init=10, random_state=42)
    two_dim_data = mds.fit_transform(1 - transformed_values)
    # transformed values lie in the 0-1 range,
    # so 1-transformed_value yields a positive dissimilarity matrix
    print(two_dim_data.shape)
    print(sorted_values.shape)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/manifold/_mds.py:299: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`.
      warnings.warn(
    (385, 2)
    (385,)


.. GENERATED FROM PYTHON SOURCE LINES 103-104

We first quickly fit a KNN so that the plots does not get too busy:

.. GENERATED FROM PYTHON SOURCE LINES 104-118

.. code-block:: default


    import numpy as np
    from sklearn.neighbors import NearestNeighbors

    n_points = 5
    np.random.seed(42)

    random_points = np.random.choice(
        len(similarity_encoder.categories_[0]), n_points, replace=False
    )
    nn = NearestNeighbors(n_neighbors=2).fit(transformed_values)
    _, indices_ = nn.kneighbors(transformed_values[random_points])
    indices = np.unique(indices_.squeeze())


.. GENERATED FROM PYTHON SOURCE LINES 119-120

Then we plot it, adding the categories in the scatter plot:

.. GENERATED FROM PYTHON SOURCE LINES 120-135

.. code-block:: default


    import matplotlib.pyplot as plt

    f, ax = plt.subplots()
    ax.scatter(x=two_dim_data[indices, 0], y=two_dim_data[indices, 1])
    # adding the legend
    for x in indices:
        ax.text(
            x=two_dim_data[x, 0],
            y=two_dim_data[x, 1],
            s=sorted_values[x],
            fontsize=8,
        )
    ax.set_title("multi-dimensional-scaling representation using a 3gram similarity matrix")


.. image-sg:: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_001.png
   :alt: multi-dimensional-scaling representation using a 3gram similarity matrix
   :srcset: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Text(0.5, 1.0, 'multi-dimensional-scaling representation using a 3gram similarity matrix')


.. GENERATED FROM PYTHON SOURCE LINES 136-140

Heatmap of the similarity matrix
................................

We can also plot the distance matrix for those observations:

.. GENERATED FROM PYTHON SOURCE LINES 140-152

.. code-block:: default


    f2, ax2 = plt.subplots(figsize=(6, 6))
    cax2 = ax2.matshow(transformed_values[indices, :][:, indices])
    ax2.set_yticks(np.arange(len(indices)))
    ax2.set_xticks(np.arange(len(indices)))
    ax2.set_yticklabels(sorted_values[indices], rotation=30)
    ax2.set_xticklabels(sorted_values[indices], rotation=60, ha="right")
    ax2.xaxis.tick_bottom()
    ax2.set_title("Similarities across categories")
    f2.colorbar(cax2)
    f2.tight_layout()


.. image-sg:: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_002.png
   :alt: Similarities across categories
   :srcset: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 153-159

As shown in the previous plot, we see that the nearest neighbor of
"Communication Equipment Technician"
is "Telecommunication Technician", although it is also
very close to senior "Supply Technician": therefore, we grasp the
"Communication" part (not initially present in the category as a unique word)
as well as the "Technician" part of this category.

.. GENERATED FROM PYTHON SOURCE LINES 162-165

Feature interpretation with the |Gap|
-------------------------------------


.. GENERATED FROM PYTHON SOURCE LINES 167-172

The |Gap| is a better encoder than the
|SE| in the sense that it is more scalable and
interpretable, which we will present now.

First, let's retrieve the dirty column to encode:

.. GENERATED FROM PYTHON SOURCE LINES 172-178

.. code-block:: default


    dirty_column = "employee_position_title"
    X_dirty = data[[dirty_column]]
    print(X_dirty.head(), end="\n\n")
    print(f"Number of dirty entries = {len(X_dirty)}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

           employee_position_title
    0  Office Services Coordinator
    1        Master Police Officer
    2             Social Worker IV
    3       Resident Supervisor II
    4      Planning Specialist III

    Number of dirty entries = 9228


.. GENERATED FROM PYTHON SOURCE LINES 179-185

.. _example_gap_encoder:

Encoding dirty job titles
.........................

Then, we'll create an instance of the |Gap| with 10 components:

.. GENERATED FROM PYTHON SOURCE LINES 185-190

.. code-block:: default


    from dirty_cat import GapEncoder

    enc = GapEncoder(n_components=10, random_state=42)


.. GENERATED FROM PYTHON SOURCE LINES 191-193

Finally, we'll fit the model on the dirty categorical data and transform it
in order to obtain encoded vectors of size 10:

.. GENERATED FROM PYTHON SOURCE LINES 193-197

.. code-block:: default


    X_enc = enc.fit_transform(X_dirty)
    print(f"Shape of encoded vectors = {X_enc.shape}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Shape of encoded vectors = (9228, 10)


.. GENERATED FROM PYTHON SOURCE LINES 198-208

Interpreting encoded vectors
............................

The |Gap| can be understood as a continuous encoding
on a set of latent topics estimated from the data. The latent topics
are built by capturing combinations of substrings that frequently
co-occur, and encoded vectors correspond to their activations.
To interpret these latent topics, we select for each of them a few labels
from the input data with the highest activations.
In the example below we select 3 labels to summarize each topic.

.. GENERATED FROM PYTHON SOURCE LINES 208-214

.. code-block:: default


    topic_labels = enc.get_feature_names_out(n_labels=3)
    for k in range(len(topic_labels)):
        labels = topic_labels[k]
        print(f"Topic n°{k}: {labels}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Topic n°0: correctional, correction, warehouse
    Topic n°1: administrative, specialist, principal
    Topic n°2: services, officer, service
    Topic n°3: coordinator, equipment, operator
    Topic n°4: firefighter, rescuer, rescue
    Topic n°5: management, enforcement, permitting
    Topic n°6: technology, technician, mechanic
    Topic n°7: community, sergeant, sheriff
    Topic n°8: representative, accountant, auditor
    Topic n°9: assistant, library, safety


.. GENERATED FROM PYTHON SOURCE LINES 215-220

As expected, topics capture labels that frequently co-occur. For instance,
the labels "firefighter", "rescuer", "rescue" appear together in
"Firefighter/Rescuer III", or "Fire/Rescue Lieutenant".

This enables us to understand the encoding of different samples

.. GENERATED FROM PYTHON SOURCE LINES 220-232

.. code-block:: default


    encoded_labels = enc.transform(X_dirty[:20])
    plt.figure(figsize=(8, 10))
    plt.imshow(encoded_labels)
    plt.xlabel("Latent topics", size=12)
    plt.xticks(range(0, 10), labels=topic_labels, rotation=50, ha="right")
    plt.ylabel("Data entries", size=12)
    plt.yticks(range(0, 20), labels=X_dirty[:20].to_numpy().flatten())
    plt.colorbar().set_label(label="Topic activations", size=12)
    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_003.png
   :alt: 02 investigating dirty categories
   :srcset: /auto_examples/images/sphx_glr_02_investigating_dirty_categories_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 233-236

As we can see, each dirty category encodes on a small number of topics,
These can thus be reliably used to summarize each topic, which are in
effect latent categories captured from the data.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  9.756 seconds)


.. _sphx_glr_download_auto_examples_02_investigating_dirty_categories.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/dirty-cat/dirty-cat/0.4.1?urlpath=lab/tree/notebooks/auto_examples/02_investigating_dirty_categories.ipynb
        :alt: Launch binder
        :width: 150 px


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 02_investigating_dirty_categories.py <02_investigating_dirty_categories.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 02_investigating_dirty_categories.ipynb <02_investigating_dirty_categories.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_