.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/05_deduplication.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_05_deduplication.py: Deduplicating misspelled categories with deduplicate ==================================================== Real world datasets often come with slight misspellings in the category names, for instance if the category is manually input. Such misspellings break many data-analyses steps that require exact matching, such as a 'GROUP BY'. Merging the multiple variants of the same category or entity is known as *deduplication*. It is performed by the |dd| function. Deduplication relies on *unsupervised learning*, to find structure in data without providing explicit labels/categories of the data a-priori. Specifically clustering of the distance between strings can be used to find clusters of strings that are similar to each other (e.g. differ only by a misspelling) and hence gives us an easy tool to flag potentially misspelled category names in an unsupervised manner. .. |dd| replace:: :func:`~dirty_cat.deduplicate` .. GENERATED FROM PYTHON SOURCE LINES 25-37 An example dataset ------------------- Imagine the following example: As a data scientist, our job is to analyze the data from a hospital ward. We notice that most of the cases involve the prescription of one of three different medications: "Contrivan", "Genericon", or "Zipholan". However, data entry is manual and - either because the prescribing doctor's handwriting was hard to decipher, or due to mistakes during data input - there are multiple spelling mistakes for these three medications. Let's generate some example data that demonstrate this. .. GENERATED FROM PYTHON SOURCE LINES 37-65 .. code-block:: default import numpy as np from dirty_cat.datasets import make_deduplication_data # our three medication names medications = ["Contrivan", "Genericon", "Zipholan"] entries_per_medications = [500, 100, 1500] # 5% probability of a typo per letter prob_mistake_per_letter = 0.05 duplicated_names = make_deduplication_data( medications, entries_per_medications, prob_mistake_per_letter, random_state=42, # set seed for reproducibility ) # we extract the unique medication names in the data & how often they appear unique_examples, counts = np.unique(duplicated_names, return_counts=True) # and build a series out of them import pandas as pd ex_series = pd.Series(counts, index=unique_examples) # This is our data: ex_series.head() .. rst-class:: sphx-glr-script-out .. code-block:: none Ciltrivan 1 Cjntrivan 1 Cmntrivan 1 Coatrivan 1 Cobtvivan 1 dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 66-68 Visualize the data ------------------ .. GENERATED FROM PYTHON SOURCE LINES 68-75 .. code-block:: default import matplotlib.pyplot as plt ex_series.plot.barh(figsize=(10, 15)) plt.xlabel("Medication name") plt.ylabel("Counts") .. image-sg:: /auto_examples/images/sphx_glr_05_deduplication_001.png :alt: 05 deduplication :srcset: /auto_examples/images/sphx_glr_05_deduplication_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(33.222222222222214, 0.5, 'Counts') .. GENERATED FROM PYTHON SOURCE LINES 76-83 We can now see clearly the structure of the data: The three original medications are the most common ones, however there are many spelling mistakes and hence many slight variations of the names of the original medications. The idea is to use the fact that the string-distance of each misspelled medication name will be closest to either the correctly or incorrectly spelled orginal medication name - and therefore form clusters. .. GENERATED FROM PYTHON SOURCE LINES 85-92 We can visualize the pair-wise distance between all medication names -------------------------------------------------------------------- Below we use a heatmap to visualize the pairwise-distance between medication names. A darker color means that two medication names are closer together (i.e. more similar), a lighter color means a larger distance. We can see that we are dealing with three clusters - the original medication names and their misspellings that cluster around them. .. GENERATED FROM PYTHON SOURCE LINES 92-106 .. code-block:: default from dirty_cat import compute_ngram_distance from scipy.spatial.distance import squareform ngram_distances = compute_ngram_distance(unique_examples) square_distances = squareform(ngram_distances) import seaborn as sns fig, axes = plt.subplots(1, 1, figsize=(12, 12)) sns.heatmap( square_distances, yticklabels=ex_series.index, xticklabels=ex_series.index, ax=axes ) .. image-sg:: /auto_examples/images/sphx_glr_05_deduplication_002.png :alt: 05 deduplication :srcset: /auto_examples/images/sphx_glr_05_deduplication_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 107-118 .. _example_deduplication: Deduplication: suggest corrections of misspelled names ------------------------------------------------------ The |dd| function uses clustering based on string similarities to group duplicated names The number of clusters will need some adjustment depending on the data you have. If no fixed number of clusters is given, |dd| tries to set it automatically via the `silhouette score `_. .. GENERATED FROM PYTHON SOURCE LINES 118-123 .. code-block:: default from dirty_cat import deduplicate deduplicated_data = deduplicate(duplicated_names) .. GENERATED FROM PYTHON SOURCE LINES 124-125 We can visualize the distribution of categories in the deduplicated data: .. GENERATED FROM PYTHON SOURCE LINES 125-135 .. code-block:: default deduplicated_unique_examples, deduplicated_counts = np.unique( deduplicated_data, return_counts=True ) deduplicated_series = pd.Series(deduplicated_counts, index=deduplicated_unique_examples) deduplicated_series.plot.barh(figsize=(10, 15)) plt.xlabel("Medication name") plt.ylabel("Counts") .. image-sg:: /auto_examples/images/sphx_glr_05_deduplication_003.png :alt: 05 deduplication :srcset: /auto_examples/images/sphx_glr_05_deduplication_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(38.722222222222214, 0.5, 'Counts') .. GENERATED FROM PYTHON SOURCE LINES 136-144 In this example we can correct all spelling mistakes by using the ideal number of clusters as determined by the silhouette score. However, often the translation/deduplication won't be perfect and will require some tweaks. In this case, we can construct and update a translation table based on the data returned by |dd|. It consists of the (potentially) misspelled category names as indices and the (potentially) correct categories as values. .. GENERATED FROM PYTHON SOURCE LINES 144-151 .. code-block:: default # create a table that maps original -> corrected categories translation_table = pd.Series(deduplicated_data, index=duplicated_names) # remove duplicates in the original data translation_table = translation_table[~translation_table.index.duplicated(keep="first")] .. GENERATED FROM PYTHON SOURCE LINES 152-155 Since the number of correct spellings will likely be much smaller than the number of original categories, we can print the estimated cluster and their most common exemplars (the guessed correct spelling): .. GENERATED FROM PYTHON SOURCE LINES 155-168 .. code-block:: default def print_corrections(spell_correct): correct = np.unique(spell_correct.values) for c in correct: print( f"Guessed correct spelling: {c!r} for " f"{spell_correct[spell_correct==c].index.values}" ) print_corrections(translation_table) .. rst-class:: sphx-glr-script-out .. code-block:: none Guessed correct spelling: 'Contrivan' for ['Contrivan' 'Coctrivan' 'Contriwan' 'Conthivan' 'tontrivan' 'Contrivap' 'Cmntrivan' 'Cortrivan' 'Cjntrivan' 'Contrisan' 'qontrivan' 'Contrivxn' 'Csntrivan' 'Conzrivan' 'Cwntrivan' 'Contrizan' 'Coezrivan' 'Contriuan' 'Contrivaw' 'Ciltrivan' 'Contruvan' 'Contravan' 'Coztrivaz' 'Coatrivan' 'Contrioan' 'Cobtvivan' 'pontrivan'] Guessed correct spelling: 'Genericon' for ['Genericon' 'Generyhon' 'Genmricon' 'uenericon' 'aenericon'] Guessed correct spelling: 'Zipholan' for ['Zipholan' 'Ziphglan' 'eipholan' 'Ziphvlan' 'Ziwholan' 'Ziphocan' 'lipholan' 'zipholan' 'Zipholsn' 'Zieholan' 'Zivhoyan' 'Ziphonan' 'Zopholan' 'Ziphoqan' 'Zipholnn' 'Ziphotan' 'Zipeolan' 'Zipholln' 'Zipholap' 'Zzpholan' 'zppholan' 'Zipholau' 'gipholan' 'Zidholan' 'Zaeholan' 'Zwpholan' 'Zipholaz' 'jipholan' 'Zvpholan' 'dipholan' 'Ziphblan' 'Ziptolan' 'Ziphojan' 'Zizholan' 'sipholzn' 'Ziaholan' 'Zipxolan' 'Zipholen' 'Zmpholan' 'Zipdolan' 'Zipholpn' 'Zxppolan' 'Ziphwlan' 'bipholan' 'Ziqholan' 'tipholan' 'Zipholax' 'Zilholan' 'ripholan' 'Ziphclan' 'Ziaholaa' 'Ziphosan' 'Zikholan' 'Ziphocaa' 'Zapholan' 'Zhjholan' 'mipholan'] .. GENERATED FROM PYTHON SOURCE LINES 169-172 In case we want to adapt the translation table post-hoc we can easily modified it manually and apply it, for instance modifying the correspondance for the last entry as such: .. GENERATED FROM PYTHON SOURCE LINES 172-176 .. code-block:: default translation_table.iloc[-1] = "Completely new category" new_deduplicated_names = translation_table[duplicated_names] assert (new_deduplicated_names == "Completely new category").sum() > 0 .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 4.931 seconds) .. _sphx_glr_download_auto_examples_05_deduplication.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/dirty-cat/dirty-cat/0.4.1?urlpath=lab/tree/notebooks/auto_examples/05_deduplication.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 05_deduplication.py <05_deduplication.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 05_deduplication.ipynb <05_deduplication.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_