.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/04_fuzzy_joining_and_FeatureAugmenter.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_04_fuzzy_joining_and_FeatureAugmenter.py: Fuzzy joining dirty tables and the FeatureAugmenter ===================================================== Here we show how to combine data from different sources, with a vocabulary not well normalized. Joining is difficult: one entry on one side does not have an exact match on the other side. The |fj| function enables to join tables without cleaning the data by accounting for the label variations. To illustrate, we will join data from the `2022 World Happiness Report `_. with tables provided in `the World Bank open data platform `_ in order to create a first prediction model. Moreover, the |fa| is a scikit-learn Transformer that makes it easy to use such fuzzy joining multiple tables to bring in information in a machine-learning pipeline. In particular, it enables tuning parameters of |fj| to find the matches that maximize prediction accuracy. .. |fj| replace:: :func:`~dirty_cat.fuzzy_join` .. |fa| replace:: :func:`~dirty_cat.FeatureAugmenter` .. GENERATED FROM PYTHON SOURCE LINES 30-34 Data Importing and preprocessing -------------------------------- We import the happiness score table first: .. GENERATED FROM PYTHON SOURCE LINES 34-42 .. code-block:: default import pandas as pd df = pd.read_csv( "https://raw.githubusercontent.com/dirty-cat/datasets/master/data/Happiness_report_2022.csv", thousands=",", ) df.drop(df.tail(1).index, inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 43-44 Let's look at the table: .. GENERATED FROM PYTHON SOURCE LINES 44-46 .. code-block:: default df.head(3) .. raw:: html

	RANK	Country	Happiness score	Whisker-high	Whisker-low	Dystopia (1.83) + residual	Explained by: GDP per capita	Explained by: Social support	Explained by: Healthy life expectancy	Explained by: Freedom to make life choices	Explained by: Generosity	Explained by: Perceptions of corruption
0	1	Finland	7821.0	7886.0	7756.0	2518.0	1892.0	1258.0	775.0	736.0	109.0	534.0
1	2	Denmark	7636.0	7710.0	7563.0	2226.0	1953.0	1243.0	777.0	719.0	188.0	532.0
2	3	Iceland	7557.0	7651.0	7464.0	2320.0	1936.0	1320.0	803.0	718.0	270.0	191.0

.. GENERATED FROM PYTHON SOURCE LINES 47-51 This is a table that contains the happiness index of a country along with some of the possible explanatory factors: GDP per capita, Social support, Generosity etc. .. GENERATED FROM PYTHON SOURCE LINES 53-55 For the sake of this example, we only keep the country names and our variable of interest: the 'Happiness score'. .. GENERATED FROM PYTHON SOURCE LINES 55-57 .. code-block:: default df = df[["Country", "Happiness score"]] .. GENERATED FROM PYTHON SOURCE LINES 58-67 Additional tables from other sources ------------------------------------ Now, we need to include explanatory factors from other sources, to complete our covariates (X table). Interesting tables can be found on `the World Bank open data platform `_, for which we have a downloading function: .. GENERATED FROM PYTHON SOURCE LINES 67-69 .. code-block:: default from dirty_cat.datasets import fetch_world_bank_indicator .. GENERATED FROM PYTHON SOURCE LINES 70-71 We extract the table containing GDP per capita by country: .. GENERATED FROM PYTHON SOURCE LINES 71-74 .. code-block:: default gdppc = fetch_world_bank_indicator(indicator_id="NY.GDP.PCAP.CD").X gdppc.head(3) .. raw:: html

	Country Name	GDP per capita (current US$)
0	Aruba	29342.100858
1	Africa Eastern and Southern	1549.772730
2	Afghanistan	368.754614

.. GENERATED FROM PYTHON SOURCE LINES 75-76 Then another table, with life expectancy by country: .. GENERATED FROM PYTHON SOURCE LINES 76-79 .. code-block:: default life_exp = fetch_world_bank_indicator("SP.DYN.LE00.IN", "life_exp").X life_exp.head(3) .. raw:: html

	Country Name	Life expectancy at birth, total (years)
0	Aruba	75.723000
1	Africa Eastern and Southern	63.313856
2	Afghanistan	62.575000

.. GENERATED FROM PYTHON SOURCE LINES 80-81 And a table with legal rights strength by country: .. GENERATED FROM PYTHON SOURCE LINES 81-84 .. code-block:: default legal_rights = fetch_world_bank_indicator("IC.LGL.CRED.XQ").X legal_rights.head(3) .. raw:: html

	Country Name	Strength of legal rights index (0=weak to 12=strong)
0	Africa Eastern and Southern	4.538462
1	Afghanistan	10.000000
2	Africa Western and Central	5.863636

.. GENERATED FROM PYTHON SOURCE LINES 85-91 A correspondance problem ------------------------ Alas, the entries for countries do not perfectly match between our original table (df), and those that we downloaded from the worldbank (gdppc): .. GENERATED FROM PYTHON SOURCE LINES 91-94 .. code-block:: default df.sort_values(by="Country").tail(7) .. raw:: html

	Country	Happiness score
29	Uruguay	6474.0
52	Uzbekistan	6063.0
107	Venezuela	4925.0
76	Vietnam	5485.0
131	Yemen*	4197.0
136	Zambia	3760.0
143	Zimbabwe	2995.0

.. GENERATED FROM PYTHON SOURCE LINES 95-97 .. code-block:: default gdppc.sort_values(by="Country Name").tail(7) .. raw:: html

	Country Name	GDP per capita (current US$)
253	Vietnam	3756.489121
252	Virgin Islands (U.S.)	39552.168595
193	West Bank and Gaza	3663.969055
255	World	12236.615743
258	Yemen, Rep.	701.714878
260	Zambia	1137.343633
261	Zimbabwe	1773.920411

.. GENERATED FROM PYTHON SOURCE LINES 98-103 We can see that Yemen is written "Yemen*" on one side, and "Yemen, Rep." on the other. We also have entries that probably do not have correspondances: "World" on one side, whereas the other table only has country-level data. .. GENERATED FROM PYTHON SOURCE LINES 105-111 Joining tables with imperfect correspondance -------------------------------------------- We will now join our initial table, df, with the 3 additional ones that we have extracted. .. GENERATED FROM PYTHON SOURCE LINES 113-119 .. _example_fuzzy_join: 1. Joining GDP per capita table ............................... To join them with dirty_cat, we only need to do the following: .. GENERATED FROM PYTHON SOURCE LINES 119-137 .. code-block:: default from dirty_cat import fuzzy_join # We will ignore the warnings: import warnings warnings.filterwarnings("ignore") df1 = fuzzy_join( df, # our table to join gdppc, # the table to join with left_on="Country", # the first join key column right_on="Country Name", # the second join key column return_score=True, ) df1.tail(20) # We merged the first WB table to our initial one. .. raw:: html

	Country	Happiness score	Country Name	GDP per capita (current US$)	matching_score
126	Sri Lanka	4362.0	Sri Lanka	4013.687657	1.000000
127	Madagascar*	4339.0	Madagascar	500.511032	0.779244
128	Egypt	4288.0	Egypt, Arab Rep.	3698.834981	0.627379
129	Chad*	4251.0	Chad	685.690284	0.658960
130	Ethiopia	4241.0	Ethiopia	925.077428	1.000000
131	Yemen*	4197.0	Yemen, Rep.	701.714878	0.626980
132	Mauritania*	4153.0	Mauritania	2166.046799	0.796150
133	Jordan	4152.0	Jordan	4103.258966	1.000000
134	Togo	4112.0	Togo	973.206129	1.000000
135	India	3777.0	India	2256.590409	1.000000
136	Zambia	3760.0	Zambia	1137.343633	1.000000
137	Malawi	3750.0	Malawi	634.835660	1.000000
138	Tanzania	3702.0	Tanzania	1099.287598	1.000000
139	Sierra Leone	3574.0	Sierra Leone	480.039211	1.000000
140	Lesotho*	3512.0	Lesotho	1094.098185	0.736366
141	Botswana*	3471.0	Botswana	6805.221274	0.780092
142	Rwanda*	3268.0	Rwanda	822.347989	0.735689
143	Zimbabwe	2995.0	Zimbabwe	1773.920411	1.000000
144	Lebanon	2955.0	Lebanon	4136.146347	1.000000
145	Afghanistan	2404.0	Afghanistan	368.754614	1.000000

.. GENERATED FROM PYTHON SOURCE LINES 138-142 .. topic:: Note: We fix the ``return_score`` parameter to `True` so as to keep the matching score, that we will use later to show what are the worst matches. .. GENERATED FROM PYTHON SOURCE LINES 144-160 We see that our |fj| succesfully identified the countries, even though some country names differ between tables. For instance, "Czechia" is well identified as "Czech Republic" and "Luxembourg*" as "Luxembourg". .. topic:: Note: This would all be missed out if we were using other methods such as `pandas.merge `_, which can only find exact matches. In this case, to reach the best result, we would have to `manually` clean the data (e.g. remove the * after country name) and look for matching patterns in every observation. Let's do some more inspection of the merging done. .. GENERATED FROM PYTHON SOURCE LINES 163-165 Let's print the four worst matches, which will give us an overview of the situation: .. GENERATED FROM PYTHON SOURCE LINES 165-168 .. code-block:: default df1.sort_values("matching_score").head(4) .. raw:: html

	Country	Happiness score	Country Name	GDP per capita (current US$)	matching_score
87	Ivory Coast	5235.0	East Asia & Pacific	13041.781186	0.500000
121	Palestinian Territories*	4483.0	Palau	12083.888149	0.501106
94	Laos	5140.0	Lao PDR	2535.623432	0.562278
111	Turkey	4744.0	Turkiye	9661.235975	0.567053

.. GENERATED FROM PYTHON SOURCE LINES 169-172 We see that some matches were unsuccesful (e.g "Palestinian Territories*" and "Palau"), because there is simply no match in the two tables. .. GENERATED FROM PYTHON SOURCE LINES 174-177 In this case, it is better to use the threshold parameter so as to include only precise-enough matches: .. GENERATED FROM PYTHON SOURCE LINES 177-187 .. code-block:: default df1 = fuzzy_join( df, gdppc, left_on="Country", right_on="Country Name", match_score=0.35, return_score=True, ) df1.sort_values("matching_score").head(4) .. raw:: html

	Country	Happiness score	Country Name	GDP per capita (current US$)	matching_score
87	Ivory Coast	5235.0	East Asia & Pacific	13041.781186	0.500000
121	Palestinian Territories*	4483.0	Palau	12083.888149	0.501106
94	Laos	5140.0	Lao PDR	2535.623432	0.562278
111	Turkey	4744.0	Turkiye	9661.235975	0.567053

.. GENERATED FROM PYTHON SOURCE LINES 188-190 Matches that are not available (or precise enough) are marked as `NaN`. We will remove them using the drop_unmatched parameter: .. GENERATED FROM PYTHON SOURCE LINES 190-202 .. code-block:: default df1 = fuzzy_join( df, gdppc, left_on="Country", right_on="Country Name", match_score=0.35, drop_unmatched=True, ) df1.drop(columns=["Country Name"], inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 203-205 We can finally plot and look at the link between GDP per capital and happiness: .. GENERATED FROM PYTHON SOURCE LINES 205-222 .. code-block:: default import matplotlib.pyplot as plt import seaborn as sns sns.set_context("notebook") plt.figure(figsize=(4, 3)) ax = sns.regplot( data=df1, x="GDP per capita (current US$)", y="Happiness score", lowess=True, ) ax.set_ylabel("Happiness index") ax.set_title("Is a higher GDP per capita linked to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_and_FeatureAugmenter_001.png :alt: Is a higher GDP per capita linked to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_and_FeatureAugmenter_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 223-227 It seems that the happiest countries are those having a high GDP per capita. However, unhappy countries do not have only low levels of GDP per capita. We have to search for other patterns. .. GENERATED FROM PYTHON SOURCE LINES 229-234 2. Joining life expectancy table ................................ Now let's include other information that may be relevant, such as in the life_exp table: .. GENERATED FROM PYTHON SOURCE LINES 234-246 .. code-block:: default df2 = fuzzy_join( df1, life_exp, left_on="Country", right_on="Country Name", match_score=0.45, ) df2.drop(columns=["Country Name"], inplace=True) df2.head(3) .. raw:: html

	Country	Happiness score	GDP per capita (current US$)	Life expectancy at birth, total (years)
0	Finland	7821.0	53654.750296	82.131707
1	Denmark	7636.0	68007.756673	81.551220
2	Iceland	7557.0	68727.636665	83.065854

.. GENERATED FROM PYTHON SOURCE LINES 247-248 Let's plot this relation: .. GENERATED FROM PYTHON SOURCE LINES 248-260 .. code-block:: default plt.figure(figsize=(4, 3)) fig = sns.regplot( data=df2, x="Life expectancy at birth, total (years)", y="Happiness score", lowess=True, ) fig.set_ylabel("Happiness index") fig.set_title("Is a higher life expectancy linked to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_and_FeatureAugmenter_002.png :alt: Is a higher life expectancy linked to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_and_FeatureAugmenter_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 261-263 It seems the answer is yes! Countries with higher life expectancy are also happier. .. GENERATED FROM PYTHON SOURCE LINES 266-270 3. Joining legal rights strength table ...................................... And the table with a measure of legal rights strength in the country: .. GENERATED FROM PYTHON SOURCE LINES 270-282 .. code-block:: default df3 = fuzzy_join( df2, legal_rights, left_on="Country", right_on="Country Name", match_score=0.45, ) df3.drop(columns=["Country Name"], inplace=True) df3.head(3) .. raw:: html

	Country	Happiness score	GDP per capita (current US$)	Life expectancy at birth, total (years)	Strength of legal rights index (0=weak to 12=strong)
0	Finland	7821.0	53654.750296	82.131707	6.0
1	Denmark	7636.0	68007.756673	81.551220	8.0
2	Iceland	7557.0	68727.636665	83.065854	4.0

.. GENERATED FROM PYTHON SOURCE LINES 283-284 Let's take a look at their correspondance in a figure: .. GENERATED FROM PYTHON SOURCE LINES 284-296 .. code-block:: default plt.figure(figsize=(4, 3)) fig = sns.regplot( data=df3, x="Strength of legal rights index (0=weak to 12=strong)", y="Happiness score", lowess=True, ) fig.set_ylabel("Happiness index") fig.set_title("Does a country's legal rights strength lead to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_and_FeatureAugmenter_003.png :alt: Does a country's legal rights strength lead to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_and_FeatureAugmenter_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 297-299 From this plot, it is not clear that this measure of legal strength is linked to happiness. .. GENERATED FROM PYTHON SOURCE LINES 301-303 Great! Our joined table has become bigger and full of useful information. And now we are ready to apply a first machine learning model to it! .. GENERATED FROM PYTHON SOURCE LINES 305-310 Prediction model ---------------- We now separate our covariates (X), from the target (or exogenous) variables: y .. GENERATED FROM PYTHON SOURCE LINES 310-313 .. code-block:: default X = df3.drop("Happiness score", axis=1).select_dtypes(exclude=object) y = df3[["Happiness score"]] .. GENERATED FROM PYTHON SOURCE LINES 314-315 Let us now define the model that will be used to predict the happiness score: .. GENERATED FROM PYTHON SOURCE LINES 315-325 .. code-block:: default from sklearn import __version__ as sklearn_version if sklearn_version < "1.0": from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import KFold hgdb = HistGradientBoostingRegressor(random_state=0) cv = KFold(n_splits=2, shuffle=True, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 326-330 To evaluate our model, we will apply a `4-fold cross-validation`. We evaluate our model using the `R2` score. Let's finally assess the results of our models: .. GENERATED FROM PYTHON SOURCE LINES 330-338 .. code-block:: default from sklearn.model_selection import cross_validate cv_results_t = cross_validate(hgdb, X, y, cv=cv, scoring="r2") cv_r2_t = cv_results_t["test_score"] print(f"Mean R2 score is {cv_r2_t.mean():.2f} +- {cv_r2_t.std():.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Mean R2 score is 0.63 +- 0.03 .. GENERATED FROM PYTHON SOURCE LINES 339-350 We have a satisfying first result: an R2 of 0.66! Data cleaning varies from dataset to dataset: there are as many ways to clean a table as there are errors. |fj| method is generalizable across all datasets. Data transformation is also often very costly in both time and ressources. |fj| is fast and easy-to-use. Now up to you, try improving our model by adding information into it and beating our result! .. GENERATED FROM PYTHON SOURCE LINES 352-359 Using the |fa| to fuzzy join multiple tables -------------------------------------------- A faster way to merge different tables from the World Bank to `X` is to use the |fa|. The |fa| is a transformer that can easily chain joins of tables on a main table. .. GENERATED FROM PYTHON SOURCE LINES 361-365 .. _example_feature_augmenter: Instantiating the transformer ............................. .. GENERATED FROM PYTHON SOURCE LINES 365-367 .. code-block:: default y = df["Happiness score"] .. GENERATED FROM PYTHON SOURCE LINES 368-371 We gather the auxilliary tables into a list of (tables, keys) for the `tables` parameter. An instance of the transformer with the necessary information is: .. GENERATED FROM PYTHON SOURCE LINES 371-382 .. code-block:: default from dirty_cat import FeatureAugmenter fa = FeatureAugmenter( tables=[ (gdppc, "Country Name"), (life_exp, "Country Name"), (legal_rights, "Country Name"), ], main_key="Country", ) .. GENERATED FROM PYTHON SOURCE LINES 383-387 Fitting and transforming into the final table ............................................. To get our final joined table we will fit and transform the main table (df) with our create instance of the |fa|: .. GENERATED FROM PYTHON SOURCE LINES 387-391 .. code-block:: default df_final = fa.fit_transform(df) df_final.head(10) .. raw:: html

	Country	Happiness score	Country Name	GDP per capita (current US$)	Country Name_aux	Life expectancy at birth, total (years)	Country Name_aux	Strength of legal rights index (0=weak to 12=strong)
0	Finland	7821.0	Finland	53654.750296	Finland	82.131707	Finland	6.0
1	Denmark	7636.0	Denmark	68007.756673	Denmark	81.551220	Denmark	8.0
2	Iceland	7557.0	Iceland	68727.636665	Iceland	83.065854	Iceland	4.0
3	Switzerland	7512.0	Switzerland	91991.600458	Switzerland	83.100000	Switzerland	6.0
4	Netherlands	7415.0	Netherlands	57767.878811	Netherlands	81.409756	Netherlands	2.0
5	Luxembourg*	7404.0	Luxembourg	133590.146976	Luxembourg	81.741463	Luxembourg	3.0
6	Sweden	7384.0	Sweden	61028.738060	Sweden	82.407317	Sweden	7.0
7	Norway	7365.0	Norway	89154.276093	Norway	83.209756	Norway	5.0
8	Israel	7364.0	Israel	52170.711862	Israel	82.700000	Israel	6.0
9	New Zealand	7200.0	New Zealand	48781.026633	New Zealand	82.056098	New Zealand	12.0

.. GENERATED FROM PYTHON SOURCE LINES 392-395 And that's it! As previously, we now have a big table ready for machine learning. Let's create our machine learning pipeline: .. GENERATED FROM PYTHON SOURCE LINES 395-413 .. code-block:: default from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer # We include only the columns that will be pertinent for our regression: encoder = make_column_transformer( ( "passthrough", [ "GDP per capita (current US$)", "Life expectancy at birth, total (years)", "Strength of legal rights index (0=weak to 12=strong)", ], ), remainder="drop", ) pipeline = make_pipeline(fa, encoder, HistGradientBoostingRegressor()) .. GENERATED FROM PYTHON SOURCE LINES 414-417 And the best part is that we are now able to evaluate the parameters of the |fj|. For instance, the ``match_score`` was manually picked and can now be introduced into a grid search: .. GENERATED FROM PYTHON SOURCE LINES 417-427 .. code-block:: default from sklearn.model_selection import GridSearchCV # We will test four possible values of match_score: params = {"featureaugmenter__match_score": [0.2, 0.3, 0.4, 0.5]} grid = GridSearchCV(pipeline, param_grid=params) grid.fit(df, y) print(grid.best_params_) .. rst-class:: sphx-glr-script-out .. code-block:: none {'featureaugmenter__match_score': 0.2} .. GENERATED FROM PYTHON SOURCE LINES 428-431 The grid searching gave us the best value of 0.5 for the parameter ``match_score``. Let's use this value in our regression: .. GENERATED FROM PYTHON SOURCE LINES 431-434 .. code-block:: default print(f"Mean R2 score with pipeline is {grid.score(df, y):.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Mean R2 score with pipeline is 0.83 .. GENERATED FROM PYTHON SOURCE LINES 435-446 .. topic:: Note: Here, ``grid.score()`` takes directly the best model (with ``match_score=0.5``) that was found during the grid search. Thus, it is equivalent to fixing the ``match_score`` to 0.5 and refitting the pipeline on the data. Great, by evaluating the correct ``match_score`` we improved our results significantly! .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 14.332 seconds) .. _sphx_glr_download_auto_examples_04_fuzzy_joining_and_FeatureAugmenter.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/dirty-cat/dirty-cat/0.4.1?urlpath=lab/tree/notebooks/auto_examples/04_fuzzy_joining_and_FeatureAugmenter.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 04_fuzzy_joining_and_FeatureAugmenter.py <04_fuzzy_joining_and_FeatureAugmenter.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 04_fuzzy_joining_and_FeatureAugmenter.ipynb <04_fuzzy_joining_and_FeatureAugmenter.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_