.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/06_ken_embeddings_example.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_06_ken_embeddings_example.py: Wikipedia embeddings to enrich the data ======================================= When the data comprises common entities (cities, companies or famous people), bringing new information assembled from external sources may be the key to improving the analysis. Embeddings, or vectorial representations of entities, are a conveniant way to capture and summarize the information on an entity. Relational data embeddings capture all common entities from Wikipedia. [#]_ These will be called `KEN embeddings` in the following example. We will see that these embeddings of common entities significantly improve our results. .. [#] https://soda-inria.github.io/ken_embeddings/ .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |ColumnTransformer| replace:: :class:`~sklearn.compose.ColumnTransformer` .. |MinHash| replace:: :class:`~dirty_cat.MinHashEncoder` .. |HGBR| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. GENERATED FROM PYTHON SOURCE LINES 37-42 The data -------- We will take a look at the video game sales dataset. Let's retrieve the dataset: .. GENERATED FROM PYTHON SOURCE LINES 42-53 .. code-block:: default import pandas as pd X = pd.read_csv( "https://raw.githubusercontent.com/William2064888/vgsales.csv/main/vgsales.csv", sep=";", on_bad_lines="skip", ) # Shuffle the data X = X.sample(frac=1, random_state=11, ignore_index=True) X.head(3) .. raw:: html
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 6500 Star Wars: Bounty Hunter GC 2002 Shooter LucasArts 0.20 0.05 0.0 0.01 0.26
1 13442 Thrillville: Off the Rails DS 2007 Strategy LucasArts 0.03 0.01 0.0 0.00 0.05
2 15074 Thomas and Friends: Steaming around Sodor 3DS 2015 Action Avanquest Software 0.00 0.02 0.0 0.00 0.02


.. GENERATED FROM PYTHON SOURCE LINES 54-55 Our goal will be to predict the sales amount (y, our target column): .. GENERATED FROM PYTHON SOURCE LINES 55-58 .. code-block:: default y = X["Global_Sales"] y .. rst-class:: sphx-glr-script-out .. code-block:: none 0 0.26 1 0.05 2 0.02 3 1.16 4 0.03 ... 16567 0.25 16568 0.49 16569 0.22 16570 0.53 16571 0.11 Name: Global_Sales, Length: 16572, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 59-60 Let's take a look at the distribution of our target variable: .. GENERATED FROM PYTHON SOURCE LINES 60-68 .. code-block:: default import seaborn as sns import matplotlib.pyplot as plt sns.set_theme(style="ticks") sns.histplot(y) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_06_ken_embeddings_example_001.png :alt: 06 ken embeddings example :srcset: /auto_examples/images/sphx_glr_06_ken_embeddings_example_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 69-70 It seems better to take the log of sales rather than the absolute values: .. GENERATED FROM PYTHON SOURCE LINES 70-76 .. code-block:: default import numpy as np y = np.log(y) sns.histplot(y) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_06_ken_embeddings_example_002.png :alt: 06 ken embeddings example :srcset: /auto_examples/images/sphx_glr_06_ken_embeddings_example_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 77-78 Before moving further, let's carry out some basic preprocessing: .. GENERATED FROM PYTHON SOURCE LINES 78-85 .. code-block:: default # Get a mask of the rows with missing values in "Publisher" and "Global_Sales" mask = X.isna()["Publisher"] | X.isna()["Global_Sales"] # And remove them X.dropna(subset=["Publisher", "Global_Sales"], inplace=True) y = y[~mask] .. GENERATED FROM PYTHON SOURCE LINES 86-93 Extracting entity embeddings ---------------------------- We will use KEN embeddings to enrich our data. We will start by checking out the available tables with :class:`~dirty_cat.datasets.get_ken_table_aliases`: .. GENERATED FROM PYTHON SOURCE LINES 93-97 .. code-block:: default from dirty_cat.datasets import get_ken_table_aliases get_ken_table_aliases() .. rst-class:: sphx-glr-script-out .. code-block:: none {'all_entities', 'schools', 'albums', 'movies', 'companies', 'games'} .. GENERATED FROM PYTHON SOURCE LINES 98-101 The *games* table is the most relevant to our case. Let's see what kind of types we can find in it with the function :class:`~dirty_cat.datasets.get_ken_types`: .. GENERATED FROM PYTHON SOURCE LINES 101-105 .. code-block:: default from dirty_cat.datasets import get_ken_types get_ken_types(embedding_table_id="games") .. raw:: html
Type
0 <wikicat_1994_video_games>
1 <wikicat_irem_games>
2 <wikicat_ea_guingamp_players>
3 <wikicat_video_game_companies_of_the_united_ki...
4 <wikicat_asian_games_medalists_in_swimming>
... ...
636 <wikicat_college_football_games>
637 <wikicat_sonic_team_games>
638 <wikicat_space_opera_video_games>
639 <wikicat_boxers_at_the_2002_asian_games>
640 <wikicat_motorcycle_video_games>

641 rows × 1 columns



.. GENERATED FROM PYTHON SOURCE LINES 106-110 Interesting, we have a broad range of topics! Next, we'll use :class:`~dirty_cat.datasets.get_ken_embeddings` to extract the embeddings of entities we need: .. GENERATED FROM PYTHON SOURCE LINES 110-112 .. code-block:: default from dirty_cat.datasets import get_ken_embeddings .. GENERATED FROM PYTHON SOURCE LINES 113-120 KEN Embeddings are classified by types. The :class:`~dirty_cat.datasets.get_ken_embeddings` function allows us to specify the types to be included and/or excluded so as not to load all Wikipedia entity embeddings in a table. In a first table, we include all embeddings with the type name "game" and exclude those with type name "companies" or "developer". .. GENERATED FROM PYTHON SOURCE LINES 120-126 .. code-block:: default embedding_games = get_ken_embeddings( types="game", exclude="companies|developer", embedding_table_id="games", ) .. GENERATED FROM PYTHON SOURCE LINES 127-129 In a second table, we include all embeddings containing the type name "game_development_companies", "game_companies" or "game_publish": .. GENERATED FROM PYTHON SOURCE LINES 129-142 .. code-block:: default embedding_publisher = get_ken_embeddings( types="game_development_companies|game_companies|game_publish", embedding_table_id="games", suffix="_aux", ) # We keep the 200 embeddings column names in a list (for the |Pipeline|): n_dim = 200 emb_columns = [f"X{j}" for j in range(n_dim)] emb_columns2 = [f"X{j}_aux" for j in range(n_dim)] .. GENERATED FROM PYTHON SOURCE LINES 143-152 Merging the entities .................... We will now merge the entities from Wikipedia with their equivalent match in our video game sales table: The entities from the 'embedding_games' table will be merged along the column "Name" and the ones from 'embedding_publisher' table with the column "Publisher" .. GENERATED FROM PYTHON SOURCE LINES 152-160 .. code-block:: default from dirty_cat import FeatureAugmenter fa1 = FeatureAugmenter(tables=[(embedding_games, "Entity")], main_key="Name") fa2 = FeatureAugmenter(tables=[(embedding_publisher, "Entity")], main_key="Publisher") X_full = fa1.fit_transform(X) X_full = fa2.fit_transform(X_full) .. GENERATED FROM PYTHON SOURCE LINES 161-167 Prediction with base features ----------------------------- We will forget for now the KEN Embeddings and build a typical learning pipeline, where will we try to predict the amount of sales only using the base features contained in the initial table. .. GENERATED FROM PYTHON SOURCE LINES 169-172 We first use scikit-learn's |ColumnTransformer| to define the columns that will be included in the learning process and the appropriate encoding of categorical variables using the |MinHash| and |OneHotEncoder|: .. GENERATED FROM PYTHON SOURCE LINES 172-187 .. code-block:: default from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder from dirty_cat import MinHashEncoder min_hash = MinHashEncoder(n_components=100) ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False) encoder = make_column_transformer( ("passthrough", ["Year"]), (ohe, ["Genre"]), (min_hash, ["Platform"]), remainder="drop", ) .. GENERATED FROM PYTHON SOURCE LINES 188-190 We incorporate our |ColumnTransformer| into a |Pipeline|. We define a predictor, |HGBR|, fast and reliable for big datasets. .. GENERATED FROM PYTHON SOURCE LINES 190-196 .. code-block:: default from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.pipeline import make_pipeline hgb = HistGradientBoostingRegressor(random_state=0) pipeline = make_pipeline(encoder, hgb) .. GENERATED FROM PYTHON SOURCE LINES 197-198 The |Pipeline| can now be readily applied to the dataframe for prediction: .. GENERATED FROM PYTHON SOURCE LINES 198-219 .. code-block:: default from sklearn.model_selection import cross_validate # We will save the results in a dictionnary: all_r2_scores = dict() all_rmse_scores = dict() cv_results = cross_validate( pipeline, X_full, y, scoring=["r2", "neg_root_mean_squared_error"] ) all_r2_scores["Base features"] = cv_results["test_r2"] all_rmse_scores["Base features"] = -cv_results["test_neg_root_mean_squared_error"] print("With base features:") print( f"Mean R2 is {all_r2_scores['Base features'].mean():.2f} +-" f" {all_r2_scores['Base features'].std():.2f} and the RMSE is" f" {all_rmse_scores['Base features'].mean():.2f} +-" f" {all_rmse_scores['Base features'].std():.2f}" ) .. rst-class:: sphx-glr-script-out .. code-block:: none With base features: Mean R2 is 0.21 +- 0.01 and the RMSE is 1.30 +- 0.01 .. GENERATED FROM PYTHON SOURCE LINES 220-225 Prediction with KEN Embeddings ------------------------------ We will now build a second learning pipeline using only the KEN embeddings from Wikipedia. .. GENERATED FROM PYTHON SOURCE LINES 227-228 We keep only the embeddings columns: .. GENERATED FROM PYTHON SOURCE LINES 228-232 .. code-block:: default encoder2 = make_column_transformer( ("passthrough", emb_columns), ("passthrough", emb_columns2), remainder="drop" ) .. GENERATED FROM PYTHON SOURCE LINES 233-234 We redefine the |Pipeline|: .. GENERATED FROM PYTHON SOURCE LINES 234-236 .. code-block:: default pipeline2 = make_pipeline(encoder2, hgb) .. GENERATED FROM PYTHON SOURCE LINES 237-238 Let's look at the results: .. GENERATED FROM PYTHON SOURCE LINES 238-253 .. code-block:: default cv_results = cross_validate( pipeline2, X_full, y, scoring=["r2", "neg_root_mean_squared_error"] ) all_r2_scores["KEN features"] = cv_results["test_r2"] all_rmse_scores["KEN features"] = -cv_results["test_neg_root_mean_squared_error"] print("With KEN Embeddings:") print( f"Mean R2 is {all_r2_scores['KEN features'].mean():.2f} +-" f" {all_r2_scores['KEN features'].std():.2f} and the RMSE is" f" {all_rmse_scores['KEN features'].mean():.2f} +-" f" {all_rmse_scores['KEN features'].std():.2f}" ) .. rst-class:: sphx-glr-script-out .. code-block:: none With KEN Embeddings: Mean R2 is 0.36 +- 0.01 and the RMSE is 1.16 +- 0.01 .. GENERATED FROM PYTHON SOURCE LINES 254-256 It seems including the embeddings is very relevant for the prediction task at hand! .. GENERATED FROM PYTHON SOURCE LINES 258-264 Prediction with KEN Embeddings and base features ------------------------------------------------ As we have seen the predictions scores in the case when embeddings are only present and when they are missing, we will do a final prediction with all variables included. .. GENERATED FROM PYTHON SOURCE LINES 266-267 We include both the embeddings and the base features: .. GENERATED FROM PYTHON SOURCE LINES 267-276 .. code-block:: default encoder3 = make_column_transformer( ("passthrough", emb_columns), ("passthrough", emb_columns2), ("passthrough", ["Year"]), (ohe, ["Genre"]), (min_hash, ["Platform"]), remainder="drop", ) .. GENERATED FROM PYTHON SOURCE LINES 277-278 We redefine the |Pipeline|: .. GENERATED FROM PYTHON SOURCE LINES 278-280 .. code-block:: default pipeline3 = make_pipeline(encoder3, hgb) .. GENERATED FROM PYTHON SOURCE LINES 281-282 Let's look at the results: .. GENERATED FROM PYTHON SOURCE LINES 282-297 .. code-block:: default cv_results = cross_validate( pipeline3, X_full, y, scoring=["r2", "neg_root_mean_squared_error"] ) all_r2_scores["Base + KEN features"] = cv_results["test_r2"] all_rmse_scores["Base + KEN features"] = -cv_results["test_neg_root_mean_squared_error"] print("With KEN Embeddings and base features:") print( f"Mean R2 is {all_r2_scores['Base + KEN features'].mean():.2f} +-" f" {all_r2_scores['Base + KEN features'].std():.2f} and the RMSE is" f" {all_rmse_scores['Base + KEN features'].mean():.2f} +-" f" {all_rmse_scores['Base + KEN features'].std():.2f}" ) .. rst-class:: sphx-glr-script-out .. code-block:: none With KEN Embeddings and base features: Mean R2 is 0.49 +- 0.01 and the RMSE is 1.05 +- 0.01 .. GENERATED FROM PYTHON SOURCE LINES 298-302 Plotting the results .................... Finally, we plot the scores on a boxplot: .. GENERATED FROM PYTHON SOURCE LINES 302-309 .. code-block:: default plt.figure(figsize=(5, 3)) # sphinx_gallery_thumbnail_number = -1 ax = sns.boxplot(data=pd.DataFrame(all_r2_scores), orient="h") plt.xlabel("Prediction accuracy ", size=15) plt.yticks(size=15) plt.tight_layout() .. image-sg:: /auto_examples/images/sphx_glr_06_ken_embeddings_example_003.png :alt: 06 ken embeddings example :srcset: /auto_examples/images/sphx_glr_06_ken_embeddings_example_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 310-319 There is a clear improvement when including the KEN embeddings among the explanatory variables. In this case, the embeddings from Wikipedia introduced additional background information on the game and the publisher of the game that would otherwise be missed. It helped significantly improve the prediction score. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 10 minutes 48.731 seconds) .. _sphx_glr_download_auto_examples_06_ken_embeddings_example.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/dirty-cat/dirty-cat/0.4.1?urlpath=lab/tree/notebooks/auto_examples/06_ken_embeddings_example.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 06_ken_embeddings_example.py <06_ken_embeddings_example.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 06_ken_embeddings_example.ipynb <06_ken_embeddings_example.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_