.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/01_dirty_categories.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_01_dirty_categories.py: Dirty categories: machine learning with non normalized strings ============================================================== Including strings that represent categories often calls for much data preparation. In particular categories may appear with many morphological variants, when they have been manually input or assembled from diverse sources. Here we look at a dataset on wages [#]_ where the column 'Employee Position Title' contains dirty categories. On such a column, standard categorical encodings leads to very high dimensions and can lose information on which categories are similar. We investigate various encodings of this dirty column for the machine learning workflow, predicting the 'Current Annual Salary' with gradient boosted trees. First we manually assemble a complex encoder for the full dataframe, after which we show a much simpler way, albeit with less fine control. .. [#] https://www.openml.org/d/42125 .. |TV| replace:: :class:`~dirty_cat.TableVectorizer` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |ColumnTransformer| replace:: :class:`~sklearn.compose.ColumnTransformer` .. |RandomForestRegressor| replace:: :class:`~sklearn.ensemble.RandomForestRegressor` .. |Gap| replace:: :class:`~dirty_cat.GapEncoder` .. |MinHash| replace:: :class:`~dirty_cat.MinHashEncoder` .. |HGBR| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. |SE| replace:: :class:`~dirty_cat.SimilarityEncoder` .. |permutation importances| replace:: :func:`~sklearn.inspection.permutation_importance` .. GENERATED FROM PYTHON SOURCE LINES 57-61 The data -------- We first retrieve the dataset: .. GENERATED FROM PYTHON SOURCE LINES 61-65 .. code-block:: default from dirty_cat.datasets import fetch_employee_salaries employee_salaries = fetch_employee_salaries() .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/dirty_cat/datasets/_fetching.py:608: UserWarning: Could not find the dataset 42125 locally. Downloading it from OpenML; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/project/dirty_cat/datasets/data. info = _fetch_openml_dataset(dataset_id, data_directory) .. GENERATED FROM PYTHON SOURCE LINES 66-67 X, the input data (descriptions of employees): .. GENERATED FROM PYTHON SOURCE LINES 67-70 .. code-block:: default X = employee_salaries.X X .. raw:: html
gender department department_name division assignment_category employee_position_title underfilled_job_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator NaN 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer NaN 09/12/1988 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV NaN 11/19/1989 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II NaN 05/05/2014 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III NaN 03/05/2007 2007
... ... ... ... ... ... ... ... ... ...
9223 F HHS Department of Health and Human Services School Based Health Centers Fulltime-Regular Community Health Nurse II NaN 11/03/2015 2015
9224 F FRS Fire and Rescue Services Human Resources Division Fulltime-Regular Fire/Rescue Division Chief NaN 11/28/1988 1988
9225 M HHS Department of Health and Human Services Child and Adolescent Mental Health Clinic Serv... Parttime-Regular Medical Doctor IV - Psychiatrist NaN 04/30/2001 2001
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II NaN 09/05/2006 2006
9227 M DLC Department of Liquor Control Licensure, Regulation and Education Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II NaN 01/30/2012 2012

9228 rows × 9 columns



.. GENERATED FROM PYTHON SOURCE LINES 71-72 and y, our target column (the annual salary): .. GENERATED FROM PYTHON SOURCE LINES 72-75 .. code-block:: default y = employee_salaries.y y.name .. rst-class:: sphx-glr-script-out .. code-block:: none 'current_annual_salary' .. GENERATED FROM PYTHON SOURCE LINES 76-77 Now, let's carry out some basic preprocessing: .. GENERATED FROM PYTHON SOURCE LINES 77-87 .. code-block:: default import pandas as pd X["date_first_hired"] = pd.to_datetime(X["date_first_hired"]) X["year_first_hired"] = X["date_first_hired"].apply(lambda x: x.year) # Get a mask of the rows with missing values in 'gender' mask = X.isna()["gender"] # And remove them X.dropna(subset=["gender"], inplace=True) y = y[~mask] .. GENERATED FROM PYTHON SOURCE LINES 88-93 Assembling a machine-learning pipeline that encodes the data ------------------------------------------------------------ To build a learning pipeline, we need to assemble encoders for each column, and apply a supervised learning model on top. .. GENERATED FROM PYTHON SOURCE LINES 95-100 The categorical encoders ........................ An encoder is needed to turn a categorical column into a numerical representation: .. GENERATED FROM PYTHON SOURCE LINES 100-104 .. code-block:: default from sklearn.preprocessing import OneHotEncoder one_hot = OneHotEncoder(handle_unknown="ignore", sparse=False) .. GENERATED FROM PYTHON SOURCE LINES 105-108 We assemble these to apply them to the relevant columns. The |ColumnTransformer| is created by specifying a set of transformers alongside with the column names on which each must be applied: .. GENERATED FROM PYTHON SOURCE LINES 108-119 .. code-block:: default from sklearn.compose import make_column_transformer encoder = make_column_transformer( (one_hot, ["gender", "department_name", "assignment_category"]), ("passthrough", ["year_first_hired"]), # Last but not least, our dirty column (one_hot, ["employee_position_title"]), remainder="drop", ) .. GENERATED FROM PYTHON SOURCE LINES 120-127 Pipelining an encoder with a learner .................................... We will use a |HGBR|, which is a good predictor for data with heterogeneous columns (we need to require the experimental feature for scikit-learn versions earlier than 1.0): .. GENERATED FROM PYTHON SOURCE LINES 127-137 .. code-block:: default from sklearn.experimental import enable_hist_gradient_boosting # We can now import the |HGBR| from ensemble from sklearn.ensemble import HistGradientBoostingRegressor # We then create a pipeline chaining our encoders to a learner from sklearn.pipeline import make_pipeline pipeline = make_pipeline(encoder, HistGradientBoostingRegressor()) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble. warnings.warn( .. GENERATED FROM PYTHON SOURCE LINES 138-139 The pipeline can be readily applied to the dataframe for prediction: .. GENERATED FROM PYTHON SOURCE LINES 139-141 .. code-block:: default pipeline.fit(X, y) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( .. raw:: html
Pipeline(steps=[('columntransformer',
                     ColumnTransformer(transformers=[('onehotencoder-1',
                                                      OneHotEncoder(handle_unknown='ignore',
                                                                    sparse=False),
                                                      ['gender', 'department_name',
                                                       'assignment_category']),
                                                     ('passthrough', 'passthrough',
                                                      ['year_first_hired']),
                                                     ('onehotencoder-2',
                                                      OneHotEncoder(handle_unknown='ignore',
                                                                    sparse=False),
                                                      ['employee_position_title'])])),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 142-147 Dirty-category encoding ----------------------- The |OneHotEncoder| is actually not well suited to the 'Employee Position Title' column, as this column contains 400 different entries: .. GENERATED FROM PYTHON SOURCE LINES 147-151 .. code-block:: default import numpy as np np.unique(y) .. rst-class:: sphx-glr-script-out .. code-block:: none array([ 9196. , 11147.24, 13244.5 , ..., 233003. , 239566. , 303091. ]) .. GENERATED FROM PYTHON SOURCE LINES 152-156 .. _example_minhash_encoder: We will now experiment with encoders specially made for handling dirty columns: .. GENERATED FROM PYTHON SOURCE LINES 156-172 .. code-block:: default from dirty_cat import ( SimilarityEncoder, TargetEncoder, MinHashEncoder, GapEncoder, ) encoders = { "one-hot": one_hot, "similarity": SimilarityEncoder(), "target": TargetEncoder(handle_unknown="ignore"), "minhash": MinHashEncoder(n_components=100), "gap": GapEncoder(n_components=100), } .. GENERATED FROM PYTHON SOURCE LINES 173-176 We now loop over the different encoding methods, instantiate a new |Pipeline| each time, fit it and store the returned cross-validation score: .. GENERATED FROM PYTHON SOURCE LINES 176-196 .. code-block:: default from sklearn.model_selection import cross_val_score all_scores = dict() for name, method in encoders.items(): encoder = make_column_transformer( (one_hot, ["gender", "department_name", "assignment_category"]), ("passthrough", ["year_first_hired"]), # Last but not least, our dirty column (method, ["employee_position_title"]), remainder="drop", ) pipeline = make_pipeline(encoder, HistGradientBoostingRegressor()) scores = cross_val_score(pipeline, X, y) print(f"{name} encoding") print(f"r2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}\n") all_scores[name] = scores .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( one-hot encoding r2 score: mean: 0.776; std: 0.028 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( similarity encoding r2 score: mean: 0.923; std: 0.014 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( target encoding r2 score: mean: 0.842; std: 0.030 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( minhash encoding r2 score: mean: 0.919; std: 0.012 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( gap encoding r2 score: mean: 0.922; std: 0.013 .. GENERATED FROM PYTHON SOURCE LINES 197-201 Plotting the results .................... Finally, we plot the scores on a boxplot: .. GENERATED FROM PYTHON SOURCE LINES 201-212 .. code-block:: default import seaborn import matplotlib.pyplot as plt plt.figure(figsize=(4, 3)) ax = seaborn.boxplot(data=pd.DataFrame(all_scores), orient="h") plt.ylabel("Encoding", size=20) plt.xlabel("Prediction accuracy ", size=20) plt.yticks(size=20) plt.tight_layout() .. image-sg:: /auto_examples/images/sphx_glr_01_dirty_categories_001.png :alt: 01 dirty categories :srcset: /auto_examples/images/sphx_glr_01_dirty_categories_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 213-224 The clear trend is that encoders grasping similarities between categories (|SE|, |MinHash|, and |Gap|) perform better than those discarding it. |SE| is the best performer, but it is less scalable on big data than the |MinHash| and |Gap|. The most scalable encoder is the |MinHash|. On the other hand, the |Gap| has the benefit of providing interpretable features (see :ref:`sphx_glr_auto_examples_02_investigating_dirty_categories.py`) | .. GENERATED FROM PYTHON SOURCE LINES 226-235 .. _example_table_vectorizer: A simpler way: automatic vectorization -------------------------------------- The code to assemble a column transformer is a bit tedious. We will now explore a simpler, automated, way of encoding the data. Let's start again from the raw data: .. GENERATED FROM PYTHON SOURCE LINES 235-239 .. code-block:: default employee_salaries = fetch_employee_salaries() X = employee_salaries.X y = employee_salaries.y .. GENERATED FROM PYTHON SOURCE LINES 240-242 We'll drop the 'date_first_hired' column as it's redundant with 'year_first_hired'. .. GENERATED FROM PYTHON SOURCE LINES 242-244 .. code-block:: default X = X.drop(["date_first_hired"], axis=1) .. GENERATED FROM PYTHON SOURCE LINES 245-246 We still have a complex and heterogeneous dataframe: .. GENERATED FROM PYTHON SOURCE LINES 246-248 .. code-block:: default X .. raw:: html
gender department department_name division assignment_category employee_position_title underfilled_job_title year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator NaN 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer NaN 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV NaN 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II NaN 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III NaN 2007
... ... ... ... ... ... ... ... ...
9223 F HHS Department of Health and Human Services School Based Health Centers Fulltime-Regular Community Health Nurse II NaN 2015
9224 F FRS Fire and Rescue Services Human Resources Division Fulltime-Regular Fire/Rescue Division Chief NaN 1988
9225 M HHS Department of Health and Human Services Child and Adolescent Mental Health Clinic Serv... Parttime-Regular Medical Doctor IV - Psychiatrist NaN 2001
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II NaN 2006
9227 M DLC Department of Liquor Control Licensure, Regulation and Education Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II NaN 2012

9228 rows × 8 columns



.. GENERATED FROM PYTHON SOURCE LINES 249-251 The |TV| can to turn this dataframe into a form suited for machine learning. .. GENERATED FROM PYTHON SOURCE LINES 253-262 Using the TableVectorizer in a supervised-learning pipeline ----------------------------------------------------------- Assembling the |TV| in a |Pipeline| with a powerful learner, such as gradient boosted trees, gives **a machine-learning method that can be readily applied to the dataframe**. The |TV| requires at least dirty_cat 0.2.0. .. GENERATED FROM PYTHON SOURCE LINES 262-269 .. code-block:: default from dirty_cat import TableVectorizer pipeline = make_pipeline( TableVectorizer(auto_cast=True), HistGradientBoostingRegressor() ) .. GENERATED FROM PYTHON SOURCE LINES 270-271 Let's perform a cross-validation to see how well this model predicts: .. GENERATED FROM PYTHON SOURCE LINES 271-280 .. code-block:: default from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X, y, scoring="r2") print(f"scores={scores}") print(f"mean={np.mean(scores)}") print(f"std={np.std(scores)}") .. rst-class:: sphx-glr-script-out .. code-block:: none scores=[0.92688413 0.89763491 0.92818275 0.93156022 0.92611958] mean=0.9220763174855001 std=0.012361865357065842 .. GENERATED FROM PYTHON SOURCE LINES 281-284 The prediction performed here is pretty much as good as above but the code here is much simpler as it does not involve specifying columns manually. .. GENERATED FROM PYTHON SOURCE LINES 286-291 Analyzing the features created ------------------------------ Let us perform the same workflow, but without the |Pipeline|, so we can analyze the TableVectorizer's mechanisms along the way. .. GENERATED FROM PYTHON SOURCE LINES 291-293 .. code-block:: default table_vec = TableVectorizer(auto_cast=True) .. GENERATED FROM PYTHON SOURCE LINES 294-295 We split the data between train and test, and transform them: .. GENERATED FROM PYTHON SOURCE LINES 295-304 .. code-block:: default from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.15, random_state=42 ) X_train_enc = table_vec.fit_transform(X_train, y_train) X_test_enc = table_vec.transform(X_test) .. GENERATED FROM PYTHON SOURCE LINES 305-306 The encoded data, X_train_enc and X_test_enc are numerical arrays: .. GENERATED FROM PYTHON SOURCE LINES 306-308 .. code-block:: default X_train_enc .. rst-class:: sphx-glr-script-out .. code-block:: none array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ..., 5.24002256e-02, 9.83610175e+00, 2.00700000e+03], [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.00500000e+03], [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.00900000e+03], ..., [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 1.99000000e+03], [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.01200000e+03], [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.01400000e+03]]) .. GENERATED FROM PYTHON SOURCE LINES 309-310 They have more columns than the original dataframe, but not much more: .. GENERATED FROM PYTHON SOURCE LINES 310-312 .. code-block:: default X_train.shape, X_train_enc.shape .. rst-class:: sphx-glr-script-out .. code-block:: none ((7843, 8), (7843, 169)) .. GENERATED FROM PYTHON SOURCE LINES 313-318 Inspecting the features created ............................... The |TV| assigns a transformer for each column. We can inspect this choice: .. GENERATED FROM PYTHON SOURCE LINES 318-322 .. code-block:: default from pprint import pprint pprint(table_vec.transformers_) .. rst-class:: sphx-glr-script-out .. code-block:: none [('low_card_cat', OneHotEncoder(drop='if_binary', handle_unknown='ignore'), ['gender', 'department', 'department_name', 'assignment_category']), ('high_card_cat', GapEncoder(n_components=30), ['division', 'employee_position_title', 'underfilled_job_title']), ('remainder', 'passthrough', ['year_first_hired'])] .. GENERATED FROM PYTHON SOURCE LINES 323-336 This is what is being passed to the |ColumnTransformer| under the hood. If you're familiar with how the latter works, it should be very intuitive. We can notice it classified the columns 'gender' and 'assignment_category' as low cardinality string variables. A |OneHotEncoder| will be applied to these columns. The vectorizer actually makes the difference between string variables (data type ``object`` and ``string``) and categorical variables (data type ``category``). Next, we can have a look at the encoded feature names. Before encoding: .. GENERATED FROM PYTHON SOURCE LINES 336-338 .. code-block:: default X.columns.to_list() .. rst-class:: sphx-glr-script-out .. code-block:: none ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title', 'underfilled_job_title', 'year_first_hired'] .. GENERATED FROM PYTHON SOURCE LINES 339-340 After encoding (we only plot the first 8 feature names): .. GENERATED FROM PYTHON SOURCE LINES 340-343 .. code-block:: default feature_names = table_vec.get_feature_names_out() feature_names[:8] .. rst-class:: sphx-glr-script-out .. code-block:: none ['gender_F', 'gender_M', 'gender_nan', 'department_BOA', 'department_BOE', 'department_CAT', 'department_CCL', 'department_CEC'] .. GENERATED FROM PYTHON SOURCE LINES 344-350 As we can see, it gave us interpretable columns. This is because we used the |Gap| on the column 'division', which was classified as a high cardinality string variable. (default values, see |TV|'s docstring). In total, we have a reasonable number of encoded columns: .. GENERATED FROM PYTHON SOURCE LINES 350-353 .. code-block:: default len(feature_names) .. rst-class:: sphx-glr-script-out .. code-block:: none 169 .. GENERATED FROM PYTHON SOURCE LINES 354-366 Feature importances in the statistical model -------------------------------------------- In this section, we will train a regressor, and plot the feature importances. .. topic:: Note: To minimize computation time, we use the feature importances computed by the |RandomForestRegressor|, but you should prefer |permutation importances| instead (which are less subject to biases). First, let's train the |RandomForestRegressor|: .. GENERATED FROM PYTHON SOURCE LINES 366-372 .. code-block:: default from sklearn.ensemble import RandomForestRegressor regressor = RandomForestRegressor() regressor.fit(X_train_enc, y_train) .. raw:: html
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 373-374 Retrieving the feature importances: .. GENERATED FROM PYTHON SOURCE LINES 374-381 .. code-block:: default importances = regressor.feature_importances_ std = np.std([tree.feature_importances_ for tree in regressor.estimators_], axis=0) indices = np.argsort(importances) # Sort from least to most indices = list(reversed(indices)) .. GENERATED FROM PYTHON SOURCE LINES 382-383 Plotting the results: .. GENERATED FROM PYTHON SOURCE LINES 383-396 .. code-block:: default import matplotlib.pyplot as plt plt.figure(figsize=(12, 9)) plt.title("Feature importances") n = 20 n_indices = indices[:n] labels = np.array(feature_names)[n_indices] plt.barh(range(n), importances[n_indices], color="b", yerr=std[n_indices]) plt.yticks(range(n), labels, size=15) plt.tight_layout(pad=1) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_01_dirty_categories_002.png :alt: Feature importances :srcset: /auto_examples/images/sphx_glr_01_dirty_categories_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 397-408 We can deduce from this data that the three factors that define the most the salary are: being hired for a long time, being a manager, and having a permanent, full-time job :) .. topic:: The |TV| automates preprocessing As this notebook demonstrates, many preprocessing steps can be automated by the |TV|, and the resulting pipeline can still be inspected, even with non-normalized entries. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 54.415 seconds) .. _sphx_glr_download_auto_examples_01_dirty_categories.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/dirty-cat/dirty-cat/0.4.1?urlpath=lab/tree/notebooks/auto_examples/01_dirty_categories.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 01_dirty_categories.py <01_dirty_categories.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 01_dirty_categories.ipynb <01_dirty_categories.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_