.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/01_dirty_categories.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_01_dirty_categories.py: Dirty categories: machine learning with non normalized strings ============================================================== Including strings that represent categories often calls for much data preparation. In particular categories may appear with many morphological variants, when they have been manually input or assembled from diverse sources. Here we look at a dataset on wages [#]_ where the column 'Employee Position Title' contains dirty categories. On such a column, standard categorical encodings leads to very high dimensions and can lose information on which categories are similar. We investigate various encodings of this dirty column for the machine learning workflow, predicting the 'Current Annual Salary' with gradient boosted trees. First we manually assemble a complex encoder for the full dataframe, after which we show a much simpler way, albeit with less fine control. .. [#] https://www.openml.org/d/42125 .. |TV| replace:: :class:`~dirty_cat.TableVectorizer` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |ColumnTransformer| replace:: :class:`~sklearn.compose.ColumnTransformer` .. |RandomForestRegressor| replace:: :class:`~sklearn.ensemble.RandomForestRegressor` .. |Gap| replace:: :class:`~dirty_cat.GapEncoder` .. |MinHash| replace:: :class:`~dirty_cat.MinHashEncoder` .. |HGBR| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. |SE| replace:: :class:`~dirty_cat.SimilarityEncoder` .. |permutation importances| replace:: :func:`~sklearn.inspection.permutation_importance` .. GENERATED FROM PYTHON SOURCE LINES 57-61 The data -------- We first retrieve the dataset: .. GENERATED FROM PYTHON SOURCE LINES 61-65 .. code-block:: default from dirty_cat.datasets import fetch_employee_salaries employee_salaries = fetch_employee_salaries() .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/dirty_cat/datasets/_fetching.py:608: UserWarning: Could not find the dataset 42125 locally. Downloading it from OpenML; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/project/dirty_cat/datasets/data. info = _fetch_openml_dataset(dataset_id, data_directory) .. GENERATED FROM PYTHON SOURCE LINES 66-67 X, the input data (descriptions of employees): .. GENERATED FROM PYTHON SOURCE LINES 67-70 .. code-block:: default X = employee_salaries.X X .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	underfilled_job_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	NaN	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	NaN	09/12/1988	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	NaN	11/19/1989	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	NaN	05/05/2014	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	NaN	03/05/2007	2007
...	...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	NaN	11/03/2015	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	NaN	11/28/1988	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	NaN	04/30/2001	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	NaN	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	NaN	01/30/2012	2012

9228 rows × 9 columns

.. GENERATED FROM PYTHON SOURCE LINES 71-72 and y, our target column (the annual salary): .. GENERATED FROM PYTHON SOURCE LINES 72-75 .. code-block:: default y = employee_salaries.y y.name .. rst-class:: sphx-glr-script-out .. code-block:: none 'current_annual_salary' .. GENERATED FROM PYTHON SOURCE LINES 76-77 Now, let's carry out some basic preprocessing: .. GENERATED FROM PYTHON SOURCE LINES 77-87 .. code-block:: default import pandas as pd X["date_first_hired"] = pd.to_datetime(X["date_first_hired"]) X["year_first_hired"] = X["date_first_hired"].apply(lambda x: x.year) # Get a mask of the rows with missing values in 'gender' mask = X.isna()["gender"] # And remove them X.dropna(subset=["gender"], inplace=True) y = y[~mask] .. GENERATED FROM PYTHON SOURCE LINES 88-93 Assembling a machine-learning pipeline that encodes the data ------------------------------------------------------------ To build a learning pipeline, we need to assemble encoders for each column, and apply a supervised learning model on top. .. GENERATED FROM PYTHON SOURCE LINES 95-100 The categorical encoders ........................ An encoder is needed to turn a categorical column into a numerical representation: .. GENERATED FROM PYTHON SOURCE LINES 100-104 .. code-block:: default from sklearn.preprocessing import OneHotEncoder one_hot = OneHotEncoder(handle_unknown="ignore", sparse=False) .. GENERATED FROM PYTHON SOURCE LINES 105-108 We assemble these to apply them to the relevant columns. The |ColumnTransformer| is created by specifying a set of transformers alongside with the column names on which each must be applied: .. GENERATED FROM PYTHON SOURCE LINES 108-119 .. code-block:: default from sklearn.compose import make_column_transformer encoder = make_column_transformer( (one_hot, ["gender", "department_name", "assignment_category"]), ("passthrough", ["year_first_hired"]), # Last but not least, our dirty column (one_hot, ["employee_position_title"]), remainder="drop", ) .. GENERATED FROM PYTHON SOURCE LINES 120-127 Pipelining an encoder with a learner .................................... We will use a |HGBR|, which is a good predictor for data with heterogeneous columns (we need to require the experimental feature for scikit-learn versions earlier than 1.0): .. GENERATED FROM PYTHON SOURCE LINES 127-137 .. code-block:: default from sklearn.experimental import enable_hist_gradient_boosting # We can now import the |HGBR| from ensemble from sklearn.ensemble import HistGradientBoostingRegressor # We then create a pipeline chaining our encoders to a learner from sklearn.pipeline import make_pipeline pipeline = make_pipeline(encoder, HistGradientBoostingRegressor()) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble. warnings.warn( .. GENERATED FROM PYTHON SOURCE LINES 138-139 The pipeline can be readily applied to the dataframe for prediction: .. GENERATED FROM PYTHON SOURCE LINES 139-141 .. code-block:: default pipeline.fit(X, y) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( .. raw:: html

Pipeline(steps=[('columntransformer',
                     ColumnTransformer(transformers=[('onehotencoder-1',
                                                      OneHotEncoder(handle_unknown='ignore',
                                                                    sparse=False),
                                                      ['gender', 'department_name',
                                                       'assignment_category']),
                                                     ('passthrough', 'passthrough',
                                                      ['year_first_hired']),
                                                     ('onehotencoder-2',
                                                      OneHotEncoder(handle_unknown='ignore',
                                                                    sparse=False),
                                                      ['employee_position_title'])])),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('columntransformer',
                     ColumnTransformer(transformers=[('onehotencoder-1',
                                                      OneHotEncoder(handle_unknown='ignore',
                                                                    sparse=False),
                                                      ['gender', 'department_name',
                                                       'assignment_category']),
                                                     ('passthrough', 'passthrough',
                                                      ['year_first_hired']),
                                                     ('onehotencoder-2',
                                                      OneHotEncoder(handle_unknown='ignore',
                                                                    sparse=False),
                                                      ['employee_position_title'])])),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor())])

columntransformer: ColumnTransformer

ColumnTransformer(transformers=[('onehotencoder-1',
                                     OneHotEncoder(handle_unknown='ignore',
                                                   sparse=False),
                                     ['gender', 'department_name',
                                      'assignment_category']),
                                    ('passthrough', 'passthrough',
                                     ['year_first_hired']),
                                    ('onehotencoder-2',
                                     OneHotEncoder(handle_unknown='ignore',
                                                   sparse=False),
                                     ['employee_position_title'])])

onehotencoder-1

['gender', 'department_name', 'assignment_category']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse=False)

passthrough

['year_first_hired']

passthrough

passthrough

onehotencoder-2

['employee_position_title']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse=False)

HistGradientBoostingRegressor

HistGradientBoostingRegressor()

.. GENERATED FROM PYTHON SOURCE LINES 142-147 Dirty-category encoding ----------------------- The |OneHotEncoder| is actually not well suited to the 'Employee Position Title' column, as this column contains 400 different entries: .. GENERATED FROM PYTHON SOURCE LINES 147-151 .. code-block:: default import numpy as np np.unique(y) .. rst-class:: sphx-glr-script-out .. code-block:: none array([ 9196. , 11147.24, 13244.5 , ..., 233003. , 239566. , 303091. ]) .. GENERATED FROM PYTHON SOURCE LINES 152-156 .. _example_minhash_encoder: We will now experiment with encoders specially made for handling dirty columns: .. GENERATED FROM PYTHON SOURCE LINES 156-172 .. code-block:: default from dirty_cat import ( SimilarityEncoder, TargetEncoder, MinHashEncoder, GapEncoder, ) encoders = { "one-hot": one_hot, "similarity": SimilarityEncoder(), "target": TargetEncoder(handle_unknown="ignore"), "minhash": MinHashEncoder(n_components=100), "gap": GapEncoder(n_components=100), } .. GENERATED FROM PYTHON SOURCE LINES 173-176 We now loop over the different encoding methods, instantiate a new |Pipeline| each time, fit it and store the returned cross-validation score: .. GENERATED FROM PYTHON SOURCE LINES 176-196 .. code-block:: default from sklearn.model_selection import cross_val_score all_scores = dict() for name, method in encoders.items(): encoder = make_column_transformer( (one_hot, ["gender", "department_name", "assignment_category"]), ("passthrough", ["year_first_hired"]), # Last but not least, our dirty column (method, ["employee_position_title"]), remainder="drop", ) pipeline = make_pipeline(encoder, HistGradientBoostingRegressor()) scores = cross_val_score(pipeline, X, y) print(f"{name} encoding") print(f"r2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}\n") all_scores[name] = scores .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( one-hot encoding r2 score: mean: 0.776; std: 0.028 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( similarity encoding r2 score: mean: 0.923; std: 0.014 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( target encoding r2 score: mean: 0.842; std: 0.030 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( minhash encoding r2 score: mean: 0.919; std: 0.012 /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( /home/circleci/project/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn( gap encoding r2 score: mean: 0.922; std: 0.013 .. GENERATED FROM PYTHON SOURCE LINES 197-201 Plotting the results .................... Finally, we plot the scores on a boxplot: .. GENERATED FROM PYTHON SOURCE LINES 201-212 .. code-block:: default import seaborn import matplotlib.pyplot as plt plt.figure(figsize=(4, 3)) ax = seaborn.boxplot(data=pd.DataFrame(all_scores), orient="h") plt.ylabel("Encoding", size=20) plt.xlabel("Prediction accuracy ", size=20) plt.yticks(size=20) plt.tight_layout() .. image-sg:: /auto_examples/images/sphx_glr_01_dirty_categories_001.png :alt: 01 dirty categories :srcset: /auto_examples/images/sphx_glr_01_dirty_categories_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 213-224 The clear trend is that encoders grasping similarities between categories (|SE|, |MinHash|, and |Gap|) perform better than those discarding it. |SE| is the best performer, but it is less scalable on big data than the |MinHash| and |Gap|. The most scalable encoder is the |MinHash|. On the other hand, the |Gap| has the benefit of providing interpretable features (see :ref:`sphx_glr_auto_examples_02_investigating_dirty_categories.py`) | .. GENERATED FROM PYTHON SOURCE LINES 226-235 .. _example_table_vectorizer: A simpler way: automatic vectorization -------------------------------------- The code to assemble a column transformer is a bit tedious. We will now explore a simpler, automated, way of encoding the data. Let's start again from the raw data: .. GENERATED FROM PYTHON SOURCE LINES 235-239 .. code-block:: default employee_salaries = fetch_employee_salaries() X = employee_salaries.X y = employee_salaries.y .. GENERATED FROM PYTHON SOURCE LINES 240-242 We'll drop the 'date_first_hired' column as it's redundant with 'year_first_hired'. .. GENERATED FROM PYTHON SOURCE LINES 242-244 .. code-block:: default X = X.drop(["date_first_hired"], axis=1) .. GENERATED FROM PYTHON SOURCE LINES 245-246 We still have a complex and heterogeneous dataframe: .. GENERATED FROM PYTHON SOURCE LINES 246-248 .. code-block:: default X .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	underfilled_job_title	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	NaN	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	NaN	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	NaN	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	NaN	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	NaN	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	NaN	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	NaN	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	NaN	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	NaN	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	NaN	2012

9228 rows × 8 columns

.. GENERATED FROM PYTHON SOURCE LINES 249-251 The |TV| can to turn this dataframe into a form suited for machine learning. .. GENERATED FROM PYTHON SOURCE LINES 253-262 Using the TableVectorizer in a supervised-learning pipeline ----------------------------------------------------------- Assembling the |TV| in a |Pipeline| with a powerful learner, such as gradient boosted trees, gives **a machine-learning method that can be readily applied to the dataframe**. The |TV| requires at least dirty_cat 0.2.0. .. GENERATED FROM PYTHON SOURCE LINES 262-269 .. code-block:: default from dirty_cat import TableVectorizer pipeline = make_pipeline( TableVectorizer(auto_cast=True), HistGradientBoostingRegressor() ) .. GENERATED FROM PYTHON SOURCE LINES 270-271 Let's perform a cross-validation to see how well this model predicts: .. GENERATED FROM PYTHON SOURCE LINES 271-280 .. code-block:: default from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X, y, scoring="r2") print(f"scores={scores}") print(f"mean={np.mean(scores)}") print(f"std={np.std(scores)}") .. rst-class:: sphx-glr-script-out .. code-block:: none scores=[0.92688413 0.89763491 0.92818275 0.93156022 0.92611958] mean=0.9220763174855001 std=0.012361865357065842 .. GENERATED FROM PYTHON SOURCE LINES 281-284 The prediction performed here is pretty much as good as above but the code here is much simpler as it does not involve specifying columns manually. .. GENERATED FROM PYTHON SOURCE LINES 286-291 Analyzing the features created ------------------------------ Let us perform the same workflow, but without the |Pipeline|, so we can analyze the TableVectorizer's mechanisms along the way. .. GENERATED FROM PYTHON SOURCE LINES 291-293 .. code-block:: default table_vec = TableVectorizer(auto_cast=True) .. GENERATED FROM PYTHON SOURCE LINES 294-295 We split the data between train and test, and transform them: .. GENERATED FROM PYTHON SOURCE LINES 295-304 .. code-block:: default from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.15, random_state=42 ) X_train_enc = table_vec.fit_transform(X_train, y_train) X_test_enc = table_vec.transform(X_test) .. GENERATED FROM PYTHON SOURCE LINES 305-306 The encoded data, X_train_enc and X_test_enc are numerical arrays: .. GENERATED FROM PYTHON SOURCE LINES 306-308 .. code-block:: default X_train_enc .. rst-class:: sphx-glr-script-out .. code-block:: none array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ..., 5.24002256e-02, 9.83610175e+00, 2.00700000e+03], [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.00500000e+03], [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.00900000e+03], ..., [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 1.99000000e+03], [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.01200000e+03], [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 5.00000000e-02, 5.00000000e-02, 2.01400000e+03]]) .. GENERATED FROM PYTHON SOURCE LINES 309-310 They have more columns than the original dataframe, but not much more: .. GENERATED FROM PYTHON SOURCE LINES 310-312 .. code-block:: default X_train.shape, X_train_enc.shape .. rst-class:: sphx-glr-script-out .. code-block:: none ((7843, 8), (7843, 169)) .. GENERATED FROM PYTHON SOURCE LINES 313-318 Inspecting the features created ............................... The |TV| assigns a transformer for each column. We can inspect this choice: .. GENERATED FROM PYTHON SOURCE LINES 318-322 .. code-block:: default from pprint import pprint pprint(table_vec.transformers_) .. rst-class:: sphx-glr-script-out .. code-block:: none [('low_card_cat', OneHotEncoder(drop='if_binary', handle_unknown='ignore'), ['gender', 'department', 'department_name', 'assignment_category']), ('high_card_cat', GapEncoder(n_components=30), ['division', 'employee_position_title', 'underfilled_job_title']), ('remainder', 'passthrough', ['year_first_hired'])] .. GENERATED FROM PYTHON SOURCE LINES 323-336 This is what is being passed to the |ColumnTransformer| under the hood. If you're familiar with how the latter works, it should be very intuitive. We can notice it classified the columns 'gender' and 'assignment_category' as low cardinality string variables. A |OneHotEncoder| will be applied to these columns. The vectorizer actually makes the difference between string variables (data type ``object`` and ``string``) and categorical variables (data type ``category``). Next, we can have a look at the encoded feature names. Before encoding: .. GENERATED FROM PYTHON SOURCE LINES 336-338 .. code-block:: default X.columns.to_list() .. rst-class:: sphx-glr-script-out .. code-block:: none ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title', 'underfilled_job_title', 'year_first_hired'] .. GENERATED FROM PYTHON SOURCE LINES 339-340 After encoding (we only plot the first 8 feature names): .. GENERATED FROM PYTHON SOURCE LINES 340-343 .. code-block:: default feature_names = table_vec.get_feature_names_out() feature_names[:8] .. rst-class:: sphx-glr-script-out .. code-block:: none ['gender_F', 'gender_M', 'gender_nan', 'department_BOA', 'department_BOE', 'department_CAT', 'department_CCL', 'department_CEC'] .. GENERATED FROM PYTHON SOURCE LINES 344-350 As we can see, it gave us interpretable columns. This is because we used the |Gap| on the column 'division', which was classified as a high cardinality string variable. (default values, see |TV|'s docstring). In total, we have a reasonable number of encoded columns: .. GENERATED FROM PYTHON SOURCE LINES 350-353 .. code-block:: default len(feature_names) .. rst-class:: sphx-glr-script-out .. code-block:: none 169 .. GENERATED FROM PYTHON SOURCE LINES 354-366 Feature importances in the statistical model -------------------------------------------- In this section, we will train a regressor, and plot the feature importances. .. topic:: Note: To minimize computation time, we use the feature importances computed by the |RandomForestRegressor|, but you should prefer |permutation importances| instead (which are less subject to biases). First, let's train the |RandomForestRegressor|: .. GENERATED FROM PYTHON SOURCE LINES 366-372 .. code-block:: default from sklearn.ensemble import RandomForestRegressor regressor = RandomForestRegressor() regressor.fit(X_train_enc, y_train) .. raw:: html

RandomForestRegressor()

.. GENERATED FROM PYTHON SOURCE LINES 373-374 Retrieving the feature importances: .. GENERATED FROM PYTHON SOURCE LINES 374-381 .. code-block:: default importances = regressor.feature_importances_ std = np.std([tree.feature_importances_ for tree in regressor.estimators_], axis=0) indices = np.argsort(importances) # Sort from least to most indices = list(reversed(indices)) .. GENERATED FROM PYTHON SOURCE LINES 382-383 Plotting the results: .. GENERATED FROM PYTHON SOURCE LINES 383-396 .. code-block:: default import matplotlib.pyplot as plt plt.figure(figsize=(12, 9)) plt.title("Feature importances") n = 20 n_indices = indices[:n] labels = np.array(feature_names)[n_indices] plt.barh(range(n), importances[n_indices], color="b", yerr=std[n_indices]) plt.yticks(range(n), labels, size=15) plt.tight_layout(pad=1) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_01_dirty_categories_002.png :alt: Feature importances :srcset: /auto_examples/images/sphx_glr_01_dirty_categories_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 397-408 We can deduce from this data that the three factors that define the most the salary are: being hired for a long time, being a manager, and having a permanent, full-time job :) .. topic:: The |TV| automates preprocessing As this notebook demonstrates, many preprocessing steps can be automated by the |TV|, and the resulting pipeline can still be inspected, even with non-normalized entries. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 54.415 seconds) .. _sphx_glr_download_auto_examples_01_dirty_categories.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/dirty-cat/dirty-cat/0.4.1?urlpath=lab/tree/notebooks/auto_examples/01_dirty_categories.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 01_dirty_categories.py <01_dirty_categories.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 01_dirty_categories.ipynb <01_dirty_categories.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_