dirty_cat.TableVectorizer

Usage examples at the bottom of this page.

class dirty_cat.TableVectorizer(*, cardinality_threshold=40, low_card_cat_transformer=None, high_card_cat_transformer=None, numerical_transformer=None, datetime_transformer=None, auto_cast=True, impute_missing='auto', remainder='passthrough', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False)[source]

Automatically transform a heterogeneous dataframe to a numerical array.

Easily transforms a heterogeneous data table (such as a DataFrame) to a numerical array for machine learning. For this it transforms each column depending on its data type. It provides a simplified interface for the ColumnTransformer; more documentation of attributes and functions are available in its doc.

New in version 0.2.0.

Parameters:
cardinality_thresholdint, default=40

Two lists of features will be created depending on this value: strictly under this value, the low cardinality categorical features, and above or equal, the high cardinality categorical features. Different transformers will be applied to these two groups, defined by the parameters low_card_cat_transformer and high_card_cat_transformer respectively. Note: currently, missing values are counted as a single unique value (so they count in the cardinality).

low_card_cat_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on categorical/string features with low cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. OneHotEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (OneHotEncoder(handle_unknown="ignore", drop="if_binary")). Features classified under this category are imputed based on the strategy defined with impute_missing.

high_card_cat_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on categorical/string features with high cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. GapEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (GapEncoder(n_components=30)). Features classified under this category are imputed based on the strategy defined with impute_missing.

numerical_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (here nothing, so ‘passthrough’). Features classified under this category are not imputed at all (regardless of impute_missing).

datetime_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on datetime features. Can either be a transformer object instance (e.g. DatetimeEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (DatetimeEncoder()). Features classified under this category are not imputed at all (regardless of impute_missing).

auto_castbool, optional, default=True

If set to True, will try to convert each column to the best possible data type (dtype).

impute_missing{‘auto’, ‘force’, ‘skip’}, default=’auto’

When to impute missing values in categorical (textual) columns. ‘auto’ will impute missing values if it is considered appropriate (we are using an encoder that does not support missing values and/or specific versions of pandas, numpy and scikit-learn). ‘force’ will impute missing values in all categorical columns. ‘skip’ will not impute at all. When imputed, missing values are replaced by the string ‘missing’. As imputation logic for numerical features can be quite intricate, it is left to the user to manage. See also attribute imputed_columns_.

remainder{‘drop’, ‘passthrough’} or Transformer, default=’drop’

By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default ‘drop’). By specifying remainder=’passthrough’, all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

sparse_thresholdfloat, default=0.3

If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.

n_jobsint, optional

Number of jobs to run in parallel. None (the default) means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

transformer_weightsdict, optional

Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

verbosebool, default=False

If True, the time elapsed while fitting each transformer will be printed as it is completed.

See also

dirty_cat.GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

dirty_cat.MinHashEncoder

Encode string columns as a numeric array with the minhash method.

dirty_cat.SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

Notes

The column order of the input data is not guaranteed to be the same as the output data (returned by transform()). This is a due to the way the ColumnTransformer works. However, the output column order will always be the same for different calls to transform() on a same fitted TableVectorizer instance. For example, if input data has columns [‘name’, ‘job’, ‘year’], then output columns might be shuffled, e.g. [‘job’, ‘year’, ‘name’], but every call to transform() on this instance will return this order.

Examples

Fit a TableVectorizer on an example dataset:

>>> from dirty_cat.datasets import fetch_employee_salaries
>>> ds = fetch_employee_salaries()
>>> ds.X.head(3)
  gender department                          department_name                                           division assignment_category      employee_position_title underfilled_job_title date_first_hired  year_first_hired
0      F        POL                     Department of Police  MSB Information Mgmt and Tech Division Records...    Fulltime-Regular  Office Services Coordinator                   NaN       09/22/1986              1986
1      M        POL                     Department of Police         ISB Major Crimes Division Fugitive Section    Fulltime-Regular        Master Police Officer                   NaN       09/12/1988              1988
2      F        HHS  Department of Health and Human Services      Adult Protective and Case Management Services    Fulltime-Regular             Social Worker IV                   NaN       11/19/1989              1989
>>> tv = TableVectorizer()
>>> tv.fit(ds.X)

Now, we can inspect the transformers assigned to each column:

>>> tv.transformers_
[
    ('datetime', DatetimeEncoder(), ['date_first_hired']),
    ('low_card_cat', OneHotEncoder(drop='if_binary', handle_unknown='ignore'),
     ['gender', 'department', 'department_name', 'assignment_category']),
    ('high_card_cat', GapEncoder(n_components=30),
     ['division', 'employee_position_title', 'underfilled_job_title']),
    ('remainder', 'passthrough', ['year_first_hired'])
]
Attributes:
transformers_list of 3-tuples (str, Transformer or str, list of str)

The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be an unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the remainder parameter. If there are remaining columns, then len(transformers_)==len(transformers)+1, otherwise len(transformers_)==len(transformers).

columns_Index

The fitted array’s columns. They are applied to the data passed to the transform method.

types_dict mapping of str to type

A mapping of inferred types per column. Key is the column name, value is the inferred dtype. Exists only if auto_cast=True.

imputed_columns_list of str

The list of columns in which we imputed the missing values.

Methods

fit(X[, y])

Fit all transformers using X.

fit_transform(X[, y])

Fit all transformers, transform the data, and concatenate the results.

get_feature_names([input_features])

Return clean feature names.

get_feature_names_out([input_features])

Return clean feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set the output container when "transform" and "fit_transform" are called.

set_params(**kwargs)

Set the parameters of this estimator.

transform(X)

Transform X by applying the fitted transformers on the columns.

fit(X, y=None)[source]

Fit all transformers using X.

Parameters:
X{array-like, dataframe} of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray-like of shape (n_samples,…), default=None

Targets for supervised learning.

Returns:
selfColumnTransformer

This estimator.

fit_transform(X, y=None)[source]

Fit all transformers, transform the data, and concatenate the results.

In practice, it (1) converts features to their best possible types if auto_cast=True, (2) classify columns based on their data type, (3) replaces “false missing” (see _replace_false_missing()), and imputes categorical columns depending on impute_missing, and finally, transforms X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray-like of shape (n_samples,), optional

Targets for supervised learning.

Returns:
{array-like, sparse matrix} of shape (n_samples, sum_n_components)

Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

get_feature_names(input_features=None)[source]

Return clean feature names. Compatibility method for sklearn < 1.0.

Use get_feature_names_out() instead.

Parameters:
input_featuresNone

Unused, only here for compatibility.

Returns:
list of str

Feature names.

get_feature_names_out(input_features=None)[source]

Return clean feature names.

Feature names are formatted like: “<column_name>_<value>” if encoded by OneHotEncoder or alike, (e.g. “job_title_Police officer”), or “<column_name>” otherwise.

Parameters:
input_featuresNone

Unused, only here for compatibility.

Returns:
list of str

Feature names.

get_params(deep=True)[source]

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

property named_transformers_

Access the fitted transformer by name.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

set_output(*, transform=None)[source]

Set the output container when “transform” and “fit_transform” are called.

Calling set_output will set the output of all estimators in transformers and transformers_.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**kwargs)[source]

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.

Parameters:
**kwargsdict

Estimator parameters.

Returns:
selfColumnTransformer

This estimator.

transform(X)[source]

Transform X by applying the fitted transformers on the columns.

Parameters:
Xarray-like of shape (n_samples, n_features)

The data to be transformed.

Returns:
{array-like, sparse matrix} of shape (n_samples, sum_n_components)

Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Examples using dirty_cat.TableVectorizer