dirty_cat.SuperVectorizer

Usage examples at the bottom of this page.

class dirty_cat.SuperVectorizer(*, cardinality_threshold=40, low_card_cat_transformer=None, high_card_cat_transformer=None, numerical_transformer=None, datetime_transformer=None, auto_cast=True, impute_missing='auto', remainder='passthrough', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False)[source]

Easily transforms a heterogeneous data table (such as a dataframe) to a numerical array for machine learning. For this it transforms each column depending on its data type. It provides a simplified interface for the sklearn.compose.ColumnTransformer ; more documentation of attributes and functions are available in its doc.

New in version 0.2.0.

Parameters:
cardinality_thresholdint, default=40

Two lists of features will be created depending on this value: strictly under this value, the low cardinality categorical features, and above or equal, the high cardinality categorical features. Different transformers will be applied to these two groups, defined by the parameters low_card_cat_transformer and high_card_cat_transformer respectively. Note: currently, missing values are counted as a single unique value (so they count in the cardinality).

low_card_cat_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None # noqa

Transformer used on categorical/string features with low cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. OneHotEncoder(drop=”if_binary”)), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (OneHotEncoder()). Features classified under this category are imputed based on the strategy defined with impute_missing.

high_card_cat_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None # noqa

Transformer used on categorical/string features with high cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. GapEncoder()), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (GapEncoder(n_components=30)). Features classified under this category are imputed based on the strategy defined with impute_missing.

numerical_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None # noqa

Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler()), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (here nothing, so ‘passthrough’). Features classified under this category are not imputed at all (regardless of impute_missing).

datetime_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None

Transformer used on datetime features. Can either be a transformer object instance (e.g. DatetimeEncoder()), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (DatetimeEncoder()). Features classified under this category are not imputed at all (regardless of impute_missing).

auto_castbool, default=True

If set to True, will try to convert each column to the best possible data type (dtype).

impute_missingstr, default=’auto’

When to impute missing values in categorical (textual) columns. ‘auto’ will impute missing values if it is considered appropriate (we are using an encoder that does not support missing values and/or specific versions of pandas, numpy and scikit-learn). ‘force’ will impute missing values in all categorical columns. ‘skip’ will not impute at all. When imputed, missing values are replaced by the string ‘missing’. As imputation logic for numerical features can be quite intricate, it is left to the user to manage. See also attribute imputed_columns_.

remaindertyping.Union[typing.Literal[“drop”, “passthrough”], sklearn.base.TransformerMixin], default=’drop’ # noqa

By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

sparse_threshold: float, default=0.3

If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.

n_jobsint, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

transformer_weightsdict, default=None

Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

verbosebool, default=False

If True, the time elapsed while fitting each transformer will be printed as it is completed

Notes

The column order of the input data is not guaranteed to be the same as the output data (returned by transform). This is a due to the way the ColumnTransformer works. However, the output column order will always be the same for different calls to transform on a same fitted SuperVectorizer instance. For example, if input data has columns [‘name’, ‘job’, ‘year], then output columns might be shuffled, e.g., [‘job’, ‘year’, ‘name’], but every call to transform will return this order.

Attributes:
transformers_: typing.List[typing.Tuple[str, typing.Union[str, sklearn.base.TransformerMixin], typing.List[str]]] # noqa

The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be an unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the remainder parameter. If there are remaining columns, then len(transformers_)==len(transformers)+1, otherwise len(transformers_)==len(transformers).

columns_: pandas.Index

The fitted array’s columns. They are applied to the data passed to the transform method.

types_: typing.Dict[str, type]

A mapping of inferred types per column. Key is the column name, value is the inferred dtype. Exists only if auto_cast=True.

imputed_columns_: typing.List[str]

The list of columns in which we imputed the missing values.

Methods

fit(X[, y])

Fit all transformers using X.

fit_transform(X[, y])

Fit all transformers, transform the data, and concatenate the results.

get_feature_names([input_features])

Ensures compatibility with sklearn < 1.0.

get_feature_names_out([input_features])

Returns clean feature names with format "<column_name>_<value>" if encoded by OneHotEncoder or alike, e.g.

get_params([deep])

Get parameters for this estimator.

set_params(**kwargs)

Set the parameters of this estimator.

transform(X)

Transform X by applying fitted transformers on each column, and concatenate the results.

fit(X, y=None)

Fit all transformers using X.

Parameters:
X{array-like, dataframe} of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray-like of shape (n_samples,…), default=None

Targets for supervised learning.

Returns:
selfColumnTransformer

This estimator.

fit_transform(X, y=None)[source]

Fit all transformers, transform the data, and concatenate the results. In practice, it (1) converts features to their best possible types if auto_cast=True, (2) classify columns based on their data type, (3) replaces “false missing” (see function _replace_false_missing), and imputes categorical columns depending on impute_missing, and finally, transforms X.

Parameters:
X{array-like, dataframe} of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray-like of shape (n_samples,), default=None

Targets for supervised learning.

Returns:
{array-like, sparse matrix} of shape (n_samples, sum_n_components)

hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

get_feature_names(input_features=None)[source]

Ensures compatibility with sklearn < 1.0. Use get_feature_names_out instead.

get_feature_names_out(input_features=None)[source]

Returns clean feature names with format “<column_name>_<value>” if encoded by OneHotEncoder or alike, e.g. “job_title_Police officer”, or “<column_name>” otherwise.

Returns:
typing.List[str]

Feature names.

get_params(deep=True)

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

property named_transformers_

Access the fitted transformer by name.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

set_params(**kwargs)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.

Parameters:
**kwargsdict

Estimator parameters.

Returns:
selfColumnTransformer

This estimator.

transform(X)[source]

Transform X by applying fitted transformers on each column, and concatenate the results.

Parameters:
X{array-like, dataframe} of shape (n_samples, n_features)

The data to be transformed.

Returns:
{array-like, sparse matrix} of shape (n_samples, sum_n_components)

hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Examples using dirty_cat.SuperVectorizer

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings

Dirty categories: machine learning with non normalized strings
Handling datetime features with the DatetimeEncoder

Handling datetime features with the DatetimeEncoder

Handling datetime features with the DatetimeEncoder