dirty_cat.SuperVectorizer

Usage examples at the bottom of this page.

class dirty_cat.SuperVectorizer(*, cardinality_threshold=40, low_card_cat_transformer=OneHotEncoder(), high_card_cat_transformer=GapEncoder(n_components=30), numerical_transformer=None, datetime_transformer=DatetimeEncoder(), auto_cast=True, impute_missing='auto', remainder='passthrough', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False)[source]

Easily transforms a heterogeneous data table (such as a dataframe) to a numerical array for machine learning. For this it transforms each column depending on its data type. It provides a simplified interface for scikit-learn’s ColumnTransformer.

New in version 0.2.0.

Parameters
  • cardinality_threshold (int, default=40) – Two lists of features will be created depending on this value: strictly under this value, the low cardinality categorical values, and above or equal, the high cardinality categorical values. Different encoders will be applied to these two groups, defined by the parameters low_card_cat_transformer and high_card_cat_transformer respectively.

  • low_card_cat_transformer (Transformer or str or None, default=OneHotEncoder()) – Transformer used on categorical/string features with low cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. OneHotEncoder()), a Pipeline containing the preprocessing steps, None to apply remainder, ‘drop’ for dropping the columns, or ‘passthrough’ to return the unencoded columns.

  • high_card_cat_transformer (Transformer or str or None, default=GapEncoder(n_components=30)) – Transformer used on categorical/string features with high cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. GapEncoder()), a Pipeline containing the preprocessing steps, None to apply remainder, ‘drop’ for dropping the columns, or ‘passthrough’ to return the unencoded columns.

  • numerical_transformer (Transformer or str or None, default=None) – Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler()), a Pipeline containing the preprocessing steps, None to apply remainder, ‘drop’ for dropping the columns, or ‘passthrough’ to return the unencoded columns.

  • datetime_transformer (Transformer or str or None, default=DatetimeEncoder()) – Transformer used on datetime features. Can either be a transformer object instance (e.g. DatetimeEncoder()), a Pipeline containing the preprocessing steps, None to apply remainder, ‘drop’ for dropping the columns, or ‘passthrough’ to return the unencoded columns.

  • auto_cast (bool, default=True) – If set to True, will try to convert each column to the best possible data type (dtype).

  • impute_missing (str, default='auto') – When to impute missing values in string columns. ‘auto’ will impute missing values if it’s considered appropriate (we are using an encoder that does not support missing values and/or specific versions of pandas, numpy and scikit-learn). ‘force’ will impute all missing values. ‘skip’ will not impute at all. When imputed, missing values are replaced by the string ‘missing’. See also attribute imputed_columns_.

  • remainder ({'drop', 'passthrough'} or estimator, default='drop') – By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

  • sparse_threshold (float, default=0.3) – If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.

  • n_jobs (int, default=None) – Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • transformer_weights (dict, default=None) – Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

  • verbose (bool, default=False) – If True, the time elapsed while fitting each transformer will be printed as it is completed

transformers_

The final distribution of columns. List of three-tuple containing (1) the name of the category (2) the encoder/transformer instance which will be applied or “passthrough” or “drop” (3) the list of column names or index

Type

List[Tuple[str, Union[str, BaseEstimator], Union[str, int]]]

columns_

The column names of fitted array.

Type

List[Union[str, int]]

types_

A mapping of inferred types per column. Key is the column name, value is the inferred dtype.

Type

Dict[str, type]

imputed_columns_

The list of columns in which we imputed the missing values.

Type

List[str]

OptionalEstimator

alias of Optional[Union[sklearn.base.BaseEstimator, str]]

fit(X, y=None)

Fit all transformers using X.

Parameters
  • X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.

  • y (array-like of shape (n_samples,...), default=None) – Targets for supervised learning.

Returns

self – This estimator.

Return type

ColumnTransformer

fit_transform(X, y=None)[source]

Fit all transformers, transform the data, and concatenate results.

Parameters
  • X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.

  • y (array-like of shape (n_samples,), default=None) – Targets for supervised learning.

Returns

X_t – hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Return type

{array-like, sparse matrix} of shape (n_samples, sum_n_components)

Raises

RuntimeError – If no transformers could be constructed, usually because transformers passed do not match any column. To fix the issue, try passing the least amount of None as encoders.

get_feature_names()[source]

Deprecated, use “get_feature_names_out”

get_feature_names_out(input_features=None)[source]

Returns clean feature names with format “<column_name>_<value>” if encoded by OneHotEncoder or alike, e.g. “job_title_Police officer”, or “<column_name>” if not encoded.

get_params(deep=True)

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

property named_transformers_

Access the fitted transformer by name.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

set_params(**kwargs)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.

Parameters

**kwargs (dict) – Estimator parameters.

Returns

self – This estimator.

Return type

ColumnTransformer

steps
transform(X)[source]

Transform X by applying transformers on each column, then concatenate results.

Parameters

X ({array-like, dataframe} of shape (n_samples, n_features)) – The data to be transformed.

Returns

X_t – hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Return type

{array-like, sparse matrix} of shape (n_samples, sum_n_components)

Examples using dirty_cat.SuperVectorizer