dirty_cat
.SuperVectorizer¶
Usage examples at the bottom of this page.
- class dirty_cat.SuperVectorizer(*, cardinality_threshold=40, low_card_cat_transformer=None, high_card_cat_transformer=None, numerical_transformer=None, datetime_transformer=None, auto_cast=True, impute_missing='auto', remainder='passthrough', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False)[source]¶
Easily transforms a heterogeneous data table (such as a dataframe) to a numerical array for machine learning. For this it transforms each column depending on its data type. It provides a simplified interface for the
sklearn.compose.ColumnTransformer
; more documentation of attributes and functions are available in its doc.New in version 0.2.0.
- Parameters:
- cardinality_thresholdint, default=40
Two lists of features will be created depending on this value: strictly under this value, the low cardinality categorical features, and above or equal, the high cardinality categorical features. Different transformers will be applied to these two groups, defined by the parameters low_card_cat_transformer and high_card_cat_transformer respectively. Note: currently, missing values are counted as a single unique value (so they count in the cardinality).
- low_card_cat_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None # noqa
Transformer used on categorical/string features with low cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. OneHotEncoder(drop=”if_binary”)), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (OneHotEncoder()). Features classified under this category are imputed based on the strategy defined with impute_missing.
- high_card_cat_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None # noqa
Transformer used on categorical/string features with high cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. GapEncoder()), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (GapEncoder(n_components=30)). Features classified under this category are imputed based on the strategy defined with impute_missing.
- numerical_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None # noqa
Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler()), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (here nothing, so ‘passthrough’). Features classified under this category are not imputed at all (regardless of impute_missing).
- datetime_transformertyping.Optional[typing.Union[sklearn.base.TransformerMixin, typing.Literal[“drop”, “remainder”, “passthrough”]]], default=None
Transformer used on datetime features. Can either be a transformer object instance (e.g. DatetimeEncoder()), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns, or None to use the default transformer (DatetimeEncoder()). Features classified under this category are not imputed at all (regardless of impute_missing).
- auto_castbool, default=True
If set to True, will try to convert each column to the best possible data type (dtype).
- impute_missingstr, default=’auto’
When to impute missing values in categorical (textual) columns. ‘auto’ will impute missing values if it is considered appropriate (we are using an encoder that does not support missing values and/or specific versions of pandas, numpy and scikit-learn). ‘force’ will impute missing values in all categorical columns. ‘skip’ will not impute at all. When imputed, missing values are replaced by the string ‘missing’. As imputation logic for numerical features can be quite intricate, it is left to the user to manage. See also attribute imputed_columns_.
- remaindertyping.Union[typing.Literal[“drop”, “passthrough”], sklearn.base.TransformerMixin], default=’drop’ # noqa
By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default
'drop'
). By specifyingremainder='passthrough'
, all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By settingremainder
to be an estimator, the remaining non-specified columns will use theremainder
estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.- sparse_threshold: float, default=0.3
If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.
- n_jobsint, default=None
Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.- transformer_weightsdict, default=None
Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.
- verbosebool, default=False
If True, the time elapsed while fitting each transformer will be printed as it is completed
Notes
The column order of the input data is not guaranteed to be the same as the output data (returned by transform). This is a due to the way the ColumnTransformer works. However, the output column order will always be the same for different calls to transform on a same fitted SuperVectorizer instance. For example, if input data has columns [‘name’, ‘job’, ‘year], then output columns might be shuffled, e.g., [‘job’, ‘year’, ‘name’], but every call to transform will return this order.
- Attributes:
- transformers_: typing.List[typing.Tuple[str, typing.Union[str, sklearn.base.TransformerMixin], typing.List[str]]] # noqa
The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be an unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the
remainder
parameter. If there are remaining columns, thenlen(transformers_)==len(transformers)+1
, otherwiselen(transformers_)==len(transformers)
.- columns_: pandas.Index
The fitted array’s columns. They are applied to the data passed to the transform method.
- types_: typing.Dict[str, type]
A mapping of inferred types per column. Key is the column name, value is the inferred dtype. Exists only if auto_cast=True.
- imputed_columns_: typing.List[str]
The list of columns in which we imputed the missing values.
Methods
fit
(X[, y])Fit all transformers using X.
fit_transform
(X[, y])Fit all transformers, transform the data, and concatenate the results.
get_feature_names
([input_features])Ensures compatibility with sklearn < 1.0.
get_feature_names_out
([input_features])Returns clean feature names with format "<column_name>_<value>" if encoded by OneHotEncoder or alike, e.g.
get_params
([deep])Get parameters for this estimator.
set_params
(**kwargs)Set the parameters of this estimator.
transform
(X)Transform X by applying fitted transformers on each column, and concatenate the results.
- fit(X, y=None)¶
Fit all transformers using X.
- Parameters:
- X{array-like, dataframe} of shape (n_samples, n_features)
Input data, of which specified subsets are used to fit the transformers.
- yarray-like of shape (n_samples,…), default=None
Targets for supervised learning.
- Returns:
- selfColumnTransformer
This estimator.
- fit_transform(X, y=None)[source]¶
Fit all transformers, transform the data, and concatenate the results. In practice, it (1) converts features to their best possible types if auto_cast=True, (2) classify columns based on their data type, (3) replaces “false missing” (see function _replace_false_missing), and imputes categorical columns depending on impute_missing, and finally, transforms X.
- Parameters:
- X{array-like, dataframe} of shape (n_samples, n_features)
Input data, of which specified subsets are used to fit the transformers.
- yarray-like of shape (n_samples,), default=None
Targets for supervised learning.
- Returns:
- {array-like, sparse matrix} of shape (n_samples, sum_n_components)
hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
- get_feature_names(input_features=None)[source]¶
Ensures compatibility with sklearn < 1.0. Use get_feature_names_out instead.
- get_feature_names_out(input_features=None)[source]¶
Returns clean feature names with format “<column_name>_<value>” if encoded by OneHotEncoder or alike, e.g. “job_title_Police officer”, or “<column_name>” otherwise.
- Returns:
- typing.List[str]
Feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- property named_transformers_¶
Access the fitted transformer by name.
Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.
- set_params(**kwargs)¶
Set the parameters of this estimator.
Valid parameter keys can be listed with
get_params()
. Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.- Parameters:
- **kwargsdict
Estimator parameters.
- Returns:
- selfColumnTransformer
This estimator.
- transform(X)[source]¶
Transform X by applying fitted transformers on each column, and concatenate the results.
- Parameters:
- X{array-like, dataframe} of shape (n_samples, n_features)
The data to be transformed.
- Returns:
- {array-like, sparse matrix} of shape (n_samples, sum_n_components)
hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
Examples using dirty_cat.SuperVectorizer
¶
Dirty categories: machine learning with non normalized strings
Handling datetime features with the DatetimeEncoder