dirty_cat
.DatetimeEncoder¶
Usage examples at the bottom of this page.
- class dirty_cat.DatetimeEncoder(extract_until='hour', add_day_of_the_week=False)[source]¶
Transforms each datetime column into several numeric columns for temporal features (e.g year, month, day…).
Constant extracted features are dropped; for instance, if the year is always the same in a feature, the extracted “year” column won’t be added. If the dates are timezone aware, all the features extracted will correspond to the provided timezone.
- Parameters:
- extract_until{‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’, ‘nanosecond’}, default=’hour’
Extract up to this granularity. If all features have not been extracted, add the ‘total_time’ feature, which contains the time to epoch (in seconds). For instance, if you specify ‘day’, only ‘year’, ‘month’, ‘day’ and ‘total_time’ features will be created.
- add_day_of_the_weekbool, default=False
Add day of the week feature (if day is extracted). This is a numerical feature from 0 (Monday) to 6 (Sunday).
See also
dirty_cat.GapEncoder
Encodes dirty categories (strings) by constructing latent topics with continuous encoding.
dirty_cat.MinHashEncoder
Encode string columns as a numeric array with the minhash method.
dirty_cat.SimilarityEncoder
Encode string columns as a numeric array with n-gram string similarity.
Examples
>>> enc = DatetimeEncoder()
Let’s encode the following dates:
>>> X = [['2022-10-15'], ['2021-12-25'], ['2020-05-18'], ['2019-10-15 12:00:00']]
>>> enc.fit(X) DatetimeEncoder()
The encoder will output a transformed array with four columns (year, month, day and hour):
>>> enc.transform(X) array([[2022., 10., 15., 0.], [2021., 12., 25., 0.], [2020., 5., 18., 0.], [2019., 10., 15., 12.]])
- Attributes:
- n_features_in_int
Number of features in the data seen during fit.
- n_features_out_int
Number of features of the transformed data.
- features_per_column_mapping of int to list of str
Dictionary mapping the index of the original columns to the list of features extracted for each column.
- col_names_None or list of str
List of the names of the features of the input data, if input data was a pandas DataFrame, otherwise None.
Methods
fit
(X[, y])Fit the instance to X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names
([input_features])Return clean feature names.
get_feature_names_out
([input_features])Return clean feature names.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y])Transform X by replacing each datetime column with corresponding numerical features.
- fit(X, y=None)[source]¶
Fit the instance to X.
In practice, just stores which extracted features are not constant.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
Data where each column is a datetime feature.
- yNone
Unused, only here for compatibility.
- Returns:
DatetimeEncoder
Fitted
DatetimeEncoder
instance (self).
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names(input_features=None)[source]¶
Return clean feature names. Compatibility method for sklearn < 1.0.
Use
get_feature_names_out()
instead.- Parameters:
- input_featuresNone
Unused, only here for compatibility.
- Returns:
- list of str
List of feature names.
- get_feature_names_out(input_features=None)[source]¶
Return clean feature names.
Feature names are formatted like: “<column_name>_<new_feature>” if the original data has column names, otherwise with format “<column_index>_<new_feature>” where <new_feature> is one of {“year”, “month”, “day”, “hour”, “minute”, “second”, “millisecond”, “microsecond”, “nanosecond”, “dayofweek”}.
- Parameters:
- input_featuresNone
Unused, only here for compatibility.
- Returns:
- list of str
List of feature names.
- get_params(deep=True)[source]¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_output(*, transform=None)[source]¶
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- transform(X, y=None)[source]¶
Transform X by replacing each datetime column with corresponding numerical features.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
The data to transform, where each column is a datetime feature.
- yNone
Unused, only here for compatibility.
- Returns:
ndarray
, shape (n_samples, n_features_out_)Transformed input.