dirty_cat.DatetimeEncoder

Usage examples at the bottom of this page.

class dirty_cat.DatetimeEncoder(extract_until='hour', add_day_of_the_week=False)[source]

Transforms each datetime column into several numeric columns for temporal features (e.g year, month, day…).

Constant extracted features are dropped; for instance, if the year is always the same in a feature, the extracted “year” column won’t be added. If the dates are timezone aware, all the features extracted will correspond to the provided timezone.

Parameters:
extract_until{‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’, ‘nanosecond’}, default=’hour’

Extract up to this granularity. If all features have not been extracted, add the ‘total_time’ feature, which contains the time to epoch (in seconds). For instance, if you specify ‘day’, only ‘year’, ‘month’, ‘day’ and ‘total_time’ features will be created.

add_day_of_the_weekbool, default=False

Add day of the week feature (if day is extracted). This is a numerical feature from 0 (Monday) to 6 (Sunday).

See also

dirty_cat.GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

dirty_cat.MinHashEncoder

Encode string columns as a numeric array with the minhash method.

dirty_cat.SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

Examples

>>> enc = DatetimeEncoder()

Let’s encode the following dates:

>>> X = [['2022-10-15'], ['2021-12-25'], ['2020-05-18'], ['2019-10-15 12:00:00']]
>>> enc.fit(X)
DatetimeEncoder()

The encoder will output a transformed array with four columns (year, month, day and hour):

>>> enc.transform(X)
array([[2022.,   10.,   15.,    0.],
       [2021.,   12.,   25.,    0.],
       [2020.,    5.,   18.,    0.],
       [2019.,   10.,   15.,   12.]])
Attributes:
n_features_in_int

Number of features in the data seen during fit.

n_features_out_int

Number of features of the transformed data.

features_per_column_mapping of int to list of str

Dictionary mapping the index of the original columns to the list of features extracted for each column.

col_names_None or list of str

List of the names of the features of the input data, if input data was a pandas DataFrame, otherwise None.

Methods

fit(X[, y])

Fit the instance to X.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names([input_features])

Return clean feature names.

get_feature_names_out([input_features])

Return clean feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X[, y])

Transform X by replacing each datetime column with corresponding numerical features.

fit(X, y=None)[source]

Fit the instance to X.

In practice, just stores which extracted features are not constant.

Parameters:
Xarray-like, shape (n_samples, n_features)

Data where each column is a datetime feature.

yNone

Unused, only here for compatibility.

Returns:
DatetimeEncoder

Fitted DatetimeEncoder instance (self).

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_feature_names(input_features=None)[source]

Return clean feature names. Compatibility method for sklearn < 1.0.

Use get_feature_names_out() instead.

Parameters:
input_featuresNone

Unused, only here for compatibility.

Returns:
list of str

List of feature names.

get_feature_names_out(input_features=None)[source]

Return clean feature names.

Feature names are formatted like: “<column_name>_<new_feature>” if the original data has column names, otherwise with format “<column_index>_<new_feature>” where <new_feature> is one of {“year”, “month”, “day”, “hour”, “minute”, “second”, “millisecond”, “microsecond”, “nanosecond”, “dayofweek”}.

Parameters:
input_featuresNone

Unused, only here for compatibility.

Returns:
list of str

List of feature names.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X, y=None)[source]

Transform X by replacing each datetime column with corresponding numerical features.

Parameters:
Xarray-like, shape (n_samples, n_features)

The data to transform, where each column is a datetime feature.

yNone

Unused, only here for compatibility.

Returns:
ndarray, shape (n_samples, n_features_out_)

Transformed input.

Examples using dirty_cat.DatetimeEncoder