Skip to content

purged module

Classes for purged cross-validation in time series.

As described in Advances in Financial Machine Learning, Marcos Lopez de Prado, 2018.


BasePurgedCV class

BasePurgedCV(
    n_folds=10,
    purge_td=0
)

Abstract class for purged time series cross-validation.

Time series cross-validation requires each sample has a prediction time, at which the features are used to predict the response, and an evaluation time, at which the response is known and the error can be computed. Importantly, it means that unlike in standard sklearn cross-validation, the samples X, response y, pred_times and eval_times must all be pandas DataFrames/Series having the same index. It is also assumed that the samples are time-ordered with respect to the prediction time.

Subclasses


eval_times property

Times at which the response becomes available and the error can be computed.


indices property

Indices.


n_folds property

Number of folds.


pred_times property

Times at which predictions are made.


purge method

BasePurgedCV.purge(
    train_indices,
    test_fold_start,
    test_fold_end
)

Purge part of the train set.

Given a left boundary index test_fold_start of the test set and a right boundary index test_fold_end, this method removes from the train set all the samples whose evaluation time is posterior to the prediction time of the first test sample after the boundary.


purge_td property

Purge period.


split method

BasePurgedCV.split(
    X,
    y=None,
    pred_times=None,
    eval_times=None
)

Yield the indices of the train and test sets.


PurgedKFoldCV class

PurgedKFoldCV(
    n_folds=10,
    n_test_folds=2,
    purge_td=0,
    embargo_td=0
)

Purged and embargoed combinatorial cross-validation.

The samples are decomposed into n_folds folds containing equal numbers of samples, without shuffling. In each cross validation round, n_test_folds folds are used as the test set, while the other folds are used as the train set. There are as many rounds as n_test_folds folds among the n_folds folds.

Each sample should be tagged with a prediction time and an evaluation time. The split is such that the intervals [pred_times, eval_times] associated to samples in the train and test set do not overlap. (The overlapping samples are dropped.) In addition, an "embargo" period is defined, giving the minimal time between an evaluation time in the test set and a prediction time in the training set. This is to avoid, in the presence of temporal correlation, a contamination of the test set by the train set.

Superclasses

Inherited members


compute_test_set method

PurgedKFoldCV.compute_test_set(
    fold_bound_list
)

Compute the position indices of the samples in the test set.


compute_train_set method

PurgedKFoldCV.compute_train_set(
    test_fold_bounds,
    test_indices
)

Compute the position indices of the samples in the train set.


embargo method

PurgedKFoldCV.embargo(
    train_indices,
    test_indices,
    test_fold_end
)

Apply the embargo procedure to part of the train set.

This amounts to dropping the train set samples whose prediction time occurs within PurgedKFoldCV.embargo_td of the test set sample evaluation times. This method applies the embargo only to the part of the training set immediately following the end of the test set determined by test_fold_end.


embargo_td property

Embargo period.


n_test_folds property

Number of folds used in the test set.


PurgedWalkForwardCV class

PurgedWalkForwardCV(
    n_folds=10,
    n_test_folds=1,
    min_train_folds=2,
    max_train_folds=None,
    split_by_time=False,
    purge_td=0
)

Purged walk-forward cross-validation.

The samples are decomposed into n_folds folds containing equal numbers of samples, without shuffling. In each cross validation round, n_test_folds contiguous folds are used as the test set, while the train set consists in between min_train_folds and max_train_folds immediately preceding folds.

Each sample should be tagged with a prediction time and an evaluation time. The split is such that the intervals [pred_times, eval_times] associated to samples in the train and test set do not overlap. (The overlapping samples are dropped.)

With split_by_time=True in the PurgedWalkForwardCV.split() method, it is also possible to split the samples in folds spanning equal time intervals (using the prediction time as a time tag), instead of folds containing equal numbers of samples.

Superclasses

Inherited members


compute_fold_bounds method

PurgedWalkForwardCV.compute_fold_bounds()

Compute a list containing the fold (left) boundaries.


compute_test_set method

PurgedWalkForwardCV.compute_test_set(
    fold_bound,
    count_folds
)

Compute the position indices of the samples in the test set.


compute_train_set method

PurgedWalkForwardCV.compute_train_set(
    fold_bound,
    count_folds
)

Compute the position indices of the samples in the train set.


fold_bounds property

Fold boundaries.


max_train_folds property

Maximal number of folds to be used in the train set.


min_train_folds property

Minimal number of folds to be used in the train set.


n_test_folds property

Number of folds used in the test set.


split_by_time property

Whether the folds span identical time intervals. Otherwise, the folds contain an (approximately) equal number of samples.