purged module¶
Classes for purged cross-validation in time series.
As described in Advances in Financial Machine Learning, Marcos Lopez de Prado, 2018.
BasePurgedCV class¶
Abstract class for purged time series cross-validation.
Time series cross-validation requires each sample has a prediction time, at which the features are used to predict the response, and an evaluation time, at which the response is known and the error can be computed. Importantly, it means that unlike in standard sklearn cross-validation, the samples X, response y, pred_times and eval_times must all be pandas DataFrames/Series having the same index. It is also assumed that the samples are time-ordered with respect to the prediction time.
Subclasses
eval_times property¶
Times at which the response becomes available and the error can be computed.
indices property¶
Indices.
n_folds property¶
Number of folds.
pred_times property¶
Times at which predictions are made.
purge method¶
Purge part of the train set.
Given a left boundary index test_fold_start of the test set and a right boundary index test_fold_end, this method removes from the train set all the samples whose evaluation time is posterior to the prediction time of the first test sample after the boundary.
purge_td property¶
Purge period.
split method¶
Yield the indices of the train and test sets.
PurgedKFoldCV class¶
Purged and embargoed combinatorial cross-validation.
The samples are decomposed into n_folds folds containing equal numbers of samples, without shuffling. In each cross validation round, n_test_folds folds are used as the test set, while the other folds are used as the train set. There are as many rounds as n_test_folds folds among the n_folds folds.
Each sample should be tagged with a prediction time and an evaluation time. The split is such that the intervals [pred_times, eval_times] associated to samples in the train and test set do not overlap. (The overlapping samples are dropped.) In addition, an "embargo" period is defined, giving the minimal time between an evaluation time in the test set and a prediction time in the training set. This is to avoid, in the presence of temporal correlation, a contamination of the test set by the train set.
Superclasses
Inherited members
- BasePurgedCV.eval_times
- BasePurgedCV.indices
- BasePurgedCV.n_folds
- BasePurgedCV.pred_times
- BasePurgedCV.purge()
- BasePurgedCV.purge_td
- BasePurgedCV.split()
compute_test_set method¶
Compute the position indices of the samples in the test set.
compute_train_set method¶
Compute the position indices of the samples in the train set.
embargo method¶
Apply the embargo procedure to part of the train set.
This amounts to dropping the train set samples whose prediction time occurs within PurgedKFoldCV.embargo_td of the test set sample evaluation times. This method applies the embargo only to the part of the training set immediately following the end of the test set determined by test_fold_end.
embargo_td property¶
Embargo period.
n_test_folds property¶
Number of folds used in the test set.
PurgedWalkForwardCV class¶
PurgedWalkForwardCV(
n_folds=10,
n_test_folds=1,
min_train_folds=2,
max_train_folds=None,
split_by_time=False,
purge_td=0
)
Purged walk-forward cross-validation.
The samples are decomposed into n_folds folds containing equal numbers of samples, without shuffling. In each cross validation round, n_test_folds contiguous folds are used as the test set, while the train set consists in between min_train_folds and max_train_folds immediately preceding folds.
Each sample should be tagged with a prediction time and an evaluation time. The split is such that the intervals [pred_times, eval_times] associated to samples in the train and test set do not overlap. (The overlapping samples are dropped.)
With split_by_time=True in the PurgedWalkForwardCV.split() method, it is also possible to split the samples in folds spanning equal time intervals (using the prediction time as a time tag), instead of folds containing equal numbers of samples.
Superclasses
Inherited members
- BasePurgedCV.eval_times
- BasePurgedCV.indices
- BasePurgedCV.n_folds
- BasePurgedCV.pred_times
- BasePurgedCV.purge()
- BasePurgedCV.purge_td
- BasePurgedCV.split()
compute_fold_bounds method¶
Compute a list containing the fold (left) boundaries.
compute_test_set method¶
Compute the position indices of the samples in the test set.
compute_train_set method¶
Compute the position indices of the samples in the train set.
fold_bounds property¶
Fold boundaries.
max_train_folds property¶
Maximal number of folds to be used in the train set.
min_train_folds property¶
Minimal number of folds to be used in the train set.
n_test_folds property¶
Number of folds used in the test set.
split_by_time property¶
Whether the folds span identical time intervals. Otherwise, the folds contain an (approximately) equal number of samples.