Synthetic¶

Synthetic data is data that might have been generated by financial markets but was not. Synthetic price and return data address the financial small data problem and have numerous uses, including testing new investment strategies and feeding data-hungry ML models. They also help us to detect behavior and outlier discrepancies between real and mimicked markets. For example, if our model performs well on a subset of real-world data, we can check it against synthetic data to find out whether we introduced the look-ahead bias or any other Achilles' heel to our model without knowing it.

To assist us in generating synthetic data, vectorbt implements the class SyntheticData, which takes the start date, the end date, and the frequency, and builds a datetime-like Pandas Index. It then calls the abstract class method SyntheticData.generate_key, which takes the key and the index, generates new data, and returns a Series or a DataFrame ready to be consumed by Data. The key here is either a feature or a symbol, depending on the data orientation that the user has chosen. We must override this method and implement our own data generation logic.

Note

If the logic depends on the data orientation (that is, whether features or symbols should be generated), you should be more specific and override SyntheticData.generate_symbol and/or SyntheticData.generate_feature.

There are two preset classes: RandomData, which uses cumulative normally-distributed returns, and GBMData, which uses the Geometric Brownian Motion. Both generators are very basic but still very interesting to test the models against. One weakness of them is that real asset prices regularly make dramatic moves in response to new information. To account for this, we'll build a data generator based on the Lévy alpha-stable distribution!

>>> from vectorbtpro import *  # (1)!
>>> from scipy.stats import levy_stable

>>> def geometric_levy_price(alpha, beta, drift, vol, shape):  # (2)!
...     _rvs = levy_stable.rvs(alpha, beta,loc=0, scale=1, size=shape)
...     _rvs_sum = np.cumsum(_rvs, axis=0)
...     return np.exp(vol * _rvs_sum + (drift - 0.5 * vol ** 2))

>>> class LevyData(vbt.SyntheticData):  # (3)!
...
...     _settings_path = dict(custom="data.custom.levy")  # (4)!
...
...     @classmethod
...     def generate_key(
...         cls, 
...         key, 
...         index, 
...         columns, 
...         start_value=None,  # (5)!
...         alpha=None, 
...         beta=None, 
...         drift=None, 
...         vol=None, 
...         seed=None,
...         **kwargs
...     ):
...         start_value = cls.resolve_custom_setting(start_value, "start_value")  # (6)!
...         alpha = cls.resolve_custom_setting(alpha, "alpha")
...         beta = cls.resolve_custom_setting(beta, "beta")
...         drift = cls.resolve_custom_setting(drift, "drift")
...         vol = cls.resolve_custom_setting(vol, "vol")
...         seed = cls.resolve_custom_setting(seed, "seed")
...         if seed is not None:
...             np.random.seed(seed)
...
...         shape = (len(index), len(columns))
...         out = geometric_levy_price(alpha, beta, drift, vol, shape)
...         out = start_value * out
...         return pd.DataFrame(out, index=index, columns=columns)

>>> LevyData.set_custom_settings(  # (7)!
...     populate_=True,
...     start_value=100., 
...     alpha=1.68, 
...     beta=0.01, 
...     drift=0.0, 
...     vol=0.01, 
...     seed=None
... )

Imports np, pd, njit, and vbt
Generation function, see Asset price mimicry
Subclass SyntheticData
Specify the path to global settings for this class and assign it a new identifier "custom". There's one more identifier already registered for us: "base", which points to general settings.
Set most arguments to None to pull their default value from the global settings. You can also hard-code the values here if you've decided to not use global settings.
Access the global settings and if the argument value is None, then replace None with the default value
Populate the global settings for this class

Let's try it out by generating and plotting the close price of several symbols of data:

>>> levy_data = LevyData.pull(
...     "Close",
...     keys_are_features=True,
...     columns=pd.Index(["BTC/USD", "ETH/USD", "XRP/USD"], name="symbol"),
...     start="2020-01-01 UTC",
...     end="2021-01-01 UTC",
...     seed=42)
>>> levy_data.get()
symbol                        BTC/USD     ETH/USD     XRP/USD
2020-01-01 00:00:00+00:00   99.218626  101.893255  100.371131
2020-01-02 00:00:00+00:00  100.062835   99.537102   97.857226
2020-01-03 00:00:00+00:00   95.321467  100.474547   98.246993
2020-01-04 00:00:00+00:00   96.493680   96.455981   99.797874
2020-01-05 00:00:00+00:00   98.489931   95.658733   98.892301
...                               ...         ...         ...
2020-12-27 00:00:00+00:00  189.477849   91.730109   55.055316
2020-12-28 00:00:00+00:00  190.620767   89.452822   59.555616
2020-12-29 00:00:00+00:00  187.641089   92.164802   60.034154
2020-12-30 00:00:00+00:00  188.287168   92.245270   59.188719
2020-12-31 00:00:00+00:00  185.500114   91.701142   58.443060

[366 rows x 3 columns]

>>> levy_data.get().vbt.plot().show()

Well done! We've built our own data mimicker that simulates sudden large changes in price.

Python code