Synthetic¶
Synthetic data is data that might have been generated by financial markets but was not. Synthetic price and return data address the financial small data problem and have numerous uses, including testing new investment strategies and feeding data-hungry ML models. They also help us to detect behavior and outlier discrepancies between real and mimicked markets. For example, if our model performs well on a subset of real-world data, we can check it against synthetic data to find out whether we introduced the look-ahead bias or any other Achilles' heel to our model without knowing it.
To assist us in generating synthetic data, vectorbt implements the class SyntheticData, which takes the start date, the end date, and the frequency, and builds a datetime-like Pandas Index. It then calls the abstract class method SyntheticData.generate_key, which takes the key and the index, generates new data, and returns a Series or a DataFrame ready to be consumed by Data. The key here is either a feature or a symbol, depending on the data orientation that the user has chosen. We must override this method and implement our own data generation logic.
Note
If the logic depends on the data orientation (that is, whether features or symbols should be generated), you should be more specific and override SyntheticData.generate_symbol and/or SyntheticData.generate_feature.
There are two preset classes: RandomData, which uses cumulative normally-distributed returns, and GBMData, which uses the Geometric Brownian Motion. Both generators are very basic but still very interesting to test the models against. One weakness of them is that real asset prices regularly make dramatic moves in response to new information. To account for this, we'll build a data generator based on the Lévy alpha-stable distribution!
>>> from vectorbtpro import * # (1)!
>>> from scipy.stats import levy_stable
>>> def geometric_levy_price(alpha, beta, drift, vol, shape): # (2)!
... _rvs = levy_stable.rvs(alpha, beta,loc=0, scale=1, size=shape)
... _rvs_sum = np.cumsum(_rvs, axis=0)
... return np.exp(vol * _rvs_sum + (drift - 0.5 * vol ** 2))
>>> class LevyData(vbt.SyntheticData): # (3)!
...
... _settings_path = dict(custom="data.custom.levy") # (4)!
...
... @classmethod
... def generate_key(
... cls,
... key,
... index,
... columns,
... start_value=None, # (5)!
... alpha=None,
... beta=None,
... drift=None,
... vol=None,
... seed=None,
... **kwargs
... ):
... start_value = cls.resolve_custom_setting(start_value, "start_value") # (6)!
... alpha = cls.resolve_custom_setting(alpha, "alpha")
... beta = cls.resolve_custom_setting(beta, "beta")
... drift = cls.resolve_custom_setting(drift, "drift")
... vol = cls.resolve_custom_setting(vol, "vol")
... seed = cls.resolve_custom_setting(seed, "seed")
... if seed is not None:
... np.random.seed(seed)
...
... shape = (len(index), len(columns))
... out = geometric_levy_price(alpha, beta, drift, vol, shape)
... out = start_value * out
... return pd.DataFrame(out, index=index, columns=columns)
>>> LevyData.set_custom_settings( # (7)!
... populate_=True,
... start_value=100.,
... alpha=1.68,
... beta=0.01,
... drift=0.0,
... vol=0.01,
... seed=None
... )
- Imports
np,pd,njit, andvbt - Generation function, see Asset price mimicry
- Subclass SyntheticData
- Specify the path to global settings for this class and assign it a new identifier "custom". There's one more identifier already registered for us: "base", which points to general settings.
- Set most arguments to None to pull their default value from the global settings. You can also hard-code the values here if you've decided to not use global settings.
- Access the global settings and if the argument value is None, then replace None with the default value
- Populate the global settings for this class
Let's try it out by generating and plotting the close price of several symbols of data:
>>> levy_data = LevyData.pull(
... "Close",
... keys_are_features=True,
... columns=pd.Index(["BTC/USD", "ETH/USD", "XRP/USD"], name="symbol"),
... start="2020-01-01 UTC",
... end="2021-01-01 UTC",
... seed=42)
>>> levy_data.get()
symbol BTC/USD ETH/USD XRP/USD
2020-01-01 00:00:00+00:00 99.218626 101.893255 100.371131
2020-01-02 00:00:00+00:00 100.062835 99.537102 97.857226
2020-01-03 00:00:00+00:00 95.321467 100.474547 98.246993
2020-01-04 00:00:00+00:00 96.493680 96.455981 99.797874
2020-01-05 00:00:00+00:00 98.489931 95.658733 98.892301
... ... ... ...
2020-12-27 00:00:00+00:00 189.477849 91.730109 55.055316
2020-12-28 00:00:00+00:00 190.620767 89.452822 59.555616
2020-12-29 00:00:00+00:00 187.641089 92.164802 60.034154
2020-12-30 00:00:00+00:00 188.287168 92.245270 59.188719
2020-12-31 00:00:00+00:00 185.500114 91.701142 58.443060
[366 rows x 3 columns]
>>> levy_data.get().vbt.plot().show()
Well done! We've built our own data mimicker that simulates sudden large changes in price.