Automated stock trading strategy using deep reinforcement learning and recurrent neural networks.
We begin by discussing the challenges in stock trading, particularly the issues related to noisy and irregular data. The proposed solution is a Deep Reinforcement Learning (DRL) model with an architecture designed to handle these challenges by learning from historical stock data.
The dataset includes stock prices from major companies like NVIDIA, Microsoft, Apple, Amazon, and Google, covering the period from January 1, 2009, to June 1, 2024. Due to significant market events like the 2020 pandemic and the start of wars, the dataset is divided into training (2009-2022) and testing (2022-2024) periods. A metric called "Turbulence Threshold" is introduced to handle extreme market fluctuations.
Before training the model, the data undergoes several preprocessing steps, such as normalization and feature extraction. The importance of data consistency, the impact of weighting data points, and strategies for maintaining data integrity are emphasized.
The environment is designed using the OpenAI framework, simulating a stock market where the agent (trading algorithm) can buy, sell, or hold stocks. The action space is defined as discrete actions corresponding to buying, selling, or holding stocks, while the observation space includes various stock-related metrics like prices, indicators, etc.
The proposed model is a CLSTM-PPO (Cascading Long Short-Term Memory - Proximal Policy Optimization) model. It uses LSTM layers to capture temporal dependencies in stock data and a PPO algorithm to optimize trading decisions. The model is trained to maximize cumulative returns while minimizing risks like maximum pullback.
The following algorithm summarizes the process of our work:
The model’s performance is evaluated using several financial metrics:
Cumulative Return (CR)
Max Earning Rate (MER)
Maximum Pullback (MPB)
Average Profitability Per Trade (APPT)
Sharpe Ratio (SR)
These metrics help assess the profitability, risk, and overall performance of the trading strategy.
-
Time Window Size:
This hyperparameter defines the length of the sequence of data points (e.g., days of stock prices) that the model considers as input. It is crucial for capturing patterns over time. Impact: A larger time window allows the model to capture long-term trends and dependencies in the data. However, this also increases the complexity of the model and the computational resources required. A smaller time window focuses on short-term patterns, which might miss out on broader trends but can react faster to recent changes. Values Tested: The report mentions testing different time window sizes, such as 5, 15, 30, and 50. The optimal window size depends on the specific dataset and trading strategy, with the report finding that larger windows generally improve performance but come with trade-offs.
-
Hidden Size of LSTM Networks:
This refers to the number of units in the hidden layers of the LSTM (Long Short-Term Memory) networks, which are used to model the temporal dependencies in the stock data. Impact: Larger hidden sizes allow the model to capture more complex patterns and interactions in the data, which can improve accuracy but also increase the risk of overfitting, especially with limited data. Smaller hidden sizes reduce the risk of overfitting and computational cost but might miss out on capturing intricate relationships in the data. Considerations: The report suggests that the hidden size should be carefully tuned based on the complexity of the data and the amount of available training data.
-
Number of Time Steps:
This hyperparameter defines the number of steps or sequence lengths that the model processes at once during training. Impact: More time steps allow the model to consider a longer sequence of past events, which can be useful for understanding long-term dependencies. However, this can also lead to increased computational costs and potential overfitting. Fewer time steps make the model focus more on immediate past events, which might speed up training but could miss out on important historical information. Value Used: The report mentions a large number of time steps (e.g., 10,000) to allow the model to learn from extensive historical data.
-
Boolean Parameter for State Termination:
This hyperparameter controls when the model should terminate a sequence or trajectory during training. It is a Boolean parameter that decides whether to end the current state based on certain conditions. Impact: If set to True: The model will terminate the sequence early, which can prevent overfitting by not allowing the model to focus too long on any particular pattern. If set to False: The model will continue learning from the current sequence, which might be useful for capturing long-term dependencies but could lead to overfitting. Use Case: This parameter allows for flexibility in training, making the model adaptable to different market conditions by controlling how long it should focus on specific sequences.
-
Architectural Hyperparameters:
These are additional hyperparameters related to the overall training process, including: Learning Rate: Determines how quickly the model adjusts its parameters in response to the gradients. Batch Size: Defines the number of samples processed before the model's parameters are updated. Number of Epochs: Indicates how many times the entire dataset is passed through the model during training. Impact: Learning Rate: A higher learning rate can speed up training but may overshoot the optimal solution, while a lower rate ensures more precise updates but can slow down the process. Batch Size: Larger batches make the gradient estimation more stable but require more memory, while smaller batches allow for more frequent updates but with noisier gradients. Number of Epochs: Too few epochs might result in underfitting, while too many can lead to overfitting. Values Used: The report mentions using a batch size of 64 and 128, and the number of epochs being tested around 10 and 15, indicating an attempt to balance training time and model performance.
-
Stable Baseline Parameters:
These parameters are related to the specific implementation of the Proximal Policy Optimization (PPO) algorithm within the Stable Baselines library. They include: Value Function Coefficient: Weights the contribution of the value function in the loss calculation. Advantage Estimation: Adjusts how the advantages (differences between expected and actual rewards) are calculated and used in training. Implementation: The report suggests setting these parameters according to the recommendations in the related literature or default settings within the Stable Baselines framework to ensure the PPO algorithm performs optimally.
It Is necessary to highlight how larger time windows generally leads to better model performance, but with diminishing returns and increased computational costs.
While the proposed method can yield profitable trading strategies, it is also sensitive to market conditions and requires careful tuning of hyperparameters.