Transitioning from Supervised Learning systems to Multi-Agent Reinforcement learning for financial platforms — Part 3

Prakhar Gurawa

Published in

InsiderFinance Wire

6 min readOct 18, 2021

prakhargurawa.medium.com

Application of Reinforcement Learning on our financial systems

Describing existing machine learning framework for our financial system

The current model is based on supervised-based learning where the typical workflow can be represented as data collection, model training, trading policy, backtesting, and live testing. The live data is bought from sources like Quandl and AlphaVintage which is then split into training, testing, and validation which is then passed to a time-based LSTM system to predict probable profit or loss. The trading policy lets us make appropriate decisions based on the output of LSTM models. Also, backtesting is performed on a hold-out set of historic market data. Finally, all the system parameters are frozen and tested on real live data. This mechanism works well but still not efficiently and that is profitable, which leads us to move to other techniques like multi-agents and reinforcement learning.

Describing the financial MDP

For every MDP we need to define its state space (states in which agents can be), action space (different actions that agents can take), and reward space (reward agent gets at each state taking particular action).

Level of Agents

We will have agents as macro agents (that deal with the buy, sell, or gold of assets) and micro agents (to make the decision of where to place the order). The action output of the macro agent will be input for the micro agent which will finally place an order in the limit book orders. Both the agents work in an optimistic assumption where both work to maximize the overall reward. The multi-agent reinforcement learning framework for financial use cases using macro and micro agents is described in the next figure.

State-space

We can build a state space consisting of technical indicators such as MACD (Moving average convergence divergence (Anghel 2015 [1])), RSI (Relative strength index (Bhargavi, Gumparthi, and Anith 2017 [2])), Williams %R (a momentum indicator invented by Larry Williams, Dahlquist, 2011), Weighted Bar direction (a parameter that tells us the direction and importance of the candlestick (William and Jafari 2011 [3]) formed by assigning weights) and previous day High — Low range. These indicators are chosen because of their simplicity and popularity. The mathematical way to represent our position state will be [A, B, C] where A represents the number of contacts already brought, B is the number of contacts already sold and C is the corresponding loss at the current position.

Action space

The action space for our use case is small, containing only three actions represented as 0 for hold assets, 1 to buy the asset, and 2 for sell. The choice of actions belongs to state A = {buy, sell, hold} also known as asset trading signal.

Reward space

As our aim is to maximize our return, so our rewards will be proportionate to profits at a particular state. The rewards are defined as below with pt as the close price of the asset at time t and β as commission.

Learning Algorithm: Q Learning

we will use a famous temporal difference algorithm q learning which is a model-free algorithm to learn the optimal policy ∏ : S→ A that maps from state to action. Here Q learning is preferred over SARSA due to the fact that the Q learning algorithm learns optimal policy while SARSA learns near-optimal policy and also avoids high risk which is crucial for our case. The q values will be updated using bellow update:

For exploitation-exploration trade-off, we will use a decaying e-greedy policy where the agent takes random actions with probability e and takes policy actions with probability 1-e. Also, e decays with time.

Working of macro and micro agents

For the macro agent, the state space at time t consists of historic price data from t-h to t where h can be considered as a hyperparameter representing how far we need to go back in history to predict the price at time t. The algorithm for the macro agent reward function is taken from Yagna [4] below. The working of macro-agent training. Here the neural net is used to learn the optimal policy is described below.

Finally, the micro agent determines the action to place an order in the limit book. The quantity and size of the order book are determined by the data provided by the macro agent. Here order book represents a data structure that holds all bids at the asking price/volume value [4]. Also, overall state space for macro agents is the historic price of a stock and the quantitative measures we decided before

The overall working of pipeline with macro and micro agent

Multi-agent system development software suggestions

Several agent platforms will be suitable for this approach such as JADE [5] (Java Agent Development Framework). Which is a free software to develop agents and their compliance written in JAVA or PADE [6](Python Agent Development Framework).

Conclusion

A multi-agent reinforcement learning approach seems perfect to implement in the current space considering the cost of the labeled datasets and the need for an autonomous system that can adapt on the fly without the need of retraining the system. Also, it looks like agents and multi-agents will be better at working with the chaotic markets compared with the existing supervised approach if applied and implemented with appropriate hyperparameters in the described system, as they will be better understanding more dynamic aspects of the market in a real-time environment.

References

[1]: Anghel, Gabriel Dan I. 2015. “Stock Market Efficiency and the MACD. Evidence from Countries around the World”.

[2]: Bhargavi, R., Srinivas Gumparthi, and R. Anith. 2017. “Relative Strength Index for Developing Effective Trading Strategies in Constructing Optimal Portfolio.” International Journal of Applied Engineering Research 12 (19): 8926–36.

[3]: William, Ron, and Sheba Jafari. 2011. “Candlestick Analysis,” no. September.

[4]: Yagna Patel “Optimizing Market Making using Multi-Agent Reinforcement Learning”.

[5]: JADE (Java Agent Development Framework), https://jade.tilab.com/

[6]: PADE https://pade.readthedocs.io/en/latest/#python-agent-development-framework

Other papers for a better understanding of this field:

Simon Kuttruff “A Machine Learning framework for Algorithmic trading on Energy markets”.
Prakhar Ganesh, Puneet Rakheja. “Deep Reinforcement Learning in High Frequency Trading”.
B. J. Heaton et al., “Deep learning in finance”.
Y. Li, “Deep reinforcement learning: an overview”.
Anghel, Gabriel Dan I. 2015. “Stock Market Efficiency and the MACD. Evidence from Countries around the World”.
Zhenhan Huang, Fumihide Tanaka “A Modularized and Scalable Multi-Agent Reinforcement Learning-based System for Financial Portfolio Management”.

Transitioning from Supervised Learning systems to Multi-Agent Reinforcement learning for financial platforms — Part 3