Bayesian optimization (BO), a sequential decision making strategy for efficiently finding a global optimum through iteration between search and evaluation, has recently gained great popularity given its high efficiency in terms of data requirements. In chemical engineering, where data generation through experiments is oftentimes time-consuming and expensive, the utility of BO has been examined in solving diverse experimental design problems, including materials discovery and reaction design[1-2]. The efficiency of BO is achieved by balancing the exploration and exploitation of the search space, and for this, a so-called acquisition function is defined and optimized. In standard BO, however, the widely-used acquisition functions are one-step optimal, which means that they consider just the immediate improvement at next step, and ignore the gains that would be obtained through many rounds of future evaluations. However, none of real-world decision-making problem can be solved in a single step of iteration. Real-world problems require decision making over multiple iterations starting from the initial knowledge state. Theoretically, to obtain an optimal solution to the multi-step lookahead BO problem, a stochastic dynamic programming (DP) problem should be solved, which is computationally intractable in almost all cases. Thus, several approximate methods for the multi-step lookahead BO have been suggested[3-4]; however, they are computationally too expensive to implement or restricted to two-step lookahead.
In this work, we propose an architecture of reinforcement learning based BO for multi-step lookahead decision making in an uncertain environment. Reinforcement learning (RL) is used to approximately solve the DP problem in an efficient way and to enable multi-step lookahead decision making. To incorporate reinforcement learning into the BO, the BO problem has to be translated into a Markov Decision Process (MDP) which RL requires. Unlike games or robotics where reinforcement learning has been applied to thus far, proper definitions of the state and reward for an agent are not clear for the BO problem. Thus, this paper suggests a novel way of defining an MDP for solving the multi-step lookahead BO problem. Proximal Policy Optimization (PPO), a state-of-the art RL algorithm, is employed in this work. The performance of the proposed RL-based BO has been tested throughout several benchmark functions by comparing average regrets for each step of decision to that of the conventional BO. As a result, the proposed BO has lower average regret values than the conventional BO, which means the reinforcement learning based BO has found a better optimum faster than the conventional BO. The proposed BO can be applied to a variety of sequential decision-making problems cast in an unknown environment (e.g. with an unknown decision-reward map) to accelerate the finding of the global optimal solution.