Issue 
Natl Sci Open
Volume 3, Number 5, 2024



Article Number  20230054  
Number of page(s)  18  
Section  Information Sciences  
DOI  https://doi.org/10.1360/nso/20230054  
Published online  22 March 2024 
RESEARCH ARTICLE
Learning the continuoustime optimal decision law from discretetime rewards
^{1}
School of Automation, Guangdong University of Technology, Guangdong Key Laboratory of IoT Information Technology, Guangzhou 510006, China
^{2}
Key Laboratory of Intelligent Information Processing and System Integration of IoT, Ministry of Education, Guangzhou 510006, China
^{3}
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore
^{4}
111 Center for Intelligent Batch Manufacturing Based on IoT Technology, Guangzhou 510006, China
^{5}
UTA Research Institute, the University of Texas at Arlington, Fort Worth 76118, USA
^{6}
Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville 37996, USA
^{7}
Oak Ridge National Laboratory, Oak Ridge 37830, USA
^{8}
GuangdongHongKongMacao Joint Laboratory for Smart Discrete Manufacturing, Guangzhou 510006, China
^{*} Corresponding authors (emails: ci.chen@gdut.edu.cn (Ci Chen); elhxie@ntu.edu.sg (Lihua Xie); shlxie@gdut.edu.cn (Shengli Xie))
Received:
6
September
2023
Revised:
18
January
2024
Accepted:
18
March
2024
The concept of reward is fundamental in reinforcement learning with a wide range of applications in natural and social sciences. Seeking an interpretable reward for decisionmaking that largely shapes the system’s behavior has always been a challenge in reinforcement learning. In this work, we explore a discretetime reward for reinforcement learning in continuous time and action spaces that represent many phenomena captured by applying physical laws. We find that the discretetime reward leads to the extraction of the unique continuoustime decision law and improved computational efficiency by dropping the integrator operator that appears in classical results with integral rewards. We apply this finding to solve outputfeedback design problems in power systems. The results reveal that our approach removes an intermediate stage of identifying dynamical models. Our work suggests that the discretetime reward is efficient in search of the desired decision law, which provides a computational tool to understand and modify the behavior of largescale engineering systems using the optimal learned decision.
Key words: continuoustime state and action / decision law learning / discretetime reward / dynamical systems / reinforcement learning
© The Author(s) 2024. Published by Science Press and EDP Sciences.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
INTRODUCTION
Reinforcement learning (RL) refers to actionbased learning [1], which is responsible for the modification of control policies based on rewards received from the natural and manmade systems under study. RL can be interpreted as an invisible hand introduced by Adam Smith as it describes the social benefits of actions governed by certain interests. Biologically, living organisms learn to act by interacting with the environment and observing the resulting reward stimulus. In cognitive sciences, RL has been used by Burrhus Skinner to study behavior pattern learning based on reinforcement and punishment stimuli.
Recent advances in RL [25] have revealed its advantage in understanding complex dynamical systems and extracting decision laws for an underlying system. A seminal breakthrough by Google Deepmind [6, 7] has resulted in a promising approach to playing the game of Go by using evaluationbased action selections with the help of powerful algorithmic computing. The resulting action has achieved a longstanding goal of artificial intelligence to defeat a world champion in the game of Go, the complexity of which is far beyond the ability of humans to master or model.
Although remarkable, most of the RL results were developed for discretetime systems. For example, in the game of Go, every board position is within the region of 19×19 blocks meaning that the state and action spaces that the RL algorithms are working on are discretetime. However, the advantages of using a continuoustime model, rather than the discretetime one, have become clear in a situation, wherein one aims to identify explicit analytical laws for underlying physical systems [8], such as Newton’s laws of motion and NavierStokes equations in fluid dynamics [9]. In such cases, the discretetime systemoriented RL such as the example of the Go game in refs. [6, 7] may not be applicable for searching decision laws for the underlying continuoustime system.
One classic method to search continuoustime decision laws is to compute them after system identification. A review of system identification techniques was documented in ref. [10]. Various techniques have been recently employed to discover the governing equations including symbolic regressionbased modeling [11, 8], sparse identification [12], empirical dynamic modeling [13, 14], and automated inference of dynamics [15]. In the system identificationbased methods, the system model must be learned before the control design.
In the field of RL, there has been slow progress in distilling decision laws for continuoustime dynamic processes. Some early attempts have been made, including refs. [1620]. However, there are severe limitations in these works, wherein solving a continuoustime Bellman equation is a must for obtaining the optimal decision law. In ref. [16], Euler’s method was used to discretize the Bellman equation so that RLbased methods for discretetime systems can be applied. The main concern for ref. [16] is that its result is only based on the discretized system and may not lead to the optimal control policy even if the sampling period becomes small. Instead of using discretization, an exact method for continuoustime RL was given in ref. [21], wherein an integral reward is designed for feedback. This reward feedbackbased technique is later termed as integral reinforcement learning (IRL) [22] as it requires that the integral reward be available for feedback. It is interesting to seek a continuoustime optimal decision law via the integral reward, since the reward is one of the fundamental units in RL to shape the behavior of complex systems. Later, some learningbased studies relied on the assumption of feedbacking the integral reward [2328].
However, it is not always desirable to use such an integral reward as the integral operation is computationally expensive and storage overoccupying, especially for dynamical systems with largescale dimensions. As an illustrative example, the utility for learning is defined as a quadratic energy function of the system state and action. Given this kind of utility, calculating the integral reward requires taking the tensor product of two vector spaces (state and action) with the dimensions n and m. After canceling the same items in the integral operation, the total dimension of the stored data for computing the reward becomes $\frac{1}{2}n\left(n+1\right)+\frac{1}{2}m\left(m+1\right)$ for each sampling. More data storage and computation will be occupied when action and output data are collected over a long period. It thus becomes challenging and interesting to get rid of adverse facts from the integral reward, meanwhile, extracting the continuoustime optimal decision law.
In this work, we focus on a discretetime reward, which starts at r(u(t_{1}), y(t_{1})) and stacks r(u(t_{2}), y(t_{2})) below, and then continues for all r(u(t_{i}), y(t_{i})) until r(u(t_{s}), y(t_{s})), which is in a sharp different to the IRLbased method using the integral reward, which starts at ${\int}_{{t}_{1}}^{{t}_{2}}}r\left(u(t)\mathrm{,}\text{}y(t)\right)\text{d}t$ until ${\int}_{{t}_{s}1}^{{t}_{s}}}r(u(t)\mathrm{,}\text{}y(t))\text{d}t$ for feedback. The discretetime reward has clear physical merits as it could be sampled over nonuniform time and represents a simple slice of the overall integral reward. However, the inner mechanism of learning a decision law for underlying continuoustime dynamical systems from a discretetime reward remains unstudied. In this work, we aim to study such a mechanism and propose an analytical reinforcement learning framework using the discretetime reward to capture the optimal decision law for continuoustime dynamical systems. This technical innovation comes from the introduction of state derivative feedback into the learning process, which is in sharp contrast to the existing works based on IRL. We apply this framework to solve outputfeedback design problems in power systems. Note that an outputfeedback decision law design was given in ref. [28], which, however, required computing integrals of system input and output, wherein one integral is used to formulate rewards and the other for system state reconstruction. Compared to ref. [28], we remove the computation for integral rewards and only need one integral operator for system state reconstruction for outputfeedback design. It is seen that the presented framework is a datadriven approach that removes an intermediate stage of identifying dynamical models in modelbased control design methods. Our result suggests an analytical framework for achieving desired performance for complex dynamical systems.
CONTINUOUSTIME OPTIMAL DECISION LAW LEARNING FROM DISCRETETIME REWARD
In this work, we revisit the problem of optimal decision law learning for continuoustime dynamical systems. We notice that an analytical dynamical system model is necessary to extract its explicit decision law. Here, we consider the following linear timeinvariant continuoustime dynamical system, which is extensively used to capture a large number of physical phenomena in different communities, ranging from the control science [23], the neuroscience [29, 30], to the complex network science [3133]:$\{\begin{array}{l}\dot{x}(t)=Ax(t)+Bu(t)\mathrm{,}\hfill \\ y(t)=Cx(t)\mathrm{,}\hfill \end{array}$(1)where the notation t denotes the time for system evolution, $x\left(t\right)={\left[{x}_{1}\left(t\right)\mathrm{,}\text{}{x}_{2}\left(t\right)\mathrm{,}\text{}\dots \mathrm{,}\text{}{x}_{n}\left(t\right)\right]}^{\text{T}}\in {\mathbb{R}}^{n}$ represents the stacked state at the time t with the dimension being n×1, like operating states of organs within a human digestive system, $y\left(t\right)={\left[{y}_{1}\left(t\right)\mathrm{,}\text{}{y}_{2}\left(t\right)\mathrm{,}\text{}\dots \mathrm{,}\text{}{y}_{p}\left(t\right)\right]}^{\text{T}}\in {\mathbb{R}}^{p}$ denotes the output measurement, like the mouth condition among organs of the digestive system, and $u\left(t\right)={\left[{u}_{1}\left(t\right)\mathrm{,}\text{}{u}_{2}\left(t\right)\mathrm{,}\text{}\dots \mathrm{,}\text{}{u}_{m}\left(t\right)\right]}^{\text{T}}\in {\mathbb{R}}^{m}$ is the system action for applying the decision law to transform the system state, as the action of eating to stimulate the digestive system. The matrix A is called drift dynamics denoting how the system state evolves without any action. The action matrix B describes a mapping between the system state and the controller through which we attempt to change the behavior of the system. The matrix C denotes a mapping from the state to the output measurement. The system in eq. (1) is assumed to satisfy the controllability of (A, B), which evaluates the capability of control in manipulating the state. Its dual concept is the observability of (A, C) for evaluating the ability to observe the state from the output. The controllability and observability condition is standard and essential for the system design and control, and has been widely considered in recent works such as refs. [3133]. The decision law is also termed as a control policy that aims to take an initial state x(0) to a state with a prescribed performance using the output of y(t).
To determine the decision law u(t), we collect inputoutput data $U\in {\mathbb{R}}^{s\times n}$ and $Y\in {\mathbb{R}}^{s\times p}$ over time for the system evolution as$U=\left[\begin{array}{c}{u}^{\text{T}}\left({t}_{1}\right)\\ \vdots \\ {u}^{\text{T}}\left({t}_{i}\right)\\ \vdots \\ {u}^{\text{T}}\left({t}_{s}\right)\end{array}\right]=\left[\begin{array}{llll}{u}_{1}\left({t}_{1}\right)\hfill & {u}_{2}\left({t}_{1}\right)\hfill & \dots \hfill & {u}_{m}\left({t}_{1}\right)\hfill \\ \vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {u}_{1}\left({t}_{i}\right)\hfill & {u}_{2}\left({t}_{i}\right)\hfill & \dots \hfill & {u}_{m}\left({t}_{i}\right)\hfill \\ \vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {u}_{1}\left({t}_{s}\right)\hfill & {u}_{2}\left({t}_{s}\right)\hfill & \dots \hfill & {u}_{m}\left({t}_{s}\right)\hfill \end{array}\right],$$Y=\left[\begin{array}{c}{y}^{\text{T}}\left({t}_{1}\right)\\ \vdots \\ {y}^{\text{T}}\left({t}_{i}\right)\\ \vdots \\ {y}^{\text{T}}\left({t}_{s}\right)\end{array}\right]=\left[\begin{array}{cccc}{y}_{1}\left({t}_{1}\right)& {y}_{2}\left({t}_{1}\right)& \dots & {y}_{p}\left({t}_{1}\right)\\ \vdots & \vdots & \ddots & \vdots \\ {y}_{1}\left({t}_{i}\right)& {y}_{2}\left({t}_{i}\right)& \dots & {y}_{p}\left({t}_{i}\right)\\ \vdots & \vdots & \ddots & \vdots \\ {y}_{1}\left({t}_{s}\right)& {y}_{2}\left({t}_{s}\right)& \dots & {y}_{p}\left({t}_{s}\right)\end{array}\right]\mathrm{,}$where the nonuniform sampling time satisfies t_{1}<t_{2}<…<t_{i}<…<t_{s1}<t_{s} with t_{1} and t_{s} being, respectively, the points of time when the data collection starts and ends. Given the discretetime data samples U and Y, we now formulate a reward function as${\Theta}_{r}\left(U\text{,}Y\right)=\left[\begin{array}{c}\\ r\left(u\left({t}_{i}\right)\mathrm{,}\text{}y\left({t}_{i}\right)\right)\\ \end{array}\right]\mathrm{,}$(2)where the utility r(u(t_{i}), y(t_{i})) is defined as u^{T}(t_{i})Ru(t_{i})+y^{T}(ti)QY(t_{i}) with weighting matrices Q and R being symmetric positive definite for tuning the outputs and actions; and the lines in the equation denote that ${\Theta}_{r}(U\mathrm{,}Y)$ is a collection of the utility data that starts r(u(t_{1}), y(t_{1})), and then continues for all r(u(t_{i}), y(t_{i})) until r(u(t_{s}), y(t_{s})). The notation ${\Theta}_{r}(U\mathrm{,}Y)$ is a vector containing rewards observed by nonuniform sampling. Eq. (2) is called a discretetime reward, which is in contrast to refs. [21, 22] with an integral reward consisting of the integration ${\int}_{{t}_{i}}^{{t}_{i+1}}}{u}^{\text{T}}(t)Ru(t)+{y}^{\text{T}}(t)Qy(t)\text{d}t$.
Then, we ask: can we design the decision law u(t) based on the discretetime reward to solve the optimization problem$J(Q\text{,}R\text{,}u(t)\mathrm{,}\text{}y(t))=\underset{u(t)}{{\displaystyle \mathrm{min}}}{\displaystyle {\int}_{0}^{\infty}}r(u(t)\mathrm{,}\text{}y(t))\text{d}t$(3)without prior knowledge of the system dynamics A, B, C, and the system state x(t)?
The optimization criterion in eq. (3) concerns minimizing the energy in the outputs with the least possible supplied action energy over the infinite continuoustime horizon. The available information for the design of a decision law is only the input and output data. This is called an outputfeedback design, meaning that only the output y(t), rather than the state x(t), is available. The outputfeedback design is more challenging than the statefeedback one, since the system output y(t) denotes only a part of the full state x(t). A key issue that needs addressing is how to use the discretetime reward for decision law learning within the outputfeedback design.
We shall introduce an analytical framework as illustrated in Figure 1 for extracting the optimal decision law that minimizes the criterion in eq. (3) based on the discretetime reward in eq. (2). A distinguishing feature of the presented framework is that the discretetime reward is in a central place for learning. The information flow in Figure 1 illustrates that the inputoutput data are first collected as discretetime data samples, based on which the discretetime reward is constructed. Then, the discretetime reward is fedback to a critic module for updating the value estimate. Next, this updated value estimate is used for control policy improvement, which finally leads to the optimal decision law.
Figure 1 Schematic framework of the reinforcement learning algorithm using policy iteration for continuoustime dynamical systems. (A) At each time t=t_{i}, for i=1, 2, …, one observes the current output y(t) and action u(t). The sampled inputoutput data are collected along the trajectory of the dynamical system in realtime, and are stacked over the time interval [t_{1}, t_{s}] as the discretetime inputoutput data U and Y. (B) The inputoutput data of U and Y, associated with the prescribed optimization criterion, are used for updating the value estimate given in the critic module, based on which the control policy in the actor module is updated. The ultimate goal of this framework is to use the inputoutput data U and Y for learning the optimal decision law that minimizes the userdefined optimization criterion J(Q, R, u(t), y(t)). 
One question for the framework in Figure 1 is the solvability. This was partially answered by the control system community (Supplementary information, Section 1A). Assuming that the system dynamics A, B, C, and the system state x(t) are available for the design, a modelbased offline decision law u(t) that solves the optimization of eq. (3) is given by [23]$u(t)={K}^{*}x(t)\mathrm{,}$(4)where K^{*} is an optimal decision gain determined by K^{*}=R^{1}B^{T}cP^{*} with the matrix P^{*} obtained from solving an algebraic Riccati equation involving the full system dynamics A, B, C (Supplementary information, Section 1B).
Considering that only the system output y(t), rather than the system state x(t), is available, we may turn to set up an outputfeedback design to approximate the optimal decision law in eq. (4) through$u(t)={K}_{}{o}^{*}\Phi (u(t)\mathrm{,}y(t))\mathrm{,}$(5)where ${K}_{\text{o}}^{*}$ is a feedback decision gain and Φ(u(t), y(t)) is a feedforward signal from the data learning point of view. Here, eq. (5) provides a way of transforming the design problem of u(t) into two subproblems by searching ${K}_{\text{o}}^{*}$ and Φ(u(t), y(t)). Indeed, the design of Φ(u(t), y(t)) is feedforward depending on the control and output only. One realization form for Φ(u(t), y(t)) will be later specified as η(t) in this study. We thus shift the focus to resolving the feedback gain ${K}_{\text{o}}^{*}$ from the discretetime reward in eq. (2), under a key premise that the model of A, B, C, and x(t) defined in eq. (1) exists but the accurate model information is not available for design beforehand.
To search for the gain ${K}_{\text{o}}^{*}$ that meets the optimization criterion eq. (3) without prior knowledge of A, B, C, one may turn to machine learning for the solution. In the setting of machine learning, an unknown system is referred to as an unknown environment. Thus, through interactions with the environment, the control policy design of maximizing a reward, equivalent to a cost to be minimized given in eq. (3), is termed as RL [4]. Recent advances have revealed that RL is a promising method across various disciplines for searching for a decision law that gives rise to satisfactory system performance.
Although great success has been achieved, RLbased results typically assume the state and action constrained in a discretetime space, while it is not readily feasible to learn the decision law in eq. (5) for continuoustime systems in eq. (1). The framework of continuoustime systems is more suitable for modeling most physical phenomena, as the models of physical systems obtained from the application of physical laws are naturally in continuoustime forms such as refs. [23, 2933]. Note that the discretization technique may not be applicable for transforming continuoustime systems into discretetime ones. The reason is rooted in different structures of the optimal decision law between the continuoustime and discretetime systems.
Another key observation for the framework in Figure 1 is that the dynamical systems are indeed continuoustime in terms of the state x(t), while the rewards for feedback are sampled over the discretetime time series. The discretetime data principle is the cornerstone of parameter learning having numerous applications ranging from control, signal processing, to astrophysics and economics. Although it is now possible to utilize IRL for learning a continuoustime optimal decision law, the method of IRL violates such a principle for data collection and processing. The direct aftermath is that IRL requires measuring the integral of the tensor product of two vector spaces over the time interval [t_{i},t_{i=1}], including the outputaction data ${\int}_{{t}_{i}}^{{t}_{i+1}}}y(\tau )\otimes u(\tau )\text{d}\tau $ (or stateaction data ${\int}_{{t}_{i}}^{{t}_{i+1}}}x(\tau )\otimes u(\tau )\text{d}\tau $), actionaction data ${\int}_{{t}_{i}}^{{t}_{i+1}}}u(\tau )\otimes u(\tau )\text{d}\tau $, and outputoutput data ${\int}_{{t}_{i}}^{{t}_{i+1}}}y(\tau )\otimes y(\tau )\text{d}\tau $ (or statestate data ${\int}_{{t}_{i}}^{{t}_{i+1}}}x(\tau )\otimes x(\tau )\text{d}\tau $), wherein the symbol $\otimes $ denotes the Kronecker product operator. These integral tensored data are required in IRL as the smallest unit for formulating the integral rewards. Recent advances in adaptive optimal control support this view of the decision law design [2228], where the continuoustime integration operator has to be applied over the tensor product.
Here, we advocate using discretetime data samples as the smallest unit, from which the discretetime reward is constructed for learning the feedback gain ${K}_{\text{o}}^{*}$. We shall explore the inner mechanism of learning a decision law for the underlying continuoustime dynamical system from the discretetime reward, and provide rigorous mathematical reasoning for the decision law learning.
The schematic of the presented RLbased framework is illustrated in Figure 2 with a focus on constructing a suitable discretetime reward for feedback learning.
Figure 2 Computational approach for deriving optimal design laws from the data. (A) Preprocess the actions and outputs of the dynamical system and construct the feedforward signals that will be used for the feedback gain learning and the design of an online realtime control loop (Supplementary information, Section 2A). (B) Measure the inputoutput data, as well as the feedforward signals, over discretetime series, based on which the discretetime data samples are assembled using the tensor product (Supplementary information, Section 2B). (C) This part is central for learning the feedback gain ${K}_{\text{o}}^{*}$ from discretetime data. First, calculate the Bellman equation for optimality via policy iterations. Then, through policy evaluation and improvement, the optimal feedback gain is obtained from the discretetime data samples with rigorous mathematical operations and convergence deduction (Supplementary information, Section 2C). Finally, both the feedforward signal in (A) and the feedback gain ${K}_{\text{o}}^{*}$ contribute to the optimal decision law in eq. (5). 
In Figure 2A, the inputoutput signals, u(t) and y(t), determine the data sets of U and Y and also the feedforward signals of $\eta (t)={\left[{\eta}_{u}^{T}(t)\mathrm{,}\text{}{\eta}_{y}^{T}(t)\right]}^{\text{T}}\in {\mathbb{R}}^{n(m+p)}$ and $\theta (t)={\left[{\dot{\eta}}_{u}^{\text{T}}(t)\mathrm{,}{\dot{\eta}}_{y}^{\text{T}}(t)\right]}^{\text{T}}\in {\mathbb{R}}^{n(m+p)}$ satisfying${\dot{\eta}}_{u}(t)=({I}_{m}\otimes {D}_{\eta}){\eta}_{u}(t)+u(t)\otimes b\mathrm{,}$(6)${\dot{\eta}}_{y}(t)=({I}_{p}\otimes {D}_{\eta}){\eta}_{y}(t)+y(t)\otimes b\mathrm{,}$(7)where the companion matrix ${D}_{\eta}$ and the vector b are userdefined variables as detailed in Supplementary information, Section 2A. Matrix ${D}_{\eta}$ should be made Hurwitz by choosing the entries on its last low to be negative. Let the feedforward signal Φ(u(t), y(t)) in eq. (5) be realized as $\eta (t)$, which denotes the change of the state after the parametrization [37]. This further generates the following data sets collected over several time instants as${\Theta}_{\eta}=\left[\begin{array}{c}\\ {\eta}^{\text{T}}({t}_{i})\\ \end{array}\right]\in {\mathbb{R}}^{s\times n(p+m)}\mathrm{,}$(8)${\Theta}_{\theta}=\left[\begin{array}{c}\\ {\theta}^{\text{T}}({t}_{i})\\ \end{array}\right]\in {\mathbb{R}}^{s\times n(p+m)}\mathrm{,}$(9)where n, p, and m are the row dimensions of x(t), y(t), and u(t), respectively, and s denotes the number of time samples. Note that $\eta (t)$ and $\theta (t)$ are vectors in the continuoustime space, as opposed to data matrices ${\Theta}_{\eta}$ and ${\Theta}_{\theta}$. The results in Figure 2A and B reveal how to learn the feedback gain ${K}_{\text{o}}^{*}$ from the discretetime reward.
Now, all the data sets required in learning the gain ${K}_{\text{o}}^{*}$ have been determined, specified as U, Y, ${\Theta}_{\eta}$, and ${\Theta}_{\theta}$. Based on these four data sets, it is ready to construct the following discretetime data samples:${\Theta}_{\eta u}=\left[\begin{array}{c}\\ {\eta}^{\text{T}}({t}_{i})\otimes {u}^{\text{T}}({t}_{i})\\ \end{array}\right]\in {\mathbb{R}}^{s\times nm(p+m)}\mathrm{,}$${\Theta}_{yy}=\left[\begin{array}{c}\\ {y}^{\text{T}}({t}_{i})\otimes {y}^{\text{T}}({t}_{i})\\ \end{array}\right]\in {\mathbb{R}}^{s\times {p}^{2}}\mathrm{,}$${\Theta}_{\eta y}=\left[\begin{array}{c}\\ {\eta}^{\text{T}}({t}_{i})\otimes {y}^{\text{T}}({t}_{i})\\ \end{array}\right]\in {\mathbb{R}}^{s\times np(p+m)}\mathrm{,}$${\Theta}_{\eta \eta}=\left[\begin{array}{c}\\ {\eta}^{\text{T}}({t}_{i})\otimes {\eta}^{\text{T}}({t}_{i})\\ \end{array}\right]\in {\mathbb{R}}^{s\times {n}^{2}{(p+m)}^{2}}\mathrm{,}$${\Theta}_{\theta \eta}=\left[\begin{array}{c}\\ {\theta}^{\text{T}}({t}_{i})\otimes {\eta}^{\text{T}}({t}_{i})\\ \end{array}\right]\in {\mathbb{R}}^{s\times {n}^{2}{(p+m)}^{2}}\mathrm{,}$using the tensor product of discretetime spaces, the data flow of which is given in Figure 2B. Take ${\Theta}_{\eta u}$ for example. Each row of ${\Theta}_{\eta u}$ is obtained by the tensor product of the vectors $\eta ({t}_{i})$ and $u({t}_{i})$, while each column is collected over different time series ranging from t=t_{1} to t=t_{s}.
The learning philosophy is now evolved from the integral reward to the discretetime one, resulting in extra design benefits. The main advantage of using the discretetime reward is that the computational efficiency is significantly improved when compared to that in the method of IRL. For example, in the considered outputfeedback design, the data space for storage equals the sum of the dimension of U, Y, ${\Theta}_{\eta}$, and ${\Theta}_{\theta}$, which is labeled as ${T}_{\text{tal}\mathrm{,1}}=s\times [p+m+2n(p+m)]$. If a system with the same dimension is applied in the setting of IRL, the data space for storage would become ${T}_{\text{tal}\mathrm{,2}}=s\times \left[nm(p+m)+\frac{p(p+1)}{2}+np(p+m)+\frac{(np+nm)(np+nm+1)}{2}\right]$, which is obtained by summing the dimensions of ${\Theta}_{\eta u}$, ${\Theta}_{yy}$, ${\Theta}_{\eta y}$, and ${\Theta}_{\eta \eta}$ after eliminating same elements in ${\Theta}_{yy}$ and ${\Theta}_{\eta \eta}$. The column length of T_{tal, 1} is much less than that of T_{tal, 1}, especially for a large magnitude of s, p, m, or n. Besides, the integral operator has to be imposed for all the T_{tal, 2} samples in IRL, while it is ruled out for the T_{tal, 2} samples. This reveals that fewer data space for storage and less computational time is consumed in discretetime rewardbased learning when compared to that in IRL.
We seek the feedback gain ${K}_{\text{o}}^{*}$ by employing the policy iteration. Let the gain matrix at the kth iteration be ${\overline{K}}^{k}$, and let the vector S^{k} satisfy${S}^{k}=\left[\begin{array}{c}\\ \text{vec}\left({\overline{K}}^{k}\right)\end{array}\right]\mathrm{,}$(10)where vec(·) is the vectorization operator and ${\overline{K}}^{0}$ denotes an initial stabilizing gain obtained by trial and error. Construct libraries ${\Theta}^{k}\left({\overline{K}}^{k}\mathrm{,}{\Theta}_{\theta \eta}\mathrm{,}{\Theta}_{\eta \eta}\mathrm{,}{\Theta}_{\eta \text{y}}\mathrm{,}{\Theta}_{\eta u}\right)$ and ${\Phi}^{k}\left({\overline{K}}^{k}\mathrm{,}{\Theta}_{\eta \eta}\mathrm{,}{\Theta}_{\text{yy}}\right)$ consisting of the discretetime data samples ${\Theta}_{\theta \eta}$, ${\Theta}_{\eta \text{y}}$, ${\Theta}_{\eta u}$, ${\Theta}_{\eta \eta}$, and ${\Theta}_{\text{yy}}$.
As a variant reward of eq. (2), the discretetime reward used in the policy iteration is defined as follows:${\Theta}_{r}({U}_{k}\mathrm{,}Y)=\left[\begin{array}{c}\\ r({u}_{k}({t}_{i})\mathrm{,}y({t}_{i}))\\ \end{array}\right]\mathrm{,}$(11)where ${u}_{k}(t)={\overline{K}}^{k}\eta (t)$ denotes an iterative decision law at the kth iteration step. Considering that the optimal decision law is unknown, one needs to use the reward of eq. (11) in the policy iteration, rather than eq. (2), for the algorithmic stability purpose. Based on eq. (11), straightforward manipulation leads to $r({u}_{k}({t}_{i})\mathrm{,}y({t}_{i}))={\Theta}_{\eta \eta}vec(({\overline{K}}^{k}{)}^{\text{T}}R{\overline{K}}^{k})+{\Theta}_{yy}vec(Q)$. The discretetime reward in eq. (11) is nonuniformly sampled and can be stacked into the form of ${\Phi}^{k}({\overline{K}}^{k}\mathrm{,}{\Theta}_{\eta \eta}\mathrm{,}{\Theta}_{yy})$.
With the collected data, we construct a policy iterationbased Bellman equation for solving S^{k+1}:${\Theta}^{k}({\overline{K}}^{k}\mathrm{,}{\Theta}_{\theta \eta}\mathrm{,}{\Theta}_{\eta \eta}\mathrm{,}{\Theta}_{\eta y}\mathrm{,}{\Theta}_{\eta u}){S}^{k+1}={\Phi}^{k}({\overline{K}}^{k}\mathrm{,}{\Theta}_{\eta \eta}\mathrm{,}{\Theta}_{yy})\mathrm{,}$(12)where the matrix ${\overline{K}}^{k+1}$, contained in S^{k+1}, is the feedback gain that we want to learn (Supplementary information, Section 2B). A verifiable condition$\text{rank}([{\Theta}_{\eta \eta}\mathrm{,}{\Theta}_{\eta u}])=(nm+np)\left(\frac{nm+np+1}{2}+m\right)$(13)
is proposed for evaluating the richness of the collected data samples that uniquely solve the iterative gain ${\overline{K}}^{k+1}$ from eq. (12) (Supplementary information, Section 2C). This rank condition is related to persistent excitation which is wellknown in parameter estimation and adaptive control [38, 39, 37].
As illustrated in Figure 2C, the computation in eq. (12) is carried out iteratively by replacing ${\overline{K}}^{k+1}$ from a previous step k with that from the current one until the convergence criteria are met. Such an iterative procedure ultimately leads to the unique optimal feedback gain. With the converged ${\overline{K}}^{k+1}$, label ${\overline{K}}^{k+1}$ as ${K}_{\text{o}}^{*}$, and the decision law is then given by eq. (5) which solves the optimization in eq. (3) (Supplementary information, Section 2C).
As for solving ${\overline{K}}^{k+1}$ from eq. (12), the computation error, termed as e_{k}(t_{i}), consists of the multiplication of a matrix exponential and an initial system state e^{(ALC)ti}X(0) (Supplementary information, Section 2B). This is indeed an error in the basis function approximation for resolving the state [28], which happens in the preprocessing period for the feedforward signals η(t) and θ(t) as illustrated in Figure 2A. This error is ruled out if x(0)=0. Despite that the outputfeedback design does not allow manipulating the initial state x(0), one can decrease the computation error e_{k}(t_{i}) by executing the preprocessing period for a long enough time. This allows us to handle the unknown nonzero initial state x(0) and to remove the impacts of the computation error e_{k}(t_{i}) by decaying the matrix exponential e^{(ALC)ti}.
Even though one way for reducing the computation error e_{k}(t_{i}) caused by x(0) is found, the design for the basis function approximation still needs the dimension of system state x(t). Fortunately, one may deduce the state dimension, from either the model structure of the physical system or from the system data. From the perspective of physics, it may provide a reasonable estimation of the dimension of the controlled system with the application of physical laws. For example, the relationship between the action force and the resulting mass displacement in the massspringdamper system is clear after the application of physical laws. From the perspective of the data, the dimension of the state vector that we are seeking relates the collected input to output data. Take subspace analysis in biological learning [30] for example. The dimension of the motor learning system state depends on the number of singular values associated with a matrix consisting of inputoutput data. Also, the System Identification Toolbox builtin commercial software such as Matlab offers graphical user interfaces to assist in the task of model order estimation.
The initial stabilizing gain ${\overline{K}}^{0}$ is required in this work, as well as in all the policy iterationbased RL including [2328]. The procedure of finding the stabilizing gain ${\overline{K}}^{0}$ is verifiable as we feedback it to the controlled system using $u(t)={\overline{K}}^{0}\Phi (u(t)\mathrm{,}y(t))$. The time for finding such a gain ${\overline{K}}^{0}$ can be used for decaying e^{(ALC)ti}X(0) in the computation error e_{k}(t_{i}).
Robustness to noise has been a vital issue of an algorithm for extracting the decision law. The robustness of the control policy obtained using the proposed framework is analyzed in (Supplementary information, Section 2D). Depending on the noise, it may be necessary to filter the system output y(t) and action u(t) before the sampling. For the removal of highfrequency components from the signal, the Nyquist frequency, which is equal to half the sampling rate, is required. To counteract noised signals of y(t) and u(t), one feasible solution is to learn a decision law directly for an augmented system that integrates the original control system and the extra filter dynamics. For example, as illustrated in the engineering control system [40], the presented framework works for the augmented system with a filter of system output.
RESULTS
In what follows, we will use the algorithm in Figure 2 to search for the optimal decision law based on the data captured from dynamical power systems. In power systems, it is important to sustain feedback control regulators of the prescribed structure for stability, meanwhile, possessing some desired performance. The algorithm in Figure 2 is responsible for collecting the discretetime data samples from the power system, defining the discretetime reward, and learning the prescribed control policy. In the learning process, no additional prior knowledge about the power system’s model is used for seeking the optimal decision law.
The design task is now specified as designing an outputfeedback continuoustime control policy via discretetime rewards for solving the loadfrequency regulator problem of power systems [23, 41]. The power system dynamics considered consists of governor, turbine, and generator models, whose outputs are governor position change, generator output change, and frequency change. An additional model is introduced for integrating the frequency change to supply the governor model. Linearization is utilized to obtain the power system dynamics operating on the normal condition. The frequency change in the power system, denoted as y(t)=Δf(t), is allowed for the measurement, rather than the whole power system’s states. The system dynamics are shown in Figure 3A, wherein the integral action of the frequency deviation is the control policy u(t) to be designed; the state for the loadfrequency regulation system is stacked as $x(t)=[{x}_{1}(t)\mathrm{,}{x}_{2}(t)\mathrm{,}{x}_{3}(t)\mathrm{,}{x}_{4}(t{)]}^{\text{T}}$ with x_{1}(t)=Δf(t) incremental frequency change (Hz), ${x}_{2}(t)=\Delta {P}_{g}(t)$ incremental generator output change (p.u. MW), ${x}_{3}(t)=\Delta {X}_{g}(t)$ incremental governor position change (p.u. MW), and ${x}_{4}(t)=\Delta E(t)$ incremental change in voltage angle in radians. The physical parameters for the system dynamics shown in Figure 3A can be found in refs. [23, 41]. The desired control policy is to minimize the following utility over an infinity horizon$J(Q=\mathrm{1,}R=\mathrm{1,}u(t)\mathrm{,}y(t))=\underset{u(t)}{{\displaystyle \mathrm{min}}}{\displaystyle {\int}_{0}^{\infty}}{y}^{\text{T}}(t)y(t)+{u}^{\text{T}}(t)u(t)\text{d}t$(14)using the inputoutput data from the power system, together with the discretetime reward, as shown in Figure 3B. In eq. (14), the choice of Q, R=1 is for the example illustration only. Such a choice of Q, R works means that the proposed method can also be applied under other feasible choices of Q, R for different trials.
Figure 3 System modeling for an electric power system and the discretetime data for learning control policy. (A) Load frequency control of an electric power system is modeled by considering its nominal continuoustime system around an operating point specified by a constant load value. Only partial system state Δf(t) is available for measurement, requiring the policy design to follow the outputfeedback principle. The statespace equation of the power system is highlighted in the figure, but the system dynamics are unknown to the control policy designer. (B) The continuoustime control policy is inferred from the discretetime reward. The time for the power system’s evolution is sampled nonuniformly. (C) Sampled inputoutput data are collected for generating the discretetime reward. 
It is checkable that the considered power system is stabilizable and detectable, which guarantees that the decision law to be learned exists and is unique [23]. The unique decision law will be solved offline for comparison purposes only. A stabilizing feedback gain ${\overline{K}}^{0}$ is required in our framework and the existing policy iterationbased methods [2328]. In the power system control design, such a stabilizing gain ${\overline{K}}^{0}$ can be obtained through trial and error as it can be fedback into the system for testing, while the optimal policy associated with Q and R is the target for pursuing from the data. Some alternative methods for obtaining the stabilizing gain ${\overline{K}}^{0}$ can be found in the literature such as refs. [42, 43, 23, 24].
Now, we intend to employ the discretetime reward for inferring the continuoustime control policy satisfying a form of eq. (5) and meeting the performance criterion of eq. (14). The histogram of the time under nonuniform sampling is given in Figure 3B, which indicates that the data only sampled at those instants of time are collected for learning. Such nonuniform sampling will cause difficulty in directly converting a discretetime transfer function to a continuoustime one [44]. Thus, searching for the optimal decision law by the discretization technique may not be feasible.
From Figure 3C, the data of the input, output, and reward are presented alongside the nonuniform sampling time. Their trajectories along with the time are divided into two regions to distinguish between the feedforward and feedback learning, which, respectively, correspond to the learning of $\Phi (u(t)\mathrm{,}y(t))$, and ${K}_{\text{o}}^{*}$ as formulated in eq. (5).
For the period of feedforward learning, the priority is to reduce the computation error e_{k}(t_{i}). To this end, the input and output data in the first 13 s in Figure 3C are used to learn the feedforward signal $\Phi (u(t)\mathrm{,}y(t))$. Note that a long time for feedforward learning is preferred over a short one. A long time makes the computation error e_{k}(t_{i}) small as it exponentially vanishes. During this period, it is not necessary to give the discretetime reward, since the rewards are only used for the feedbackgain learning in the presented framework. Thus, for the feedforward learning in Figure 3C, the reward is set to 0.01 for using a logarithmic scale on the yaxis.
During the period of the feedback learning (indicated by the region in the orangered color in Figure 3C, the key task is to collect the inputoutput data for calculating rewards and for iteratively learning the feedback gain ${K}_{\text{o}}^{*}$. Note that the reward is calculated only after the feedforward learning is accomplished. With the feedforward signal $\Phi (u(t)\mathrm{,}y(t))$ and the stabilizing gain matrix ${\overline{K}}^{0}$ ready, one thus constructs an iterative decision law u_{k}(t), where k denotes the iteration step in the policy iteration. The discretetime reward is now calculated as in eq. (11) with the system output data y(t) and the iterative decision law u_{k}(t). At the end of the time for the data collection, namely at the 20th second, by eq. (12), the iterative gain ${\overline{K}}^{k}$ is computed for k=1, 2, … We set a stopping criterion when the norm of the error between two successive gains ${\overline{K}}^{k}$ and ${\overline{K}}^{k+1}$ is less than 10^{5}. When the iteration arrives at the 8th step, it yields the feedback policy as$\begin{array}{ll}{\overline{K}}^{8}=\hfill & [\begin{array}{rrrr}\hfill 5.0394\cdot {10}^{8}& \hfill \text{\hspace{1em}}136.0782& \hfill \text{\hspace{1em}}43.6079& \hfill \text{\hspace{1em}}3.9698\end{array}\hfill \\ \hfill & \begin{array}{rrrr}\hfill 63.4867& \hfill \text{\hspace{0.17em}}57.9988& \hfill \text{\hspace{1em}}11.3416& \hfill \text{\hspace{1em}}0.8173\end{array}]\mathrm{,}\hfill \end{array}$which converges to the computed optimal feedback gain ${\overline{K}}^{*}$ assuming that the offline statespace equation of the power system is known (see Figure 4D). Therefore, the desired feedback gain is well identified after using the presented framework. The identified gain ${\overline{K}}^{8}$, together with the feedforward signal $\Phi (u(t)\mathrm{,}y(t))$, results in the exact control policy that satisfies the prescribed performance eq. (14).
Figure 4 Learning from discretetime rewards. (A, B) Rewards associated with the output y(t) and the action u_{k}(t) at the kth iteration step. The data here represent the rewards at the 1st and 8th iteration steps, which correspond to the learning results for the first and final trials. (C) Discretetime rewards for the different iteration steps. It suggests that as the iteration steps go larger, the value of the discretetime reward decreases. (D) The convergence of learned control policy. The convergence error decreases as the iteration step turns large. This ensures to increase in accuracy by simply setting the iteration step large. The ratio between the gain matrices ${\overline{K}}^{k}$ and ${\overline{P}}^{k}$ reveals the different learning capabilities in approximating the gain matrices from data. 
To illustrate the dynamical learning process, we present the discretetime reward and the learned control policy gain at each iteration in Figure 4. The reward associated with the action and output at the eighth trial is given in Figure 4A. Rewards used in the first trial and the eighth trial are presented in Figure 4B, where an order of magnitude for the peak value of rewards is reduced from 10^{2} in the first trial to 10^{0} in the eighth trial. It reveals that the control effort, as well as the discretetime reward, is reduced after learning. Such a reduction corresponds to the design goal of minimizing the infinity horizon utility in eq. (3). The discretetime rewards for all eight iterations are given in Figure 4C, wherein the results within the time window from 18.33 to 20 s are now displayed. It further indicates that the rewards tend to decrease as the iteration step moves forward. Figure 4D reveals that the feedback gain ${\overline{K}}^{k}$ converges to the predetermined gain as the iteration step increases. Let a matrix ${\overline{P}}^{k}$ be an unspecified matrix in eq. (10), which corresponds to eq. (26) of Supplementary information, Section 2B. From Figure 4D, the learning ratio between ${\overline{K}}^{k}$ and ${\overline{P}}^{k}$ varies in the policy iteration. It implies that the components in eq. (10) differ from each other in terms of learning accuracy. Note that the feedback gain ${\overline{K}}^{k}$ is computationally unique for each iteration (Supplementary information, Section 2C). The uniqueness of ${\overline{K}}^{k}$ is equivalent to the uniqueness of the learned control policy. From the power system design, it is clear that our framework does not have an intermediate stage of identifying dynamical models, but directly learns the optimal control policy from the data. This reveals that the presented framework provides a datadriven control policy design, meanwhile, the prescribed system performance requirement is incorporated into its design.
The below shows the comparisons between this work and the existing IRL works based on continuoustime rewards such as ref. [28]. For comparison, we use the same parameters and conditions as shown in Figure 3A for learning. Due to the different principles in storage and computation, this work only requires about 0.8 s (the time consuming for CPU computing) to run the iterations based on the data collected over the time period of [13, 20] s, while the IRLbased work requires about 7 s to accomplish under the fixed sampling interval 0.01 s. The feature of high efficiency is ensured since this work now removes the computation of integrals required in the existing works of continuoustime reward. Moreover, from Figure 5, the proposed algorithm’s accuracy differs from that of IRL [28] along with the iteration step k even under the same initial conditions including the same fixed sampling interval (0.01 s). The orders of the iterative errors for policy learning $\parallel {\overline{K}}^{k}{\overline{K}}^{*}\parallel $ and value learning $\parallel {\overline{P}}^{k}{\overline{P}}^{*}\parallel $ are, respectively, reduced to about 10^{4} and 10^{6} in the eighth trial, while they are about 10^{0} and 10^{3} only by IRL [28]. When increasing the data samples, the traditional design may also have high learning accuracy. Therefore, our design that leverages discretetime reward admits an advantage at the given samples when compared to the continuoustime reward from the performance viewpoint.
Figure 5 Policy and value learning results by the proposed method and by the method of IRL. The solid lines denote the results of iterations by IRL [28], and the dashed lines are for the proposed method. At the initial iteration, the same policy and value conditions for both two methods, while the convergent norms of policy and value learning errors show that the proposed method has the higher algorithm’s accuracy over IRL [28]. 
We further consider applying the proposed method to a powergrid network with 341 generators, each of which has the following electromechanical dynamics [45]:${\dot{\delta}}_{i}={\omega}_{i}{\omega}_{s}\mathrm{,}\text{}{\dot{\omega}}_{i}=\frac{1}{{H}_{i}}({D}_{i}({\omega}_{s}{\omega}_{i})+{P}_{mi}{P}_{ei})\mathrm{,}$(15)where i=1, 2, …, 341; δ_{i} and ω_{i} are, respectively, the ith generator’s rotor angle and frequency; ω_{s} denotes the nominal frequency; H_{i} and D_{i} are the inertia and damping constants; P_{ei} and P_{mi}, respectively, denote electrical power and mechanical power. The mathematical model of electrical power shall satisfy ${P}_{ei}={E}_{i}{\displaystyle {\sum}_{j=1}^{341}}{E}_{j}[\text{Re}({\mathcal{A}}_{i\mathrm{,}j})\mathrm{cos}({\delta}_{ij})+\text{Im}({\mathcal{A}}_{i\mathrm{,}j})\text{sin}({\delta}_{ij})]$, where E is a column vector by stacking the internal voltage E_{i}; δ_{ij} denotes the relative angle between two generators i and j; $\mathcal{A}=[{\mathcal{A}}_{i\mathrm{,}j}]$ is the effective admittance matrix of the grid network representing the coupling among generators; and operators Im(·) and Re(·) are, respectively, imaginary and real parts of a complex number.
For each local generator in the power grid, its frequency and frequencies of other generators (ω_{i} for i=1, 2, …, 341) are not measurable, which makes the policy design indeed an outputfeedback regulation problem. The equilibrium, i.e., generation matches consumption, follows from the powerflow equation [46], whose numerical solution such as δ_{s} for the generator angle at the equilibrium can be offline computed by MATPOWER toolbox [47] for example. Parameters of the 341machine power grid were given in ref. [36]. In addition to the frequency of the ith generator, the inertia and damping constants D_{i} and H_{i} are unknown to the proposed method.
On each generator, a governor is installed and the proposed learning method is applied to extract the optimal policy associated with the weighting matrices Q_{i}=1 and R_{i}=1. The angle of each generator in the power system is measurable and its trajectory is shown in Figure 6A, while frequencies are not measurable for feedback and their trajectories are given in Figure 6B. Both Figure 6A and B validate the effectiveness of regulating the power system state to the equilibrium. The norm errors of the policy learning and value learning are shown in Figure 6C with the order of the norm of the error could be driven to around 10^{10}, which showcases the learning convergence for the powergrid network.
Figure 6 Powergrid network regulation. (A, B) From t=0 s to t=1.5 s, all 341 generators in the network were operating at the equilibrium with Δδ_{i}=δ_{i}δs and Δw being zeros. At t=1.5 s, exploration noise was added to the multimachine power network, and the data over the time interval [10, 15] s are collected for policy and value learning. The learned outputfeedback policy was installed on each generator governor over [15, 20] s for reaching the equilibrium of the powergrid network. (C) The convergence of the policy and value learning is shown for each local generator. 
CONCLUSION
We have demonstrated a learning mechanism for extracting the continuoustime optimal decision law from the discretetime reward through RL. Compared to the integral reward, the discretetime reward in eq. (2) is computationally efficient as its discretetime form is a slice of the ultimate reward in eq. (3) with the infinity horizon. We have used the discretetime reward to build a new RLbased framework that guides the search for the decision law with the desired system performance. The search has been accomplished using the data collected directly from the realtime trajectories of the dynamical systems. Our framework extracts the decision law without an intermediate stage of identifying dynamical models, which is required in a system identificationbased control policy design. The analytical RL framework that we proposed is interpretable and provable. This framework may help to better reveal the physics underlying the observed phenomenon and to enable a system to behave in a desired manner.
In summary, we have revealed the use of the discretetimerewardbased technique to search the optimal decision law for dynamical systems from data without prior knowledge of the exact model of the dynamical system. We proposed an idea of feedbacking the state derivative into the learning process, which makes our result unique among the previous results in the field. The power of exploiting the state derivative further allows us to establish an analytical RL framework using discretetime rewards. To achieve this, we have divided the searching procedure into two stages. One stage is to learn a feedforward signal and the other one is to learn the feedback gain. The combination of the feedforward and feedback gains leads to the discovery of the desired control policy directly from the data. We have demonstrated this method in solving design problems in power systems. Within the presented framework, we now equivalently achieve the linear quadric control design using outputfeedback control based on the action data and output data. Our framework provides a design tool to understand and transform a dynamical system, with potential applications in fields such as complex networks.
Funding
This work was supported by the Guangdong Basic and Applied Basic Research Foundation (2024A1515011936) and the National Natural Science Foundation of China (62320106008).
Author contributions
C.C., L.X., K.X., F.L.L., Y.L. and S.X. designed the research; C.C., L.X. and S.X. performed the research; C.C., L.X. and S.X. contributed new reagents/analytic tools; C.C., L.X., F.L.L. and S.X. analyzed the data; C.C. and L.X. wrote the supporting information; and C.C., L.X., K.X., F.L.L., Y.L. and S.X. wrote the paper.
Conflict of interest
The authors declare no conflict of interest.
Supplementary information
Supplementary file provided by the authors. Access here
The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.
References
 Mendel JM, McLaren RW. Reinforcementlearning control and pattern recognition systems. In: Mendel JM, Fu KS (eds). Adaptive, Learning and Pattern Recognition Systems: Theory and Applications. New York: Academic Press, 1970, 287318. [Google Scholar]
 Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science 2015; 349: 255260. [Article] [Google Scholar]
 LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015; 521: 436444. [Article] [Google Scholar]
 Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 2018. [Google Scholar]
 Botvinick M, Ritter S, Wang JX, et al. Reinforcement learning, fast and slow. Trends Cogn Sci 2019; 23: 408422. [Article] [Google Scholar]
 Silver D, Huang A, Maddison CJ, et al. Mastering the game of go with deep neural networks and tree search. Nature 2016; 529: 484489. [Article] [Google Scholar]
 Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of go without human knowledge. Nature 2017; 550: 354359. [Article] [Google Scholar]
 Schmidt M, Lipson H. Distilling freeform natural laws from experimental data. Science 2009; 324: 8185. [Article] [Google Scholar]
 Batchelor CK. An Introduction to Fluid Dynamics. Cambridge: Cambridge University Press, 2000. [CrossRef] [Google Scholar]
 Ljung L. System Identification: Theory for the User. PrenticeHall: Upper Saddle River, 1999. [Google Scholar]
 Bongard J, Lipson H. Automated reverse engineering of nonlinear dynamical systems. Proc Natl Acad Sci USA 2007; 104: 99439948. [Article] [Google Scholar]
 Brunton SL, Proctor JL, Kutz JN. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc Natl Acad Sci USA 2016; 113: 39323937. [Article] [Google Scholar]
 Sugihara G, May R, Ye H, et al. Detecting causality in complex ecosystems. Science 2012; 338: 496500. [Article] [Google Scholar]
 Ye H, Beamish RJ, Glaser SM, et al. Equationfree mechanistic ecosystem forecasting using empirical dynamic modeling. Proc Natl Acad Sci USA 2015; 112: E1569E1576. [Article] [Google Scholar]
 Daniels BC, Nemenman I. Automated adaptive inference of phenomenological dynamical models. Nat Commun 2015; 6: 8133. [Article] [Google Scholar]
 Baird LC. Reinforcement learning in continuous time: Advantage updating. In: Proceedings of the 1994 IEEE International Conference on Neural Networks (ICNN’94). Orlando: IEEE, 1994, 24482453. [Google Scholar]
 Doya K. Reinforcement learning in continuous time and space. Neural Comput 2000; 12: 219245. [Article] [Google Scholar]
 Hanselmann T, Noakes L, Zaknich A. Continuoustime adaptive critics. IEEE Trans Neural Netw 2007; 18: 631647. [Article] [Google Scholar]
 Murray J, Cox C, Saeks R, et al. Globally convergent approximate dynamic programming applied to an autolander. In: Proceedings of the 2001 American Control Conference. Arlington: IEEE, 2001, 29012906. [Google Scholar]
 Mehta P, Meyn S. Qlearning and pontryagin’s minimum principle. In: Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference. Shanghai: IEEE, 2009, 35983605. [Google Scholar]
 Vrabie D, Pastravanu O, AbuKhalaf M, et al. Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 2009; 45: 477484. [Article] [CrossRef] [MathSciNet] [Google Scholar]
 Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Contr Syst Mag 2012; 32: 76105. [Google Scholar]
 Lewis FL, Vrabie D, Syrmos VL. Optimal Control. Hoboken: John Wiley & Sons, 2012. [Google Scholar]
 Jiang Y, Jiang ZP. Robust Adaptive Dynamic Programming. Hoboken: John Wiley & Sons, 2017. [Google Scholar]
 Kamalapurkar R, Walters P, Rosenfeld J, et al. Reinforcement Learning for Optimal Feedback Control: A LyapunovBased Approach. Cham: Springer, 2018. [Google Scholar]
 Chen C, Modares H, Xie K, et al. Reinforcement learningbased adaptive optimal exponential tracking control of linear systems with unknown dynamics. IEEE Trans Automat Contr 2019; 64: 44234438. [Article] [Google Scholar]
 Chen C, Lewis FL, Xie K, et al. Offpolicy learning for adaptive optimal output synchronization of heterogeneous multiagent systems. Automatica 2020; 119: 109081. [Article] [CrossRef] [Google Scholar]
 Chen C, Xie L, Xie K, et al. Adaptive optimal output tracking of continuoustime systems via outputfeedbackbased reinforcement learning. Automatica 2022; 146: 110581. [Article] [Google Scholar]
 Todorov E, Jordan MI. Optimal feedback control as a theory of motor coordination. Nat Neurosci 2002; 5: 12261235. [Article] [Google Scholar]
 Shadmehr R, MussaIvaldi S. Biological Learning and Control: How the Brain Builds Representations, Predicts Events, and Makes Decisions. Cambridge: MIT Press, 2012. [CrossRef] [Google Scholar]
 Liu YY, Slotine JJ, Barabási AL. Controllability of complex networks. Nature 2011; 473: 167173. [Article] [Google Scholar]
 Ruths J, Ruths D. Control profiles of complex networks. Science 2014; 343: 13731376. [Article] [Google Scholar]
 Li A, Cornelius SP, Liu YY, et al. The fundamental advantages of temporal networks. Science 2017; 358: 10421046. [Article] [Google Scholar]
 Zañudo JGT, Yang G, Albert R. Structurebased control of complex networks with nonlinear dynamics. Proc Natl Acad Sci USA 2017; 114: 72347239. [Article] [Google Scholar]
 Yan G, Tsekenis G, Barzel B, et al. Spectrum of controlling and observing complex networks. Nat Phys 2015; 11: 779786. [Article] [Google Scholar]
 Duan C, Nishikawa T, Motter AE. Prevalence and scalable control of localized networks. Proc Natl Acad Sci USA 2022; 119: e2122566119. [Article] [Google Scholar]
 Tao G. Adaptive control design and analysis. Hoboken: John Wiley & Sons, 2003. [CrossRef] [Google Scholar]
 Åström KJ, Wittenmark B. Adaptive Control. North Chelmsford, MA: Courier Corporation, 2013. [Google Scholar]
 Ioannou PA, Sun J. Robust Adaptive Control. New York: Courier Corporation, 2012. [Google Scholar]
 Stevens BL, Lewis FL, Johnson EN. Aircraft Control and Simulation: Dynamics, Controls Design, and Autonomous Systems. Hoboken: John Wiley & Sons, 2015. [CrossRef] [Google Scholar]
 Wang Y, Zhou R, Wen C. Robust loadfrequency controller design for power systems. In: Proceedings of the First IEEE Conference on Control Applications. Dayton: IEEE, 1993. [Google Scholar]
 Chen C, Lewis FL, Li B. Homotopic policy iterationbased learning design for unknown linear continuoustime systems. Automatica 2022; 138: 110153. [Article] [CrossRef] [Google Scholar]
 Franklin DW, Burdet E, Peng Tee K, et al. CNS learns stable, accurate, and efficient movements using a simple algorithm. J Neurosci 2008; 28: 1116511173. [Article] [Google Scholar]
 Marvasti F. Nonuniform Sampling: Theory and Practice. New York: Springer, 2012. [Google Scholar]
 Machowski J, Lubosny Z, Bialek JW, et al. Power System Dynamics: Stability and Control. Hoboken: John Wiley & Sons, 2020. [Google Scholar]
 Kundur PS, Malik OP. Power System Stability and Control. New York: McGrawHill Education, 2022. [Google Scholar]
 Zimmerman RD, MurilloSanchez CE, Thomas RJ. MATPOWER: Steadystate operations, planning, and analysis tools for power systems research and education. IEEE Trans Power Syst 2010; 26: 1219. [Article] [Google Scholar]
All Figures
Figure 1 Schematic framework of the reinforcement learning algorithm using policy iteration for continuoustime dynamical systems. (A) At each time t=t_{i}, for i=1, 2, …, one observes the current output y(t) and action u(t). The sampled inputoutput data are collected along the trajectory of the dynamical system in realtime, and are stacked over the time interval [t_{1}, t_{s}] as the discretetime inputoutput data U and Y. (B) The inputoutput data of U and Y, associated with the prescribed optimization criterion, are used for updating the value estimate given in the critic module, based on which the control policy in the actor module is updated. The ultimate goal of this framework is to use the inputoutput data U and Y for learning the optimal decision law that minimizes the userdefined optimization criterion J(Q, R, u(t), y(t)). 

In the text 
Figure 2 Computational approach for deriving optimal design laws from the data. (A) Preprocess the actions and outputs of the dynamical system and construct the feedforward signals that will be used for the feedback gain learning and the design of an online realtime control loop (Supplementary information, Section 2A). (B) Measure the inputoutput data, as well as the feedforward signals, over discretetime series, based on which the discretetime data samples are assembled using the tensor product (Supplementary information, Section 2B). (C) This part is central for learning the feedback gain ${K}_{\text{o}}^{*}$ from discretetime data. First, calculate the Bellman equation for optimality via policy iterations. Then, through policy evaluation and improvement, the optimal feedback gain is obtained from the discretetime data samples with rigorous mathematical operations and convergence deduction (Supplementary information, Section 2C). Finally, both the feedforward signal in (A) and the feedback gain ${K}_{\text{o}}^{*}$ contribute to the optimal decision law in eq. (5). 

In the text 
Figure 3 System modeling for an electric power system and the discretetime data for learning control policy. (A) Load frequency control of an electric power system is modeled by considering its nominal continuoustime system around an operating point specified by a constant load value. Only partial system state Δf(t) is available for measurement, requiring the policy design to follow the outputfeedback principle. The statespace equation of the power system is highlighted in the figure, but the system dynamics are unknown to the control policy designer. (B) The continuoustime control policy is inferred from the discretetime reward. The time for the power system’s evolution is sampled nonuniformly. (C) Sampled inputoutput data are collected for generating the discretetime reward. 

In the text 
Figure 4 Learning from discretetime rewards. (A, B) Rewards associated with the output y(t) and the action u_{k}(t) at the kth iteration step. The data here represent the rewards at the 1st and 8th iteration steps, which correspond to the learning results for the first and final trials. (C) Discretetime rewards for the different iteration steps. It suggests that as the iteration steps go larger, the value of the discretetime reward decreases. (D) The convergence of learned control policy. The convergence error decreases as the iteration step turns large. This ensures to increase in accuracy by simply setting the iteration step large. The ratio between the gain matrices ${\overline{K}}^{k}$ and ${\overline{P}}^{k}$ reveals the different learning capabilities in approximating the gain matrices from data. 

In the text 
Figure 5 Policy and value learning results by the proposed method and by the method of IRL. The solid lines denote the results of iterations by IRL [28], and the dashed lines are for the proposed method. At the initial iteration, the same policy and value conditions for both two methods, while the convergent norms of policy and value learning errors show that the proposed method has the higher algorithm’s accuracy over IRL [28]. 

In the text 
Figure 6 Powergrid network regulation. (A, B) From t=0 s to t=1.5 s, all 341 generators in the network were operating at the equilibrium with Δδ_{i}=δ_{i}δs and Δw being zeros. At t=1.5 s, exploration noise was added to the multimachine power network, and the data over the time interval [10, 15] s are collected for policy and value learning. The learned outputfeedback policy was installed on each generator governor over [15, 20] s for reaching the equilibrium of the powergrid network. (C) The convergence of the policy and value learning is shown for each local generator. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.