Skip to main content
PLOS One logoLink to PLOS One
. 2018 Jul 19;13(7):e0200455. doi: 10.1371/journal.pone.0200455

Multiqubit and multilevel quantum reinforcement learning with quantum technologies

F A Cárdenas-López 1,2,*, L Lamata 3, J C Retamal 1,2, E Solano 3,4,5
Editor: Zoltan Zimboras6
PMCID: PMC6053154  PMID: 30024914

Abstract

We propose a protocol to perform quantum reinforcement learning with quantum technologies. At variance with recent results on quantum reinforcement learning with superconducting circuits, in our current protocol coherent feedback during the learning process is not required, enabling its implementation in a wide variety of quantum systems. We consider diverse possible scenarios for an agent, an environment, and a register that connects them, involving multiqubit and multilevel systems, as well as open-system dynamics. We finally propose possible implementations of this protocol in trapped ions and superconducting circuits. The field of quantum reinforcement learning with quantum technologies will enable enhanced quantum control, as well as more efficient machine learning calculations.

Introduction

Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that has attracted increasing attention in the last years. ML usually refers to a computer program which can learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E [1]. In other words, Machine Learning addresses the problem of how a computer algorithm can be constructed to automatically improve with experience. Several applications in this field have been implemented such as handwriting pattern recognition [2], speech recognition [3] and the development of a computer able to beat an expert Go player [4], just to name a few.

The learning process in ML can be divided in three types: supervised learning, unsupervised learning and reinforcement learning [5]. In supervised machine learning, an initial data set has the function of training the system for later prediction making or to classify data. Usually, supervised learning problems are categorized into regression (continuous output) or classification (discrete output). Unsupervised learning allows one to address problems where the training data is not necessary and only correlations between subsets in the data (clustering) are considered and analyzed. Finally, reinforcement learning [6] differs from supervised and unsupervised learning in that it takes into account a scalar parameter (reward) to evaluate the input-output relation in a trial and error way. In this case, the system (so-called “agent”) obtains information from its outer world (“environment”) to decide which is the better way to optimize itself, for adapting to the environment.

Quantum information processing (QIP) could contribute positively in the future in the development of the machine learning field, with several quantum algorithms for machine learning with significant possible gains with respect to their classical counterparts [711]. More specifically, quantum algorithms have been developed and in some cases implemented for supervised and unsupervised learning problems [1218]. However, quantum reinforcement learning has not been widely explored and just a few results have been obtained up to now [1926]. Related topics in biomimetic quantum technologies are quantum memristors [2730], as well as quantum Helmholtz and Boltzmann machines [3133]. These, together with quantum reinforcement learning, may set the stage for the future development of semi-autonomous quantum devices.

The field of quantum technologies has grown extensively in the past decade. In particular, two architectures which are very promising for the implementation of a quantum computer, in terms of numbers of qubits and gate fidelities, are trapped ions [34, 35] and superconducting circuits [3638]. Current technological progress in trapped ions has allowed us to implement quantum protocols with several ions involving high-fidelity single and two-qubit gates as well as high-fidelity readout [39, 40]. Superconducting circuits have also proven to be an excellent platform to perform quantum information processing protocols because of their individual addressing and scalability. Two-qubit quantum gates have achieved fidelities larger than 99% [41, 42] in this platform. Furthermore, technological progress in this architecture has made possible to build artificial atoms with high coherence time in coplanar [43] and 3D architecture [44], allowing for the development of feedback control with superconducting circuits [45, 46]. This feedback mechanism has inspired protocols for quantum reinforcement learning with superconducting circuits [23] where the feedback loop control allows one to reward and restart the system to obtain maximal learning fidelity.

Here, we propose a general protocol to perform quantum reinforcement learning with quantum technologies. We understand general in the sense that it goes beyond the context of qubits for embedding information in agent or environment. In this sense, and at variance with a previous result [23], we extend the realm of the quantum reinforcement learning protocol to multi-qubit, multi-level, and open quantum systems, therefore permitting a wider set of scenarios. Our protocol considers a quantum system (the agent), which interacts with an external quantum system (its environment) via an auxiliary quantum system (a register). The aim of our quantum reinforcement learning protocol is for the agent to acquire information from its environment and adapt to it, via a rewarding mechanism. In this fully quantum scenario the meaning of the learning process is the establishment of quantum correlations among the parties [21]. In our specific case, the quantum agent aims at attaining maximum quantum state overlap with the environment state, in the sense that local measurements on agent and environment will produce the same outcomes or, equivalently, that the agent and environment entangled final state is invariant under the exchange of these two subsystems. An interpretation of this outcome is that the agent can learn about the information embedded in the environment state, which has been consequently modified from a separable to an entangled state with the agent and registers. After this process we are in position of evaluating any figure of merit with the outcome measurements. Optimizing this figure of merit should be associated to a particular learning process probably requiring particular actions to be applied on the agent. Another possible result is obtained by considering projective measurements in the register systems. Only after these projective measurements agent and environment will be decoupled from them and the protocol assures that the former are in a pure correlated state, without needing to know any information about their initial states. We analyze the case where the register subspace is larger than agent and environment subspaces. The inclusion of more elements in the register subspace allows for delaying the application of the rewarding criterion to the end of the quantum protocol. This fact will enable its implementation in a wider variety of quantum platforms, besides superconducting circuits with coherent feedback. We also study quantum reinforcement learning in the case where agent, environment and register are composed of qudits. In this case, we obtain that the maximal learning fidelity is achieved in a fixed number of steps in the qudit dimension, and this number scales polynomially with the number of subsystems in the environment subspace. In addition, we analyse quantum reinforcement learning in the situation where the environment is larger than the agent. We highlight two results: the first of them is obtained when considering that the register has the same elements than the environment. In this case, two rewarding criteria are needed to obtain maximal learning fidelity and the entanglement between the agent and a specific part of the environment is a key resource. The other case is the situation where the register has more elements than the environment. In this case, only one measurement is needed to obtain maximal learning fidelity and the environment-agent entanglement is not a key resource. Based on this fact, the rewarding criterion is applied at the end of the protocol. Finally, we describe how our quantum learning protocols can be implemented in quantum platforms as trapped ions and superconducting circuits.

Quantum reinforcement learning protocol with final measurement

Here, we introduce a protocol to perform quantum reinforcement learning, which introduces significant novelties with respect to the existing literature. Unlike a previous quantum reinforcement learning result [23], the protocol described here needs one measurement at the end of the procedure and no feedback, allowing for its implementation in a variety of quantum platforms including ions and photons. The improvement relies on adding more registers than before [23] and making them interact conditionally with each other. The inclusion of ancillary systems has proven to be useful in several implementations of quantum information, because measurements on the ancillary system allow one in principle to obtain information about the main system without destroying it. Moreover, the measurement associated with the rewarding criterion is performed at the end of the protocol. This opens the possibility to implement quantum reinforcement learning protocols in architectures for which implementing coherent feedback may be a challenging problem.

The quantum reinforcement learning protocol described here works in the following way. We firstly consider an agent and environment, composed of one qubit each, and two register qubits, see Fig 1. The first step is to encode the environment information in the register states (usually this kind of operation in the context of classical reinforcement learning is called the action). Subsequently, the internal states of the registers interact conditionally with the agent (usually this kind of operation in classical reinforcement learning is called the percept). Finally, an agent-register interaction changes the agent state (partial rewarding mechanism). At this stage the rewarding criterion is satisfied, in the form of a correlated agent-environment state, in the sense that local measurements on agent and environment will produce the same outcomes. On the other hand, the agent-environment system is also entangled with the two registers, and in order to attain a correlated pure state of agent and environment, a single, final measurement may be performed on the two register states. This will produce an agent-environment state maximizing the learning fidelity defined as FAE=|ψA|ϕE|, where |ψA〉 is the agent state and |ϕE〉 is the environment state, both after the protocol.

Fig 1. Proposed protocol to perform quantum reinforcement learning with final measurement.

Fig 1

We consider a set composed of four qubits, corresponding to agent A, environment E, and registers R1 and R2. The considered interactions agent-register, register-register and environment-register consist of CNOT gates. The measurement in the register subspace is denoted by the rightmost box.

To perform our quantum reinforcement learning protocol we consider that initially agent and environment are in arbitrary single-qubit pure states, whereas the register states are in their ground state, namely

{|A=αA0|0A+αA1|1A,|E=αE0|0E+αE1|1E,|R=|01|02} (1)
|Ψ0=|A|E|R. (2)

The first step in the protocol is to extract information from the environment, updating the information in the registers conditionally to the environment state. This process is done by applying a pair of CNOT gates in the environment-register subspace. Here, the first system is the control and the second the target,

|Ψ1=U(E,R2)CNOTU(E,R1)CNOT|Ψ0, (3)
|Ψ1=(αA0|0A+αA1|1A)(αE0|0E|01|02+αE1|1E|11|12). (4)

Then, the information encoded on the registers is updated conditional on the agent state. As the register subspace is larger than the agent subspace, we will choose which part of the register subspace will the agent update. Without loss of generality, let us assume that the register R1 will be updated. The upgrade of agent subspace is performed by a CNOT gate acting in the AR1 subspace, where the agent state is the control and the register is the target,

|Ψ2=U(A,R1)CNOT|Ψ1,|Ψ2=(αA0αE0|0A|0E|01|02+αA0αE1|0A|1E|11|12+αA1αE0|1A|0E|11|02+αA1αE1|1A|1E|01|12). (5)

Subsequently, the register R2 is also updated with respect to the R1 state. This is accomplished by applying a CNOT gate in the register subspace, where R1 acts as control and R2 as target,

|Ψ3=U(R1,R2)CNOT|Ψ2,|Ψ3=(αA0αE0|0A|0E|01|02+αA0αE1|0A|1E|11|02+αA1αE0|1A|0E|11|12+αA1αE1|1A|1E|01|12). (6)

Followingly, we update the agent state according to the information encoded in the register R1. This is done by applying a CNOT gate in the R1A subspace, where R1 is the control and A is the target,

|Ψ4=U(R1,A)CNOT|Ψ3,|Ψ4=(αA0αE0|0A|0E|01|02+αA0αE1|1A|1E|11|02+αA1αE0|0A|0E|11|12+αA1αE1|1A|1E|01|12). (7)

We point out that, in the previous state, agent and environment are already maximally correlated, in the sense of having the same outcomes with respect to local measurements performed on either of them, or, equivalently, the state is invariant under particle exchange with respect to the agent-environment subsystem. We also remark that this state is general, valid for any initial agent and environment states. The fact that agent and environment get entangled with the two registers allows one to distinguish between identical agent-environment components that originate from different initial states, namely, to distinguish between states arising from αA0αE0 or αA1αE0, as well as from αA0αE1 or αA1αE1.

Finally, by performing a projective measurement on the register subspace, the rewarding criteron is satisfied. It is easy to show that, independently of the measurement outcome, the learning fidelity FAE=|ψA|ϕE| is maximal, given that agent and environment states end up being in the same state, either |0〉 or |1〉. In this case only one iteration of the protocol is sufficient in order that the agent adapts to the environment. Moreover, throughout the protocol, measurements on agent and/or environment are not required, which may allow its implementation in a variety of quantum platforms as trapped ions, superconducting circuits, and quantum photonics.

In our protocol, we do not need coherent feedback given that the registers entangle with agent and environment and as a result produce the desired agent-environment state that is invariant under permutation. It is true that the entanglement with the registers produces a mixed state in case the register states are discarded, but this is not a drawback in our protocol. Indeed, what our protocol does is, for arbitrary initial agent and environment states, which need not be known, to give a constructive way to produce a final agent-environment state perfectly correlated, in the sense of invariant under permutations in agent-environment subspace. This state is in general entangled, namely, quantum, and we do not need to perform any measurement on agent and environment during the protocol, namely, it can equally well work with photons, ions, and superconducting circuits, among others. After the production of the agent-environment-register entangled state, the registers are entangled with agent and environment, but this does not prevent us from measuring the registers at a certain desired time, and decoupling agent and environment from them. This way, we will not have measured agent and environment at any time of the protocol, and we can assure that they are perfectly correlated irrespective of their initial states, and without having any prior information about them. This may be useful, e.g., for distributing private keys in quantum cryptography for arbitrary, unknown, initial states, without the need to initialize agent and register in reference states.

Quantum reinforcement learning for multiqubit systems with final measurement

In the previous section, we have showed that by considering more than just one register the rewarding criterion in the quantum reinforcement learning algorithm can be done at the end of our protocol. The same results can be obtained when we consider more complex configurations. Indeed, by assuming that agent and register are composed of two qubits each, and four qubits act as registers, we show that the rewarding criterion can also be applied at the end of the quantum protocol. Let us illustrate this fact with an analysis for multiqubit agent, environment, and register states,

|A=αA00|00A+αA01|01A+αA10|10A+αA11|11A, (8)
|E=αE00|00E+αE01|01E+αE10|10E+αE11|11E, (9)
|R=|01|02|03|04, (10)
|Ψ0=|A|E|R. (11)

Following the same procedure described previously, the protocol consists mainly in three types of interaction, as shown in Fig 2. Firstly, we update the registers conditionally to the environment states. More specifically, we consider an interaction between the environment qubits E1 and E2 with the registers R1 and R2, respectively. In this description, the environment acts as control and the registers act as targets in the CNOT gates,

|Ψ1=U(E1,R1)CNOTU(E2,R2)CNOT,|Ψ0,|Ψ1=|A(αE00|00E|01|02|03|04+αE01|01E|01|12|03|04+αE10|10E|11|02|03|04+αE11|11E|11|12|03|04). (12)

Thereafter, we update similarly the remaining registers, that is, we apply a CNOT gate between the environment qubits E1 and E2 and the register qubits R3 and R4, respectively, obtaining

|Ψ2=U(E1,R3)CNOTU(E2,R4)CNOT|Ψ1,|Ψ2=|A(αE00|00E|01|02|03|04+αE01|01E|01|12|03|14+αE10|10E|11|02|13|04+αE11|11E|11|12|13|14). (13)

Next step consists in updating a part of the register subspace conditionally to the agent state. Thus, the registers R1 and R2 will be updated via A1 and A2, respectively,

|Ψ3=U(A1,R1)CNOTU(A2,R2)CNOT|Ψ2,|Ψ3=αA00αE00|00A|00E|01|02|03|04+αA00αE01|00A|01E|01|12|03|14+αA00αE10|00A|10E|11|02|13|04+αA00αE11|00A|11E|11|12|13|14+αA01αE00|01A|00E|01|12|03|04+αA01αE01|01A|01E|01|02|03|14+αA01αE10|01A|10E|11|12|13|04+αA01αE11|01A|11E|11|02|13|14+αA10αE00|10A|00E|11|02|03|04+αA10αE01|10A|01E|11|12|03|14+αA10αE10|10A|10E|01|02|13|04+αA10αE11|10A|11E|01|12|13|14+αA11αE00|11A|00E|11|12|03|04+αA11αE01|11A|01E|11|02|03|14+αA11αE10|11A|10E|01|12|13|04+αA11αE11|11A|11E|01|02|13|14. (14)

Afterwards, to obtain orthogonal outcomes in the register subspace we perform a pair of CNOT gates in this subspace. The interaction will be between the registers that interact with a common environment, namely, register R1 interacts with R3 because both have interacted with E1. Similarly for R2 and R4, which have interacted with E2. In this case, R1(R2) is the control and R3(R4) is the target.

|Ψ4=U(R1,R3)CNOTU(R2,R4)CNOT|Ψ3,|Ψ4=αA00αE00|00A|00E|01|02|03|04+αA00αE01|00A|01E|01|12|03|04+αA00αE10|00A|10E|11|02|03|04+αA00αE11|00A|11E|11|12|03|04+αA01αE00|01A|00E|01|12|03|14+αA01αE01|01A|01E|01|02|03|14+αA01αE10|01A|10E|11|12|03|14+αA01αE11|01A|11E|11|02|03|14+αA10αE00|10A|00E|11|02|13|04+αA10αE01|10A|01E|11|12|13|04+αA10αE10|10A|10E|01|02|13|04+αA10αE11|10A|11E|01|12|13|04+αA11αE00|11A|00E|11|12|13|14+αA11αE01|11A|01E|11|02|13|14+αA11αE10|11A|10E|01|12|13|14+αA11αE11|11A|11E|01|02|13|14. (15)

Finally, we update the agent considering the states of the register in order that the rewarding criterion is satisfied. This is done by applying two CNOT gates in the agent-register subspace, where A1 is controlled by R1 and A2 is controlled by R2,

|Ψ5=U(R1,A1)CNOTU(R2,A2)CNOT|Ψ4,|Ψ5=αA00αE00|00A|00E|01|02|03|04+αA00αE01|01A|01E|01|12|03|04+αA00αE10|10A|10E|11|02|03|04+αA00αE11|11A|11E|11|12|03|04+αA01αE00|00A|00E|01|12|03|14+αA01αE01|01A|01E|01|02|03|14+αA01αE10|10A|10E|11|12|03|14+αA01αE11|11A|11E|11|02|03|14+αA10αE00|00A|00E|11|02|13|04+αA10αE01|01A|01E|11|12|13|04+αA10αE10|10A|10E|01|02|13|04+αA10αE11|11A|11E|01|12|13|04+αA11αE00|00A|00E|11|12|13|14+αA11αE01|01A|01E|11|02|13|14+αA11αE10|10A|10E|01|12|13|14+αA11αE11|11A|11E|01|02|13|14. (16)

From the latter Eq (16), it is straightforward to see that independently of the measurement outcomes the learning fidelity is maximal. Moreover, as in the previous case, one iteration of the quantum reinforcement protocol is needed to obtain maximal learning fidelity, FAE=|ψA|ϕE|.

Fig 2. Schematic representation of quantum reinforcement learning protocol for multiqubit systems.

Fig 2

Agent, environment and registers are denoted as A, E and R1, R2, R3 and R4, respectively. The measurement in the register subspace is denoted by the rightmost box.

Quantum reinforcement learning for qudit systems

So far, we have studied quantum reinforcement learning processes only for two-level systems or in pairs of them. However, there are several quantum systems which cannot be described in terms of a two-level system. For instance, quantum harmonic oscillators, electronic energy levels in an ion, and superconducting artificial atoms such as transmons [47], where for some regimes of Josephson energy they must be considered as a three-level system. In this context, it is interesting to extend the quantum reinforcement learning protocol developed here for cases where multilevel systems compound the agent, environment, and register.

To perform the previous task, we first need to define a set of logic operations that we will perform on our system. In the qubit case, the main logical operation applied is the CNOT gate, which considers a conditional interaction between two qubits, where one acts as a control while the other acts as a target. The control qubit remains unchanged whereas the target qubit output is modified by the addition modulo 2. Then, it is wise to assume that the set of logic operations between multilevel systems could be defined in terms of an addition modulo D, where D stands for the dimension of one subsystem (agent, environment or register subspaces), according to

U|i1|j2=|i1|ij2. (17)

Here, ij stands for the addition modulo D. This gate is usually known as XOR gate [48]. For two-dimensional systems, this gate corresponds to the CNOT gate. Nevertheless, for higher dimensional systems this definition presents several disadvantages. For instance, the XOR gate defined as in Eq (17) is unitary but not Hermitian for D>2. Moreover, this logical operation is no longer its own inverse. To avoid these problems, in the literature [48] the generalized XOR gate (GXOR) has been defined as

GXOR1,2|i1|j2=|i1|ij2, (18)

where the operation ⊖ denotes the difference ij modulo D. The GXOR gate of Eq (18) does not present the disadvantages pointed out in the definition of Eq (17). That is, the GXOR gate is Hermitian, unitary and ij = 0 only when i = j.

Considering our proposed protocol for single-qubit cases, we show that when we take into account multilevel systems, the number of interactions to obtain maximal learning fidelity is fixed and depends only on the number of agent subsystems in the protocol. Let us illustrate this with an example of multilevel agent-environment-register state,

|Ψ0=n=0N-1m=0N-1αAnαEm|nA|mE|01|02. (19)

The first step in our protocol is identical to the equivalent one in the single-qubit case. We update the register conditionally on the environment state, that is, we transfer information of the environment and encode it in the register system. This is done by applying a pair of GXOR gates acting in the environment-register subsystem. In this case, the environment interacts with both registers R1 and R2. The environment acts as control and both registers are targets,

|Ψ1=U(E,R1)GXOR|Ψ0,|Ψ1=n=0N-1m=0N-1αAnαEm|nA|mE|m1|02. (20)
|Ψ2=U(E,R2)GXOR|Ψ1,|Ψ2=n=0N-1m=0N-1αAnαEm|nA|mE|m1|m2. (21)

Once the information has been transferred to the register, we update the register R1 based on the agent state. That is, we perform a GXOR gate in the subspace composed of agent and register. Here, the agent act as a control and the register R1 is the target,

|Ψ3=U(A,R1)GXOR|Ψ2,|Ψ3=n=0N-1m=0N-1αAnαEm|nA|mE|nm1|m2. (22)

Orthogonal outcome measurements in the register subspace are provided by interactions between the registers in this subspace. Thus, we apply a GXOR gate in the register subspace, where R1 is the control and R2 is the target,

|Ψ4=U(R1,R2)GXOR|Ψ3,|Ψ4=n=0N-1m=0N-1αAnαEm|nA|mE|nm1|(nm)m2. (23)

Subsequently, the agent state is updated conditionally to the information encoded in the state of the register R1. The GXOR gate is applied in the register-agent subspace. In this case, R1 is the control and the agent is the target,

|Ψ5=U(R1,A)GXOR|Ψ4,|Ψ5=n=0N-1m=0N-1αAnαEm|0mA|mE|nm1|n2m2. (24)

For the case where the multi-level system contains D=2, we recover the result discussed previously because of 0 ⊖ m = m for that dimension. On the other hand, we are interested in systems with more energy levels, such that we need to adapt the protocol to obtain maximal learning fidelity for a fixed number of steps. In this case, we will update the agent subsystem by an iterative interaction with registers R1 and R2 as shown in Fig 3. Here, the agent always acts as target, while the registers are the controls. Therefore, we apply a GXOR gate between the register R2 and the agent,

|Ψ6=U(R2,A)GXOR|Ψ5,|Ψ6=n=0N-1m=0N-1αAnαEm|nmA|mE|nm1|n2m2. (25)

Now, by applying a GXOR gate between the register R1 and the agent we obtain,

|Ψ7=U(R1,A)GXOR|Ψ6,|Ψ7=n=0N-1m=0N-1αAnαEm|0A|mE|nm1|n2m2. (26)

We perform subsequently a GXOR gate in the subspace composed of R2 and agent A,

|Ψ8=U(R2,A)GXOR|Ψ7,|Ψ8=n=0N-1m=0N-1αAnαEm|n2mA|mE|nm1|n2m2. (27)

Finally, applying a GXOR gate on the register-agent subspace we obtain the desired result. By considering a fixed number of interactions between the set of agent, environment and register, the learning fidelity becomes maximal independently of the outcome measurement on the register subspace, which can again be carried out at the end of the protocol,

|Ψ9=U(R1,A)GXOR|Ψ8,|Ψ9=n=0N-1m=0N-1αAnαEm|mA|mE|nm1|n2m2. (28)

Thus, in a machine learning protocol where the learning units are composed by multilevel systems (see Fig 3), the number of logical operations required to obtain maximal learning fidelity does not depend on the system dimension.

Fig 3. Quantum reinforcement learning protocol for qudits.

Fig 3

The systems involved are denoted as agent A, environment E and registers R1, R2. In this case, the logical quantum gates which are applied in the learning protocol correspond to GXOR gates. The measurement process in the register subspace is denoted with the rightmost box.

Example

Here, we exemplify how our reinforcement learning protocol works in qudit systems. We consider, without loss of generality, the case for dimension D=4. In this case, the agent-environment-register state has the following form,

|A=αA0|0A+αA1|1A+αA2|2A+αA3|3A, (29)
|E=αE0|0E+αE1|1E+αE2|2E+αE3|3E (30)
|R=|01|02 (31)
|Ψ0=|A|E|R. (32)

As mentioned previously, the considered quantum gate is a GXOR gate with subtraction modulo 4. The first step is to update the register according to the environment information,

|Ψ1=U(E,R1)GXOR|Ψ0,|Ψ1=(αA0|0A+αA1|1A+αA2|2A+αA3|3A)(αE0|0E|01|02+αE1|1E|11|02+αE2|2E|21|02+αE3|3E|31|02), (33)
|Ψ2=U(E,R2)GXOR|Ψ1,|Ψ2=(αA0|0A+αA1|1A+αA2|2A+αA3|3A)(αE0|0E|01|02+αE1|1E|11|12+αE2|2E|21|22+αE3|3E|31|32). (34)

Subsequently, the register is updated conditional to the agent state,

|Ψ3=U(A,R1)GXOR|Ψ2,|Ψ3=αA0αE0|0A|0E|01|02+αA0αE1|0A|1E|31|12+αA0αE2|0A|2E|21|22+αA0αE3|0A|3E|11|32+αA1αE0|1A|0E|11|02+αA1αE1|1A|1E|01|12+αA1αE2|1A|2E|31|22+αA1αE3|1A|3E|21|32+αA2αE0|2A|0E|21|02+αA2αE1|2A|1E|11|12+αA2αE2|2A|2E|01|22+αA2αE3|2A|3E|31|32+αA3αE0|3A|0E|31|02+αA3αE1|3A|1E|21|12+αA3αE2|3A|2E|11|22+αA3αE3|3A|3E|01|32. (35)

Then, to obtain orthogonal outcome measurements in the register basis, we perform an interaction in the register subspace,

|Ψ4=U(R1,R2)GXOR|Ψ3,|Ψ4=αA0αE0|0A|0E|01|02+αA0αE1|0A|1E|31|22+αA0αE2|0A|2E|21|02+αA0αE3|0A|3E|11|22+αA1αE0|1A|0E|11|12+αA1αE1|1A|1E|01|32+αA1αE2|1A|2E|31|12+αA1αE3|1A|3E|21|32+αA2αE0|2A|0E|21|22+αA2αE1|2A|1E|11|02+αA2αE2|2A|2E|01|22+αA2αE3|2A|3E|31|02+αA3αE0|3A|0E|31|32+αA3αE1|3A|1E|21|12+αA3αE2|3A|2E|11|32+αA3αE3|3A|3E|01|12. (36)

Now, we need to apply iterative interactions in the register-agent subspace to update the agent in each step until we get maximal learning fidelity with respect to the environment. We start by performing a GXOR gate between the register R1 and the agent,

|Ψ5=U(R1,A)GXOR|Ψ4,|Ψ5=αA0αE0|0A|0E|01|02+αA0αE1|3A|1E|31|22+αA0αE2|2A|2E|21|02+αA0αE3|1A|3E|11|22+αA1αE0|0A|0E|11|12+αA1αE1|3A|1E|01|32+αA1αE2|2A|2E|31|12+αA1αE3|1A|3E|21|32+αA2αE0|0A|0E|21|22+αA2αE1|3A|1E|11|02+αA2αE2|2A|2E|01|22+αA2αE3|1A|3E|31|02+αA3αE0|0A|0E|31|32+αA3αE1|3A|1E|21|12+αA3αE2|2A|2E|11|32+αA3αE3|1A|3E|01|12. (37)

Hereafter, we apply the GXOR gate in the R2-agent subspace,

|Ψ6=U(R2,A)GXOR|Ψ5,|Ψ6=αA0αE0|0A|0E|01|02+αA0αE1|3A|1E|31|22+αA0αE2|2A|2E|21|02+αA0αE3|1A|3E|11|22+αA1αE0|1A|0E|11|12+αA1αE1|0A|1E|01|32+αA1αE2|3A|2E|31|12+αA1αE3|2A|3E|21|32+αA2αE0|2A|0E|21|22+αA2αE1|1A|1E|11|02+αA2αE2|0A|2E|01|22+αA2αE3|3A|3E|31|02+αA3αE0|3A|0E|31|32+αA3αE1|2A|1E|21|12+αA3αE2|1A|2E|11|32+αA3αE3|0A|3E|01|12. (38)

Afterwards, we perform a GXOR gate between R1 and A,

|Ψ7=U(R1,A)GXOR|Ψ6,|Ψ7=αA0αE0|0A|0E|01|02+αA0αE1|0A|1E|31|22+αA0αE2|0A|2E|21|02+αA0αE3|0A|3E|11|22+αA1αE0|0A|0E|11|12+αA1αE1|0A|1E|01|32+αA1αE2|0A|2E|31|12+αA1αE3|0A|3E|21|32+αA2αE0|0A|0E|21|22+αA2αE1|0A|1E|11|02+αA2αE2|0A|2E|01|22+αA2αE3|0A|3E|31|02+αA3αE0|0A|0E|31|32+αA3αE1|0A|1E|21|12+αA3αE2|0A|2E|11|32+αA3αE3|0A|3E|01|12. (39)

Subsequently, an interaction in the R2-agent subspace is performed,

|Ψ8=U(R2,A)GXOR|Ψ7,|Ψ8=αA0αE0|0A|0E|01|02+αA0αE1|2A|1E|31|22+αA0αE2|0A|2E|21|02+αA0αE3|2A|3E|11|22+αA1αE0|1A|0E|11|12+αA1αE1|3A|1E|01|32+αA1αE2|1A|2E|31|12+αA1αE3|3A|3E|21|32+αA2αE0|2A|0E|21|22+αA2αE1|0A|1E|11|02+αA2αE2|2A|2E|01|22+αA2αE3|0A|3E|31|02+αA3αE0|3A|0E|31|32+αA3αE1|1A|1E|21|12+αA3αE2|3A|2E|11|32+αA3αE3|1A|3E|01|12. (40)

Finally, we apply a GXOR gate between R1 and the agent,

|Ψ9=U(R1,A)GXOR|Ψ8,|Ψ9=αA0αE0|0A|0E|01|02+αA0αE1|1A|1E|31|22+αA0αE2|2A|2E|21|02+αA0αE3|3A|3E|11|22+αA1αE0|0A|0E|11|12+αA1αE1|1A|1E|01|32+αA1αE2|2A|2E|31|12+αA1αE3|3A|3E|21|32+αA2αE0|0A|0E|21|22+αA2αE1|1A|1E|11|02+αA2αE2|2A|2E|01|22+αA2αE3|3A|3E|31|02+αA3αE0|0A|0E|31|32+αA3αE1|1A|1E|21|12+αA3αE2|2A|2E|11|32+αA3αE3|3A|3E|01|12. (41)

As we can see, based in the quantum protocol described previously (see Fig 3), we have shown that for a fixed number of interactions, we obtain maximal learning fidelity even though the system has an arbitrary dimension.

Quantum reinforcement learning in multiqudit systems

In the previous section, we proved that for an agent and environment composed of a multilevel system each, the quantum reinforcement learning protocol entails maximal learning fidelity for a fixed number of steps, irrespective of the dimension. Here, using this result, we also prove that for more than one multilevel system in agent, environment, and register subspaces, the number of steps is also fixed and scales with the number of individual subsystems that compose both agent and environment subsystems. To be more specific, in the single-multilevel case the needed total steps are nine. For two multilevel systems, we show that the number of required steps are eighteen, and in general, 9n, with n being the number of multilevel subsystems. The possible initial states of our protocol consist in arbitrary superpositions for both agent and environment states and the register states are in their ground state,

|Ψ0=n,m=0N-1p,q=0N-1αAnmαEpq|nA|mA|pE|qE|01|02|03|04. (42)

The first step in the protocol consists in encoding the environment information in the register states. This is done by applying a pair of GXOR gates. The gates are applied in the environment-register subspace, while the interaction in this case is the same as the one described previously. Namely, E1 controls R1 and E2 controls R2.

|Ψ1=U(E2,R2)GXORU(E1,R1)GXOR|Ψ0,|Ψ1=n,m=0N-1p,q=0N-1αAnmαEpq|nA|mA|pE|qE|p1|q2|03|04. (43)

Similarly, in the second step we encode the environment information in the other two registers (R3 and R4) through GXOR gates. Here, the control system is the environment while the targets are the registers.

|Ψ2=U(E2,R4)GXORU(E1,R3)GXOR|Ψ1,|Ψ2=n,m=0N-1p,q=0N-1αAnmαEpq|nA|mA|pE|qE|p1|q2|p3|q4. (44)

Subsequently, a part of the register subspace is updated conditional on the agent information. Therefore, we apply a pair of GXOR gates on the agent-register subspace. In this case, agents A1 and A2 are controls and registers R1 and R2 targets.

|Ψ3=U(A2,R2)GXORU(A1,R1)GXOR|Ψ2,|Ψ3=n,m=0N-1p,q=0N-1αAnmαEpq|nA|mA|pE|qE|np1|mq2|p3|q4. (45)

Now, we update the register subspace considering interactions between register components which have been acted upon with the same part of the environment. Namely, the register R3 will be updated with the control of R1 (Similarly with R4 being controlled with R2).

|Ψ4=U(R2,R4)GXORU(R1,R3)GXOR|Ψ3,|Ψ4=n,m=0N-1p,q=0N-1αAnmαEpq|nA|mA|pE|qE|np1|mq2|n2p3|m2q4. (46)

Subsequently, we need to apply successive interactions between agent states and register states to obtain maximal learning fidelity. We show that applying the same interactions as for the single multilevel case for the triplet formed by agent A1 with the environment parts R1 and R3 (similarly A2 with R2 and R4), the maximal learning fidelity is reached. It is straightforward to show that

|Ψ9=U(R2,A2)GXORU(R1,A1)GXORU(R4,A2)GXORU(R3,A1)GXOR×U(R2,A2)GXORU(R1,A1)GXORU(R4,A2)GXORU(R3,A1)GXORU(R2,A2)GXOR×U(R1,A1)GXOR|Ψ4,|Ψ9=n,m=0N-1p,q=0N-1αAnmαEpq|pA|qA|pE|qE|np1|mq2|n2p3|m2q4. (47)

Summarizing, for the case studied in this section, we demonstrate that the number of operations required to obtain maximal learning fidelity does not depend on the learning unit dimension and it is equal to eighteen operations, which correspond to the double of the required steps in the single multiqubit case. It is straightforward to realize that the number of needed operations to achieve maximal learning fidelity in a machine learning protocol composed by n subsystems for agent and environment is equal to 9n. Namely, the number of operations scales polynomially, indeed linearly, with the number of subsystems.

Quantum reinforcement learning in larger environments

Up to now, the quantum reinforcement learning protocol described here always considers that the agent and the environment have the same number of subsystems, as well as the same dimension. In these cases, we have shown that by adding more system registers the quantum protocol improves in the sense that only one iteration and one measurement is enough to obtain maximal learning fidelity. Nevertheless, in more realistic scenarios, the agent must adapt to larger or more complex surroundings. Here, we discuss the situation where the environment has more subsystems than the agent, and therefore a larger dimension. As the environment has more information than the agent, it is expect that not all available surrounding information will be transferred to the agent. Indeed, we prove that by depending on the register-environment interaction, the agent can encode the information from one specific part of the environment. In this case, unlike the protocol previously discussed, we achieve maximal learning fidelity after applying one measurement and a rewarding iteration (feedback).

The proposed quantum protocol is shown in Fig 4. Here, one two-level system forms the agent, while register and environment are constituted each by two qubits. Each environment qubit interacts with one qubit from the register, such that this interaction updates the registers conditionally to the environment information. Then, one part of the register subspace is also upgraded conditionally to the agent state. Subsequently, we perform a measurement on the register subspace, such that depending on the measurement outcomes we apply a conditional operation in the agent-register subspace until the agent adapts to a specific part of the environment. To illustrate this, let us introduce a possible agent-register-subspace state which has the following form,

|A=αA0|0A+αA1|1A (48)
|E=αE00|00E+αE01|01E+αE10|10E+αE11|11E (49)
|R=|01|02, (50)
|Ψ0=|A|E|R. (51)

The first step is to transfer quantum information from the environment onto the registers. This is done by applying a pair of CNOT gates in the environment-register subspaces,

|Ψ1=U(E,R2)CNOTU(E,R1)CNOT|Ψ0,|Ψ1=(αA0|0A+αA1|1A)(αE00|00E|01|02+αE01|01E|01|12+αE10|10E|11|02+αE11|11E|11|12). (52)

Subsequently, the register R1 is updated conditionally to the agent information. Therefore, a CNOT gate is applied in the agent-register subspace, where the agent qubit is the control and the register R1 is the target,

|Ψ2=U(A,R1)CNOT|Ψ1,|Ψ2=αA0αE00|0A|00E|01|02+αA0αE01|0A|01E|01|12+αA0αE10|0A|10E|11|02+αA0αE11|0A|11E|11|12+αA1αE00|1A|00E|11|02+αA1αE01|1A|01E|11|12+αA1αE10|1A|10E|01|02+αA1αE11|1A|11E|01|12. (53)

Afterwards, we perform a measurement on the register subspace. In this case, the wave function is projected into the four possible measurement outcomes,

M1=(αA0αE00|0A|00E+αA1αE10|1A|10E)|01|02=(αA0αE00|0A|0E1+αA1αE10|1A|1E1)|0E2|01|02,M2=(αA0αE01|0A|01E+αA1αE11|1A|11E)|01|12=(αA0αE01|0A|0E1+αA1αE11|1A|1E1)|1E2|01|12,M3=(αA1αE00|1A|00E+αA0αE10|0A|10E)|11|02=(αA1αE00|1A|0E1+αA0αE10|0A|1E1)|0E2|11|02,M4=(αA0αE11|0A|11E+αA1αE01|1A|01E)|11|12=(αA0αE11|0A|1E1+αA1αE01|1A|0E1)|1E2|11|12. (54)

As we can see, the projective measurement on the register subspace produces that agent and one part of the environment subspace (E1) is in an entangled state. At this stage, we can apply the rewarding criterion which consists in performing a CNOT gate operation in the register-agent subspace. The register qubit R1 is the control and the agent is the target,

M1a=U(R1,A)CNOTM1=(αA0αE00|0A|0E1+αA1αE10|1A|1E1)|0E2|01|02,M2a=U(R1,A)CNOTM2=(αA0αE01|0A|0E1+αA1αE11|1A|1E1)|1E2|01|12,M3a=U(R1,A)CNOTM3=(αA1αE00|0A|0E1+αA0αE10|1A|1E1)|0E2|11|02,M4a=U(R1,A)CNOTM4=(αA0αE11|1A|1E1+αA1αE01|0A|0E1)|1E2|11|12. (55)

Finally, we perform a CNOT gate in the agent-register subspace to obtain orthogonal measurement outcomes. The qubit agent is the control and the qubit register R1 is the target, according to

M1b=U(A,R1)CNOTM1a=αA0αE00|0A|00E|01|02+αA1αE10|1A|10E|11|02,M2b=U(A,R1)CNOTM2a=αA0αE01|0A|01E|01|12+αA1αE11|1A|11E|11|12,M3b=U(A,R1)CNOTM3a=αA1αE00|0A|00E|11|02+αA0αE10|1A|10E|01|02,M4b=U(A,R1)CNOTM4a=αA1αE01|0A|01E|11|12+αA0αE11|1A|11E|01|12. (56)

In this quantum reinforcement learning protocol, we perform interactions between the environment and the register subspaces. Nevertheless, the agent is updated only regarding the information encoded in register R1. Thus, the maximal learning fidelity is achieved with respect to the first qubit of the environment.

Fig 4. Quantum reinforcement learning for larger environment systems.

Fig 4

The systems involved are denoted as agent A, environment E and registers R1, R2, where E contains now two qubits while A just one. The logical gates applied between the different subsystems are CNOT gates. In this case, to obtain maximal learning fidelity, it is required to perform two separate measurements denoted by the blue boxes.

Let us now consider another configuration similar to the one studied previously in this article, where the register is formed by a larger number of subsystems than the environment. Here, additionally, the environment we consider is larger than the agent. We prove that, for this system configuration, maximal learning fidelity between the agent and one part of the environment is achieved in one rewarding process. For this configuration, the maximal fidelity does not depend on the entanglement present in the agent-environment subspace. The general agent-register-environment state is

|A=αA0|0A+αA1|1A, (57)
|E=(αE0|0E1+αE1|1E1)|0E2+(βE0|0E1+βE1|1E1)|1E2, (58)
|R=|01|02|03|04, (59)
|Ψ0=|A|E|R. (60)

The quantum protocol consists in updating the registers R1,2 conditionally to the environment state E1,2,

|Ψ1=U(E2,R2)CNOTU(E1,R1)CNOT|Ψ0,|Ψ1=(αA0|0A+αA1|1A)(αE0|0E1|0E2|01|02|03|04+αE1|1E1|0E2|11|02|03|04+βE0|0E1|1E2|01|12|03|04+βE1|1E1|1E2|11|12|03|04). (61)

After this, we also update the information of the registers R3,4 conditionally to the environment state E1,2,

|Ψ2=U(E2,R4)CNOTU(E1,R3)CNOT|Ψ1,|Ψ2=(αA0|0A+αA1|1A)(αE0|0E1|0E2|01|02|03|04+αE1|1E1|0E2|11|02|13|04+βE0|0E1|1E2|01|12|03|14+βE1|1E1|1E2|11|12|13|14). (62)

Now, the register R1 is updated conditionally to the agent state,

|Ψ3=U(A,R1)CNOT|Ψ2,|Ψ3=αA0αE0|0A|0E1|0E2|01|02|03|04+αA0αE1|0A|1E1|0E2|11|02|13|04+αA0βE0|0A|0E1|1E2|01|12|03|14+αA0βE1|0A|1E1|1E2|11|12|13|14+αA1αE0|1A|0E1|0E2|11|02|03|04+αA1αE1|1A|1E1|0E2|01|02|13|04+αA1βE0|1A|0E1|1E2|11|12|03|14+αA1βE1|1A|1E1|1E2|01|12|13|14. (63)

Then, the next step would consist in updating a part of the register subspace from the information encoded in the other part. However, this step is not necessary because the number of terms in Eq (63) is smaller than all the possible measurement outcomes in the register subspace. Thus, the register is always projected onto orthogonal measurement outcomes. On the other hand, we update the agent state from the information encoding in the register R1. Therefore, we perform a CNOT gate in the register-agent subspace, where the register R1 is the control and the agent is the target,

|Ψ4=U(R1,A)CNOT|Ψ3,|Ψ4=αA0αE0|0A|0E1|0E2|01|02|03|04+αA0αE1|1A|1E1|0E2|11|02|13|04+αA0βE0|0A|0E1|1E2|01|12|03|14+αA0βE1|1A|1E1|1E2|11|12|13|14+αA1αE0|0A|0E1|0E2|11|02|03|04+αA1αE1|1A|1E1|0E2|01|02|13|04+αA1βE0|0A|0E1|1E2|11|12|03|14+αA1βE1|1A|1E1|1E2|01|12|13|14. (64)

By measuring the register subspace, we obtain that agent and environment qubit E1 achieve maximal fidelity.

Quantum reinforcement learning for mixed states

Let us consider now the situation where the environment evolves under a noisy mechanism (for qubit states, noisy mechanisms can be depolarizing noise as well as amplitude damping). In this case, the density matrix describing the environment state reads

ρ=(ρ00ρ01ρ01*ρ11). (65)

We focus now our attention in the application of the quantum reinforcement learning protocol in this type of state. We will show that, by adding more registers, two main results will be obtained. Firstly, even though the environment is in a mixed state, the learning fidelity will be maximal for any measurement outcome in the register basis. Additionally, the measurement outcomes provide relevant information about the coherences of the mixed state. To apply the quantum protocol, we express the mixed state in term of its (non-unique) purification, such as

|ΨE+e=[ρ00|0E+ρ10ρ00|1E]|e1+[ρ11-|ρ10|2ρ00]|1E|e2, (66)
|ψe=ρ10ρ00|e1+[ρ11-|ρ10|2ρ00]|e2|ΨE+e=ρ00|0E|e1+ρ11|1E|ψe¯. (67)

Here, |ψe¯ is a normalized vector in the purification Hilbert space. As we can see, the coefficient of the quantum state written in its extended Hilbert space (environment + purification) depends only on the diagonal terms of the mixed state. Moreover, to obtain additional information about the mixed state, we need to perform unitary transformations on it in such a way that the information related to the coherences is in the diagonal of the state after the transformation. To be more specific, we need to perform unitary transformations such that the mixed state can be written as follows,

ρ¯UyρUy=12(1+(ρ01+ρ01*)ρ11-ρ00+(ρ01-ρ01*)ρ11-ρ00-(ρ01-ρ01*)1-(ρ01+ρ01*)), (68)
ρ˜UxρUx=12(1-i(ρ01-ρ01*)ρ01+ρ01*+i(ρ11-ρ00)ρ01+ρ01*-i(ρ11-ρ00)1+i(ρ01-ρ01*)). (69)

To carry out this task, we need to add three more registers, where each of them has the function to encode information of diagonal, real, and imaginary part of the coherence terms, respectively. A possible state for the space composed of agent, mixed environment and register is given by

|A=αA0|0A+αA1|1A, (70)
|ΨE+e=ρ00|0E|e1+ρ11|1E|ψe¯ (71)
|R=|01|0213(|13|04|05+|03|14|05+|03|04|15) (72)
|Ψ0=|A|ΨE+e|R. (73)

The first step is to apply a unitary transformation, which is conditional to the state of the register R3, R4 and R5. In case that the register state is |1〉3|0〉4|0〉5, we apply the transformation U1=IR3IR4IR5. If the register state is in the state |0〉3|1〉4|0〉5, we apply the transformation U2=IR3UyIR5. Finally, if the register state is in the state |0〉3|0〉4|1〉5 the unitary transformation is given by U3=IR3IR4Ux. Hence, the state after this transformation is given by unitary transformation in the environment state according to

|Ψ1=|A|ψE+e|01|02|13|04|05+|AUy|ψE+e|01|02|03|14|05+|AUx|ψE+e|01|02|03|04|15,|Ψ1=13(αA0|0A+αA1|1A)[(ρ00|0E|e1+ρ11|1E|ψe¯)|01|02|13|04|05+(12+Re(ρ01)|0E|e1+12-Re(ρ01)|1E|ψe¯)|01|02|03|14|05+(12+Im(ρ01)|0E|e1+12-Im(ρ01)|1E|ψe¯)|01|02|03|04|15]. (74)

Afterwards, we apply the quantum protocol as we did in the first section. Namely, we first update the register conditionally to the information of the environment. Then, we update the register R1 conditionally to the information of the agent. Subsequently, to obtain orthogonal measurement outcomes we perform CNOT gates in the register subspace (R1 is the control and R2 is the agent). Finally, the agent is updated in terms of the information encoded in register R1 (where A is the target and R1 is the control),

|Ψ5=13(αA0ρ00|0A|0E|e1|01|02|13|04|05+αA0ρ11|1A|1E|ψe¯|11|02|13|04|05+αA1ρ00|0A|0E|e1|11|12|13|04|05+αA1ρ11|1A|1E|ψe¯|01|12|13|04|05+αA012+Re(ρ01)|0A|0E|e1|01|02|03|14|05+αA012-Re(ρ01)|1A|1E|ψe¯|11|02|03|14|05+αA112+Re(ρ01)|0A|0E|e1|11|12|03|14|05+αA112-Re(ρ01)|1A|1E|ψe¯|01|12|03|14|05+αA012+Im(ρ01)|0A|0E|e1|01|02|03|04|15+αA012-Im(ρ01)|1A|1E|ψe¯|11|02|03|04|15+αA112+Im(ρ01)|0A|0E|e1|11|12|03|04|15+αA112-Im(ρ01)|1A|1E|ψe¯|01|12|03|04|15). (75)

This quantum reinforcement learning protocol exhibits two features. First, by performing projective measurements on registers R1, R2 and R3, we recover the result studied in the first section, i.e., the learning fidelity is maximal independently of the measurement outcomes in the register subspace. The second feature is that, for specific measurement outcomes in a part of the register subspace, we obtain information about the population (diagonal) and the coherences (off-diagonal) of the mixed state. This feature can be used in problems such as partial cloning in cases where the system in which we can extract information evolves under loss mechanisms.

Analysis of implementation in quantum technologies

An interesting result obtained in this manuscript is that in most of the cases, for the considered quantum reinforcement learning protocols, adding more registers improves the rewarding process. That is, via a purely unitary evolution, without coherent feedback, a maximally positively-correlated agent environment state is achieved, in the sense that the final agent contains the same quantum information as the considered final environment. This means that the agent has acquired the needed information about the environment and accordingly modified it, being this a quantum process. In our formalism, typically, one measurement at the end of the protocol is enough to obtain maximal learning fidelity in one iteration of the process. In this sense, several quantum architectures could benefit of this fact, given that coherent feedback is not needed in this case. For instance, we focus our attention in two prominent platforms, namely, trapped ions and superconducting circuits.

Trapped ions

As we have pointed out along the manuscript, the performance of our proposed quantum protocols is based on the quality of the quantum gates between different subsystems. In this case, the realization of high-fidelity quantum gates is essential to perform the quantum protocol proposed here. Technological progress in trapped ions has enabled to implement single [49] and two-qubit quantum gates [50] with a large fidelity. For the single-qubit gate, e.g., a Beryllium hyperfine transition can be driven with microwave fields or lasers, being the error associated with single-qubit gates below 10−4. For two-qubit gates, the use of either microwaves or a laser beam with modulated amplitude allows for the interaction of both qubits (electronic levels of, e.g., Beryllium or Calcium ions) at the same time. Adiabatic elimination of the motion allows one to obtain maximally entangled states of both ions. The fidelity of trapped-ion two-qubit gates can reach nowadays above 99.9% [51, 52]. Trapped-ion technologies offer long coherences times, which can reach up to the range of seconds [53] for Calcium atoms. In addition, this platform enables state preparation and readout with high fidelity [39, 54, 55]. Here, the use of hyperfine states and the microwave fields improve the optical pumping fidelity and improve the relaxation time T1 allowing to obtain fidelity readouts of 99.9999% [54].

Superconducting circuits

As in trapped ions, the technological progress in superconducting circuits has grown significantly in the latter years. For instance, artificial atoms whose coherence times are in the microsecond range have been built in coplanar [43] and 3D architectures [44]. On the other hand, integrated Josephson quantum processors allows one to implement quantum gates between two-level systems even in cases where the qubits do not have identical frequencies, as well as making them interact via a quantum bus [56]. The Xmon qubits achieve two-qubit gate fidelities above 99% [41, 42]. These technological progresses have developed feedback loop control in this platform. This feedback protocol relies on high fidelity readout, as well as on conditional control on the outcome of a quantum non-demolition measurement [45, 46]. Even though in the quantum reinforcement learning protocols in this paper coherent feedback is not required, this may be a useful ingredient in other quantum reinforcement learning proposals [23].

Discussion

In summary, we propose a protocol to perform quantum reinforcement learning which does not require coherent feedback and, therefore, may be implemented in a variety of quantum technologies. Our learning protocol, being mostly unitary (except with the final register measurement) considers learning in a loose sense: while it does not depend on feedback, the protocol achieves its aim regardless of the initial state of agent and environment. In this aspect, it is general, and obtains a similar goal than Ref. [23] without the need of feedback, enabling its implementation in a variety of quantum platforms. We also point out that one may employ different performance measures than the one considered here, depending on the agent possible aims. Adding more registers than in previous proposals in the literature [23], the rewarding criterion can be applied at the end of the protocol, while agent and environment need not be measured directly, although only via the registers. We also obtain that when the considered systems are composed of qudits, the number of steps needed to obtain maximal learning fidelity is fixed in each qudit dimension and scales polynomially with the number of qudit subsystems. We consider as well environment states which are mixtures, while the agent can also in this case acquire the appropriate information from them. Theoretically, all the cases considered of qubit, multiqubit, qudit, and multiqudit, have many similarities. Even though the protocols are not directly transformable into one another, a d-dimensional qudit can be rewritten as a log2(d) multiqubit system, while a multiqudit system with n qudits is equivalent to an n log2(d) multiqubit system. Therefore, in this respect, it is intuitive that the results for all these protocols (namely, that maximal fidelity can be attained) should be related. Nevertheless, it is valuable to show that the protocol can be scaled up to multiqudit systems with many parties and high dimensions, given that this will be an ultimate goal of a scalable quantum device. Implementations of these protocols in trapped ions and superconducting circuits seem feasible with current platforms.

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

We acknowledge support from CEDENNA basal grant No. FB0807 and Dirección de Postgrado USACH (FAC-L), FONDECYT under grant No. 1140194 (JCR), Spanish MINECO/FEDER FIS2015-69983-P and Basque Government IT986-16 (LL and ES), and Ramón y Cajal Grant RYC-2012-11391 (LL).

References

  • 1. Michalski RS, Carbonell JG, Mitchell TM. Machine learning: An artificial intelligence approach. Springer Science & Business Media; 2013. [Google Scholar]
  • 2. Plamondon R, Srihari SN. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;22(1):63–84. 10.1109/34.824821 [DOI] [Google Scholar]
  • 3.Lee KF, Hon HW, Hwang MY, Mahajan S, Reddy R. The SPHINX speech recognition system. In: International Conference on Acoustics, Speech, and Signal Processing,; 1989. p. 445–448 vol.1.
  • 4. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529(7587):484–489. 10.1038/nature16961 [DOI] [PubMed] [Google Scholar]
  • 5. Russell SJ, Norvig P. Artificial Intelligence: A Modern Approach (International Edition). Pearson US Imports & PHIPEs; 2002. [Google Scholar]
  • 6. Sutton RS, Barto AG. Reinforcement learning: An introduction. vol. 1 MIT press Cambridge; 1998. [Google Scholar]
  • 7. Wittek P. Quantum machine learning: what quantum computing means to data mining. Academic Press; 2014. [Google Scholar]
  • 8. Schuld M, Sinayskiy I, Petruccione F. An introduction to quantum machine learning. Contemporary Physics. 2015;56(2):172–185. 10.1080/00107514.2014.964942 [DOI] [Google Scholar]
  • 9.Adcock J, Allen E, Day M, Frick S, Hinchliff J, Johnson M, et al. Advances in quantum machine learning. arXiv preprint arXiv:151202900. 2015;.
  • 10. Biamonte J, Wittek P, Pancotti N, Rebentrost P, Wiebe N, Lloyd S. Quantum Machine Learning. Nature. 2017; 549, 195–202. 10.1038/nature23474 [DOI] [PubMed] [Google Scholar]
  • 11. Dunjko V, Briegel HJ. Machine learning & artificial intelligence in the quantum domain. Rep. Prog. Phys. 2018;81:074001 10.1088/1361-6633/aab406 [DOI] [PubMed] [Google Scholar]
  • 12.Bonner R, Freivalds R. A survey of quantum learning. Quantum Computation and Learning. 2003; p. 106.
  • 13. Aïmeur E, Brassard G, Gambs S. Quantum speed-up for unsupervised learning. Machine Learning. 2013;90(2):261–287. 10.1007/s10994-012-5316-5 [DOI] [Google Scholar]
  • 14.Lloyd S, Mohseni M, Rebentrost P. Quantum algorithms for supervised and unsupervised machine learning. arXiv preprint arXiv:13070411. 2013;.
  • 15. Rebentrost P, Mohseni M, Lloyd S. Quantum Support Vector Machine for Big Data Classification. Phys Rev Lett. 2014;113:130503 10.1103/PhysRevLett.113.130503 [DOI] [PubMed] [Google Scholar]
  • 16. Alvarez-Rodriguez U, Lamata L, Escandell-Montero P, Martín-Guerrero JD, Solano E. Supervised Quantum Learning without Measurements. Scientific Reports. 2017;7(1):13645 10.1038/s41598-017-13378-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Cai XD, Wu D, Su ZE, Chen MC, Wang XL, Li L, et al. Entanglement-Based Machine Learning on a Quantum Computer. Phys Rev Lett. 2015;114:110504 10.1103/PhysRevLett.114.110504 [DOI] [PubMed] [Google Scholar]
  • 18. Li Z, Liu X, Xu N, Du J. Experimental Realization of a Quantum Support Vector Machine. Phys Rev Lett. 2015;114:140504 10.1103/PhysRevLett.114.140504 [DOI] [PubMed] [Google Scholar]
  • 19. Dong D, Chen C, Li H, Tarn TJ. Quantum Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2008;38(5):1207–1220. 10.1109/TSMCB.2008.925743 [DOI] [PubMed] [Google Scholar]
  • 20. Paparo GD, Dunjko V, Makmal A, Martin-Delgado MA, Briegel HJ. Quantum Speedup for Active Learning Agents. Phys Rev X. 2014;4:031002 10.1103/PhysRevX.4.031002 [DOI] [Google Scholar]
  • 21. Dunjko V, Taylor JM, Briegel HJ. Quantum-Enhanced Machine Learning. Phys Rev Lett. 2016;117:130501 10.1103/PhysRevLett.117.130501 [DOI] [PubMed] [Google Scholar]
  • 22.Crawford D, Levit A, Ghadermarzy N, Oberoi JS, Ronagh P. Reinforcement Learning Using Quantum Boltzmann Machines. arXiv preprint arXiv:161205695. 2016;.
  • 23. Lamata L. Basic protocols in quantum reinforcement learning with superconducting circuits. Scientific Reports. 2017;7:1609 10.1038/s41598-017-01711-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Friis N, Melnikov AA, Kirchmair G, and Briegel HJ. Coherent controlization using superconducting qubits. Scientific Reports. 2015;5:18036 10.1038/srep18036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Dunjko V, Friis N, and H. J. Briegel Quantum-enhanced deliberation of learning agents using trapped ions New J. Phys. 2015;17:023006. [Google Scholar]
  • 26.T. Sriarunothai et al., Speeding-up the decision making of a learning agent using an ion trap quantum processor arXiv:1709.01366.
  • 27. Pfeiffer P, Egusquiza IL, Di Ventra M, Sanz M, Solano E. Quantum memristors. Scientific Reports. 2016;6:29507 EP –. 10.1038/srep29507 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Salmilehto J, Deppe F, Di Ventra M, Sanz M, Solano E. Quantum Memristors with Superconducting Circuits. Scientific Reports. 2017;7:42044 EP –. 10.1038/srep42044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Sanz M, Lamata L, Solano E. Invited article: Quantum memristors in quantum photonics. APL Photonics. 2018;3:080801 10.1063/1.5036596 [DOI] [Google Scholar]
  • 30. Shevchenko SN, Pershin YV, Nori F. Qubit-Based Memcapacitors and Meminductors. Phys Rev Applied. 2016;6:014006 10.1103/PhysRevApplied.6.014006 [DOI] [Google Scholar]
  • 31. Benedetti M, Realpe-Gómez J, Perdomo-Ortiz A. Quantum-assisted Helmholtz machines: A quantum-classical deep learning framework for industrial datasets in near-term devices. Quant. Sci. Tech. 2018;3:034007 10.1088/2058-9565/aabd98 [DOI] [Google Scholar]
  • 32. Benedetti M, Realpe-Gómez J, Biswas R, Perdomo-Ortiz A. Estimation of effective temperatures in quantum annealers for sampling applications: A case study with possible applications in deep learning. Phys Rev A. 2016;94:022308 10.1103/PhysRevA.94.022308 [DOI] [Google Scholar]
  • 33.Perdomo-Ortiz A, Benedetti M, Realpe-Gómez J, Biswas R. Opportunities and challenges for quantum-assisted machine learning in near-term quantum computers. arXiv preprint arXiv:170809757. 2017;.
  • 34. Leibfried D, Blatt R, Monroe C, Wineland D. Quantum dynamics of single trapped ions. Rev Mod Phys. 2003;75:281–324. 10.1103/RevModPhys.75.281 [DOI] [Google Scholar]
  • 35. Haffner H, Roos CF, Blatt R. Quantum computing with trapped ions. Physics Reports. 2008;469(4):155–203. 10.1016/j.physrep.2008.09.003 [DOI] [Google Scholar]
  • 36. Blais A, Gambetta J, Wallraff A, Schuster DI, Girvin SM, Devoret MH, et al. Quantum-information processing with circuit quantum electrodynamics. Phys Rev A. 2007;75:032329 10.1103/PhysRevA.75.032329 [DOI] [Google Scholar]
  • 37. Clarke J, Wilhelm FK. Superconducting quantum bits. Nature. 2008;453(7198):1031–1042. 10.1038/nature07128 [DOI] [PubMed] [Google Scholar]
  • 38. Wendin G. Quantum information processing with superconducting circuits: a review. Rep Prog Phys. 2017; 80:106001 10.1088/1361-6633/aa7e1a [DOI] [PubMed] [Google Scholar]
  • 39. Harty TP, Allcock DTC, Ballance CJ, Guidoni L, Janacek HA, Linke NM, et al. High-Fidelity Preparation, Gates, Memory, and Readout of a Trapped-Ion Quantum Bit. Phys Rev Lett. 2014;113:220501 10.1103/PhysRevLett.113.220501 [DOI] [PubMed] [Google Scholar]
  • 40. Ballance CJ, Harty TP, Linke NM, Sepiol MA, Lucas DM. High-Fidelity Quantum Logic Gates Using Trapped-Ion Hyperfine Qubits. Phys Rev Lett. 2016;117:060504 10.1103/PhysRevLett.117.060504 [DOI] [PubMed] [Google Scholar]
  • 41. Barends R, Kelly J, Megrant A, Veitia A, Sank D, Jeffrey E, et al. Superconducting quantum circuits at the surface code threshold for fault tolerance. Nature. 2014;508(7497):500–503. 10.1038/nature13171 [DOI] [PubMed] [Google Scholar]
  • 42. Barends R, Shabani A, Lamata L, Kelly J, Mezzacapo A, Heras UL, et al. Digitized adiabatic quantum computing with a superconducting circuit. Nature. 2016;534(7606):222–226. 10.1038/nature17658 [DOI] [PubMed] [Google Scholar]
  • 43. Barends R, Kelly J, Megrant A, Sank D, Jeffrey E, Chen Y, et al. Coherent Josephson Qubit Suitable for Scalable Quantum Integrated Circuits. Phys Rev Lett. 2013;111:080502 10.1103/PhysRevLett.111.080502 [DOI] [PubMed] [Google Scholar]
  • 44. Paik H, Schuster DI, Bishop LS, Kirchmair G, Catelani G, Sears AP, et al. Observation of High Coherence in Josephson Junction Qubits Measured in a Three-Dimensional Circuit QED Architecture. Phys Rev Lett. 2011;107:240501 10.1103/PhysRevLett.107.240501 [DOI] [PubMed] [Google Scholar]
  • 45. Ristè D, Bultink CC, Lehnert KW, DiCarlo L. Feedback Control of a Solid-State Qubit Using High-Fidelity Projective Measurement. Phys Rev Lett. 2012;109:240502 10.1103/PhysRevLett.109.240502 [DOI] [PubMed] [Google Scholar]
  • 46.Ristè D, DiCarlo L. Digital feedback in superconducting quantum circuits. arXiv preprint arXiv:150801385. 2015;.
  • 47. Koch J, Yu TM, Gambetta J, Houck AA, Schuster DI, Majer J, et al. Charge-insensitive qubit design derived from the Cooper pair box. Phys Rev A. 2007;76:042319 10.1103/PhysRevA.76.042319 [DOI] [Google Scholar]
  • 48.Alber G, Delgado A, Gisin N, Jex I. Generalized quantum XOR-gate for quantum teleportation and state purification in arbitrary dimensional Hilbert spaces. arXiv preprint quant-ph/0008022. 2000;.
  • 49. Brown KR, Wilson AC, Colombe Y, Ospelkaus C, Meier AM, Knill E, et al. Single-qubit-gate error below 10−4 in a trapped ion. Phys Rev A. 2011;84:030303 10.1103/PhysRevA.84.030303 [DOI] [Google Scholar]
  • 50. Benhelm J, Kirchmair G, Roos CF, Blatt R. Towards fault-tolerant quantum computing with trapped ions. Nat Phys. 2008;4(6):463–466. 10.1038/nphys961 [DOI] [Google Scholar]
  • 51. Gaebler JP, Tan TR, Lin Y, Wan Y, Bowler R, Keith AC, et al. High-Fidelity Universal Gate Set for 9Be+ Ion Qubits. Phys Rev Lett. 2016;117:060505 10.1103/PhysRevLett.117.060505 [DOI] [PubMed] [Google Scholar]
  • 52. Harty TP, Sepiol MA, Allcock DTC, Ballance CJ, Tarlton JE, Lucas DM. High-Fidelity Trapped-Ion Quantum Logic Using Near-Field Microwaves. Phys Rev Lett. 2016;117:140501 10.1103/PhysRevLett.117.140501 [DOI] [PubMed] [Google Scholar]
  • 53. Langer C, Ozeri R, Jost JD, Chiaverini J, DeMarco B, Ben-Kish A, et al. Long-Lived Qubit Memory Using Atomic Ions. Phys Rev Lett. 2005;95:060502 10.1103/PhysRevLett.95.060502 [DOI] [PubMed] [Google Scholar]
  • 54. Myerson AH, Szwer DJ, Webster SC, Allcock DTC, Curtis MJ, Imreh G, et al. High-Fidelity Readout of Trapped-Ion Qubits. Phys Rev Lett. 2008;100:200502 10.1103/PhysRevLett.100.200502 [DOI] [PubMed] [Google Scholar]
  • 55. Noek R, Vrijsen G, Gaultney D, Mount E, Kim T, Maunz P, et al. High speed, high fidelity detection of an atomic hyperfine qubit. Opt Lett. 2013;38(22):4735–4738. 10.1364/OL.38.004735 [DOI] [PubMed] [Google Scholar]
  • 56. Blais A, Huang RS, Wallraff A, Girvin SM, Schoelkopf RJ. Cavity quantum electrodynamics for superconducting electrical circuits: An architecture for quantum computation. Phys Rev A. 2004;69:062320 10.1103/PhysRevA.69.062320 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All relevant data are within the paper and its Supporting Information files.


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES