The Bernoulli Generalized Likelihood Ratio test (BGLR) for Non-Stationary Multi-Armed Bandits Research Seminar at PANAMA, IRISA lab, Rennes
Lilian Besson PhD Student SCEE team, IETR laboratory, CentraleSupélec in Rennes & SequeL team, CRIStAL laboratory, Inria in Lille
Thursday 6th of June, 2019
Publications associated with this talk Joint work with my advisor Émilie Kaufmann
:
“Analyse non asymptotique d’un test séquentiel de détection de ruptures et application aux bandits non stationnaires” by Lilian Besson & Émilie Kaufmann ,→ presented at GRETSI, in Lille (France), next August 2019 ,→ perso.crans.org/besson/articles/BK__GRETSI_2019.pdf
“The Generalized Likelihood Ratio Test meets klUCB: an Improved Algorithm for Piece-Wise Non-Stationary Bandits” by Lilian Besson & Émilie Kaufmann Pre-print on HAL-02006471 and arXiv:1902.01575 Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
2 / 47
Outline of the talk
Outline of the talk 1
(Stationary) Multi-armed bandits problems
2
Piece-wise stationary multi-armed bandits problems
3
The BGLR test and its finite time properties
4
The BGLR-T + klUCB algorithm
5
Regret analysis
6
Numerical simulations Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
3 / 47
1. (Stationary) Multi-armed bandits problems
1. (Stationary) Multi-armed bandits problems 1
(Stationary) Multi-armed bandits problems
2
Piece-wise stationary multi-armed bandits problems
3
The BGLR test and its finite time properties
4
The BGLR-T + klUCB algorithm
5
Regret analysis
6
Numerical simulations Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
4 / 47
1. (Stationary) Multi-armed bandits problems
What is a bandit problem?
Multi-armed bandits = Sequential decision making problems in uncertain environments :
,→ Interactive demo perso.crans.org/besson/phd/MAB_interactive_demo/ Ref: [Bandits Algorithms, Lattimore & Szepesvári, 2019], on tor-lattimore.com/downloads/book/book.pdf Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
5 / 47
1. (Stationary) Multi-armed bandits problems
Mathematical model
Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R
Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
6 / 47
1. (Stationary) Multi-armed bandits problems
Mathematical model
Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R Usually, we focus on Bernoulli arms νk = Bernoulli(µk ), of mean µk ∈ [0, 1], giving binary rewards r(t) ∈ {0, 1}.
Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
6 / 47
1. (Stationary) Multi-armed bandits problems
Mathematical model
Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R Usually, we focus on Bernoulli arms νk = Bernoulli(µk ), of mean µk ∈ [0, 1], giving binary rewards r(t) ∈ {0, 1}. Goal : maximize the sum of rewards
T P
r(t)
t=1
"
or maximize the sum of expected rewards E
Lilian Besson
T P
#
r(t)
t=1
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
6 / 47
1. (Stationary) Multi-armed bandits problems
Mathematical model
Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R Usually, we focus on Bernoulli arms νk = Bernoulli(µk ), of mean µk ∈ [0, 1], giving binary rewards r(t) ∈ {0, 1}. Goal : maximize the sum of rewards
T P
r(t)
t=1
"
or maximize the sum of expected rewards E
T P
#
r(t)
t=1
Any efficient policy must balance between exploration and exploitation: explore all arms to discover the best one, while exploiting the arms known to be good so far. Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
6 / 47
1. (Stationary) Multi-armed bandits problems
Naive solutions
Two examples of bad solutions i) Pure exploration Play arm A(t) ∼ U({1, . . . , K}) uniformly at random "
=⇒ Mean expected rewards
Lilian Besson
1 TE
T P t=1
#
r(t) =
1 K
K P
µk maxk µk
k=1
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
7 / 47
1. (Stationary) Multi-armed bandits problems
Naive solutions
Two examples of bad solutions i) Pure exploration Play arm A(t) ∼ U({1, . . . , K}) uniformly at random "
=⇒ Mean expected rewards
1 TE
T P
#
r(t) =
t=1
1 K
K P
µk maxk µk
k=1
ii) Pure exploitation Count the number of samples and the sum of rewards of each arm P P Nk (t) = 1(A(s) = k) and Xk (t) = r(s)1(A(s) = k) s 1/2)
Play the arm of maximal UCB : A(t) = arg maxk UCBk (t) ,→ Principle of “optimism under uncertainty” α balances between exploitation (α → 0) and exploration (α → ∞) UCB is efficient: the best arm is identified correctly (with high probability) if there are enough samples (for T large enough) =⇒ Expected rewards attains the maximum T X 1 E r(t) → max µk k T t=1
"
For T → ∞,
Lilian Besson
#
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
8 / 47
1. (Stationary) Multi-armed bandits problems
The “Upper Confidence Bound” algorithm
Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK
Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
9 / 47
1. (Stationary) Multi-armed bandits problems
The “Upper Confidence Bound” algorithm
Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +
Lilian Besson
p
α log(t)/Nk (t)
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
9 / 47
1. (Stationary) Multi-armed bandits problems
The “Upper Confidence Bound” algorithm
Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +
p
α log(t)/Nk (t)
1 ) Hoeffding’s inequality gives P(UCBk (t) < µk (t)) ≤ O( t2α =⇒ the different UCBk (t) are true “Upper Confidence Bounds” on the (unknown) µk (most of the times)
Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
9 / 47
1. (Stationary) Multi-armed bandits problems
The “Upper Confidence Bound” algorithm
Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +
p
α log(t)/Nk (t)
1 ) Hoeffding’s inequality gives P(UCBk (t) < µk (t)) ≤ O( t2α =⇒ the different UCBk (t) are true “Upper Confidence Bounds” on the (unknown) µk (most of the times)
And if a suboptimal arm k > 1 is sampled, it implies UCBk (t) > UCB1 (t), but µk < µ1 : Hoeffding’s inequality also proves that any “wrong ordering” of the UCBk (t) is unlikely
Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
9 / 47
The “Upper Confidence Bound” algorithm
1. (Stationary) Multi-armed bandits problems
Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +
p
α log(t)/Nk (t)
1 ) Hoeffding’s inequality gives P(UCBk (t) < µk (t)) ≤ O( t2α =⇒ the different UCBk (t) are true “Upper Confidence Bounds” on the (unknown) µk (most of the times)
And if a suboptimal arm k > 1 is sampled, it implies UCBk (t) > UCB1 (t), but µk < µ1 : Hoeffding’s inequality also proves that any “wrong ordering” of the UCBk (t) is unlikely We can "prove that # suboptimal arms k are sampled about o(T ) times =⇒ E
T P
r(t)
t=1
→ µ∗ × O(T ) +
T →∞
P
µk × o(T )
k:∆k >0
But. . . at which speed do we have this convergence? Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
9 / 47
1. (Stationary) Multi-armed bandits problems
Regret of a bandit algorithm
Measure the performance of algorithm A by its mean regret RA (T ) Difference in the accumulated rewards between an “oracle” and A The “oracle” algorithm always plays the (unknown) best arm k ∗ = arg maxk µk (we note the best mean µk∗ = µ∗ ) Maximize the sum of expected rewards ⇐⇒ minimize the regret RA (T ) = E
" T X t=1
Lilian Besson
#
rk∗ (t) −
T X
∗
E [r(t)] = T µ −
t=1
BGLR test and Non-Stationary MAB
T X
E [r(t)] .
t=1
Thursday 6th of June, 2019
10 / 47
1. (Stationary) Multi-armed bandits problems
Regret of a bandit algorithm
Measure the performance of algorithm A by its mean regret RA (T ) Difference in the accumulated rewards between an “oracle” and A The “oracle” algorithm always plays the (unknown) best arm k ∗ = arg maxk µk (we note the best mean µk∗ = µ∗ ) Maximize the sum of expected rewards ⇐⇒ minimize the regret RA (T ) = E
" T X
#
rk∗ (t) −
t=1
T X
∗
E [r(t)] = T µ −
t=1
T X
E [r(t)] .
t=1
Typical regime for stationary bandits (lower & upper bounds) No algorithm A can obtain a regret better than
RA (T ) ≥ Ω(log(T ))
And an efficient algorithm A obtains
RA (T ) ≤ O(log(T ))
Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
10 / 47
1. (Stationary) Multi-armed bandits problems
Regret of two UCB algorithms
Regret of the UCB algorithm and another algorithm For any problem with K arms following Bernoulli distributions, of means µ1 , . . . , µK ∈ [0, 1], and optimal mean µ∗ , then For the UCB algorithm RTUCB ≤ (
X
k=1,...,K µk 1 : X1 , · · · , Xτ ∼ (µ0 ) et Xτ +1 , · · · ∼ (µ1 )).
Lilian Besson
BGLR test and Non-Stationary MAB
Thursday 6th of June, 2019
19 / 47
3. The BGLR test and its finite time properties
Likelihood ratio test for Bernoulli observations
Bernoulli likelihood ratio test Hypothesis: all distributions are Bernoulli The problem boils down to distinguishing i.i.d.
H0 : (∃µ0 : ∀i ∈ N∗ , Xi ∼ (µ0 )), against the alternative i.i.d.
i.i.d.
H1 : (∃µ0 6= µ1 , τ > 1 : X1 , · · · , Xτ ∼ (µ0 ) et Xτ +1 , · · · ∼ (µ1 )). The Likelihood Ratio statistic for this hypothesis test, after observing X1 , · · · , Xn , is sup L(n) =
µ0 ,µ1 ,τ 1 : X1 , · · · , Xτ ∼ (µ0 ) et Xτ +1 , · · · ∼ (µ1 )). The Likelihood Ratio statistic for this hypothesis test, after observing X1 , · · · , Xn , is sup L(n) =
µ0 ,µ1 ,τ