slides 169 - PDF Free Download

The Bernoulli Generalized Likelihood Ratio test (BGLR) for Non-Stationary Multi-Armed Bandits Research Seminar at PANAMA, IRISA lab, Rennes

Lilian Besson PhD Student SCEE team, IETR laboratory, CentraleSupélec in Rennes & SequeL team, CRIStAL laboratory, Inria in Lille

Thursday 6th of June, 2019

Publications associated with this talk Joint work with my advisor Émilie Kaufmann

:

“Analyse non asymptotique d’un test séquentiel de détection de ruptures et application aux bandits non stationnaires” by Lilian Besson & Émilie Kaufmann ,→ presented at GRETSI, in Lille (France), next August 2019 ,→ perso.crans.org/besson/articles/BK__GRETSI_2019.pdf

“The Generalized Likelihood Ratio Test meets klUCB: an Improved Algorithm for Piece-Wise Non-Stationary Bandits” by Lilian Besson & Émilie Kaufmann Pre-print on HAL-02006471 and arXiv:1902.01575 Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

2 / 47

Outline of the talk

Outline of the talk 1

(Stationary) Multi-armed bandits problems

2

Piece-wise stationary multi-armed bandits problems

3

The BGLR test and its finite time properties

4

The BGLR-T + klUCB algorithm

5

Regret analysis

6

Numerical simulations Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

3 / 47

1. (Stationary) Multi-armed bandits problems

1. (Stationary) Multi-armed bandits problems 1

(Stationary) Multi-armed bandits problems

2

Piece-wise stationary multi-armed bandits problems

3

The BGLR test and its finite time properties

4

The BGLR-T + klUCB algorithm

5

Regret analysis

6

Numerical simulations Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

4 / 47

1. (Stationary) Multi-armed bandits problems

What is a bandit problem?

Multi-armed bandits = Sequential decision making problems in uncertain environments :

,→ Interactive demo perso.crans.org/besson/phd/MAB_interactive_demo/ Ref: [Bandits Algorithms, Lattimore & Szepesvári, 2019], on tor-lattimore.com/downloads/book/book.pdf Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

5 / 47

1. (Stationary) Multi-armed bandits problems

Mathematical model

Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R

Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

6 / 47

1. (Stationary) Multi-armed bandits problems

Mathematical model

Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R Usually, we focus on Bernoulli arms νk = Bernoulli(µk ), of mean µk ∈ [0, 1], giving binary rewards r(t) ∈ {0, 1}.

Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

6 / 47

1. (Stationary) Multi-armed bandits problems

Mathematical model

Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R Usually, we focus on Bernoulli arms νk = Bernoulli(µk ), of mean µk ∈ [0, 1], giving binary rewards r(t) ∈ {0, 1}. Goal : maximize the sum of rewards

T P

r(t)

t=1

"

or maximize the sum of expected rewards E

Lilian Besson

T P

#

r(t)

t=1

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

6 / 47

1. (Stationary) Multi-armed bandits problems

Mathematical model

Mathematical model Discrete time steps t = 1, . . . , T The horizon T is fixed and usually unknown At time t, an agent plays the arm A(t) ∈ {1, . . . , K}, then she observes the iid random reward r(t) ∼ νk , r(t) ∈ R Usually, we focus on Bernoulli arms νk = Bernoulli(µk ), of mean µk ∈ [0, 1], giving binary rewards r(t) ∈ {0, 1}. Goal : maximize the sum of rewards

T P

r(t)

t=1

"

or maximize the sum of expected rewards E

T P

#

r(t)

t=1

Any efficient policy must balance between exploration and exploitation: explore all arms to discover the best one, while exploiting the arms known to be good so far. Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

6 / 47

1. (Stationary) Multi-armed bandits problems

Naive solutions

Two examples of bad solutions i) Pure exploration Play arm A(t) ∼ U({1, . . . , K}) uniformly at random "

=⇒ Mean expected rewards

Lilian Besson

1 TE

T P t=1

#

r(t) =

1 K

K P

µk maxk µk

k=1

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

7 / 47

1. (Stationary) Multi-armed bandits problems

Naive solutions

Two examples of bad solutions i) Pure exploration Play arm A(t) ∼ U({1, . . . , K}) uniformly at random "

=⇒ Mean expected rewards

1 TE

T P

#

r(t) =

t=1

1 K

K P

µk maxk µk

k=1

ii) Pure exploitation Count the number of samples and the sum of rewards of each arm P P Nk (t) = 1(A(s) = k) and Xk (t) = r(s)1(A(s) = k) s 1/2)

Play the arm of maximal UCB : A(t) = arg maxk UCBk (t) ,→ Principle of “optimism under uncertainty” α balances between exploitation (α → 0) and exploration (α → ∞) UCB is efficient: the best arm is identified correctly (with high probability) if there are enough samples (for T large enough) =⇒ Expected rewards attains the maximum T X 1 E r(t) → max µk k T t=1

"

For T → ∞,

Lilian Besson

#

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

8 / 47

1. (Stationary) Multi-armed bandits problems

The “Upper Confidence Bound” algorithm

Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK

Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

9 / 47

1. (Stationary) Multi-armed bandits problems

The “Upper Confidence Bound” algorithm

Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +

Lilian Besson

p

α log(t)/Nk (t)

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

9 / 47

1. (Stationary) Multi-armed bandits problems

The “Upper Confidence Bound” algorithm

Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +

p

α log(t)/Nk (t)

1 ) Hoeffding’s inequality gives P(UCBk (t) < µk (t)) ≤ O( t2α =⇒ the different UCBk (t) are true “Upper Confidence Bounds” on the (unknown) µk (most of the times)

Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

9 / 47

1. (Stationary) Multi-armed bandits problems

The “Upper Confidence Bound” algorithm

Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +

p

α log(t)/Nk (t)

1 ) Hoeffding’s inequality gives P(UCBk (t) < µk (t)) ≤ O( t2α =⇒ the different UCBk (t) are true “Upper Confidence Bounds” on the (unknown) µk (most of the times)

And if a suboptimal arm k > 1 is sampled, it implies UCBk (t) > UCB1 (t), but µk < µ1 : Hoeffding’s inequality also proves that any “wrong ordering” of the UCBk (t) is unlikely

Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

9 / 47

The “Upper Confidence Bound” algorithm

1. (Stationary) Multi-armed bandits problems

Elements of the proof for UCB algorithm Elements of proof of convergence (for K Bernoulli arms) Suppose the first arm is the best: µ∗ = µ1 > µ2 ≥ . . . ≥ µK UCBk (t) = Xk (t)/Nk (t) +

p

α log(t)/Nk (t)

1 ) Hoeffding’s inequality gives P(UCBk (t) < µk (t)) ≤ O( t2α =⇒ the different UCBk (t) are true “Upper Confidence Bounds” on the (unknown) µk (most of the times)

And if a suboptimal arm k > 1 is sampled, it implies UCBk (t) > UCB1 (t), but µk < µ1 : Hoeffding’s inequality also proves that any “wrong ordering” of the UCBk (t) is unlikely We can "prove that # suboptimal arms k are sampled about o(T ) times =⇒ E

T P

r(t)

t=1

→ µ∗ × O(T ) +

T →∞

P

µk × o(T )

k:∆k >0

But. . . at which speed do we have this convergence? Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

9 / 47

1. (Stationary) Multi-armed bandits problems

Regret of a bandit algorithm

Measure the performance of algorithm A by its mean regret RA (T ) Difference in the accumulated rewards between an “oracle” and A The “oracle” algorithm always plays the (unknown) best arm k ∗ = arg maxk µk (we note the best mean µk∗ = µ∗ ) Maximize the sum of expected rewards ⇐⇒ minimize the regret RA (T ) = E

" T X t=1

Lilian Besson

#

rk∗ (t) −

T X

∗

E [r(t)] = T µ −

t=1

BGLR test and Non-Stationary MAB

T X

E [r(t)] .

t=1

Thursday 6th of June, 2019

10 / 47

1. (Stationary) Multi-armed bandits problems

Regret of a bandit algorithm

Measure the performance of algorithm A by its mean regret RA (T ) Difference in the accumulated rewards between an “oracle” and A The “oracle” algorithm always plays the (unknown) best arm k ∗ = arg maxk µk (we note the best mean µk∗ = µ∗ ) Maximize the sum of expected rewards ⇐⇒ minimize the regret RA (T ) = E

" T X

#

rk∗ (t) −

t=1

T X

∗

E [r(t)] = T µ −

t=1

T X

E [r(t)] .

t=1

Typical regime for stationary bandits (lower & upper bounds) No algorithm A can obtain a regret better than

RA (T ) ≥ Ω(log(T ))

And an efficient algorithm A obtains

RA (T ) ≤ O(log(T ))

Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

10 / 47

1. (Stationary) Multi-armed bandits problems

Regret of two UCB algorithms

Regret of the UCB algorithm and another algorithm For any problem with K arms following Bernoulli distributions, of means µ1 , . . . , µK ∈ [0, 1], and optimal mean µ∗ , then For the UCB algorithm RTUCB ≤ (

X

k=1,...,K µk 1 : X1 , · · · , Xτ ∼ (µ0 ) et Xτ +1 , · · · ∼ (µ1 )).

Lilian Besson

BGLR test and Non-Stationary MAB

Thursday 6th of June, 2019

19 / 47

3. The BGLR test and its finite time properties

Likelihood ratio test for Bernoulli observations

Bernoulli likelihood ratio test Hypothesis: all distributions are Bernoulli The problem boils down to distinguishing i.i.d.

H0 : (∃µ0 : ∀i ∈ N∗ , Xi ∼ (µ0 )), against the alternative i.i.d.

i.i.d.

H1 : (∃µ0 6= µ1 , τ > 1 : X1 , · · · , Xτ ∼ (µ0 ) et Xτ +1 , · · · ∼ (µ1 )). The Likelihood Ratio statistic for this hypothesis test, after observing X1 , · · · , Xn , is sup L(n) =

µ0 ,µ1 ,τ 1 : X1 , · · · , Xτ ∼ (µ0 ) et Xτ +1 , · · · ∼ (µ1 )). The Likelihood Ratio statistic for this hypothesis test, after observing X1 , · · · , Xn , is sup L(n) =

µ0 ,µ1 ,τ