Lecture 23-24 (Wed April 13, Mon April 18): X-armed bandits, Monte Carlo Tree Search Lecture Slides

Figures in these slides were copied from Munos, 2014 (reference below). Main Reference
Remi Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. 2014. <hal-00747575v5> [pdf]

Other related references Bandit based Monte-Carlo Planning, by Levente Kocsis, Csaba Szepesvári, ECML 2006 [pdf]

UCT algorithm: first optimistic (UCB like) approach to Monte Carlo Tree Search

Modification of UCT with Patternsin Monte-Carlo Go.
Sylvain Gelly, Yizao Wang, Remi Munos, Olivier Teytaud.[Research Report] 2006. <inria-00117266v1> [pdf]

Application of Optimistic tree-search algorithm UCT to Computer Go

Lecture 21-22 (Wed April 6, Mon April 11): UCRL2 algorithm. Lecture Slides (Complete slide deck for two lectures: April 6 and 11, includes algorithm description and regret analysis) Lecture notes part I (April 6) Algorithm description Lecture notes part II (April 11) Regret analysis

Lecture 23-24 (Wed April 13, Mon April 18): X-armed bandits, Monte Carlo Tree SearchLecture SlidesFigures in these slides were copied from Munos, 2014 (reference below).

Main Reference

Remi Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. 2014. <hal-00747575v5> [pdf]

Other related references

Bandit based Monte-Carlo Planning, by Levente Kocsis, Csaba Szepesvári, ECML 2006 [pdf]Modification of UCT with Patternsin Monte-Carlo Go.Sylvain Gelly, Yizao Wang, Remi Munos, Olivier Teytaud.[Research Report] 2006. <inria-00117266v1> [pdf]

GoLecture 21-22 (Wed April 6, Mon April 11): UCRL2 algorithm.Lecture Slides (Complete slide deck for two lectures: April 6 and 11, includes algorithm description and regret analysis)

Lecture notes part I (April 6) Algorithm description

Lecture notes part II (April 11) Regret analysis

Main Reference

Near-optimal Regret Bounds for Reinforcement Learning [pdf]

Other closely related references

Optimal Adaptive Policies for Markov Decision Processes

Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Lecture 20 (Mon April 4): Exploration-exploitation in MDP/reinforcement learning.Lecture NotesLecture 19 (Wed Mar 30): Bayesian bandit problem and Introduction to MDP.Lecture notesReferences

J. C. Gittins Bandit Processes and Dynamic Allocation Indices 1979

Useful textbooks on MDP

Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin L. PutermanNeuro-Dynamic Programming byDimitri P. BertsekasandJohn TsitsiklisLecture 18 (Mon Mar 28): Proof of Gittins Index theorem Part 2Lecture NotesHand-written notes

References

J. N. Tsitsiklis A short proof to Gittins Index Theorem 1994

Lecture 17 (Wed Mar 23): Proof of Gittins Index theorem Part 1Lecture notesProof of Markovian stationary optimal policy (Lecture slides)

Lecture 16 (Mon Mar 21): Bandits with Markovian inputs, Gittins Index theoremLecture notesLecture 14, 15 (Mon Mar 7, Wed Mar 9): Bandits with constraints and concave objectiveLecture notes Part I (Bandits with Knapsacks)

Lecture notes Part II (Bandits with global concave objective)

Preliminary Lecture Notes (slides)

References

S. Agrawal, N. R. Devanur, "Bandits with concave rewards and convex knapsacks". EC 2014

- bandits with knapsacks, bandits with convex constraints, bandits with global concave objective

S. Agrawal, N. R. Devanur, "Fast algorithms for online stochastic convex programming".Lecture 12, 13 (Mon Feb 29, Wed Mar 2): Dynamic PricingGuest Lecture by Alex SlivkinsLecture Slides Part ILecture Slides Part IILecture 10, 11 (Mon Feb 22, Wed Feb 24): Contextual banditsGuest Lecture by Alekh AgarwalLecture Notes part I

Lecture Notes part II

References

Lecture 8, 9(Mon Feb 15, Wed Feb 17): Linear banditsLecture notes part I

Lecture Notes Part II (Last edited on Mar 24, 2016)

References

(Adversarial)

(Stochastic)

Lecture 6,7 (Mon Feb 8, Wed Feb 10, 2016): Adversarial banditsLecture Notes

References

Lecture 5 (Wednesday Feb 3, 2016): Thompson SamplingLecture notes

Lecture 4 (Monday Feb 1, 2016): Thompson SamplingLecture Notes

References

Lecture 3 (Wednesday Jan 27, 2016) : UCB algorithm, lower boundsLecture notes (Last edited on 1/31/2016)

- includes UCB algorithm and regret analysis

These lecture notes from Robert Kleinberg's class at Cornell contain a proof of problem-independent lower bound discussed in class.References:

Lecture 2 (Monday Jan 25, 2016) : UCB algorithm, lower boundsLecture Notes (Last edited on 1/31/2016)

References:

Lecture 1 (Wednesday Jan 20, 2016):IntroductionSlides