Lecture 23-24 (Wed April 13, Mon April 18): X-armed bandits, Monte Carlo Tree Search
Lecture Slides

Figures in these slides were copied from Munos, 2014 (reference below).
Main Reference
Remi Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. 2014. <hal-00747575v5> [pdf]

Other related references
Bandit based Monte-Carlo Planning, by Levente Kocsis, Csaba Szepesvári, ECML 2006 [pdf]
  • UCT algorithm: first optimistic (UCB like) approach to Monte Carlo Tree Search
Modification of UCT with Patterns in Monte-Carlo Go.
Sylvain Gelly, Yizao Wang, Remi Munos, Olivier Teytaud.[Research Report] 2006. <inria-00117266v1> [pdf]
  • Application of Optimistic tree-search algorithm UCT to Computer Go

Lecture 21-22 (Wed April 6, Mon April 11): UCRL2 algorithm.
Lecture Slides (Complete slide deck for two lectures: April 6 and 11, includes algorithm description and regret analysis)
Lecture notes part I (April 6) Algorithm description
Lecture notes part II (April 11) Regret analysis

Main Reference
Near-optimal Regret Bounds for Reinforcement Learning [pdf]
Other closely related references
Optimal Adaptive Policies for Markov Decision Processes
Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Lecture 20 (Mon April 4): Exploration-exploitation in MDP/reinforcement learning.
Lecture Notes

Lecture 19 (Wed Mar 30): Bayesian bandit problem and Introduction to MDP.
Lecture notes
J. C. Gittins Bandit Processes and Dynamic Allocation Indices 1979

Useful textbooks on MDP
Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin L. Puterman
  • Chapter 3 has many interesting examples of MDP. The inventory control example discussed in class is from Section 3.2
Neuro-Dynamic Programming by Dimitri P. Bertsekas and John Tsitsiklis
  • Section 2.4 has many interesting examples. The Tetris game example discussed in class is Example 2.3 page 50.

Lecture 18 (Mon Mar 28): Proof of Gittins Index theorem Part 2
Lecture Notes
Hand-written notes
J. N. Tsitsiklis A short proof to Gittins Index Theorem 1994

Lecture 17 (Wed Mar 23): Proof of Gittins Index theorem Part 1
Lecture notes
Proof of Markovian stationary optimal policy (Lecture slides)

Lecture 16 (Mon Mar 21): Bandits with Markovian inputs, Gittins Index theorem
Lecture notes

Lecture 14, 15 (Mon Mar 7, Wed Mar 9): Bandits with constraints and concave objective
Lecture notes Part I (Bandits with Knapsacks)
Lecture notes Part II (Bandits with global concave objective)
Preliminary Lecture Notes (slides)
S. Agrawal, N. R. Devanur, "Bandits with concave rewards and convex knapsacks". EC 2014
  • bandits with knapsacks, bandits with convex constraints, bandits with global concave objective
S. Agrawal, N. R. Devanur, "Fast algorithms for online stochastic convex programming".
  • full information (before the decision) setting

Lecture 12, 13 (Mon Feb 29, Wed Mar 2): Dynamic Pricing
Guest Lecture by Alex Slivkins

Lecture Slides Part I
Lecture Slides Part II

Lecture 10, 11 (Mon Feb 22, Wed Feb 24): Contextual bandits
Guest Lecture by Alekh Agarwal

Lecture Notes part I
Lecture Notes part II


Lecture 8, 9(Mon Feb 15, Wed Feb 17): Linear bandits

Lecture notes part I

Lecture Notes Part II (Last edited on Mar 24, 2016)



Lecture 6,7 (Mon Feb 8, Wed Feb 10, 2016): Adversarial bandits
Lecture Notes


Lecture 5 (Wednesday Feb 3, 2016): Thompson Sampling
Lecture notes

Lecture 4 (Monday Feb 1, 2016): Thompson Sampling
Lecture Notes


Lecture 3 (Wednesday Jan 27, 2016) : UCB algorithm, lower bounds
Lecture notes (Last edited on 1/31/2016)
  • includes UCB algorithm and regret analysis
These lecture notes from Robert Kleinberg's class at Cornell contain a proof of problem-independent lower bound discussed in class.


Lecture 2 (Monday Jan 25, 2016) : UCB algorithm, lower bounds

Lecture Notes (Last edited on 1/31/2016)


Lecture 1 (Wednesday Jan 20, 2016): Introduction