explore exploit tradeoff

From algorithms to live by

the value of exploitation can only go up over time.
the value of exploration, of finding a new favorite, can only go down over time
taking the future into account, rather than focusing just on the present, drives us toward novelty.
people tend to over-explore—to favor the new disproportionately over the best.

Multi-armed bandit problem

For every slot machine we know little or nothing about, there is some guaranteed payout rate which, if offered to us in lieu of that machine, will make us quite content never to pull its handle again. This number—which Gittins called the “dynamic allocation index,” and which the world now knows as the Gittins index—suggests an obvious strategy on the casino floor: always play the arm with the highest index.
the Gittins index is optimal only under some strong assumptions. It’s based on geometric discounting of future reward, valuing each pull at a constant fraction of the previous one, which is something that a variety of experiments in behavioral economics and psychology suggest people don’t do.
With the future weighted nearly as heavily as the present, the value of making a chance discovery, relative to taking a sure thing, goes up even more.
And if there’s a cost to switching among options, the Gittins strategy is no longer optimal either.

Restless bandit problem

The standard multi-armed bandit problem assumes that the probabilities with which the arms pay off are fixed over time.
If the probabilities of a payoff on the different arms change over time—what has been termed a “restless bandit”—the problem becomes much harder. (So much harder, in fact, that there’s no tractable algorithm for completely solving it, and it’s believed there never will be.) Part of this difficulty is that it is no longer simply a matter of exploring for a while and then exploiting: when the world can change, continuing to explore can be the right choice.
To live in a restless world requires a certain restlessness in oneself. So long as things continue to change, you must never fully cease exploring.

Win-Stay, Lose-Shift algorithm

choose an arm at random, and keep pulling it as long as it keeps paying off. If the arm doesn’t pay off after a particular pull, then switch to the other one.

Regret is monotonically increasing

First, assuming you’re not omniscient, your total amount of regret will probably never stop increasing, even if you pick the best possible strategy—because even the best strategy isn’t perfect every time.
Second, regret will increase at a slower rate if you pick the best strategy than if you pick others; what’s more, with a good strategy regret’s rate of growth will go down over time, as you learn more about the problem and are able to make better choices.
Third, and most specifically, the minimum possible regret—again assuming non-omniscience—is regret that increases at a logarithmic rate with every pull of the handle.

Upper confidence bound algo

an Upper Confidence Bound algorithm says, quite simply, to pick the option for which the top of the confidence interval is highest.
The recommendations given by Upper Confidence Bound algorithms will be similar to those provided by the Gittins index, but they are significantly easier to compute, and they don’t require the assumption of geometric discounting.
Upper Confidence Bound algorithms implement a principle that has been dubbed “optimism in the face of uncertainty.”
In the long run, optimism is the best prevention for regret.
What an explorer trades off for knowledge is pleasure. The Gittins index and the Upper Confidence Bound, as we’ve seen, inflate the appeal of lesser-known options beyond what we actually expect, since pleasant surprises can pay off many times over. But at the same time, this means that exploration necessarily leads to being let down on most occasions. Shifting the bulk of one’s attention to one’s favorite things should increase quality of life.

the expert and the generalist

From Salman:

A jack of all trades is a master of none, but oftentimes better than a master of one.

I don’t believe in this model of trying to focus your life down one thing. You’ve got one life. Just do everything you want.
― Naval Ravikant

A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects. — Robert Heinlein

revisit ch26

Summer

Table of Contents

explore exploit tradeoff

From algorithms to live by

Multi-armed bandit problem

Restless bandit problem

Win-Stay, Lose-Shift algorithm

Regret is monotonically increasing

Upper confidence bound algo

the expert and the generalist

Graph View

Backlinks

Summer

Table of Contents

explore exploit tradeoff

From algorithms to live by §

Multi-armed bandit problem §

Restless bandit problem §

Win-Stay, Lose-Shift algorithm §

Regret is monotonically increasing §

Upper confidence bound algo §

the expert and the generalist §

Graph View

Backlinks

From algorithms to live by

Multi-armed bandit problem

Restless bandit problem

Win-Stay, Lose-Shift algorithm

Regret is monotonically increasing

Upper confidence bound algo

the expert and the generalist