The link below sets out the problem
To summarise, a forgetful driver is going to a location that pays £4 (if the driver gets there). Its the left turn at the second of two identical intersections. The driver has written down that turning left at the first intersection pays £0, but if going onward at the second will pay £1. Given the driver forgets driving through an intersection (and whether or not the driver has already done so) as soon as the car is moving – what plan can be followed at *any* intersection to maximise the payment.
Well you might expect, if p = choosing ONWARD, and 1-p = choosing LEFT then the formula p2+4(1-p)p, which maximises at p = 2/3, says that the *best* solution is to turn left 1/3rd of the time and onward 2/3rds (using a dice, say) achieving a £1.3 return on average per driver.
In the paper cited below. Aumann et al. call this the planning-optimal decision.
The problem for decision theorists, is that UDT theory – works as follows once you are at an intersection, you should think that you have some probability α of being at X, and 1-α of being at Y. Your payoff for choosing CONTINUE with probability p becomes α[p2+4(1-p)p] + (1-α)[p+4(1-p)], which doesn't equal p2+4(1-p)p unless α = 1. So, once you get to an intersection, you'd choose a p that's different from the p you thought optimal at START.
However both the planning-optimal decision and the UDT approach produce less-good solutions than simple 'real world' pre-planning. Rationality consists of applying the best tool to reach the optimal outcome: in this case the highest possible £ payout for reaching A, B, or C, from choices made at X or Y, where it is impossible to determine whether or not a junction is X or Y. In the real world there exists a simple obvious winning strategy: I shall call:
Mechanism A.
(1) If you are coming to a junction, and your indicator is not on, drive onward through the junction, and immediately indicate left.
(2) If you are coming to a junction, and your indicator is on, drive left.
(3) Do not, turn on your indicator except as at (1)
(4) Do not turn off your indicator if it is on.
Now it may be argued that, this is basically cheating, but I would argue it is no more cheating than the Absent-Minded Driver being able to remember mathematics, or to have ‘on a card’ the agreed mathematically predicted action. Or a car programmed with it. The four rules above are as much rules of action as 'turn left 1/3rd of the time'. A rational agent, understanding the problem would create such a solution, probably before doing the math. The above achieves success in every case and is therefore perhaps uninteresting, but, there are other approaches that give a success rate higher than the 1.33 rate, where it is less self-evident where the 'cheat' (if any) resides.
Mechanism B.
Prepare an envelope containing the 3 cards below mark it ‘Draw a random card at each junction, follow the rules on the card’
(1) At junction ? drive Onwards - destroy card after use (Blue Card)
(2) At junction ? drive Onwards - destroy card after use (Yellow Card)
(3) At junction ? drive Left - destroy card after use* (Orange Card)
(*un-necessary but included to avoid false appearance of bias.)
This results for 1000 drivers, in a total gain of £1665, or 1.66 per driver. A decision theory, that does not permit the application of an over-riding meta-rule to govern a sequence of decisions, even when that meta-rule creates a better over all action, is simply insufficient.
Further suppose that it is impossible to begin with a non-randomised value, it is still possible to increase the result beyond 1.3 (and indeed beyond the 1.665 of Mechanism B in my previous post).
Consider
Mechanism C:
(1) You have a random number of balls in bag | |||||
(2) At a junction if number is even go straight on, and throw away a ball | |||||
(3) At a junction if number is odd go left (and throw away a ball) |
The starting state odd, or even is not determined by whether you are at X or Y, nor is the throw away instruction which is a constant, and yet the result is better than 1.3
Let 1000 people be travelling.
500 absent-mindedly come to junction X, 250 of them have a random odd number of balls, go left and earn nothing, 250 had even numbers, carry on, throwing away a ball. When that 250 (having forgotten the junction X) come to junction Y and look in the bag they find an odd number of balls, and turn left, earning 250 x 4 = 1000 (1000 / 500 = 2)
500 absent-mindedly come to junction Y, 250 of them have a random odd number of balls
go left and earn 4 x 250 = 1,000, 250 have a random even number of balls and go on, earning 250 x 1 = 250. (1250 / 500 = 2.5)
Total gain for the 1000 travellers = 2.25 per person. Substantially higher than the mathematically predicted 1.3 if lower than the optimum 'indicator light' solution.
In Mechanism C : The absent-minder traveller never knows that he (or she) is at X or Y, nor whether he or (she) has or has not previous been to a junction, nor whether or not his or her bag started with an odd or even number of balls, and yet - the information encoded in the rules, increases the utility of the outcome beyond the predicted max value of Decision Theory.
The phase state of odd / even between X and Y for those encountering X and Y acts as an extelligent memory. The question is, does ruling out the rational building in of extelligent checking, defeat the rational purposes of Decision Theory?
Simon BJ
Siimon BJ
No comments:
Post a Comment