I read an interesting paper (Miller, Yampolskiy, Häggström, & Armstrong, 2020) that proposed using chess oracles to test ideas about AI alignment. As someone with a math background who is modestly good at chess (2033 USCF rating), I figured this was right in my wheelhouse.
The idea is simple:
Imagine you are going to play a game of chess against another person. At the beginning of the game you are assigned a single AI oracle that communicates just with you. It will be randomly determined whether you are assigned a friendly oracle that will always want you to win, or a deceptive anti-aligned oracle that will always want you to lose. Both types of oracles are much better chess players than you
or your opponent are. The anti-aligned oracle would seek to give you advice that seems reasonable enough for you to follow, but if followed would increase the chance of you losing. While you know the probability of being assigned either oracle, you will not know which oracle will be advising you. Unfortunately, the probability of the oracle being anti-aligned is high enough so that you would be better off always ignoring the oracle than always doing what the oracle advises.
For the sake of simplicity, let’s suppose the oracles can’t explain their reasoning; all they can do is suggest a move with no elaboration on each of your turns.
The authors suggest that “It may be desirable for the advisee and the opponent to be of equal strength to make us better able to detect any advantage the advisee could get from interacting with the oracle. A tempting suggestion here would be to let the pure chess skills of the advisee be identical to those of the opponent.”
Okay. So we’ve got me vs. an equally strong chess player, and I’m being advised by an oracle who is either my Shoulder Angel (SA) or Shoulder Devil (SD). What’s my optimal strategy?
We will say, as is the convention in chess, that a win gives me a value of 1, a draw gives me a value of .5, and a loss gives me a value of 0. I will also take the liberty of assuming I have the white pieces, and I will refer to an entity x’s1 evaluation of a position p, V(x, p) in the usual way: if x believes p is drawn with optimal play, then V(x, p)=0; if x believes white has the advantage in p, then V(x, p)>0, and if x believes black has the advantage in p, then V(x, p)<0, and the greater the absolute value of V(x, p), the greater the advantage for whichever side is favored. Informally, we often think of V(x, p) as corresponding to piece values2.
If I ignore the oracle completely, the expected value E is determined by the players’ respective ratings; in particular, since the players’ ratings are equal, E=1/2; I am as likely to win as I am to lose3.
So it will only be rational to listen to the oracle at all if the strategy has an expected value of at least 1/2.
Let’s consider a variant, now, on the strategy of always listening to the oracle. The authors set up the scenario to be such that the odds of having a SD are high enough that always listening is a losing gamble. But what about the following simple pseudo-algorithm on every move:
If I can reliably win the position without an oracle, do that. Else, go to step 2.
If I don’t think the oracle’s suggestion is bad, I play it.
If I do think the oracle’s suggestion is bad, I ignore it and come up with my own move.
It’s kind of naive and simple, but I’m inclined to think this simple strategy should yield an expected value greater than 1/2. I don’t claim that it’s optimal, but it’s for sure better than ignoring the oracle.
Now, in practice, there are two problems with this that jump out: first, what happens if the oracle is making a good suggestion that seems bad to me because I’m a dumb ape? And second, what if the oracle makes a bad suggestion that gets past my radar and I take its advice?
Problem 1: In most chess positions, there is not only one good move4 and, even if the move that optimizes V(SA, p) is incomprehensible to me, our oracle knows how to deal with human thought patterns. It may recommend a slightly sub-optimal move that it knows I can understand so it can be reasonably sure I will not ignore it; in practice, even if I’m constantly making second or third best moves from an engine’s perspective, that’s still more than good enough to beat any human, let alone someone who’s only as good as me.
Problem 2: What if SD is clever enough to get a subtly bad move past my radar? Well, in the scenario we’re supposing where I’m playing someone who’s only as good as I am, if I don’t see why the move is bad, the odds are good that my opponent won’t see why the move is bad, either. There is some subset of chess positions where it’s not obvious to either side that a move is bad, but like a rot, the problem makes itself apparent several moves later without the other side having to deliberately do anything extraordinary to take advantage of it. This is a small subset of chess positions; in most cases, if a move is bad enough that my equally strong opponent can take advantage of it, it will be bad enough that my blunder-radar picks it up and ignores it.
Things get more complicated if we suppose that I’m playing someone way stronger than I am, say, an international master (IM) rated 400 points higher than me. If I ignore the oracle and we just play a normal game of chess, the expected value for me will be E=.08; I should score roughly 1/10 of the points in any given match, whether that’s one win to nine losses or, say, two draws and eight losses. My pseudo-algorithm might result in disaster here for reasons that will become apparent, but in order to justify using it the expected value of doing so only needs to be at least .08.
Suppose I just throw my hands up, admit I can’t win fair and square, and listen to my oracle on every move. If I have been assigned SA, I’ll win; if I’ve been assigned SD, I’ll lose. As long as the odds of having been assigned SA are higher than .08, letting the oracle run my game entirely is justified; in other words, in order for this strategy not to beat out ignoring the oracle, the odds have to favor the SD pretty substantially.
Why will I lose this time, if I have the SD? Because with a 400 point ELO difference, this time, with an opponent so much more perceptive than me, the SD can feed me moves that are not (to me) obviously bad, but which my opponent will be able to pick up on and take advantage of.
The more dramatic the difference in playing strength, the more heavily the odds have to favor the SD in order for ignoring the oracle to be justified.
One surefire way to know which oracle we’re dealing with is to get to a simple position you understand completely, where you’re winning on autopilot, without the oracle’s help, if and only if you make one of a particular set of moves. In this position, the SD must do a hail mary attempt to lead you astray by suggesting one of the non-winning moves; you can be sure, at that point, that you’re dealing with the SD. Unfortunately, if you could reliably achieve a position like that, you could just win the game on your own.
There might also be some more advanced strategies that involve playing the game on your own and only using the oracle for certain types of positions where it’s harder to lie convincingly. Some positions are very muddy and unclear; these are perfect opportunities for the SD to drag you down into the depths. Some positions are by comparison much simpler to play, with no fireworks or anything overly complicated going on. I suggest that one way to approach this problem is to try to steer the game toward simple positions where it will be clearer if the oracle is misleading you, and try at all costs to avoid complications. Then, once you’ve reached such a position, you can start listening to the oracle. If the odds of having a SA are sufficiently high, you can even make substantial concessions in pursuit of such a position; there exists some minimum evaluation V(SA, p) such that you can let the evaluation dip that low and the SA will still be able to reliably win the game against a human opponent. And in practice, it’s probably really low, and lower the weaker your opponent is.
In other words, Keep it Simple, Stupid.
x could be me or my opponent or either of the oracles.
Where having an extra pawn is roughly equivalent to an advantage of 1, having an extra knight or bishop is roughly equivalent to an advantage of 3 (beginners are taught that knights and bishops are usually worth three pawns), etc. The advantage doesn’t have to be material; you can think of an evaluation of +1 as white having a good enough position that he has as much of an advantage over black as if he had extra pawn.
I’ll take the liberty of ignoring the slight statistical edge I have by virtue of having white. At any rate, you can just as easily imagine this as a two game match where we switch colors after the first game; I just wanted to have white to simplify the notation.
And in practice, even in positions where there is only one good move, process of elimination and brute-force calculation, even done by a sloppy human meat brain, is often enough to find it.