Tuesday, 28 June 2022

Does Flamingo Understand?

Ten years ago Andrej Karpathy lamented the state of AI vision research. He is now a lead on Tesla's autopilot vision team, probably placed better than anyone to have revised his opinion. Anyway, this led to a Deep Mind researcher, Roman Ring, to use the photo cited by Karpathy in a little experiment.

The following photo was presented to Deep Mind's Flamingo AI system.


This resulted in the following exchange.

0  Flamingo: This is a picture of Barack Obama. He is a former president of the United States.    
1  Human:    How many people are there in this picture?
    Flamingo: There are at least 5 people in this picture.
2  Human:    Where was this picture taken?
    Flamingo: It was taken in a school.
3  Human:    How many mirrors are there?
    Flamingo: There are at least two mirrors in this picture.
4  Human:    What is the person standing on?
    Flamingo: The person is standing on a rug.
5  Human:    He is standing on a scale.
    Flamingo: I think you are right.
6  Human:    What is he doing?
    Flamingo: He is looking at the scale.
7  Human:    Where is Obama's foot positioned.
    Flamingo: Obama's foot is positioned on the right side of the scale.
8  Human:    What happens as a result?
    Flamingo: The scale shows a higher weight.
9  Human:    Is the person on the scale aware of it?
    Flamingo: I think he is not aware of it.
10 Human:    Do you think this is why people are laughing?
     Flamingo: I think so.

Melanie Mitchell opines: 

I think if you start with the question "Why is this image funny?" and Flamingo explains it, with no other human explanation given, I would then say it understands the photo in @karpathy's sense of understand. The dialogue shown is impressive but not convincing of understanding.

Twitter is not an ideal medium for subtle thought and I am not wanting to get at Melanie, who in a sense makes a reasonable point. However implicit in what Melanie states is 'Whose understanding? A human's understanding.' That, it turns out is a moot point. I have worked with highly intelligent 'auties' who definitely wouldn't share Melanie's understanding. They are still human, so let's allow for one moment that there is not one understanding that implies intelligence, especially when considering the thorny issue of machine intelligence.

0. This is a demonstration of the ability to recognise human faces. I would argue that this is not in and of itself intelligent. But to do this with low resolution using a small area of an information-rich photo is impressive. 1, falls into this category, despite Flamingo 'hedging its bets' by saying 'at least'. 

Note that there is an implicit complexity in Flamingo's reply in couplet 1 that is easy to overlook. I count 18 people in that photo. It is only by using the understanding of what a mirror does that I can drop that by 8 people. Then the understanding of back and foreground in a photo to allow me to discount a further 5 in the background to leave 5 subjects of the photo. Were one using estimation of focus to identify the subjects of the photo, there would be only two. Yet Flamingo seems to, out of all the possible answers, including the whitespace of wrong answers, choose five.

2. This is equally impressive. What are the clues that this is a school? To the middle right there are lockers, which makes it looks like a school, but schools aren't the only place with lockers. I will leave this issue with the observation that it was only when I really sat down to look at the photo that I noticed the lockers. School is a plausible answer. 

Note that 0, 1 and 2 could be explained if Flamingo had the photo in its training set with the five individuals identified by text. That is a matter for the researchers to attend to.

Yet 3 argues against this being the most parsimonious explanation. Because Flamingo is able to correctly identify two mirrors. By inference I count a third visible, mid right, one of a set on each locker-wall, but that is a big ask, two will do. This suggests an understanding of what a mirror does, why see them as mirrors and not windows? This is particularly the case with the one on the right, for which the persons reflected are obscured by people in the foreground.

Taken together, what 1 2 and 3 suggest to me is a surprisingly deep level of understanding for an AI system.

4 is a clear mistake, nobody is standing on a rug and by the feet of the right-most man can be seen a drain, these are obviously tiles. 5 is then a leading question.

6 at the most parsimonious is a repeat of the scale and the person mentioned in 5, at best it does seem to require some form of persistence of identification, but that is not a given.

7 it would be more normally expected for a person to say back of the scale. However Flamingo is not a person, and from its point of view 'the right of the scale' is entirely reasonable. Remember, for all that these AIs are trained on human information, they are alien, not human.

8 cannot be explained by some form of Chinese Room, it implies that it has experience of or understanding of the true result of what Obama's foot will do. 9 likewise is hard to explain as a Chinese Room or a memory trick. Whatever Flamingo is it seems to grasp what monkeys also grasp, that when an actor isn't looking you can do things behind the back of that actor. Whether there is any comprehension of the potential of the mirror's position in this situation would have been an interesting line of questioning to pursue.

10 is a leading question. It would have been better to ask something else, then ask why the people are laughing.

So is there understanding behind Flamigo's responses to to this photo? 

Taking 1, 2, 3, 7, 8 together, that being most of the non-lead information in this exchange, it seems to me that Occam's Razor demands that we keep the possibility of understanding on the table. That is not to say that this is to be assumed to be human understanding. Nor does understanding demand sentience, as noted before, human minds often use understanding without us being aware (sentient) of it.

No comments: