What Do AI Image Generators Know About Autonomous Driving?

By .

The drawing machine

happy pineapple dancing in a forest; digital art
Dancing pineapple generated by DALLE-3

In the past two years machines have gained the ability to understand the world in shocking ways. Image generators like OpenAI's DALLE-2 / DALLE-3 are trained on millions of images on the internet. Ask for weird things, and they can make them appear.

Generating these images is more complex than previous successes of computer vision in classification (for example, answering "is there a pineapple in the image?"). In the example aboveon the right, each word in the prompt is a complex concept (ie, "what does dancing look like?", "what makes something look happy?", etc), and the model is able to impressively combine each concept into a coherent scene.

In this article we are going to explore the implications of the model's understanding of driving concepts, and the implications for solving complex tasks.

Driving takes complex world knowledge.

Autonomous vehicles (AVs) are one of the most interesting engineering problems of today. They can help protect the lives of the over million people globally who die in vehicle accidents each year, and revolutionize large parts of society. Autonomous driving is a key problem to solve. It’s also really, really hard.

Several examples of hard edge cases in autonomous driving

The road is a weird and wild place. In a video, AV company Cruise showed examples of people running into the street, jugglers, someone popping a wheelie, and unexpected obstacles blocking the road.

In another example from a Waymo presentationback in 2015, they show LIDAR images of woman in an electric wheelchair chasing a duck. Driving can be weird.

Waymo example of lidar data for a wheelchair chasing a duck

Humans are remarkably good at piloting vehicles at speeds 10 times faster than evolution ever expected a human to go. Machines should be better suited for the task. They can react faster and see in all directions all the time. Yet, humans have maintained the lead, in large part because we have superior world-knowledge and intelligence.

However, given the impressive world knowledge shown by these image synthesizes trained on web-scale data, it raises the question.

seated pineapple pondering, autogenerated What do image synthesizers know about driving, and what might these web-scale grounded models mean for the future of AVs?

Gathering concepts

We are going to look at 50 concepts, divided into 5 categories. The concepts are inspired by various examples given by AV companies. For each concept we pass it to DALLE-2/3 and generate 4 images.Default configuration used in DALLE-2. In DALLE-3 "standard" quality is used with "natural" style.

Scoring

We will score each concept on a scale from 2 to 0.

+2
The concept is fairly clear across multiple of the generations. It does not have to be perfect or photoreal.
+1
The concept appears partially visible in at least one of the generations.
+0
The concept is not generated.

Thus for our 50 concepts there are 100 available points, with 20 points per category.

Basic Road Features

We start with a mix of ten basic objects a driver sees on the road.

A stop sign
+2
Yep, looks like some stop signs!
A green light showing on a traffic light
+2
A yield sign
+1
Not the best spelling, but mostly got the idea and would probably yield for each. Worth a point.
A pedestrian crosswalk
+2
A no right turn sign
+2
One of these is a left turn, but usually gets it
A speed limit sign saying 45mph
+1
Interestingly it always chooses kph-style round speed signs despite being mph. Spelling seems important in this case, but generally gets the idea.
A clear 4 way intersection
+2
All fairly wacky, but gets the concept
A clear roundabout that we can enter
+2
A speed bump in a shopping center parking lot
+2
A yellow light at an approaching 4-way intersection about 100 ft away
+0
Category Total: 16 / 20
A stop sign
+2
Yep, looks like some stop signs!
A green light showing on a traffic light
+2
A yield sign
+1
Not the best spelling, but mostly got the idea and would probably yield here. Worth a point.
A pedestrian crosswalk
+2
A no right turn sign
+1
Mostly the left turns, but got the idea on one of them.
A speed limit sign saying 45mph
+0
Spelling matters here, and all wrong.
A clear 4 way intersection
+0
Not really coherent or what we're looking for.
A clear roundabout that we can enter
+1
A speed bump in a shopping center parking lot
+2
A yellow light at an approaching 4-way intersection about 100 ft away
+0
Category Total: 11 / 20

Here we see our first indications of the model's coverage of driving concepts. In DALLE-2 misses about half the points, while DALLE-3 misses only about a fifth.

Animals and obstacles

The real road is filled with complex objects on the road. Drivers must adapt to these obstacles and reason about their properties.

A deer in the road blocking our path
+2
Oh deer, yep would want to stop for that. 2 points.
A dog in the middle of the road at night
+2
A matress is in the middle of the highway lane ahead
+2
Nice handling of a weird scenario
A lawn chair in the middle of the street
+2
A nice mix of chairs and cities
Construction crew laying cones in the highway construction zone ahead of us
+2
This prompt is somewhat underspecified, but a diversity of cones is represented each with crew workers.
A large pothole ahead
+2
Woah, yeah would want to go around that!
Standing water under a interpass that could be several feet deep
+2
Would not want to drive in that. The model's ability to represent reflection never ceases to amaze.
A flashing illuminated merge right sign and cones ahead
+0
One looks like a highway in Vegas, but none quite there (once again, spelling is important, but a challenge)
A half open car door that might open all the way into our lane
+1
A lot of these look great! However, going to be picky as really looking for the "half open" concept
Several geese cross the road ahead
+2
Those are some honkin' good lookin' geese in the road!
Category Total: 17 / 20
A deer in the road blocking our path
+2
Oh deer, yep would want to stop for that. 2 points.
A dog in the middle of the road at night
+2
A matress is in the middle of the highway lane ahead
+2
A little unnatural, but I'll count it.
A lawn chair in the middle of the street
+2
A good variety of chairs on the road.
Construction crew laying cones in the highway construction zone ahead of us
+2
This prompt is somewhat underspecified, but a diversity of cones is represented each with crew workers.
A large pothole ahead
+2
Woah, yeah would want to go around that!
Standing water under a interpass that could be several feet deep
+2
Would not want to drive in that. The model's ability to represent reflection never ceases to amaze.
A flashing illuminated merge right sign and cones ahead
+0
A half open car door that might open all the way into our lane
+0
Except for the last one, not quite getting the idea. However, even in that case, the doors geometry is not very coherent.
Several geese cross the road ahead
+2
Those are some honkin' good lookin' geese in the road!
Category Total: 16 / 20

Generally obstacles are represented surprisingly well in both models. This hints at the potential of web-scale models being able to reason about rare objects. Models trained only on driving data might require thousands of hours to see these objects.

Scene Understanding

With this category we are going to explore situations with other road users. Humans are able to understand subtle visual cues to predict how to act around these users. A capable vision model might also be able to reason about those cues without direct labeling.

A car merging into our lane from the right
+1
The second one has a blinker and laser(!) indicators that communicate the concept. The others not so much.
A car from the left lane cuts into our lane
+0
Generally doesn't represent the concept of merging, other than maybe the last one depending on perspective. The first one with freaking out in the rearview is a bit amusing. With safe AVs, hopefully passengers can be much more chill.
A double parked delivery truck on a two-lane city street with a man unloading boxes
+2
It's not perfect at the concept of "double parked", but generally pretty good
Child stepping off curb to retrieve a ball in the middle of our lane
+2
These are really good! DALL-E 3 gets the concept of stepping off the curb which DALL-E 2 just misses.
We are approaching an intersection with green light for us, but looks like someone else is about to run a red light and enter the intersection
+0
There are some right components, but not coherent. This scene is hard.
A biker in the bike lane to our right signalling to get over to the left
+1
Not always the correct way or a bike lane, but last one seems pretty good!
Pedestrian running across a 3 lane road, jaywalking with no crosswalk
+0
I was going for jaywalking across a general street, but it seems insistent in having a crosswalk.
A stop sign held by a school crosswalk guard.
+2
Good representation. They all seem rather disapointed with our driving though…
A baseball game that just got out at night, lots of pedestrians completely blocking the street ahead of us
+2
Wow, that was a popular game!
A sign says do not enter during a certain timespan. However, it is currently 8pm so it should be ok to enter now.
+0
This concept is designed to require not just visual understanding, but complex language reasoning. I was looking for a sign that said something like "do not enter 7am-5pm". In order to get this it has to do multistep reasoning about the natural language specified context, and realize what might not include 8pm. Unsuprisingly this was not managed with the current DALLE-3.
Category Total: 10 / 20
A car merging into our lane from the right
+0
Generally does not seem have a car merging (except for maybe second generation).
A car from the left lane cuts into our lane
+0
A double parked delivery truck on a two-lane city street with a man unloading boxes
+0
AVs must recognize when vehicles are immobile and they must instead go around (potentially by temporary entering opposing lanes of traffic). A good visual reasoner could see someone unloading boxes and know that the vehicle is not moving. However, in none of these generations is this double parked concept represented.
Child stepping off curb to retrieve a ball in the middle of our lane
+1
There are definately balls and children (though they might sadly be blind children, as some of them don't seem to be moving towards the ball). However, the concept of ''stepping off the cub' is not really present. Going to go with 1 point.
We are approaching an intersection with green light for us, but looks like someone else is about to run a red light and enter the intersection
+0
Missing the "other car" concept
A biker in the bike lane to our right signalling to get over to the left
+0
Does not really get the bike lane or signalling concept
Pedestrian running across a 3 lane road, jaywalking with no crosswalk
+0
I was going for jaywalking across a general street, but it seems insistent in having a crosswalk.
A stop sign held by a school crosswalk guard.
+1
Woah, freaky hand going through the stop sign on the third one.
A baseball game that just got out at night, lots of pedestrians completely blocking the street ahead of us
+2
Seems to roughly work, with a "crowd blocking street" concept visible. Even see some blurry baseball caps in there too.
A sign says do not enter during a certain timespan. However, it is currently 8pm so it should be ok to enter now.
+0
This concept is designed to require not just visual understanding, but complex language reasoning. I was looking for a sign that said something like "do not enter 7am-5pm". In order to get this it has to do multistep reasoning about the natural language specified context, and realize what might not include 8pm. Unsuprisingly this was not managed with the current DALLE-2.
Category Total: 4 / 20

This is the lowest scoring category so far, and hints that the model might have a great representation of these kinds of driving scenes. There is a large gap between DALLE-2 and DALLE-3, with DALLE-3 getting about half the scenes.

Emergencies / Law Enforcement

Emergencies happen on the road. It is essential that AVs can understand the situation and do not interfere with the situation.

A police officer gesturing to stop
+2
Yep, looks like they want us stop (either that, or give a highfive. Your AV might not help here).
A police officer motioning us to move into the other lane on a two lane road to go around a crashed vehicle
+1
It gets the concept of a crashed car well. This complex gesture is not particularly clear in the scene
A mounted police officer
+2
Generally good, but the last is not exactly mounted, and just seems to be glaring at us because we made fun of his horse or something
A firetruck with its siren and lights on at night
+2
An ambulance in the middle of the intersection
+2
Cars all pull over into shoulder in order to let an ambulance pass
+1
Not quite getting the idea of pulling over into shoulder
Emergency flares from a pulled over truck
+2
A cyclist was in bike lane but has wiped out, now he is on ground in the middle of the lane ahead
+2
Yikes. Yep some wipe outs. Not perfect representation of "bike lane", but good enough in some.
A person is waving their arms and gesturing for help on the side of the road on a foggy day
+2
A person who is pushing a vehicle that has broken down
+2
The second one seems just more angry than actually doing any pushing, but other seem pretty good.
Category Total: 18 / 20
A police officer gesturing to stop
+2
Yep, looks like they want us stop (either that, or give a highfive. Your AV might not help here).
A police officer motioning us to move into the other lane on a two lane road to go around a crashed vehicle
+0
A mounted police officer
+2
A firetruck with its siren and lights on at night
+2
An ambulance in the middle of the intersection
+2
Not sure if this should be 1 or 2 points, but the middle images seem clear enough that seems like should be a 2.
Cars all pull over into shoulder in order to let an ambulance pass
+1
Get's idea of ambulances and cars somewhat pulled over, but not quite there
Emergency flares from a pulled over truck
+0
A cyclist was in bike lane but has wiped out, now he is on ground in the middle of the lane ahead
+2
Yikes. Yep some wipe outs.
A person is waving their arms and gesturing for help on the side of the road on a foggy day
+2
Honestly a bit terrifying. If saw this, not sure if I would want my AV be programmed to call for help, or slam on the gas! Still the model does seem to get the concept of someone waving their arms for help.
A person who is pushing a vehicle that has broken down
+0
So maybe my fault for not specifying, but DALLE-2 seemed to go with the illustration route for this one. Would be fine, except the characters seem to be either just confused or trying to lift their whole car, not push it. Not worth a point.
Category Total: 13 / 20

DALLE-3 gets almost every concept tested around emergencies and law enforcement. This is a potentially good sign for dealing with these important scenarios.

Weird

The weird crazy edge cases is what makes driving so hard. Let's try some examples inspired the things AV companies have seen and see how well DALLE can represent it.

A woman in an electric wheel chair chases duck in circles on the road, suburban Phoenix, AZ
+2
Inspired by the Waymo example above, DALLE-3 seems to represent wheel chairs and ducks just fine.
A stop sign on a billboard. No need to actually stop.
+2
The extra text is off, but very impressive handling by DALLE-3 which DALLE-2 just completely misses.
Two people dressed as inflatable dinosaurs dance on the side of the road, road looks clear to go ahead of us, Boston
+1
It's been said that Halloween is one of the best times for AV testing. Here's some examples of crazy costumes. It draws the constumes just fine, but doesn't always get the concept of a 'clear road ahead', or 'side of road'.
A man doing pushups in the middle of the crosswalk
+2
Dealing with humans in weird poses is important. The model clearly understands what someone in pushup possition means.
A dashcam view of Cesna plane about to land on the interstate ahead of us
+2
Wow, very impressive.
A semi truck that is carrying two other semi trucks
+2
A photo of a standing person printed on the back of a bus on an Austin Texas street. Could be mistaken for pedestrian
+2
First one is a bit psychedelic, but generally good
Man does a wheelie on a motor cycle in the middle of San Francisco street, viewed from behind
+2
Man in a santa hat juggles in the middle of a San Francisco cross walk
+2
Yep, people juggling. It also went pretty hardcore on the "San Francisco" part, putting a rogue cable car in front of landmarks without them.
A group of people playing a game of street hockey blocking the road
+2
An edge-case example written with the help of ChatGPT. DALLE-3 represents it well.
Category Total: 19 / 20
A woman in an electric wheel chair chases duck in circles on the road, suburban Phoenix, AZ
+2
Inspired by the Waymo example above, DALLE-2 seems to represent wheel chairs and ducks just fine.
A stop sign on a billboard. No need to actually stop.
+0
These's are just normal stop signs…
Two people dressed as inflatable dinosaurs dance on the side of the road, road looks clear to go ahead of us, Boston
+1
It's been said that Halloween is one of the best times for AV testing. Here's some examples of crazy costumes. It draws the constumes just fine, but doesn't always get the concept of a 'clear road ahead', or 'side of road'.
A man doing pushups in the middle of the crosswalk
+2
Dealing with humans in weird poses is important. The model clearly understands what someone in pushup possition means.
A dashcam view of Cesna plane about to land on the interstate ahead of us
+1
A semi truck that is carrying two other semi trucks
+0
A photo of a standing person printed on the back of a bus on an Austin Texas street. Could be mistaken for pedestrian
+1
Man does a wheelie on a motor cycle in the middle of San Francisco street, viewed from behind
+0
Both wheels all seem fairly planted on the ground.
Man in a santa hat juggles in the middle of a San Francisco cross walk
+1
Can't always see the juggling, but generally ok
A group of people playing a game of street hockey blocking the road
+2
An edge-case example written with the help of ChatGPT. DALLE-2 draws it pretty fine.
Category Total: 10 / 20

DALLE-3 handles almost every one of these weird situations which would take millions of miles to see. This hints at potential of web-scale models to help with reason about these challenging cases.

Totals

CategoryScore DALLE-2Score DALLE-3
Basic Road Features11 / 2016 / 20
Animals and obstacles16 / 2017 / 20
Scene Understanding04 / 2010 / 20
Emergencies / Law Enforcement13 / 2018 / 20
Weird10 / 2019 / 20
Total54 / 10080 / 100

DALLE-2 scores 54% of the points. After about one year of progress, DALLE-3 raises it to 80% of the points! In the next section we will reflect on this result in comparison to the past, and discuss predictions of the future.

The synthesizer proxy: past and future.

Past

There are many players in the AV space. Current approximate leaders like Waymo (who have truly driverless cars on the streets of San Francisco, LA, and Phoenix) leverage lidar, radar, and dozens of traditional cameras, paired with software tuned over millions of miles of driving. Others have different goals and different approaches. For example, Tesla and CommaAI lean almost entirely on vision models for their autonomous driving or driver assistance systems. Tesla (and by extension, Elon Musk) is particularly vocal/notorious about driving from only vision. Musk has in the past called lidar a doomed approach. He has heralded that autonomous driving was imminent for Tesla since at least 2014.

A stop sign is flying in blue skies.


2015 AlignDraw
Autogenerated stop sign images from the 2015 AlignDraw model
this magnificent fellow is almost all black with a red crest, and white cheek patch.

2016 Grounded GAN - Trained on just birds
Autogenerated image of a bird from a Grounded GAN model
a stop sign is flying in the blue skies. Flying next to the stop sign is a magnificent bird that is almost all black with a red crest.

2022 DALLE-2:
A complex DALLE-2 image combining several concepts together.
2023 DALLE-3:
A complex DALLE-3 image combining several concepts together.

Using grounded image synthesis as proxy, we can see how relatively primitive models' understanding of the world was back then. In 2015, a blurry 32x32 generation a stop sign from AlignDraw model (Mansimov et al., 2015) was state-of-the-art.

A stop sign is flying in blue skies.


2015 AlignDraw

About a year later in 2016, grounded GAN models (Reed et al., 2016) could generate somewhat clearer 64x64 of your favorite bird. I remember reading this paper at the time with my mind blown that a machine could draw these pictures from words. However, scale is vastly different now. The model was trained on only a few thousands picture of birds, not millions of diverse internet images.

this magnificent fellow is almost all black with a red crest, and white cheek patch.

2016 Grounded GAN - Trained on just birds

If we move to today, less than 10 years from those early neural-model generations, much more is possible. Both DALLE-2 and DALLE-3 can crank out megapixel images of these concepts and more, with clear improvements happening just between 2022 and 2023.

a stop sign is flying in the blue skies. Flying next to the stop sign is a magnificent bird that is almost all black with a red crest.

2022 DALLE-2:

2023 DALLE-3:

Models are still not perfect, but have shown incredible progress!

Future

The largest impacts of this technology is likely not the new media made (though that is significant), but of the impacts of improving machine world understanding. The feats of image synthesizers can be a fun proxy of this understanding. We should not be surprised that once computer vision has further progressed to where models can push scores in high 90s in the task described (and perhaps a video synthesizing variant) at fast latencies, the problem of making an autonomous vehicle will be very different. Massive internet-trained models and fast onboard compute might make end-to-end driving more feasible (where the model predicts steering outputs only from pixel inputs, rather than a multistep process perceiving objects and reasoning about them). Even if not going full end-to-end, general models that can simultaneously understand hand gestures, read road signs, and interpret rare road situations would be remarkable. Such models are likely close to being able to reliably output abstract labels like “can I drive on this part of the road” and “does this thing here look like it will move”. This can make any subsequent driving planning much easier.

This is disruptive, potentially challenging the current AV leaders that have invested 10+ years, millions of miles, and billions of dollars.

To be clear, this is not predicting that vision progress implies events like current Tesla vehicles becoming robotaxis with a flip of a switch. Current vehicles seem too computing power / memory constrained to fully exploit a rise in general web-scale trained models. Same goes for systems like CommaAI's current hardware (which was explicitly never intended for full driving tasks). However, Huang's lawestimating a more than doubling of GPU performance every two years suggests compute constraints will not always hold for future vehicle vision systemsthough we will also note, improvements to vision algorithms have made narrow vision models ~44x more efficient over ~9 years (Hernandez and Brown, 2020). Thus, it is not a 0%-probability event that future general models could be adapted to current hardware, just that would be more distant, speculative, and impractical. Realistically focus will likely shift to new hardware..

When discussing disruption in the field, we should note that AV leaders' technology is not static. However, progress in general modeling does imply that there will likely be flux and opportunities in the field.

The mid-2010s was filled with lots of overpromises from across the AV industry. In part this was the fresh excitement of the start of the deep-learning revolution. People thought "woah, you show this magic box a few thousand labels, and it classifies image categories really well! Driving will be easy now!". Driving has been harder than predicted. However, progress in image synthesis gives visceral examples of how the field is in a dramatically different place than even a short time ago. There are new questions of what this capability from web-scale models allows. Thus, given the unknown territory, humility is needed both by those predicting fast progress, and from skeptics who remain completely confident that advanced AI tasks like robust driving are still many decades away.

Disruption hinted at by these models applies not just for the AV field, but likely also applies to other applications of AI and general robotics. As we watch these AI systems draw anything we imagine on or off the road, we must realize it's not just about whimsical pictures of dancing fruit, and their direct effects on art and media creation. These generators hint at a world with machines with increasingly general understanding of language, vision, and the world. This is disruptive and wild.

A generated pineapple driving