ChatGPT Can Finally See
After months of testing, OpenAI rolled out “advanced voice” mode for ChatGPT back in September. The feature lets you have real-time conversations with ChatGPT: You can interrupt the bot and its “speaking” to ask another question, and it understands your tone of voice, which it uses to both inform its responses, as well as the inflection it uses. (It’s very creepy when it laughs.)
One feature of advanced voice mode has been missing since launch, however. When OpenAI first announced the perk back in May, it showed off how ChatGPT would be able to access your camera and “see” the world around you. While chatting with the bot, you could point your camera at something, ask a question, and ChatGPT would answer as best it could. Seven months later, this capability is here, and it’s frighteningly impressive.
In order to access it, you’ll need to have a paid subscription to ChatGPT—either Plus ($20 per month), or Pro ($200 per month). ChatGPT Team subscribers are also eligible. The feature may not be available on your end right away, even if you pay, since OpenAI is rolling it out over time.
Testing out ChatGPT advanced voice mode’s vision feature
Accessing the camera is pretty straightforward once it rolls out to your account. You launch advanced voice mode the same way you always do, using the waveform icon in the bottom-right of the chat. From here, you’ll see a new camera icon, which, of course, launches the live camera feed. This doesn’t interrupt the chat: You can be in the middle of a conversation with ChatGPT, open the camera, and continue gabbing away, only now with the camera feed as part of the conversation.
The first time I used this, I pointed the camera at a Nintendo Switch box I had nearby, with an iPhone cable and my Magic Trackpad resting on top of it, and asked, “What is this?” ChatGPT said: “It looks like a Nintendo Switch OLED box with some cables and a laptop on top. Are you planning on setting it up?” Two of out three correct, as it mistook my trackpad for a laptop, but hey, close enough. Next up, I pointed it at my water bottle, and asked it to identify what I was highlighting: “That looks like a black Hydro Flask bottle. It’s great for keeping drinks cold or hot! Do you take it with you often?”
I asked a follow-up: “Do you know what model of Hydro Flask this is?” ChatGPT: “I can’t be certain of the exact model, but it looks like one of their wide-mouth bottles, probably around 32 ounces. It’s definitely designed to keep your drinks at the right temperature for hours.” That…is basically right. I’m not all that comfortable with ChatGPT guessing the size correctly, either.
I moved on to my keyboard, which ChatGPT accurately stated was an Apple Magic Keyboard. I asked which keys it could see, and named a handful, but not all, of the keys I had in frame. So, I asked how many keys it could see, and it said “about 30,” when there were 26. So, again, close.
It was able to identify the MagSafe port on my MacBook, as well as the two USB ports and the headphone jack to its right. It recognized the air vent in my ceiling, and the specific type of boots I had by my front door. All in all, it basically recognized everything I tested it on—minus the trackpad.
Advanced voice mode’s sight is fast
But beyond recognition, I think what startled me the most was the speed of these responses. You ask ChatGPT to identify something, and it does, sometimes quicker than if you asked a real person to do it. Sometimes, the bot will hold onto a word for a moment (e.g. “I thiiiiiiiiink that’s a…”) which is probably a trick to let ChatGPT process the rest of what it wants to say. I’ve also caught it less sure of itself with its first response: I pointed it at my Magic Mouse, and its first guess what a computer mouse. But when I asked what brand it was, it didn’t only specify Apple, but said it was an Apple Magic Mouse, known for its “sleek design” and “touch-sensitive surface.”
All things considered, though, these responses are often near-instantaneous, which speaks to how powerful OpenAI’s models are these days. I’m still largely an AI skeptic, but this was the first development in a while that impressed me—and I’m torn about how I feel about that.
On the one hand, I could see this tech being used for good. Imagine how helpful something like this could be for users who are blind or have impaired vision, especially in a convenient device like smart glasses. Someone could ask their AI assistant what direction they’re facing, to read the menu at a restaurant, or whether it’s safe to cross the street. Tech like this could change search for the better, and make it easy to learn new things about the world by pointing our smartphone camera at a subject.
On the flip side, my mind turns to the negative, especially since AI is still prone to hallucination. As more and more people use this technology, they will inevitably experience the mistakes AI can make, and if they’re relying on the bot to help them with tasks—especially something that involves their safety—hallucinations can be dangerous. I didn’t experience any large errors; just the trackpad mixup. Anderson Cooper found that the bot made a mistake on a geometry problem (again, not a huge issue). But it’s a good reminder that as this tech improves rapidly, its inherent flaws raise the stakes for failure.
Perhaps that’s why every live camera session warns you not to use the feature for anything involving safety.