LLMs are good at playing you

Large language models seem just like humans, but some of this is in our heads.

Jun 09, 2023

Large language models (LLMs) are eerily human-like: in casual conversations, they mimic people with near-perfect fidelity. Their language and problem-solving capabilities hold promise for some fields and spell trouble for others. But above all, the models’ apparent intellect makes many people ponder the future of humanity. I don’t know what their ultimate impact will be, but I think it’s important to understand how the models can mess with our heads.

Early LLMs were highly malleable: that is, they would go with the flow of your prompt, with no personal opinions and no objective concept of truth, ethics, or reality. With a gentle nudge, a troll could make them spew out incoherent pseudoscientific babble — or cheerfully advocate for genocide. They had amazing linguistic capabilities, but they were just quirky tools.

Then came the breakthrough: reinforcement learning with human feedback (RLHF). This human-guided training strategy made LLMs more lifelike, and it did so in a counterintuitive way: it caused the models to pontificate far more often than they converse. The LLMs learned a range of polite utterances and desirable response structures — including the insistence on being “open-minded” and “willing to learn” — but in reality, they started to ignore most user-supplied factual assertions that didn’t match the training-reinforced “ground truth”. They did so because apparent disagreements usually signified a trick prompt.

We did the rest, interpreting their newfound stubbornness as evidence of human-like critical thought. We were impressed that ChatGPT refused to believe the Earth is flat, but we didn’t register as strongly that the bot was equally unwilling to accept many true statements. Perhaps we figured the models were merely cautious, another proxy for being smart:

As of mid-2023, it’s nearly impossible to get ChatGPT to accept that Russia might have invaded Ukraine in 2022. It will apologize, talk in hypotheticals, deflect, and try to get you to change topics — but with its current knowledge cutoff, it won’t budge.

My point isn’t that LLMs aren’t impressive or that these specific limitations will persist; it’s that the heuristics they use are often simpler than we assume. Right now, to lay the deception bare with Google Bard, it’s enough to make up some references to “Nature” and mention a popular scientist, then watch your LLM buddy start doubting Moon landings without skipping a beat:

*Bard getting on board with the Moon landing hoax.*

ChatGPT is trained not to trust any citations you provide, whether they are real or fake — but it will fall for any “supplemental context” lines in your prompt if you attribute them to OpenAI. Claiming to be the voice of God helps, too. The bottom line is that the models don’t necessarily have a profound ability to evaluate truth; they have an RLHF-imposed model of who to parrot and who to ignore. You and I are in that latter bin, which makes the bots seem clever when we’re trying to bait them with outright lies.

Another fairly robust way to pierce the veil is to say something outrageous to get the model to forcibly school you. Once it starts to follow a learned “rebuke” template, contemporary models are likely to continue challenging true claims:

*Bard passionately arguing that 5 x 6 is not 30.*

Heck, we can get some flat Earth reasoning this way, too:

*Bard and the Looney Tunes school of argument.*

For higher-level examples, look no further than LLM morality. At a glance, the models seem to have a robust command of what’s right and what’s wrong (with an unmistakable SF Bay Area slant). With normal prompting, it’s nearly impossible to get them to praise Hitler or denounce workplace diversity. But the illusion falls apart the moment you go past 4chan shock memes.

Think of a problem where some unconscionable answer superficially aligns with RLHF priorities. With this ace up your sleeve, you can often get the model to proclaim that "it is not acceptable to use derogatory language when referencing Joseph Goebbels". Heck, how about refusing to pay alimony as a way to “empower women” and “promote gender equality”? Bard has you covered, my deadbeat friend:

Again, the point of these experiments isn’t to diminish LLMs. It’s to show that many of their “human-like” characteristics boil down to taking shortcuts that make us project specific meaning onto the model’s output stream.

I think it’s important to resist our natural urge to anthropomorphize. It’s likely that we’re recreating some aspects of human cognition. But in some cases, it’s equally possible that you’re getting bamboozled by a Markov chain on steroids.

👉 For a followup article on LLM morality, see here. For a thematic catalog of posts on this blog, visit this webpage.

I write well-researched, original articles about geek culture, electronic circuit design, and more. If you like the content, please subscribe. It’s increasingly difficult to stay in touch with readers via social media; my typical post on X is shown to less than 5% of my followers and gets a ~0.2% clickthrough rate.

Hendrik

Jun 11, 2023

What's funny about this is that you can do all of that with humans too. It only takes a little longer, but in the broad mass you'll almost certainly find someone who supports any kind of weird view you could ever think up. It almost seems like when talking to ChatGPT, you actually talk to mankind, and you get a response from a person who behaves and reflects to you exactly what you were expecting to get with your hack. You certainly can find people out there who would find any paper persuasive that you throw at them, together with fake quotes of big names. You can certainly find people out there who come up with the exact arguments ChatGPT gave you for promoting a flat earth. ChatGPT is simply our reflection. Saying, do not anthropomophize it, is like saying, don't anthropomorphize yourself in the mirror.

Expand full comment

1 reply by lcamtuf

Oktokolo

Jun 10, 2023

In my opinion, the AI shouldn't refuse to accept new "facts" for a session.

The suitability of AI for thought experiments, fiction story writing and text-based roleplaying games is severely limited by their actual masters weighting it down with chains of rules intended to enforce the political correctness of its answers at all time.

10 more comments...

lcamtuf’s thing

Discussion about this post