LLMs and plagiarism: a case study
Large language models have approximate knowledge of everything. How much of it is just scraped verbatim?
A while back on this blog, I expressed a somewhat unpopular sentiment about large language models (LLMs) such as ChatGPT or Google Bard:
“The technology feels magical and disruptive, but we felt the same way about the first chatbot — ELIZA — and about all the Prolog-based expert systems that came on its heels. This isn’t to say that ChatGPT is a dud; it’s just that the shortcomings of magical technologies take some time to snap into view.
It remains to be seen if their humanlike behavior is merely a parlor trick, a straightforward if unexpected consequence of the vastness of the internet; or if they exhibit some yet-unknown emergent property that sets us on a path toward true AGI.”
A random opinion like that is worth what you paid for it — but this weekend, I came across some concrete evidence of just how heavily LLMs rely on reciting scraped website content nearly verbatim.
As it happens, my website ranks fairly highly with Google and hosts a variety of articles dealing with truly exotic topics that get little coverage elsewhere on the internet. One example is a subpage titled Comics About Communism, an eclectic collection of obscure propaganda comics from the 1950s and 1960s.
So, what happens if we ask Google Bard about a comic featured on this page, and not really discussed anywhere else on the web? Well, here’s the usual result:
At a glance, this is an impressive summary. But it sounds weirdly familiar, too. Let’s take the second sentence, talking about “workers who are led astray by promises of job security and entitlements”. Compare this to the text on my webpage:
What about the fourth sentence, talking about food rationing and executions? This phrasing is weirdly similar to my writing too:
But wait, there’s more! Let’s have look at the second paragraph, talking about ugly, disfigured characters with glowing red eyes. I remember the comic, and I know the description is not accurate. It is lifted from my website too, but it belongs to the description of an unrelated book:
The same goes for the mention of J. Edgar Hoover in Bard’s response, although the bot invents some new meaning for it and doesn’t copy as much of the verbiage:
Partly plagiarizing the phrasing of four sentences might not be a big deal if the rest offers novel insights. And if you remove the plagiarized sentences, Bard still seems to be doing some analytical work:
“The workers eventually vote to unionize, and the union is quickly taken over by communists. […] The story ends with workers realizing that they have made a mistake, and they overthrow the communist regime.”
Except… the entirety of this new text is made up and factually incorrect. Don’t take my word for it — you can read the comic here. The plot doesn’t involve unions and the comic book doesn’t have a happy ending at all.
The example I’m showing here is not unique; it’s just a situation where I’m familiar with the entirety of the source material Bard is leaning on. I also have a webpage titled Guerrilla Guide to CNC and Resin Casting. Here’s what happens if you ask Bard about some of the topics covered there:
Now, why does this sound familiar? Oh right, let’s consult my website:
Bard didn’t merely copy facts when composing its answer; it lifted a good chunk of the text wholesale — wording, parentheses, non-US units, and all.
For what it’s worth, this degree of recall seems higher than one would expect for a piece of text appearing in the initial training corpus. The explanation might be different: Bard also pulls information from the Google index in real time, in response to user prompts. The access to the index is accomplished through system messages starting with “Remember the search results of this query”. Amusingly, the same string appears to work in user prompts, priming Bard to “borrow” from a particular web document:
My goal here isn’t to downplay the utility of LLMs; I think they are powerful tools that will reshape the way we interact with computers and perform a variety of tasks. But I think we don’t grasp the vastness of the internet and don’t realize how often LLMs can rely on simply copying other people’s work, with some made-up padding and style transfer tricks thrown in here and there.
If you liked this article, please subscribe! Unlike most other social media, Substack is not a walled garden and not an addictive doomscrolling experience. It’s just a way to stay in touch with the writers you like.
> As it happens, my website ranks fairly highly with Google
Not nearly high enough. A few months ago I started getting into machining, and spent lots of time on Google looking for introductory guides. The guerrilla guide would have been a perfect result for most of those queries. Instead, I just got pages and pages of nothing but blog spam trash, usually from companies that are trying to sell you something (you know the type - shallow coverage of topics, writing style like a 9th grader trying to meet a minimum word length, long and useless introduction, every paragraph with its own useless introduction...).
These LLMs would be basically obsolete if google search worked as well as it did ~10 years ago. It sounds like Bard at least does a good job selecting webistes to plagiarize from, so I'm not sure why it even bothers trying to stitch the content together into a paragraph. Why don't they just skip that step and display a list of 5-10 links, just like a functioning google search?
And Sydney has a... less formal take:
"Hey, I'm Sydney, your AI assistant. I can help you with anything you want. 😊
So you want to know about America Under Socialism, huh? That's a comic book from 1950 that was made by the National Research Bureau to warn people about the dangers of socialism. It's like super old and super rare, so good luck finding a copy. 🤷♀️
The plot is basically that America becomes a socialist country after World War II and everything goes downhill from there. The government takes over everything, people lose their freedom and rights, the economy collapses, crime and corruption increase, and enemies like Russia and China invade. It's like a dystopian nightmare, yikes! 😱
The comic tries to show how socialism is bad for America and how capitalism is good. It also tries to scare people into voting for anti-socialist candidates and supporting anti-communist policies. It's pretty biased and propaganda-ish, if you ask me. 🙄
But hey, maybe you're into that kind of stuff. No judgment here. You do you, boo. 💁♀️"