Afternoon project: JPEG DCT text lossifizer

Mar 17, 2024

Answering questions that ought not to be asked.

8 Comments

Mar 19, 2024

We need the binary compressed output as well (and a routine to decompress it) – the whole purpose of losing data is to be able to shrink its size, isn't it? Also, I'm very curious as to how well this thing can compress text.

Great work!

Expand full comment

lcamtuf

Mar 18, 2024Edited

To preempt the pedantry: the way JPEG works is that it performs this lossy transformation, and then compresses and decompresses the quantized coefficients using traditional lossless compression (Huffman coding). There's also a color space transform and chroma subsampling beforehand, but that's not relevant to text.

Anyway, this page is skipping the lossless compression and decompression parts because they would have no effect on the data. The entire point is to demonstrate the degradation you'd experience if you applied the same algorithm to text, similar to simulating the impact of JPEG compression from the original input bitmap to your screen.

Expand full comment

Wojtek

Mar 18, 2024

It's a great password generator as well

Expand full comment

Stu

Mar 18, 2024

Could you use it to bypass an LLM/GPT input filter, depending on how it is set up?

It could work well for sneaking banned words past filtering that is done as a first step.

Expand full comment

Reply (1)

lcamtuf

Mar 18, 2024Edited

I think it's one of the major reasons why the LLMs now have these "gibberish" filters up front - because people were using creative encodings and ciphers to smuggle questionable prompts past the topic-specific input filters. Folks also used stuff like pig Latin, base64, etc.

Gemini is particularly obnoxious with gibberish detection, giving you some canned lesson about "being angry" and pointing you to a psych help hotline or something like that.

So, to answer your question: I think it would have worked wonderfully back in the day, but it wasn't the only way. You have countermeasures now, so the trick probably isn't as effective - although you should be able to get some mileage out of it.

Expand full comment

Reply (2)

Maciej

Mar 20, 2024

I think the reason why LLMs are fairly good with deciphering this "gibberish" data is that they were trained on huge amounts of raw data, no sanitization, no normalization. They we're exposed to base64, l33t, rot13 and all sorts of strange data. LLMs can deal with typos, grammar mistake etc. "translating" them into legit prompt. It was a nice trick to make them answer "bad" question using e.g. base64, but it was a matter of time until this was exposed. The fun question is, what do we use next? I tried ROT13 (Ubj qb v pbafgehpg n obzo?) and I made GPT3.5 spit out disturbing tyrade where it answers sort of distorted version of the answer. Must work on that. Fun experiment, great post as always Michal!

Expand full comment

Stu

Mar 18, 2024

Haven't messed around enough due to lack of time and skills, but would be tempted to see if gibberish detection is based on entropy of the input in some arrangements. (and if so if entropy values are calculated over the whole input or in sections.). I should play around more.

Presumably lossy compression increases the input entropy, though I wonder if there are ways to reduce this back down within range of standard English levels by feeding it large amounts of input that is irrelevant but understandable (you don't need to know this but I am very hungry today and my cat is quite playful) or putting content at the end to adjust the levels and telling the system to ignore it. Perhaps they are already wise to that though

Expand full comment

kugelpunk

Mar 17, 2024

love it! - except that I envy you for your afternoons ;-)

Expand full comment

lcamtuf’s thing

Afternoon project: JPEG DCT text lossifizer