Reality:
The AI was trained to answer 3 to this question correctly.
Wait until the AI gets burned on a different question. Skeptics will rightfully use it to criticize LLMs for just being stochastic parrots, until LLM developers teach their models to answer it correctly, then the AI bros will use it as a proof it becoming “more and more human like”.
No but see they’re not skeptics, they’re just haters, and there is no valid criticism of this tech. Sorry.
And also youve just been banned from like twenty places tor being A FANATIC “anti ai shill”. Genuinely check the mod log, these fuckers are cultists.
When we see LLMs struggling to demonstrate an understanding of what letters are in each of the tokens that it emits or understand a word when there are spaces between each letter, we should compare it to a human struggling to understand a word written in IPA format (/sʌtʃ əz ðɪs/) even though we can understand the word spoken aloud normally perfectly fine.
But if you’ve learned IPA you can read it just fine
Honey, AI just did something new. It’s time to move the goalposts again.
Maybe OP was low on the priority list for computing power? Idk how this stuff works
Deep reasoning is not needed to count to 3.
It is if you’re creating ragebait.
o3-pro? Damn, that’s an expensive goof
Worked well for me
One of the interesting things I notice about the ‘reasoning’ models is their responses to questions occasionally include what my monkey brain perceives as ‘sass’.
I wonder sometimes if they recognise the trivialness of some of the prompts they answer, and subtilly throw shade.
One’s going to respond to this with ‘clever monkey! 🐒 Have a banana 🍌.’
What is that font bro…
Its called sweetpea and my sweatpea picked it out for me. How dare I stick with something my girl picked out for me.
But the fact that you actually care what font someone else uses is sad
Chill bro it’s a joke 💀. It’s like when someone uses comic sans as a font.
Ohh.
You’re shrodingers douchebag
Got it
I understand its probably more user friendly, but yet I still somehow find myself dissapointed the answers weren’t indexed from zero. Was this LLM written in MATLAB?
Nice Rs.
Is this ChatGPT o3-pro?
ChatGPT 4o
I asked it how many Ts are in names of presidents since 2000. It said 4 and stated that “Obama” contains 1 T.
Toebama
People who think that LLMs having trouble with these questions is evidence one way or another about how good or bad LLMs are just don’t understand tokenization. This is not a symptom of some big-picture deep problem with LLMs; it’s a curious artifact like in a jpeg image, but doesn’t really matter for the vast majority of applications.
You may hate AI but that doesn’t excuse being ignorant about how it works.
Also just checked and every open ai model bigger than 4.1-mini can answer this. I think the joke should emphasize how we developed a super power inefficient way to solve some problems that can be accurately and efficiently answered with a single algorithm. Another example is using ChatGPT to do simple calculator math. LLMs are good at specific tasks and really bad at others, but people kinda throw everything at them.
And yet they can seemingly spell and count (small numbers) just fine.
The problem is that it’s not actually counting anything. It’s simply looking for some text somewhere in its database that relates to that word and the number of R’s in that word. There’s no mechanism within the LLM to actually count things. It is not designed with that function. This is not general AI, this is a Generative Adversarial Network that’s using its vast vast store of text to put words together that sound like they answer the question that was asked.
what do you mean by spell fine? They’re just emitting the tokens for the words. Like, it’s not writing “strawberry,” it’s writing tokens <302, 1618, 19772>, which correspond to st, raw, and berry respectively. If you ask it to put a space between each letter, that will disrupt the tokenization mechanism, and it’s going to be quite liable to making mistakes.
I don’t think it’s really fair to say that the lookup 19772 -> berry counts as the LLM being able to spell, since the LLM isn’t operating at that layer. It doesn’t really emit letters directly. I would argue its inability to reliably spell words when you force it to go letter-by-letter or answer queries about how words are spelled is indicative of its poor ability to spell.
what do you mean by spell fine?
I mean that when you ask them to spell a word they can list every character one at a time.
These sorts of artifacts wouldn’t be a huge issue except that AI is being pushed to the general public as an alternative means of learning basic information. The meme example is obvious to someone with a strong understanding of English but learners and children might get an artifact and stamp it in their memory, working for years off bad information. Not a problem for a few false things every now and then, that’s unavoidable in learning. Thousands accumulated over long term use, however, and your understanding of the world will be coarser, like the Swiss cheese with voids so large it can’t hold itself up.
You’re talking about hallucinations. That’s different from tokenization reflection errors. I’m specifically talking about its inability to know how many of a certain type of letter are in a word that it can spell correctly. This is not a hallucination per se – at least, it’s a completely different mechanism that causes it than whatever causes other factual errors. This specific problem is due to tokenization, and that’s why I say it has little bearing on other shortcomings of LLMs.
No, I’m talking about human learning and the danger imposed by treating an imperfect tool as a reliable source of information as these companies want people to do.
Whether the erratic information is from tokenization or hallucinations is irrelevant when this is already the main source for so many people in their learning, for example, a new language.
Hallucinations aren’t relevant to my point here. I’m not defending that AIs are a good source of information, and I agree that hallucinations are dangerous (either that or misusing LLMs is dangerous). I also admit that for language learning, artifacts caused from tokenization could be very detrimental to the user.
The point I am making is that LLMs struggling with these kind of tokenization artifacts is poor evidence for drawing any conclusions about their behaviour on other tasks.
That’s a fair point when these LLMs are restricted to areas where they function well. They have use cases that make sense when isolated from the ethics around training and compute. But the people who made them are applying them wildly outside these use cases.
These are pushed as a solution to every problem for the sake of profit with intentional ignorance of these issues. If a few errors impact someone it’s just a casualty in the goal of making it profitable. That can’t be disentwined from them unless you limit your argument to open source local compute.
We gotta raise the bar, so they keep struggling to make it “better”
My attempt
0000000000000000 0000011111000000 0000111111111000 0000111111100000 0001111111111000 0001111111111100 0001111111111000 0000011111110000 0000111111000000 0001111111100000 0001111111100000 0001111111100000 0001111111100000 0000111111000000 0000011110000000 0000011110000000
Btw, I refuse to give my money to AI bros, so I don’t have the “latest and greatest”
Tested on ChatGPT o4-mini-high
It sent me this
0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0
I asked it to remove the spaces
0001111100000000 0011111111000000 0011111110000000 0111111111100000 0111111111110000 0011111111100000 0001111111000000 0011111100000000 0111111111100000 1111111111110000 1111111111110000 1111111111110000 1111111111110000 0011100111000000 0111000011100000 1111000011110000
I guess I just murdered a bunch of trees and killed a random dude with the water it used, but it looks good
I really like checking these myself to make sure it’s true. I WAS NOT DISAPPOINTED!
(Total Rs is 8. But the LOGIC ChatGPT pulls out is ……. remarkable!)
This is deepseek model right? OP was posting about GPT o3
Yes this is a small(ish) offline deepseek model
Try with o4-mini-high. It’s made to think like a human by checking its answer and doing step by step, rather than just kinda guessing one like here
“Let me know if you’d like help counting letters in any other fun words!”
Oh well, these newish calls for engagement sure take on ridiculous extents sometimes.
I want an option to select Marvin the paranoid android mood: “there’s your answer, now if you could leave me to wallow in self-pitty”
Lol someone could absolutely do that as a character card.
Here I am, emissions the size of a small country, and they ask me to count letters…
AI is amazing, we’re so fucked.
/s
Singularity is here
How many times do I have to spell it out for you chargpt? S-T-R-A-R-W-B-E-R-R-Y-R
Now ask how many asses there are in assassinations
It works if you use a reasoning model… but yeah, still ass
ohh god, I never through to ask reasoning models,
DeepSeekR17b was gold too
Oh god, the asses multiplied 🤣🤣🤣
It’s painful how Reddit that is…
So,
Now,
Alright,
then 14b, man sooo close…
I wonder how QWEN 3.0 performs cause it surpasses Deepseek apparently
I don’t have any other models pulled down, if they’re open I’ll try it and respond back here
Alr
It did quite well for this.
And people are trusting these things to do jobs / parts of jobs that humans used to do.
Humans are pretty dumb sometimes lol
It’s far better at the use of there, their, and they’re.
The average US citizen couldn’t craft a professional sounding document of their life depended on it.
It’s not better than a professional at anything, The average human is far below that bar.
I like the way you worded that a lot
Man AI is ass at this
*laugh track*