The ‘strawberrry’ problem: How to overcome AI’s limitations

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

By now, large language models (LLMs) like ChatGPT and Claude have become an everyday word across the globe. Many people have started worrying that AI is coming for their jobs, so it is ironic to see almost all LLM-based systems flounder at a straightforward task: Counting the number of “r”s in the word “strawberry.” They are not exclusively failing at the alphabet “r”; other examples include counting “m”s in “mammal”, and “p”s in “hippopotamus.” In this article, I will break down the reason for these failures and provide a simple workaround.

LLMs are powerful AI systems trained on vast amounts of text to understand and generate human-like language. They excel at tasks like answering questions, translating languages, summarizing content and even generating creative writing by predicting and constructing coherent responses based on the input they receive. LLMs are designed to recognize patterns in text, which allows them to handle a wide range of language-related tasks with impressive accuracy.

Despite their prowess, failing at counting the number of “r”s in the word “strawberry” is a reminder that LLMs are not capable of “thinking” like humans. They do not process the information we feed them like a human would.

Conversation with ChatGPT and Claude about the number of “r”s in strawberry.

Almost all the current high performance LLMs are built on transformers. This deep learning architecture doesn’t directly ingest text as their input. They use a process called tokenization, which transforms the text into numerical representations, or tokens. Some tokens might be full words (like “monkey”), while others could be parts of a word (like “mon” and “key”). Each token is like a code that the model understands. By breaking everything down into tokens, the model can better predict the next token in a sentence.

LLMs don’t memorize words; they try to understand how these tokens fit together in different ways, making them good at guessing what comes next. In the case of the word “hippopotamus,” the model might see the tokens of letters “hip,” “pop,” “o” and “tamus”, and not know that the word “hippopotamus” is made of the letters — “h”, “i”, “p”, “p”, “o”, “p”, “o”, “t”, “a”, “m”, “u”, “s”.

A model architecture that can directly look at individual letters without tokenizing them may potentially not have this problem, but for today’s transformer architectures, it is not computationally feasible.

Further, looking at how LLMs generate output text: They predict what the next word will be based on the previous input and output tokens. While this works for generating contextually aware human-like text, it is not suitable for simple tasks like counting letters. When asked to answer the number of “r”s in the word “strawberry”, LLMs are purely predicting the answer based on the structure of the input sentence.

Here’s a workaround

While LLMs might not be able to “think” or logically reason, they are adept at understanding structured text. A splendid example of structured text is computer code, of many many programming languages. If we ask ChatGPT to use Python to count the number of “r”s in “strawberry”, it will most likely get the correct answer. When there is a need for LLMs to do counting or any other task that may require logical reasoning or arithmetic computation, the broader software can be designed such that the prompts include asking the LLM to use a programming language to process the input query.

Conclusion

A simple letter counting experiment exposes a fundamental limitation of LLMs like ChatGPT and Claude. Despite their impressive capabilities in generating human-like text, writing code and answering any question thrown at them, these AI models cannot yet “think” like a human. The experiment shows the models for what they are, pattern matching predictive algorithms, and not “intelligence” capable of understanding or reasoning. However, having a prior knowledge of what type of prompts work well can alleviate the problem to some extent. As the integration of AI in our lives increases, recognizing its limitations is crucial for responsible usage and realistic expectations of these models.

Chinmay Jog is a senior machine learning engineer at Pangiam.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!