These days everyone’s boss seems to want some form of GenAI in their products. That doesn’t always make sense: however, understanding when it does and when it doesn’t is not obvious even for us experts, and nearly impossible for everyone else.
How can we help our colleagues understand the pros and cons of this tech, and figure out when and how it makes sense to use it?
In this post I am going to outline a narrative that explains LLMs without tecnicalities and help you frame some high level technical decisions, such as RAG vs finetuning, or which specific model size to use, in a way that a non-technical audience can not only grasp but also reason about. We’ll start by “translating” a few terms into their “human equivalent” and then use this metaphor to reason about the differences between RAG and finetuning.
Let’s dive in!
LLMs are high-school students Link to heading
Large Language Models are often described as “super-intelligent” entities that know far more than any human could possibly know. This makes a lot of people think that they are also extremely intelligent and are able to reason about anything in a super-human way. The reality is very different: LLMs are able to memorize and repeat far more facts that humans do, but in their abilities to reason they are often inferior to the average person.
Rather than describing LLMs as all-knowing geniuses, it’s much better to frame them as an average high-school student. They’re not the smartest humans on the planet, but they can help a lot if you guide them through the process. And just as a normal person might, sometimes they forget things, and occasionally they remember them wrong.
Some LLMs are smarter than others Link to heading
Language models are not all born equal. Some are inherently able to do more complex reasoning, to learn more facts and to talk more smoothly in more languages.
The “IQ” of an LLM can be approximated, more or less, to its parameter count. An LLM with 7 billion parameters is almost always less clever than a 40 billion parameter model, will have a harder time learning more facts, and will be harder to reason with.
However, just like with real humans, there are exceptions. Recent “small” models can easily outperform older and larger models, due to improvements in the way they’re built. Also, some small models are very good at some very specialized job and can outperform a large, general purpose model on that task.
LLMs learn by “studying” Link to heading
Another similarity to human students is that LLMs also learn all the fact they know by “going to school” and studying a ton of general and unrelated facts. This is what training an LLM means. This implies that, just like with students, an LLM needs a lot of varied material to study from. This material is what is usually called “training data” or “training dataset”.
They can also learn more than what they currently know and specialize on a topic: all they need to do is to study further on it. This is what finetuning represents, and as you would expect, it also needs some study material. This is normally called “finetuning data/datasets”.
The distinction between training and fine tuning is not much about how it’s done, but mostly about the size and contents of the dataset required. The initial training usually takes a lot of time, computing power, and tons of very varied data, just like what’s needed to bring a baby to the level of a high-schooler. Fine tuning instead looks like preparing for a specific test or exam: the study material is a lot less and a lot more specific.
Keep in mind that, just like for humans, studying more can make a student a bit smarter, but it won’t make it a genius. In many cases, no amount of training and/or finetuning can close the gap between the 7 billion parameter version of an LLM and the 40 billion one.
Every chat is an exam Link to heading
One of the most common usecases for LLMs is question answering, an NLP task where users ask questions to the model and expect a correct answer back. The fact that the answer must be correct means that this interaction is very similar to an exam: the LLM is being tested by the user on its knowledge.
This means that, just like a student, when the LLM is used directly it has to rely on its knowledge to answer the question. If it studied the topic well it will answer accurately most of the times. However if it didn’t study the subject, it will do what students always do: they will make up stuff that sounds legit, hoping that the teacher will not notice how little they know. This is what we call hallucinations.
When the answer is known to the user the answer of the LLM can be graded, just like in a real exam, to make the LLM improve. This process is called evaluation. Just like with humans, there are many ways in which the answer can be graded: the LLM can be graded on the accuracy of the facts it recalled, or the fluency it delivered its answer with, or it can be scored on the correctness of a reasoning exercise, like a math problem. These ways of grading an LLM are called metrics.
Making the exams easier Link to heading
Hallucinations are very dangerous if the user doesn’t know what the LLM was supposed to reply, so they need to be reduced, possibly eliminated entirely. It’s like we really need the students to pass the exam with flying colors, no matter how much they studied.
Luckily there are many ways to help our student succeed. One way to improve the score is, naturally, to make them study more and better. Giving them more time to study (more finetuning) and better material (better finetuning datasets) is one good way to make LLMs reply correctly more often. The issue is that this method is expensive, because it needs a lot of computing power and high quality data, and the student may still forget something during the exam.
We can make the exams even easier by converting them into open-book exams. Instead of asking the students to memorize all the facts and recall them during the exam, we can let them bring the book and lookup the information they need when the teacher asks the question. This method can be applied to LLMs too and is called RAG, which stands for “retrieval augmented generation”.
RAG has a few interesting properties. First of all, it can make very easy even for “dumb”, small LLMs to recall nearly all the important facts correctly and consistently. By letting your students carry their history books to the exam, all of them will be able to tell you the date of any historical event by just looking it up, regardless of how smart they are or how much they studied.
RAG doesn’t need a lot of data, but you need an efficient way to access it. In our metaphor, you need a well structured book with a good index to help the student find the correct facts when asked, or they might fail to find the information they need when they’re quizzed.
A trait that makes RAG unique is that is can be used to keep the LLM up-to-date with information that can’t be “studied” because it changes too fast. Let’s imagine a teacher that wants to quiz the students about today’s stock prices. They can’t expect the pupils to know them if they don’t have access to the latest financial data. Even if they were to study the prices every hour the result would be quite pointless, because all the knowledge they acquire becomes immediately irrelevant and might even confuse them.
Last but not least, RAG can be used together with finetuning. Just as a teacher can make students study the topic and then also bring the book to the exam to make sure they will answer correctly, you can also use RAG and finetuning together.
However, there are situations where RAG doesn’t help. For example, it’s pointless if the questions are asked in language that the LLM doesn’t know, or if the exam is made of tasks that require complex reasoning. This is true for human students too: books won’t help them much to understand a foreign language to the point that they can take an exam in it, and won’t be useful to crack a hard math problem. For these sort of exams the students just need to be smart and study more, which in LLM terms means that you should prefer a large model and you probably need to finetune it.
Telling a story Link to heading
Let’s recap the terminology we used:
- The LLM is a student
- The LLM’s IQ corresponds to its parameter count
- Training an LLM is the same as making a student go to school
- Finetuning it means to make it specialize on a subject by making it study only books and other material on the subject
- A training dataset is the books and material the student needs to study on
- User interactions are like university exams
- Evaluating an LLM means to score its answers as if they were the responses to a test
- A metric is a type of evaluation that focuses on a specific trait of the answer
- A hallucination is a wrong answer that the LLM makes up just like a student would, to in order to try passing an exam when it doesn’t know the answer or can’t recall it in that moment
- RAG (retrieval augmented generation) is like an open-book exam: it gives the LLM access to some material on the question’s topic, so it won’t need to hallucinate an answer. It will help the LLM recall facts, but it won’t make it smarter.
By drawing a parallel with a human student it can be a lot easier to explain to non-technical audience why some decisions were taken.
For example, it might not be obvious why RAG is cheaper than finetuning, because both need domain-specific data. By explaining that RAG is like an open-book exam versus a closed-book one, the difference is clearer: the students need less time and effort to prepare and they’re less likely to make trivial mistakes if they can bring the book with them at the exam.
Another example is hallucinations. It’s difficult for many people to understand why LLMs don’t like to say “I don’t know”, until they realise that from the LLM’s perspective every question is like an exam: better make up something that admit they’re unprepared! And so on.
Building a shared, simple intuition of how LLM works is a very powerful tool. Next time you’re asked to explain a technical decision related to LLMs, building a story around it may get the message across in a much more effective way and help everyone be on the same page. Give it a try!