Local RAG LLM Summarization Mistral vs Llama3

Luis M. Bracamontes
6 min readMay 14, 2024

--

Large Language Models (LLMs) have taken the world by surprise, and the pace at which advances occur is mind-blowing. From larger and larger context windows (the amount of information that is used by the LLM to produce the output) to very sophisticated agentic workflows and patterns that enable LLMs to actually solve business problems, the progress is remarkable. Additionally, the development of open-source applications like Langchain and DSPy has allowed the creation of very interesting projects.

Among the most promising applications for LLMs in the industry, one can find Content Generation, Q&A, Information Retrieval, Virtual Assistants, Copilots (especially for coding), and the main purpose of this article, Summarization. By the end of the article, you will know how to set up agentic evaluation pipeline for summarization with Mistral 7B and LLama 3 7B. You will also have a better understanding of which one performs better for that task!

Measuring the Amount of Information in a Summary

While there is no formal mathematical modeling for the specific task of summarization using LLMs (at least not to my knowledge), the closest I can think of is Information Distance from Information Theory:

𝐼𝐷(𝑥,𝑦) = min{∣𝑝∣:𝑝(𝑥)=𝑦 & 𝑝(𝑦)=𝑥}

𝑝 will represent the shortest program that converts one object (a str in this case) to another reduced representation of the same object (another str), where len(y) << len(x) while preserving its parity x == y. This means that if one reads the summary and later the original dialogue, no new information should be gained. This is a significant challenge as one needs minimize the difference between a ground truth summary y and the generated summary ŷ for that we’ll test two LLM’s as the p: Llama3 and Mistral.

x is the input string to the LLM, and it represents a sequence of characters coming from a conversation, documents or notes.

y is a human generated summary from x.

ŷ represents the output of the LLM when it’s tasked to create a summary of x.

Finally the distance between ŷ and y will be calculated using the cosine similarity of both embeddings.

Tooling and Data

In order to test which LLM is better at summarization, we will consider a few tools that will help us achieve that.

First, it’s important to define a dataset. For the purposes of this exercise, we will be using dialogsum. It’s a dataset consisting of dialogues between different individuals and their corresponding summaries. Most of the data will be coming from the test portion as we are only interested in checking if LLMs can produce as good summaries as humans. Specifically, the first version of 100 samples from test data. For training, I’ll use approximately 4 examples from train data as the desired output for each summary.

As hinted in the title, our main focus will be testing two different open-source LLMs: Mistral and Llama 3. To make things more interesting we’ll be testing other tools that enable the deployment of LLMs locally, making it more accessible to developers and ensuring all data remains private ;) I’m referring to LM Studio and Ollama. LM Studio is interesting because it allows the user to deploy and interact with LLMs in the same environment. While it has a vast list of open-source models to choose from (thanks to the huggingface community) the actual application is not open source :/. However that’s not a big deal. Addionally, it serves models mimicking OpenAI’s API so if you are using it already and want to debug or test before actually paying for ChatGPT this is a great option. Ollama is all open source but does not have an integrated GUI with a chat. However, the community is significant which makes is very easy to find support quickly. It also runs a server through a CLI and deploys a local endpoint.

Another important tool worth mentioning is Langchain and more specifically the ecosystem around it that lets users create, observe and debug complex pipelines and workflows for LLMs. The main components being used for our testing are langchain, langgraph and langsmith.

Implementation

Agentic workflows leveraging the use of chains are becoming more common thanks to recent advances like CoT, ReAct and Few Shot Learning that prove the actual utility of LLMs for solving different problems. Few shot learning can yield great accuracies without retraining a model! Almost as important is the development of software projects and libraries that enable the quick and fast implementation of the ideas expressed in the papers, specifically Langchain. Before jumping into the code let’s reason (no pun intended) about the list of problems that we need to solve for before building the graph:

1. Generate the summary from documents.

2. Analyze the summary for hallucinations and missing relevant information.

3. Reflect about the problems encountered during the previous step.

4. Fetch the most similar summary from the ground-truth test data using the generated summary.

Based on the previous list the graph to generate relevant summaries looks like this:

From the previous figure we see that we will need four nodes and two edges. One node to generate the summary, one edge to check for hallucinations and misses, another edge that decides the flow to go to retry and reflect nodes, and finally, one evaluation node that finds the most similar summary and its score.

The workflow is implemented using Langraph. The state and the implementation is defined as follows:

class GraphState(TypedDict):
"""
Represents the state of a conversation graph.

Attributes:
summary: The summary of the conversation.
dialogues: The dialogues in the conversation.
similar_summary: The most similar dialogue to the current dialogue.
similar_id: The ID of the most similar dialogue.
retries: The number of retries.
reflections: The reflections about the generated summary.
similarity_score: Consine similarity of the summaries.
"""
summary: str
dialogue: str
similar_summary: str
similar_id: str
retries: int = 0
reflections: dict = None
similarity_score: float = None

# Summarization Graph
workflow.add_node('sumary', generate_summary)
workflow.add_node('evaluator', get_similar_summary)
workflow.add_node('reflector', reflect_summary)
workflow.add_node('retry', retry)

# Build the graph
workflow.set_entry_point('sumary')
workflow.add_conditional_edges('sumary',
hallucination_grader_edge,
{
'grounded': 'evaluator',
'hallucinates': 'retry'
})
workflow.add_conditional_edges('retry',
max_retries,
{
'reflect': 'reflector',
'eval': 'evaluator'
})
workflow.add_edge('reflector', 'sumary')
workflow.add_edge('similar-summary', END)

The function hallucination_grader_edge runs another agent that verifies the output of the summary node and depending on its evaluation routes the workflow.


### Conditional edge hallucinations
def hallucination_grader_edge(state: GraphState) -> Dict:
"""
Hallucination grader edge.

Args:
state: The state of the summary graph.

Returns:
The hallucination decision.
"""
print("---HALLUCINATION GRADER---")
summary = state['summary']
dialogue = state['dialogue']
score = hallucination_grader.invoke({"summary":summary, "dialogue":dialogue})
grade = score['score']

if grade == 'yes':
print("---DECISION: GENERATION DOES NOT HALLUCINATE/GROUNDED---")
return 'grounded'

else:
pprint("---DECISION: [WARNING] GENERATION DOES HALLUCINATE, RETRY---")

return 'hallucinates'

The max_retries function only keeps a count for the number of times that the summary has been generated. When the limit for the maximum number of retries has been reached, the workflow goes to the evaluator to avoid infinite loops.

Results

As mentioned before I will be using a smaller sample than the original dialogsum test dataset. For this exercise I will only take the first 100 summaries and the first version of each [test_0_1, …, test_99_1], as there are three different ground truth versions per dialogue. The accuracy will be measured so that the top 1 retrieved summary is the closest to the generated summary as per the ID measure. The quality of the summary will be the score retrieved. A score of 0 would mean that the generated summary is exactly the same as the human annotated ground truth. A score of 1 would suggest that the summaries a very distant or dissimilar.

For Llama3, we have the following outcome:

Accuracy: 0.93
Average Similarity Score: 0.701
Number of hallucinations: 6

For Mistral, we have the following outcome:

Accuracy: 0.83
Average Similarity Score: 0.7002
Number of hallucinations: 0

We have a clear winner: Llama3, which achieved 93% accuracy! However, it’s interesting how close they are when it comes to the quality of the summaries, as both scored an average of 0.70 in similarity. Since, only for examples were taken from dialogsum train data to teach the LLM how to generate summaries in the prompt, it can be argued that Llama3 is a better learner. However, the quality of the generated output is no different than Mistral’s! Another thing to point out is that Mistral was not able to find any hallucinations in is generated summaries which seems odd. But after careful review, it seems to be accurate. Does this mean that some LLMs are better at reflecting than others?

--

--