No LLM can figure this out – Stuck in the middle and lost.

No LLM can solve this – Lost in the Middle. Both models failed to complete the task due to the inserted nonsense paragraph in the middle of the text. The failure rate was high, but Mistral Medium came close with a 47% success rate. This test showed that even with a 2,000 token context length, the middle portion is consistently ignored. This blind test was a complete disaster. See you in my next video!


In a blind test, two chatbot models, Model A and Model B, were tested to solve a prompt containing over 1,000 words. However, both failed to solve the task in the provided text. The article provides an in-depth analysis of the blind test and the results obtained from different models. The prompt required answering a specific question based on a paragraph of text, and the test concluded that most systems did not have the capability to perform the task.

Analysis of Results

Upon analyzing the results of the blind test, it was evident that both Model A and Model B failed to comprehend and solve the task. This was particularly noticeable as they ignored an entire paragraph in the middle of the text, which highlighted the limitations of existing language models when handling tasks contained within a lengthy context.

  • Key Takeaways:
    • Models were unable to solve tasks within the middle of a 1,000-word text.
    • Both Model A and Model B failed to understand and execute the specified prompt.
    • Chatbot Arena hosted the blind test, with the systems chosen automatically, with no external influence.

Understanding the Test Process

The blind test utilized various text prompts, and the models’ abilities to comprehend and answer questions from specific paragraphs were evaluated. However, most of the models failed to provide accurate answers, demonstrating the limitations of existing language models in handling complex tasks within lengthy contexts.

Test Results

Analysis of the results indicated that a majority of the language models, including Model A and Model B, were unable to provide accurate responses to the tasks within the text prompts. This was exemplified by their inability to comprehend specific paragraphs and deliver a logical and accurate response. The blind test’s purpose was to assess the language models’ capabilities on advanced tasks distributed within considerable lengths of text.

Chatbot ModelSuccess RateResult
Model A0%Complete Fail
Model B0%Complete Fail

Evaluating the Models’ Performance

There were limited instances where certain models demonstrated partial success by identifying parts of the paragraphs; however, the inability to understand the task specified within the paragraph was a prevalent issue. Notably, Mistral Medium showed the most promise in comprehending elements of the text but failed to execute the task effectively.


In conclusion, the blind test of Chatbot Arena showcased the existing limitations among several language models, specifically in handling tasks placed within the middle of a lengthy text. Most models, including the latest Chat GPT 4, displayed inadequate performance in solving specific prompts and comprehending tasks within provided paragraphs of text.

  • Key Takeaway:
    • While certain models demonstrated partial success, the majority exhibited shortcomings in understanding and executing tasks within a lengthy context.

Frequently Asked Questions (FAQs)

Q: What was the purpose of the blind test?

A: The blind test aimed to evaluate language models by testing their abilities to comprehend and execute specific tasks within extensive lengths of text.

Q: What were the salient findings from the test?

A: Most language models struggled to understand and execute tasks presented within the middle of a significant volume of text.

The aforementioned analysis provides a comprehensive insight into the blind test of Chatbot Arena, emphasizing the limitations of existing language models in comprehending and executing tasks within lengthy contexts.

About the Author

About the Channel:

Share the Post: