Skip to content

Instantly share code, notes, and snippets.

@salgo60
Last active May 23, 2023 18:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save salgo60/2bdc622e5c534d40d6159b96e12a84f6 to your computer and use it in GitHub Desktop.
Save salgo60/2bdc622e5c534d40d6159b96e12a84f6 to your computer and use it in GitHub Desktop.
Long trail

image

Can you explain the problems with LLM and "Coverage gap on long tail"

ChatGPT

Sure, the terms you are using seem to refer to Large Language Models (LLM) and a specific problem related to their training and performance, known as "Coverage gap on long tail".

Large Language Models like GPT-4 are trained on a vast range of internet text. But despite their training on diverse data, they may still have some gaps in their knowledge. This could be due to several reasons, but one notable reason relates to the problem of the "coverage gap on long tail".

The term "long tail" in this context refers to the frequency distribution of the data. In the distribution of the topics that the model was trained on, there are a few topics (the "head") that are very common, and many topics (the "tail") that are rare. This is reminiscent of a power-law distribution, which has a long tail.

The "coverage gap on long tail" refers to the problem where the model might not perform as well when asked about topics in the "long tail" of the distribution - the less frequent, more specific, or niche topics. Because these topics were less prevalent in the training data, the model had fewer opportunities to learn about them.

As a result, while the LLM might be able to generate plausible-sounding responses about these topics, it might not handle them with the depth, accuracy, or nuance that it would for more common topics. This is a limitation inherent to the way these models are trained: they are fundamentally limited by the data they have been trained on. If a topic is not well-represented in that data, the model's understanding and ability to generate correct and insightful responses about that topic can be limited.

Also, for many of these long-tail topics, the amount of available training data may be very sparse or the information itself may be complex, ambiguous or conflicting. These topics often require specialized, expert knowledge, and while an LLM can generate text based on patterns it's seen in its training data, it does not have the ability to truly understand or learn beyond its training, which can lead to inaccuracies or gaps in the information it provides about these topics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment