Hypothetical Embeddings Explained

Large Language Models (LLM) can be combined with embeddings to create accurate knowledge retrieval systems. Hypothetical Document Embeddings, or HyDE1, can be used to provide a question-answering interface for a given database of facts that feels like an LLM-powered chatbot, but without the risk of hallucinations.

In LLM-only systems, such as search engines powered by RLHF-tuned models, there’s a risk of the core model confidently stating wrong facts or making assertions that go beyond or even contradict the available ground truths (e.g. a search index). This problem is more commonly known as hallucination.

There are several ways to engineer prompts to minimize this effect, and combined with instruction fine tuning are generally successful in limiting hallucinations. In systems where fact retrieval is the primary use case though, this problem can be almost entirely eliminated.

HyDE-powered search relies on embeddings, dense vector representations of (in this case) textual data. OpenAI embeddings for example, can effectively encode abstract qualities of short and long-form text (up to ~8,000 tokens per embedding) in ~1600-dimensional vectors. These vectors can then be stored in vector-similarity (VSS) indexes (e.g. HNSW/RediSearch) that facilitate efficient VSS or hybrid VSS and full-text queries.

However, these systems are then limited to retrieving similar facts. A user interface could prompt the user for a fact, which is then embedded and used as a query vector to retrieve the nearest (most similar) facts from the index. This requires the user to already have a possible answer in mind, which is much less intuitive than asking a question.

Instead of embedding the user input directly, a sufficiently trained LLM can be used to generated hypothetical answers, which are in turn embedded and used for retrieval. The LLM will likely hallucinate details for its answers, but since these answers are not relayed to the user as fact, these errors are inconsequential. The embedding is much more sensitive to the structure and language used in the answers, rather than its correctness.

An example prompt (here using LLaMa-65B) may look like this:

Three possible answers to the question 'what are the negative effects of the war on drugs?' are:
1.

with an example output:

1. Incarcerated people in the U.S. have exceeded two million and the great majority of the people in jail are there for drug related crimes. In many cases, these people don't get to go to school and learn the skills they need to lead healthy, productive lives.
2. People all over the world die daily, in violent confrontations, related to the trade of illegal drugs.
3. Millions of people, including children, are sick with diseases caused by using contaminated, illegal drugs. These people are spreading disease to others.

Depending on the performance of the embedding and the difference between the answers, one embedding per answer (then averaged) or one embedding for the entire answer list can be used. For non-fiction fact retrieval (excerpts from books), I found it helpful to include the question itself in the embedding for each HyDE, since this type of content often poses a question only to immediately answer it.

Notes

  1. HyDE was first introduced in Precise Zero-Shot Dense Retrieval without Relevance Labels, where InstructGPT had been fine tuned to generate HyDE.