Part 2 - Week 4
Mrinmaya Sachan
Retrieval-Augmented Language Models (RALMs)
Motivation
- Limitations of LMs as factual databases:
- Parametric LMs store knowledge implicitly in weights — difficult to inspect, update, or guarantee correctness.
- They can hallucinate facts, especially for rare or long-tail knowledge.
- Updating requires retraining or fine-tuning, which is costly and may cause catastrophic forgetting.
- External Knowledge Bases (KBs):
- Structured (e.g., Wikidata, WordNet) or unstructured (e.g., Wikipedia, news archives).
- Queried at inference time to ground LM outputs in verifiable evidence.
- Benefits:
- Improves factual accuracy and trustworthiness.
- Enables citing sources.
- Easier to update and control content.
- Reduces risk of leaking private training data.
LAMA Probe — Language Model Analysis
- Purpose: Evaluate factual and commonsense knowledge encoded in LMs.
- Method:
- Construct cloze-style prompts from KB triples (head entity, relation, tail entity).
- Example:
(France, capital, Paris)
→ “The capital of France is [MASK].”
- Example:
- Use datasets from sources like Wikidata, Google-RE, T-REx, SQuAD.
- Construct cloze-style prompts from KB triples (head entity, relation, tail entity).
- Findings:
- LMs can recall some facts but performance varies by relation type and frequency in training data.
- Implication: Motivates augmenting LMs with retrieval to improve factuality and updateability.
Knowledge-Enhanced Language Models
Parametric Knowledge Integration
- Definition: Knowledge is stored in model parameters after training/fine-tuning.
- Approaches:
- Entity-aware embeddings:
- KnowBERT: Integrates entity embeddings from KBs (WordNet, Wikipedia) into BERT.
- Uses entity linking to map tokens to KB entities.
- Knowledge attention + recontextualization layer fuses entity and token embeddings.
- Improves perplexity, recall, and downstream task performance.
- ERNIE: Similar integration of entity and fact embeddings.
- KnowBERT: Integrates entity embeddings from KBs (WordNet, Wikipedia) into BERT.
- Intermediate memory layers:
- KGLM: Augments LM with a latent variable for entities, conditioning generation on KB facts.
- kNN-LM: At inference, retrieves nearest neighbor hidden states from a datastore built over training data.
- Entity-marked pretraining:
- WKLM: Marks entity mentions in text and replaces some with other entities to encourage entity discrimination.
- Entity-aware embeddings:
- Limitations:
- Updating knowledge requires retraining.
- KB coverage and entity linking errors can limit performance.
Non-Parametric Knowledge Integration
- Definition: Knowledge is retrieved from an external source at inference time.
- Advantages:
- Smaller LMs + retrieval can outperform larger LMs on knowledge-intensive tasks.
- Easy to update KB without retraining LM.
- Supports explicit citations and content control.
Retrieval Components
Retriever
- Sparse retrieval:
- TF–IDF:
- Represents documents as sparse vectors of term weights.
- Weight: \(\text{tf-idf}(t, d) = \text{tf}(t, d) \cdot \log\frac{N}{\text{df}(t)}\).
- Score: \(score(q, d) = \sum_{t \in q} \frac{\text{tf-idf}(t, d)}{|d|}\).
- Efficient via inverted index; works well when query–doc term overlap is high.
- Limitations: Ignores semantics beyond exact matches; sensitive to stopwords and morphology.
- TF–IDF:
- Dense retrieval:
- Encodes queries and documents into dense vectors in a shared embedding space.
- Similarity via dot product or cosine similarity.
- Dense Passage Retrieval (DPR):
- Dual-encoder: separate encoders for questions and passages.
- Trained with contrastive loss to bring matching Q–P pairs closer and push non-matching apart.
- Enables semantic matching beyond lexical overlap.
Reader / Generator
- Consumes query + retrieved docs to produce answer.
- Can be:
- Extractive: Selects answer span from retrieved text.
- Abstractive: Generates answer conditioned on retrieved evidence.
Fusion of Retrieved Knowledge
Interpolation — kNN-LM
- Build a datastore of \((\text{key}, \text{value})\) pairs from training set hidden states.
- At inference:
- Retrieve top-\(k\) nearest keys to current hidden state.
- Form probability distribution over next tokens from retrieved values.
- Interpolate with LM’s own distribution: \[ p_{\text{final}} = \lambda p_{\text{kNN}} + (1 - \lambda) p_{\text{LM}} \]
- Pros: Improves rare word prediction.
- Cons: High memory and compute cost for nearest neighbor search.
Concatenation — REALM
- Treat retrieval as latent variable \(z\): \[ p(y|x) = \sum_{z} p(y|x, z) \, p(z|x) \]
- Components:
- Neural retriever \(p(z|x)\).
- Knowledge-augmented encoder \(p(y|x, z)\).
- Pretraining:
- Masked language modeling with retrieval.
- Retriever and encoder trained jointly.
- Index updated periodically; sum over \(z\) approximated with top-\(k\) retrieved docs.
Cross-Attention — RETRO
- Retrieve \(k\) nearest chunks for each input segment.
- At intermediate Transformer layers, use cross-attention to attend to retrieved chunk embeddings.
- Benefits:
- Scales to large corpora without increasing parametric memory.
- Achieves strong performance with fewer parameters than comparable LMs.
Open Challenges
- No consensus on optimal retriever–reader integration (early vs late fusion, cross-attention vs concatenation).
- Multi-step retrieval needed for complex reasoning and multi-hop QA.
- Dense retrievers require large, high-quality training data; domain adaptation remains challenging.
- Retrieval adds inference overhead; efficient ANN search and caching are active research areas.
- Need better benchmarks for factuality, attribution, and reasoning in RALMs.