Part 2 - Week 4

Mrinmaya Sachan

Published

Tuesday, April 22, 2025

Retrieval-Augmented Language Models (RALMs)

Limitations of LMs as factual databases:
- Parametric LMs store knowledge implicitly in weights — difficult to inspect, update, or guarantee correctness.
- They can hallucinate facts, especially for rare or long-tail knowledge.
- Updating requires retraining or fine-tuning, which is costly and may cause catastrophic forgetting.
External Knowledge Bases (KBs):
- Structured (e.g., Wikidata, WordNet) or unstructured (e.g., Wikipedia, news archives).
- Queried at inference time to ground LM outputs in verifiable evidence.
- Benefits:
  - Improves factual accuracy and trustworthiness.
  - Enables citing sources.
  - Easier to update and control content.
  - Reduces risk of leaking private training data.

Purpose: Evaluate factual and commonsense knowledge encoded in LMs.
Method:
- Construct cloze-style prompts from KB triples (head entity, relation, tail entity).
  - Example: (France, capital, Paris) → “The capital of France is [MASK].”
- Use datasets from sources like Wikidata, Google-RE, T-REx, SQuAD.
Findings:
- LMs can recall some facts but performance varies by relation type and frequency in training data.
Implication: Motivates augmenting LMs with retrieval to improve factuality and updateability.

Definition: Knowledge is stored in model parameters after training/fine-tuning.
Approaches:
- Entity-aware embeddings:
  - KnowBERT: Integrates entity embeddings from KBs (WordNet, Wikipedia) into BERT.
    - Uses entity linking to map tokens to KB entities.
    - Knowledge attention + recontextualization layer fuses entity and token embeddings.
    - Improves perplexity, recall, and downstream task performance.
  - ERNIE: Similar integration of entity and fact embeddings.
- Intermediate memory layers:
  - KGLM: Augments LM with a latent variable for entities, conditioning generation on KB facts.
  - kNN-LM: At inference, retrieves nearest neighbor hidden states from a datastore built over training data.
- Entity-marked pretraining:
  - WKLM: Marks entity mentions in text and replaces some with other entities to encourage entity discrimination.
Limitations:
- Updating knowledge requires retraining.
- KB coverage and entity linking errors can limit performance.

Definition: Knowledge is retrieved from an external source at inference time.
Advantages:
- Smaller LMs + retrieval can outperform larger LMs on knowledge-intensive tasks.
- Easy to update KB without retraining LM.
- Supports explicit citations and content control.

Sparse retrieval:
- TF–IDF:
  - Represents documents as sparse vectors of term weights.
  - Weight: \(\text{tf-idf}(t, d) = \text{tf}(t, d) \cdot \log\frac{N}{\text{df}(t)}\).
  - Score: \(score(q, d) = \sum_{t \in q} \frac{\text{tf-idf}(t, d)}{|d|}\).
  - Efficient via inverted index; works well when query–doc term overlap is high.
  - Limitations: Ignores semantics beyond exact matches; sensitive to stopwords and morphology.
Dense retrieval:
- Encodes queries and documents into dense vectors in a shared embedding space.
- Similarity via dot product or cosine similarity.
- Dense Passage Retrieval (DPR):
  - Dual-encoder: separate encoders for questions and passages.
  - Trained with contrastive loss to bring matching Q–P pairs closer and push non-matching apart.
  - Enables semantic matching beyond lexical overlap.

Consumes query + retrieved docs to produce answer.
Can be:
- Extractive: Selects answer span from retrieved text.
- Abstractive: Generates answer conditioned on retrieved evidence.

Build a datastore of \((\text{key}, \text{value})\) pairs from training set hidden states.
At inference:
- Retrieve top-\(k\) nearest keys to current hidden state.
- Form probability distribution over next tokens from retrieved values.
- Interpolate with LM’s own distribution: \[ p_{\text{final}} = \lambda p_{\text{kNN}} + (1 - \lambda) p_{\text{LM}} \]
Pros: Improves rare word prediction.
Cons: High memory and compute cost for nearest neighbor search.

Treat retrieval as latent variable \(z\): \[ p(y|x) = \sum_{z} p(y|x, z) \, p(z|x) \]
Components:
- Neural retriever \(p(z|x)\).
- Knowledge-augmented encoder \(p(y|x, z)\).
Pretraining:
- Masked language modeling with retrieval.
- Retriever and encoder trained jointly.
- Index updated periodically; sum over \(z\) approximated with top-\(k\) retrieved docs.

Retrieve \(k\) nearest chunks for each input segment.
At intermediate Transformer layers, use cross-attention to attend to retrieved chunk embeddings.
Benefits:
- Scales to large corpora without increasing parametric memory.
- Achieves strong performance with fewer parameters than comparable LMs.

No consensus on optimal retriever–reader integration (early vs late fusion, cross-attention vs concatenation).
Multi-step retrieval needed for complex reasoning and multi-hop QA.
Dense retrievers require large, high-quality training data; domain adaptation remains challenging.
Retrieval adds inference overhead; efficient ANN search and caching are active research areas.
Need better benchmarks for factuality, attribution, and reasoning in RALMs.