Week 6

Published

Tuesday, March 25, 2025

Paper Overview

  • Title: “Multi-Agent Cooperation and the Emergence of (Natural) Language”
  • Authors: Lazaridou, Peysakhovich, Baroni (DeepMind, FAIR, Trento)
  • Venue: ICLR 2017
  • Problem: Can AI agents invent language through interaction?
  • Method: Referential games + Deep Reinforcement Learning
  • Focus: Emergence, grounding, semantic properties

Motivation & Background

  • Limitations of Passive Learning:

    • Supervised learning (text corpora) misses the function of language.
    • Need interactive learning for conversational AI.
    • Language as a tool for coordination (Austin, Wittgenstein).
  • Multi-Agent Games as Solution:

    • Agents learn language by needing to communicate.
    • Referential Game: Simple coordination task.
    • Emergence from “tabula rasa”.
  • Connections:

    • Cybernetics/Shannon: Communication -> Coordination (Goal-directed), feedback loop (reward signal), information channel (symbols reduce uncertainty).
    • Wittgenstein: Language-games, meaning = use, Referential game as a language-game.
    • SHRDLU: Contrast: Emergence vs. pre-programmed symbols/rules. This paper: learning from interaction, not symbolic logic base, scalability advantage (potentially).

Technical Details: The Referential Game

  • Setup:

    • Sender sees target image + distractor.
    • Sends symbol from vocabulary \(V\).
    • Receiver sees both images + symbol.
    • Guesses target. Reward if correct.
  • Agent Architectures:

    • Simple Feed-Forward NNs.
    • Sender: “Agnostic” vs. “Informed” (inductive bias).
    • Receiver: Maps symbol + images to choice.
  • Formalization:

    • Two players: Sender \(S\), Receiver \(R\).
    • Input: Pair of images \((i_L, i_R)\), one is target \(t \in \{L, R\}\). Images represented by VGG ConvNet features.
    • Sender sees images and target \(t\), chooses symbol \(s \in V\) (Vocabulary size \(K=10\) or \(100\)). Policy \(s(\theta_S(i_L, i_R, t))\). Uses discretization (sampling from Gibbs distribution).
    • Receiver sees images (random order) and symbol \(s\), guesses target \(\hat{t} \in \{L, R\}\). Policy \(r(i_L, i_R, s)\).
    • Reward: \(R=1\) if \(\hat{t} = t\), \(R=0\) otherwise. Shared reward.
    • Learning: Reinforcement Learning (REINFORCE algorithm) to minimize \(- \mathbb{E}\_{\hat{t} \sim r}[R(\hat{t})]\).

Experiments & Results

  • Exp 1: Does Communication Emerge?

    • Result: Yes! High success rate (>99%).
    • Informed sender learns faster, uses more symbols.
  • Exp 2: Semantic Properties?

    • Measure: Cluster Purity (vs. McRae categories).
    • Visualize: t-SNE of object features, colored by symbol.
    • Result: Significant purity, semantic clustering observed.
  • Exp 3: Forcing Abstraction

    • Tweak: Sender/Receiver see different images of same class.
    • Removes low-level features as common knowledge.
    • Result: Still works, slightly better purity/clustering.
  • Exp 4: Grounding in Natural Language

    • Method: Hybrid Training (Game + Supervised Image Labeling).
    • Share representation between tasks.
    • Result: High success + Dramatic purity increase (70%).
  • Exp 5: Human Interpretability

    • Test: Train on ImageNet, play game on ReferItGame images.
    • Humans guess target based on agent’s word (symbol).
    • Result: 68% human success rate (>> chance).
  • Key Finding: Flexible Semantics. Agents/Humans use words flexibly (“metonymy”).

Philosophical Stance & Assumptions

  • Explicit: Interaction needed, language is functional (Austin, Clark).

  • Implicit: Meaning from coordination, simple agents -> complex behavior, human concepts as benchmark.

  • Stance:

    • Functionalism/Pragmatism (Meaning is use/coordination).
    • Emergentism (Complexity from simple interactions).
    • Connectionism (NN implementation).
  • Framework Characterization:

    • Functionalist/Pragmatist: Regarding meaning (meaning-is-use/coordination).
    • Emergentist: Regarding the origin of complexity.
    • Empiricist: Regarding learning (experience-driven).
    • Connectionist: In implementation.
    • Implicitly Computationalist: (treating communication/learning as computational processes).

Strengths & Limitations

  • Strengths:

    • Demonstrates de novo emergence.
    • Highlights role of interaction & environment.
    • Path towards grounding/human alignment.
  • Limitations:

    • Simple language (no syntax, compositionality).
    • Simple task/world.
    • Meaning depth? (Intentionality, truth).
    • Scalability?
    • Reliance on human categories for evaluation.

Open Questions & Discussion Points

  • Can systems like this create syntax and compositionality?
  • Can RL models scale to more complex environments?
  • Does pre-training image representations on human-labeled data limit discovery of novel semantics?
  • Is this truly “language,” or just efficient encoding of symbols? Are models robust and not focusing on spurious correlations?