Multi-Agent Cooperation and the Emergence of Language

Published

Tuesday, March 25, 2025

Paper Overview

Limitations of Passive Learning:
- Supervised learning (text corpora) misses the function of language.
- Need interactive learning for conversational AI.
- Language as a tool for coordination (Austin, Wittgenstein).
Multi-Agent Games as Solution:
- Agents learn language by needing to communicate.
- Referential Game: Simple coordination task.
- Emergence from “tabula rasa”.
Connections:
- Cybernetics/Shannon: Communication -> Coordination (Goal-directed), feedback loop (reward signal), information channel (symbols reduce uncertainty).
- Wittgenstein: Language-games, meaning = use, Referential game as a language-game.
- SHRDLU: Contrast: Emergence vs. pre-programmed symbols/rules. This paper: learning from interaction, not symbolic logic base, scalability advantage (potentially).

Setup:
- Sender sees target image + distractor.
- Sends symbol from vocabulary \(V\).
- Receiver sees both images + symbol.
- Guesses target. Reward if correct.
Agent Architectures:
- Simple Feed-Forward NNs.
- Sender: “Agnostic” vs. “Informed” (inductive bias).
- Receiver: Maps symbol + images to choice.
Formalization:
- Two players: Sender \(S\), Receiver \(R\).
- Input: Pair of images \((i_L, i_R)\), one is target \(t \in \{L, R\}\). Images represented by VGG ConvNet features.
- Sender sees images and target \(t\), chooses symbol \(s \in V\) (Vocabulary size \(K=10\) or \(100\)). Policy \(s(\theta_S(i_L, i_R, t))\). Uses discretization (sampling from Gibbs distribution).
- Receiver sees images (random order) and symbol \(s\), guesses target \(\hat{t} \in \{L, R\}\). Policy \(r(i_L, i_R, s)\).
- Reward: \(R=1\) if \(\hat{t} = t\), \(R=0\) otherwise. Shared reward.
- Learning: Reinforcement Learning (REINFORCE algorithm) to minimize \(- \mathbb{E}\_{\hat{t} \sim r}[R(\hat{t})]\).

Exp 1: Does Communication Emerge?
- Result: Yes! High success rate (>99%).
- Informed sender learns faster, uses more symbols.
Exp 2: Semantic Properties?
- Measure: Cluster Purity (vs. McRae categories).
- Visualize: t-SNE of object features, colored by symbol.
- Result: Significant purity, semantic clustering observed.
Exp 3: Forcing Abstraction
- Tweak: Sender/Receiver see different images of same class.
- Removes low-level features as common knowledge.
- Result: Still works, slightly better purity/clustering.
Exp 4: Grounding in Natural Language
- Method: Hybrid Training (Game + Supervised Image Labeling).
- Share representation between tasks.
- Result: High success + Dramatic purity increase (70%).
Exp 5: Human Interpretability
- Test: Train on ImageNet, play game on ReferItGame images.
- Humans guess target based on agent’s word (symbol).
- Result: 68% human success rate (>> chance).
Key Finding: Flexible Semantics. Agents/Humans use words flexibly (“metonymy”).

Explicit: Interaction needed, language is functional (Austin, Clark).
Implicit: Meaning from coordination, simple agents -> complex behavior, human concepts as benchmark.
Stance:
- Functionalism/Pragmatism (Meaning is use/coordination).
- Emergentism (Complexity from simple interactions).
- Connectionism (NN implementation).
Framework Characterization:
- Functionalist/Pragmatist: Regarding meaning (meaning-is-use/coordination).
- Emergentist: Regarding the origin of complexity.
- Empiricist: Regarding learning (experience-driven).
- Connectionist: In implementation.
- Implicitly Computationalist: (treating communication/learning as computational processes).

Strengths:
- Demonstrates de novo emergence.
- Highlights role of interaction & environment.
- Path towards grounding/human alignment.
Limitations:
- Simple language (no syntax, compositionality).
- Simple task/world.
- Meaning depth? (Intentionality, truth).
- Scalability?
- Reliance on human categories for evaluation.

Can systems like this create syntax and compositionality?
Can RL models scale to more complex environments?
Does pre-training image representations on human-labeled data limit discovery of novel semantics?
Is this truly “language,” or just efficient encoding of symbols? Are models robust and not focusing on spurious correlations?