Published

Friday, August 2, 2024

Temporal Point Processes

Introduction

The main difference is that now the exact time is important, not just the ordering of events.

  • Time series: Measurements at regular intervals
  • Temporal point processes: Asynchronous time with events at irregular intervals

Key Characteristics:

  • Distribution over time points
  • Discrete event sequence in continuous time
  • Both location and time are random

We can model it autoregressively:

  • Predict the time of the next event given the history \(H_t = \{ t_j \mid t_j < t \}\)
  • \(H_t\) depends on a specific sample \(\{ t_j, t_n \}\)
  • We denote the conditional density \(p^*(t) = p(t \mid H_t)\)
  • The next event in \([t, t + dt]\) is \(p^*(t) \, dt\) and there is no event before \(t\)

Conditional Density:

  • Cumulative distribution function (CDF): \[ F^*(t) = \int_{t-1}^t p^*(t) \, dt \] Probability that the next event is between \(t-1\) and \(t\)

  • Survival function: \[ S(t) = 1 - F^*(t) = \int_t^\infty p^*(t) \, dt \] Probability that the next event is after \(t\)

  • Conditional intensity function: \[ \lambda(t) = \frac{p^*(t) \, dt}{S^*(t)} = \frac{P(\text{event in } [t, t + dt] \text{ and no event in } [t-1, t])}{P(\text{no event in } [t-1, t])} \] Probability of the next event in \([t, t + dt]\) given that there is no event in \([t-1, t]\)

    • Intuitive interpretation: Number of expected events per unit of time
    • Not a probability

The likelihood of the entire sequence with \(t_1, t_2, \ldots, t_n\) events on \([0, T]\) is:

\[ p(t_1, t_2, \ldots, t_n) = p^*(t_1) \cdot p^*(t_2) \cdots p^*(t_n) S^*(T) = \lambda(t_1) \lambda(t_2) \cdots \lambda(t_n) \exp \left( - \int_0^T \lambda^*(t) \, dt \right) \]

\[ p^*(t_1) = \lambda(t_1) \exp \left( - \int_0^{t_1} \lambda^*(t) \, dt \right) \]

Challenges:

  • We don’t like the integral
  • The number of events can vary

Models Based on Conditional Intensity Functions

  • More flexible and interpretable
  • Easy to combine different TPPs with different \(\lambda^*(t)\) to produce a new TPP
  • Efficient sampling

Homogeneous Poisson Process (HPP)

  • Simplest possible model
  • \(\lambda^*(t) = \mu\)
  • \(p^*(t) = \mu \exp(-\mu (t - t_1))\) is an exponential distribution

We can interpret it as modeling the inter-event times that are exponentially distributed.

Sampling:

  • Generate an exponential random variable with rate \(\mu\) and add it to the last event time
  • It doesn’t depend on the history (memoryless property), so we can sample all events at once

Inhomogeneous Poisson Process (IPP)

  • Intensity function is time-dependent: \(\lambda^*(t) = \mu(t) \geq 0\)
  • Still doesn’t depend on the history (memoryless property)
  • Captures global trends (e.g., cars on the road during rush hour)

Expected Number of Events:

\[ N([a, b]) \sim \text{Poisson} \left( \int_a^b \mu^*(t) \, dt \right) \]

Simulating an IPP:

  • Simulating inter-event times is hard and requires integration
  • Better alternative: Thinning algorithm
    • Find the upper bound of the intensity function \(\lambda^*(t) \geq \max_t \mu(t)\)
    • Sample a time \(t\) from an HPP with rate \(\lambda^*\)
    • Accept \(t\) with probability \(\mu(t) / \lambda^*\)
    • Essentially acceptance-rejection sampling

Hawkes Process

  • Self-exciting process
  • Example: Tweets, where one tweet triggers another, then a pause until another tweet

\[ \lambda(t) = \mu + \alpha \sum_i k_w(t - t_i) \quad \text{where } t_i < t \]

  • Triggering kernel: \(k_w(t - t_i) = \beta \exp(-\beta (t - t_i)) \text{ for } t > t_i\)
  • Parameters: \(\mu, \alpha, \beta\)
  • Depends on the history
  • Exhibits clustered, bursty behavior

Parameter Estimation in TPPs

  • Maximize the likelihood of the observed data over the parameters (iid learning setting)

\[ \log P(t_1, t_2, \ldots, t_n) = \sum_i \log \lambda^*(t_i) - \int_0^T \lambda^*(t) \, dt \]

  • Various optimization algorithms
  • Simple models like HPP allow for closed-form solutions
  • For Hawkes, we can use convex optimization
  • We can use gradient descent with modifications for constraints

Advantages and Disadvantages of Using Intensity:

  • Numerical integration might be intractable
  • Not clear how to define flexible models with arbitrary dynamics
  • Easy to interpret, sample, and define models with simple dynamics

We will revisit RNNs with conditional probability.

Directly Model the Conditional Density \(p^*(t \mid H_t)\) with an RNN:

  • Every time an event happens, we feed \(\tau_i = t_i - t_{i-1}\) to the RNN
  • Use the hidden state to model the history
  • Use it to model the parameters of the \(p^*\) distribution
  • \(p^*_t = p(t \mid H_t) = p(t \mid h_i)\)

How to Model \(p^*\):

  • Simple non-negative distributions: Exponential, Gamma, Weibull, Gompertz
  • Use a mixture of distributions: Convex combinations of densities
  • If we pick the exponential distribution, it’s not an HPP nor an IPP since it depends on the history and the parameter \(\mu\) changes over time

Normalizing Flows:

  • Use transformations like \(\exp(\log(1 + \exp))\) to ensure that the output is positive, combine with other transformations to get a flexible distribution

Problems with Autoregressive TPPs:

  • Error accumulation
  • Computational complexity
  • Hard to model long-range interactions

Alternative: ADD and THIN - Diffusion for TPPs

  • Noising: Data sample \(t_0 \sim \lambda_0\)
  • For every diffusion step \(n\):
    • Thin \(t_{n-1}\) = randomly remove events from data
    • Add events from HPP superposed with the history
  • End up with completely noised data according to the HPP
  • End: Noised data \(\tilde{t}_n \sim \lambda_{\text{HPP}}\)
  • Learn the denoising process
  • Now we take the entire sequence into account, not autoregressive

Marked TPPs

  • Most common type: Categorical mark (type of tweet)
  • Each event has a category
  • Events of different classes can influence each other
  • We can have a distribution over the types: The hidden state can be used to model the type represented by a categorical distribution