Temporal Point Processes
Introduction
The main difference is that now the exact time is important, not just the ordering of events.
- Time series: Measurements at regular intervals
- Temporal point processes: Asynchronous time with events at irregular intervals
Key Characteristics:
- Distribution over time points
- Discrete event sequence in continuous time
- Both location and time are random
We can model it autoregressively:
- Predict the time of the next event given the history \(H_t = \{ t_j \mid t_j < t \}\)
- \(H_t\) depends on a specific sample \(\{ t_j, t_n \}\)
- We denote the conditional density \(p^*(t) = p(t \mid H_t)\)
- The next event in \([t, t + dt]\) is \(p^*(t) \, dt\) and there is no event before \(t\)
Conditional Density:
Cumulative distribution function (CDF): \[ F^*(t) = \int_{t-1}^t p^*(t) \, dt \] Probability that the next event is between \(t-1\) and \(t\)
Survival function: \[ S(t) = 1 - F^*(t) = \int_t^\infty p^*(t) \, dt \] Probability that the next event is after \(t\)
Conditional intensity function: \[ \lambda(t) = \frac{p^*(t) \, dt}{S^*(t)} = \frac{P(\text{event in } [t, t + dt] \text{ and no event in } [t-1, t])}{P(\text{no event in } [t-1, t])} \] Probability of the next event in \([t, t + dt]\) given that there is no event in \([t-1, t]\)
- Intuitive interpretation: Number of expected events per unit of time
- Not a probability
The likelihood of the entire sequence with \(t_1, t_2, \ldots, t_n\) events on \([0, T]\) is:
\[ p(t_1, t_2, \ldots, t_n) = p^*(t_1) \cdot p^*(t_2) \cdots p^*(t_n) S^*(T) = \lambda(t_1) \lambda(t_2) \cdots \lambda(t_n) \exp \left( - \int_0^T \lambda^*(t) \, dt \right) \]
\[ p^*(t_1) = \lambda(t_1) \exp \left( - \int_0^{t_1} \lambda^*(t) \, dt \right) \]
Challenges:
- We don’t like the integral
- The number of events can vary
Models Based on Conditional Intensity Functions
- More flexible and interpretable
- Easy to combine different TPPs with different \(\lambda^*(t)\) to produce a new TPP
- Efficient sampling
Homogeneous Poisson Process (HPP)
- Simplest possible model
- \(\lambda^*(t) = \mu\)
- \(p^*(t) = \mu \exp(-\mu (t - t_1))\) is an exponential distribution
We can interpret it as modeling the inter-event times that are exponentially distributed.
Sampling:
- Generate an exponential random variable with rate \(\mu\) and add it to the last event time
- It doesn’t depend on the history (memoryless property), so we can sample all events at once
Inhomogeneous Poisson Process (IPP)
- Intensity function is time-dependent: \(\lambda^*(t) = \mu(t) \geq 0\)
- Still doesn’t depend on the history (memoryless property)
- Captures global trends (e.g., cars on the road during rush hour)
Expected Number of Events:
\[ N([a, b]) \sim \text{Poisson} \left( \int_a^b \mu^*(t) \, dt \right) \]
Simulating an IPP:
- Simulating inter-event times is hard and requires integration
- Better alternative: Thinning algorithm
- Find the upper bound of the intensity function \(\lambda^*(t) \geq \max_t \mu(t)\)
- Sample a time \(t\) from an HPP with rate \(\lambda^*\)
- Accept \(t\) with probability \(\mu(t) / \lambda^*\)
- Essentially acceptance-rejection sampling
Hawkes Process
- Self-exciting process
- Example: Tweets, where one tweet triggers another, then a pause until another tweet
\[ \lambda(t) = \mu + \alpha \sum_i k_w(t - t_i) \quad \text{where } t_i < t \]
- Triggering kernel: \(k_w(t - t_i) = \beta \exp(-\beta (t - t_i)) \text{ for } t > t_i\)
- Parameters: \(\mu, \alpha, \beta\)
- Depends on the history
- Exhibits clustered, bursty behavior
Parameter Estimation in TPPs
- Maximize the likelihood of the observed data over the parameters (iid learning setting)
\[ \log P(t_1, t_2, \ldots, t_n) = \sum_i \log \lambda^*(t_i) - \int_0^T \lambda^*(t) \, dt \]
- Various optimization algorithms
- Simple models like HPP allow for closed-form solutions
- For Hawkes, we can use convex optimization
- We can use gradient descent with modifications for constraints
Advantages and Disadvantages of Using Intensity:
- Numerical integration might be intractable
- Not clear how to define flexible models with arbitrary dynamics
- Easy to interpret, sample, and define models with simple dynamics
We will revisit RNNs with conditional probability.
Directly Model the Conditional Density \(p^*(t \mid H_t)\) with an RNN:
- Every time an event happens, we feed \(\tau_i = t_i - t_{i-1}\) to the RNN
- Use the hidden state to model the history
- Use it to model the parameters of the \(p^*\) distribution
- \(p^*_t = p(t \mid H_t) = p(t \mid h_i)\)
How to Model \(p^*\):
- Simple non-negative distributions: Exponential, Gamma, Weibull, Gompertz
- Use a mixture of distributions: Convex combinations of densities
- If we pick the exponential distribution, it’s not an HPP nor an IPP since it depends on the history and the parameter \(\mu\) changes over time
Normalizing Flows:
- Use transformations like \(\exp(\log(1 + \exp))\) to ensure that the output is positive, combine with other transformations to get a flexible distribution
Problems with Autoregressive TPPs:
- Error accumulation
- Computational complexity
- Hard to model long-range interactions
Alternative: ADD and THIN - Diffusion for TPPs
- Noising: Data sample \(t_0 \sim \lambda_0\)
- For every diffusion step \(n\):
- Thin \(t_{n-1}\) = randomly remove events from data
- Add events from HPP superposed with the history
- End up with completely noised data according to the HPP
- End: Noised data \(\tilde{t}_n \sim \lambda_{\text{HPP}}\)
- Learn the denoising process
- Now we take the entire sequence into account, not autoregressive
Marked TPPs
- Most common type: Categorical mark (type of tweet)
- Each event has a category
- Events of different classes can influence each other
- We can have a distribution over the types: The hidden state can be used to model the type represented by a categorical distribution