Concise network models of memory dynamics reveal explainable patterns in path data

Rohit Sahasrabuddhe; Renaud Lambiotte; Martin Rosvall

doi:10.1126/sciadv.adw4544

. 2025 Oct 10;11(41):eadw4544. doi: 10.1126/sciadv.adw4544

Concise network models of memory dynamics reveal explainable patterns in path data

Rohit Sahasrabuddhe ^1,^2,^*, Renaud Lambiotte ^1,^*, Martin Rosvall ³

PMCID: PMC12513435 PMID: 41071875

Abstract

Network methods capture the interplay between structure and dynamics of complex systems across scales by modeling indirect interactions as random walks. However, path data from real-world systems frequently exhibit memory effects that this first-order Markov model fails to capture. Although higher-order Markov models can capture these effects, they grow rapidly in size and require large amounts of data, making them prone to overfitting some parts and underfitting others in systems with uneven coverage. To address this challenge, we construct concise networks from path data by interpolating between first-order and second-order Markov models. We prioritize simplicity and interpretability by creating state nodes that capture prominent second-order effects and by proposing a transparent measure that balances model size and quality. Our concise networks reveal large-scale memory patterns in both synthetic and real-world systems while remaining far simpler than full second-order models.

Concise network models interpolating between Markov orders balance simplicity and quality of fit to reveal patterns in path data.

INTRODUCTION

Networks model complex systems by abstracting their components as nodes and interactions as edges (1). These edges often represent flows of some quantity, such as passengers between airports, information between people, or workers between occupations. Although edges represent only direct flows, a defining feature of networks is their ability to capture indirect flows through paths or walks. This capacity allows researchers to analyze complex systems across scales (2), revealing mesoscale structures such as communities and roles and macroscale patterns such as hierarchies and rankings. Capturing these patterns assumes that flows are transitive: Given flow from node i to node j and from j to node k, there is implied indirect flow from i to k through j, often modeled as a first-order Markov process.

In a first-order network model of trajectory data, nodes represent system components and edges capture transition rates between them. However, sequence data from many real-world systems exhibit higher-order dependencies (3–6) that they fail to capture. To address this issue, researchers have developed network models with memory (7–11), extending the network toolbox to contexts where the Markovian assumption breaks down. A natural extension is a second-order network, which incorporates one-step memory using state nodes of the form j|i—indicating arrival at j from i, where j and i are nodes in the first-order network, first-order nodes for short. This approach generalizes to build networks with fixed-order memory of any order, enabling models that capture increasingly complex dependencies.

Modeling an entire system with a single Markov order imposes a strong constraint because real systems often exhibit dependencies spanning multiple orders. Combining Markov models up to a maximum order can improve next-element prediction in real-world systems (10, 12). However, uneven observation density can cause such models to overfit some regions and underfit others. An alternative is to add state nodes only where needed (9, 13). Although these approaches help mitigate overfitting, they rely on heuristics to estimate the importance of memory effects, limiting their interpretability and scalability even for moderately sized systems. Effective models must strike a balance between simplicity and predictive accuracy (14).

In this work, we build principled and interpretable models that interpolate between first-order and second-order networks. Using nonnegative matrix factorization (NMF) to identify prominent modes in second-order dynamics, we construct coarse-grained state nodes that represent these patterns instead of individual predecessors. To balance quality and complexity, we introduce a simple performance measure and incorporate an empirical Bayes-style regularization to mitigate overfitting. We validate our method on synthetic data and apply it to two real-world systems: air travel and information spread. In both cases, the method captures essential memory effects with a small number of interpretable state nodes, providing insights beyond those revealed by first-order models.

RESULTS

Concise network models of path data

Understanding system dynamics requires models that balance simplicity with explanatory power. Our approach tackles three competing demands: capturing memory effects that first-order models miss, preventing the overfitting that plagues second-order models, and maintaining the interpretability that users need. We construct concise and interpretable network representations that capture key memory effects in path data, interpolating between first-order and second-order Markov models. To this end, we split first-order nodes into state nodes that capture prominent second-order effects. We illustrate the steps we follow for first-order node j with an example in Fig. 1. Consider data in the form of trajectories through a network, each of which is a sequence of nodes. We extract the flow through j into matrix $A$ of observation counts such that $A_{ki}$ is the frequency of the trigram $i \to j \to k$ . We omit the dependence on j from our notation because the following steps are carried out separately for each j and use i and k to refer to predecessors and successors, respectively. Overlap in the sets of predecessors and successors has no effect on our method. Let matrix $M^{(2)}$ be the maximum likelihood estimate (MLE) of second-order transition rates, obtained by normalizing the columns of $A$ .

M_{ki}^{(2)} ≔ \frac{A_{ki}}{\sum_{k^{'}} A_{k^{'} i}}

(1)

The MLE is susceptible to overfitting (see Fig. 1), which we mitigate by introducing a prior.

The prior

Let vector $M^{(1)}$ be the MLE of first-order transition rates from j

M_{k}^{(1)} ≔ \frac{\sum_{i^{'}} A_{k i^{'}}}{\sum_{k^{'}, i^{'}} A_{k^{'} i^{'}}}

(2)

measuring the probability of moving from j to k independently of i. We apply a Dirichlet prior to each column $M_{\cdot i}^{(2)}$ with parameters $M^{(1)}$ and strength $μ \in ℝ^{+}$ (see Methods). This corresponds to assigning a pseudocount of $μ M_{k}^{(1)}$ to $A_{ki}$ . The matrix $X$ is the posterior mean of the second-order transition rates

X_{ki} ≔ \frac{A_{ki} + μ M_{k}^{(1)}}{\sum_{k^{'}} A_{k^{'} i} + μ}

(3)

We can write the column $X_{\cdot i}$ as

X_{\cdot i} = (\frac{μ}{μ + n_{i}}) M^{(1)} + (1 - \frac{μ}{μ + n_{i}}) M_{\cdot i}^{(2)}

(4)

where $n_{i} = \sum_{k^{'}} A_{k^{'} i}$ . This interpolation balances memory and regularization: High μ pushes each column in $X$ toward the memoryless model $M^{(1)}$ , whereas low μ preserves the patterns in $M^{(2)}$ . In cases where system knowledge cannot inform the strength of the prior μ, we use leave-one-out cross-validation (see Methods). The prior is stronger for undersampled predecessors, regularizing their transition rates (see Fig. 1). $X$ contains the regularized dynamics through j that we aim to model with interpretable components.

Constructing state nodes

With reliable target dynamics established, we tackle the core challenge: discovering $X$ ’s hidden behavioral patterns that enable efficient representation. We approximate the target dynamics in $X$ by splitting j into state nodes that capture prominent second-order patterns. Unlike full second-order state nodes that represent specific predecessor-successor pairs, our state nodes capture behavioral modes. For air traffic, for example, predecessor-successor pairs correspond to specific routes like arriving at Denver from Chicago, whereas our state nodes capture broader patterns such as eastbound transit or westbound transit. Each state node α has estimated transition probabilities from predecessors and to successors. $\hat{P} (i \to α)$ is the probability that a random walker arriving at j from i uses α. $\hat{P} (α \to k)$ is the probability that it moves to k from α.

X_{ki} \approx {\hat{X}}_{ki} ≔ \sum_{α} \hat{P} (i \to α) \hat{P} (α \to k)

(5)

We can rewrite this expression as

\begin{matrix} \hat{X} = {\hat{X}}_{out} {\hat{X}}_{in}^{⊤}, where \\ {({\hat{X}}_{in})}_{i α} ≔ \hat{P} (i \to α) \\ {({\hat{X}}_{out})}_{k α} ≔ \hat{P} (α \to k) \end{matrix}

(6)

For a given number of state nodes $r ≪ min (∣ i ∣, ∣ k ∣)$ , we use convex NMF to find the optimal approximation such that ${\hat{X}}_{in}$ and ${\hat{X}}_{out}$ are transition matrices (see Methods). We have omitted the dependence of the estimated matrices on r to simplify notation. Crucially, each column of ${\hat{X}}_{out}$ is a convex combination of the columns of $X$ . This ensures that the state nodes model plausible behavior (see Fig. 1).

Model quality

For each first-order node, the trade-off between model size and description quality is made in the choice of r. This choice connects directly to our regularization: When μ preserves complex patterns, we need high r to capture them; when μ creates uniform behavior, low r suffices. We define a simple and intuitive measure of quality of fit to guide this choice in situations where it cannot be made with system knowledge. We define the flow overlap of two discrete probability distributions over the same domain as the probability mass in the same elements

flow overlap (p, q) ≔ \sum_{i} min [p (i), q (i)]

(7)

where p and q are discrete probability distributions over a set indexed by i. We can extend flow overlap to transition matrices as the (weighted) mean flow overlap of every column. For the rank r solution $\hat{X}$ , flow overlap becomes

\begin{matrix} flow overlap (X, \hat{X} ∣ n_{i}) ≔ \frac{1}{\sum_{i^{'}} n_{i^{'}}} \sum_{k, i} n_{i} min (X_{ki}, {\hat{X}}_{ki}) \end{matrix}

(8)

This quantity ranges from 0 to 1 and measures the fraction of flow through the first-order node that our approximation $\hat{X}$ captures. When the target dynamics of all the predecessors are similar, possibly caused by a strong prior, the r = 1 solution achieves high flow overlap. At the other extreme, when each predecessor has very different behavior, flow overlap will remain low until r approaches $min (∣ i ∣, ∣ k ∣)$ . Our method works best when the target dynamics contain large-scale patterns, as in our example, where flow overlap increases rapidly until rank 3, after which adding state nodes captures no important new behavior (Fig. 1). We pick the optimal number of state nodes as the lowest r such that the flow overlap reaches a threshold.

Constructing the network

We can create state nodes independently in parallel for each first-order node. In large systems, we can also restrict the creation of state nodes to a subset of important first-order nodes. We put the concise network model together by linking state node $α^{i}$ to $β^{j}$ with weight

\begin{array}{l} \hat{P} (α^{i} \to β^{j}) = {({\hat{X}}_{out}^{i})}_{j α^{i}} {({\hat{X}}_{in}^{j})}_{i β^{j}} \\ = \hat{P} (α^{i} \to j) \hat{P} (i \to β^{j}) \end{array}

(9)

where the superscript indicates the first-order node. Convex NMF tends to create sparse factors (15). Many of the entries of ${\hat{X}}_{in}$ and ${\hat{X}}_{out}$ will be close to 0, meaning that predecessors and successors interact primarily with a subset of state nodes. However, because the NMF is likely to converge before the weights are exactly 0, state nodes have dense neighborhoods with many low-weight edges. We recommend trimming these low-importance edges, for which we implement a simple threshold-based approach (see Methods).

Synthetic experiments

We examine the performance of our method on synthetic first-order nodes with 50 predecessors and successors each. Defining n_m modes—distributions over the successors—we sample data for each predecessor from a convex combination of modes with weights drawn from a symmetric distribution with spread controlled by c > 0. The distribution is uniform for c = 1, and higher values decrease the variance. Intuitively, for $c ≪ 1$ , each predecessor closely resembles a single mode and is a uniform mixture of them for $c ≫ 1$ . We vary $n_{m} \in {2,5,10}$ and $c \in {0.5,1.0,1.5}$ and generate 25 synthetic first-order nodes for each pair (see Methods for details).

First, we assess whether creating state nodes improves the quality of fit. For n_m = 2,5, flow overlap increases rapidly with the addition of state nodes, plateauing at a high value after n_m state nodes (Fig. 2B). Although this elbow-like behavior is less clear for n_m = 10, flow overlap reaches high values of more than 0.8 with 10 state nodes. For the same n_m, solutions with fewer than n_m state nodes do better for higher c. This is expected because predecessors tend to be similar to each other for high values of c.

Fig. 2. — (A) Typical example of $X$ for *n_m* = 2 modes and concentration parameter c = 0.5. Most predecessors closely resemble a single mode. (B) Median flow overlap as a function of the number of state nodes. We note that the single-state node solution corresponds to a first-order model. (C) Flow overlap similarity between the state nodes created by our method (solid) and the baseline (hatched) with the predecessors closest to the planted modes. We distinguish *n_m* with colors and c with transparency.

Second, we explore the interpretability of the model with n_m state nodes. Because the planted modes are extreme behaviors unlikely to be observed in the data, we do not want the state nodes to match them exactly. Instead, we compare the state nodes to the predecessors, which are most similar to the modes (see Methods). We report the mean flow overlap similarity (Eq. 7) of the best matching of state nodes with these predecessors (Fig. 2C). As a baseline, we randomly generate 50 representations with n_m state nodes for each first-order node (see Methods). Not only do we outperform the baseline for all values of the parameters, but we also achieve objectively high values of similarity (Fig. 2C).

Transit flow through airports

To validate our method on real-world systems, we start with air traffic because airports handle millions of passengers whose itineraries create clear memory effects. Although second-order memory is important in modeling the flow of passengers through airports (8, 16, 17), full second-order networks can be impractically large and are prone to overfitting. Here, we investigate whether our concise network model can capture key memory effects with interpretable state nodes and offer insights beyond a first-order network. We use open-source data from the US Bureau of Transport Statistics to construct a dataset of around 4.3 million domestic transits through 435 airports in the United States. We are interested specifically in the role of airports as transit hubs and only consider nonreturn transits of the form $i \to j \to k$ where $i \neq k$ (see Methods).

The five largest airports by transit volume—in Atlanta, Dallas–Fort Worth, Denver, Charlotte, and Chicago—account for 42% of all transits, making it crucial to capture potential memory effects in their traffic. However, they have an average of 170 connections each, and a full second-order model is impractical. We explore whether we can build a more concise model by identifying meaningful modes of behavior (see note S2.1). Flow overlap increases rapidly between rank 1 and rank 2 or 3, indicating large-scale patterns that can be modeled with only a few state nodes. For instance, two state nodes for Denver increases flow overlap from 0.60 to 0.74 by capturing intuitive behavior—passengers arriving from the east are likely to continue westward and vice versa (Fig. 3A). We observe similar patterns for the other large national hubs, revealing that passengers use them to travel between distant regions.

Fig. 3. — (A) Denver with two state nodes. For either state node, we plot with gray circles the five predecessors [respectively, (resp.) successors] with the highest ${({\hat{X}}_{in})}_{iα}$ $[resp. {({\hat{X}}_{out})}_{kα}]$ from among the 20 most-observed predecessors (resp. successors) of Denver. The state nodes are denoted as blue and orange squares with edges of the same color. Dashed (resp. solid) lines are edges to (resp. from) state nodes. The black star marks the location of Denver. (B) Gain in three-leg connectivity (y axis)— ${log}_{10} (ρ_{c} / ρ_{fo})$ —against the distance between o and d (x axis) for $(o, d)$ pairs of airports ranked 11 to 20 by the number of transits. The black dashed line denotes no gain. (C) Distribution of gain in three-leg displacement— ${log}_{10} (δ_{c} / δ_{fo})$ —for all origin airports except the 10 largest hubs. The black dashed line denotes no gain. The orange dashed line marks the mean.

First-order networks cannot model this behavior, and modeling memory effects changes connectivity analysis in transport systems. We analyze this by constructing (i) a first-order network $G_{fo}$ and (ii) a concise memory network $G_{c}$ with state nodes for the 10 largest airports, which account for 57% of the transits (see Methods). Using a flow overlap threshold of 0.7, $G_{c}$ has 27 state nodes for these airports and is much smaller than a network with full second-order models for them, which would have 1467 state nodes.

Modeling a passenger’s itinerary as a random walk on the network, we define three-leg connectivity $ρ_{fo} (o, d)$ [respectively, (resp.) $ρ_{c} (o, d)$ ] as the probability that a passenger starting at origin o reaches destination d in 3 or fewer steps on the first-order (resp. concise) network (see Methods). Comparing $G_{c}$ to $G_{fo}$ , the gain in connectivity is ${log}_{10} (ρ_{c} / ρ_{fo})$ . For $(o, d)$ pairs in the next 10 largest airports, the gain (i) increases with the geographic distance between o and d (Pearson r = 0.88, Kendall τ = 0.74, both with P value < 10⁻¹⁶) and (ii) is positive for all pairs of airports more than 2500 km apart (Fig. 3B). This result shows that first-order models systematically underestimate connectivity between distant airports, missing how passengers navigate the network through major hubs to traverse continental distances efficiently (fig. S12).

Covering larger distances on $G_{c}$ is not unique to journeys from big airports. Let $δ_{fo} (o)$ [resp. $δ_{c} (o)$ ] be the expected displacement from origin o after three steps on $G_{fo}$ (resp. $G_{c}$ ). The gain in three-leg displacement— ${log}_{10} (δ_{c} / δ_{fo})$ —is positive for most origin airports (Fig. 3C). The mean gain = 0.02 is significantly greater than 0 (one-tailed t test statistic = 27.2, P value < 10⁻¹⁶). By capturing the role of large national hubs in routing traffic across distant regions, $G_{c}$ shows that passengers can travel long distances in a few flights.

Group structure in information flow

Information flow is an important process in the analysis of social networks, where individuals are modeled as nodes and their interactions as edges. In reality, people interact in many different contexts, and whom you pass information on to likely depends on whom you got it from. Using social networks of co-work and friendship among 71 lawyers at a firm (18), we generate synthetic trajectories where information received from a friend (resp. co-worker) is passed on to a friend (resp. co-worker). From these, we construct (i) the first-order network $G_{fo}$ , (ii) a concise network $G_{c}$ with flow overlap threshold = 0.9, having two state nodes each for 52 individuals, and (iii) the second-order network $G_{so}$ (see Methods).

Identifying group structures is key to understanding information spread. We use Infomap (19), a flow-based community detection tool, to find groups of people within which information circulates rapidly before spreading to the rest of the network (see Methods). The three communities of $G_{fo}$ (Fig. 4A) correlate with work-related metadata. Office location splits the individuals in community 3 from the rest, who are then divided into litigators and corporate lawyers (table S4).

Fig. 4. — Communities identified by Infomap in the (A) first-order and (B) concise networks. The circles are first-order nodes with colors indicating community membership. The legend includes community sizes, with the two smallest communities with fewer than five nodes each in $G_{c}$ labeled Other. Node positions are determined by the spring layout applied to $G_{fo}$ . (C) Jaccard similarity of first-order nodes in each community of $G_{c}$ (x axis) with those of $G_{fo}$ (y axis; top) and $G_{so}$ (y axis; bottom). The cells are ordered by community label, and their width (resp. height) is proportional to the size of the community in $G_{c}$ (resp. $G_{fo}$ and $G_{so}$ ).

For higher-order networks, Infomap works at a state node level, allowing communities to overlap at first-order nodes (20). $G_{c}$ has seven communities (Fig. 4B), of which two are very small. Communities 1 to 3 are nonoverlapping and are the same groups of co-workers identified in $G_{fo}$ (Fig. 4C, top). Communities 4 and 5 overlap with 1 and 2 and are novel to $G_{c}$ . They are friendship groups of people who work in the same office and correspond exactly to communities in the friendship network (note S3.3). This analysis shows that $G_{c}$ effectively captures overlapping social circles, revealing distinct yet interconnected groups of friends and co-workers. Despite having more than seven times as many nodes, the community structure of $G_{so}$ is very similar to that of $G_{c}$ (Fig. 4C, bottom), showing that we can model the important memory effects with far fewer nodes.

DISCUSSION

We introduce a method to construct concise network models from path data. Using NMF, we create interpretable state nodes that reveal large-scale patterns in second-order dynamics. To prevent overfitting, we incorporate an empirical Bayes-style regularization and define a simple performance measure that balances model size and quality. Our approach produces compact networks that retain essential memory effects in empirical data at a fraction of the complexity of full second-order models.

Although we prioritize simplicity and interpretability, our framework is modular. Each stage of the pipeline can be adjusted to suit the system under study—for instance, by using cross-validation or statistical model selection criteria to choose the number of state nodes, applying more sophisticated methods for edge pruning, or experimenting with alternative regularizers and loss functions in the matrix factorization step. These extensions offer promising directions for future work.

The flexibility and scalability of our method make it well suited for analyzing path data across diverse domains. A natural extension is to incorporate higher-order memory effects. Similar techniques could also help identify and model nonstationary patterns in paths or build compact representations of temporal networks.

METHODS