Abstract
Hyper-relational knowledge graphs build upon traditional knowledge graphs to enhance the diversity and complexity of information representation. They achieve this by integrating multi-dimensional auxiliary information with standard triples. However, this characteristic introduces certain challenges to the task of N-ary Fact Link Prediction. Unlike binary relational knowledge representations, N-ary Facts have more complex and varied expression forms. To address the issue of insufficient utilization of heterogeneous graph structure information in existing N-ary Fact representation methods, this paper proposes an N-ary graph Transformer model. This model incorporates a new attention mechanism based on N-ary structural bias. By improving the representation of N-ary heterogeneous graphs, it more accurately identifies key associations in recommendation scenarios. Experimental validations on the JF17K, Wikipeople, and WD50K datasets demonstrate that the NAGT model outperforms comparative methods in extracting structural information. It effectively completes the knowledge graph and shows both efficiency and robustness in experiments related to the N-ary Fact Link Prediction task.
Keywords: knowledge graph, Hyper-relational knowledge graph, Graph neural network, N-ary fact, Link prediction
Subject terms: Engineering, Mathematics and computing
Introduction
Knowledge Graph (KG) is a fundamental artificial intelligence technology that connects perceptual intelligence with cognitive intelligence. It plays a critical role in knowledge representation and reasoning, especially in intelligent recommendation systems 1,2. Currently, large-scale KG systems like Freebase 3,4, Wikidata 5, and Google Knowledge Graph enable more accurate personalized recommendations by constructing networks of associations among users, items, and attributes. However, the incompleteness of KG continues to limit the efficiency of these recommendations. To address this limitation, hyper-relational knowledge graphs (HKG) integrate multi-dimensional auxiliary information with standard triples based on KG, enhancing the diversity and complexity of information representation. Despite this advancement, the challenge of incomplete data persists. The absence of associated entities highlights the importance of accurate Link Prediction (LP) in HKG. LP involves inferring unobserved potential relationships within the graph, which has a direct impact on the performance of recommendation systems relying on this graph for interest reasoning 6–9.
Most current research on LP in KG centers around triples that consist of binary relations. However, in consumer contexts, N-ary facts–representing relationships involving more than two entities–are also quite common. In fact, over one-third of entities are involved in N-ary facts 10. These N-ary facts typically consist of a main triple accompanied by multiple sets of key-value pairs, as shown in Fig. 1. The number of entities in N-ary facts is not fixed, resulting in more complex expressions. Different key-value pairs associated with the same triple may carry entirely different meanings. Consequently, effectively expressing N-ary facts has become a crucial challenge in improving the reasoning capabilities of recommendation systems.
Fig. 1.

Examples of triple relational facts and N-ary relational facts.
Existing methods for representing N-ary facts each have their limitations. Liu et al.11 represent N-ary facts in the form of role-entity pairs, classifying entity types through semantic roles. However, they struggle to capture the dynamic changes of attribute weights in recommendation scenarios. Wang et al.12 represent multi-element facts as a heterogeneous graph, treating relations and entities as vertices and including four types of edges. This heterogeneous graph representation is intuitive and easy to understand. However, it overlooks certain structural information. For instance, the subject and object in the main triple are not directly connected and are heavily influenced by the relation, while the importance of the value in the key-value pair to the relation in the main triple is also underestimated. It is worth noting that even in the field of binary LP, integrating attributes and structure is crucial for enhancing predictive performance. For example, Nasiri Elahe et al.13, in their work on LP in protein-protein interaction networks, integrated node attributes with topological information through attributed graph embedding, significantly alleviating network sparsity issues. However, this method is only applicable to binary relations and cannot be directly adapted to model the complex associations between key-value pairs and the main triple in N-ary relations. This further underscores the unique challenges of ”structure-attribute collaborative representation” in multi-faceted scenarios. Therefore, we propose to optimize the representation of N-ary fact heterogeneous graphs and redesign a new heterogeneous graph representation scheme, as shown in Fig. 2.
Fig. 2.
Example of N-ary fact heterogeneous graph.
The structural information of N-ary relational heterogeneous graphs significantly impacts the accuracy of recommendation reasoning, making its full extraction a challenging task. The GRAN model 12 adopts a fully connected attention module to depict interactions between vertices and introduces edge-aware attention biases to process feature information for different edge types. However, its structure-ignoring variant (GRAN-complete) demonstrated a 5.6% decrease in Mean Reciprocal Rank (MRR) compared to the heterogeneous structure-aware variant (GRAN-hete) on the high-arity WikiPeople task, along with a 3.4% drop in MRR for entity prediction on the JF17K dataset. This indicates that GRAN is still insufficient in extracting the structural information of heterogeneous graphs. The Graph Transformer architecture, with its ability to capture complex relationships and process heterogeneous data effectively, offers a promising solution to this issue. It can accurately model the hierarchical associations among users, items, and attributes, thus adapting to the multi-dimensional reasoning needs of recommendation systems.
To address the issues described above, this paper proposes a model called the N-ary Graph Transformer (NAGT). The essence of this model is its innovative attention mechanism, which is based on N-ary structural bias, aimed at improving the identification of key associations in recommendation scenarios. The model utilizes graph neural networks to extract local interaction information between users and items while incorporating the structural features of scene attributes into the attention calculation through a mapping function. This approach enables the recommendation system to more accurately capture the potential interests of cold-start users. Experimental results demonstrate that the model performs exceptionally well on N-ary LP datasets, confirming its effectiveness in completing KG and enhancing the cold-start handling capability of recommendation systems.
The main contributions of this paper are as follows:
We propose a novel N-ary heterogeneous graph representation method.
We propose a Graph Transformer architecture scheme to address the challenge of structural information extraction from N-ary heterogeneous graphs.
We propose a new multi-dimensional offset attention structure mechanism, and this model possesses stronger capability in heterogeneous graph structure extraction.
Related work
Hyper-relational knowledge graph
Traditional KG struggle to accurately capture the complete semantics of complex facts due to their binary relational structure. HKG addresses this limitation by extending to N-ary facts. however, they introduce new challenges in terms of representation learning and computational efficiency 14–17. To tackle these challenges, researchers have developed various approaches. For instance, Rosso et al.18 used convolutional neural networks (CNN) to integrate qualifier information and distinguish the importance of hyper-relational facts. However, their approach has limitations in capturing long-range dependencies within complex N-ary structures. Similarly, Yu et al.19 replaced the graph convolutional network (GCN) aggregation module with layer normalization to enhance efficiency, but this comes with a trade-off between efficiency and the preservation of structural information. The GRAN model, proposed by Wang et al.12 effectively captures both local and global dependencies through its edge-aware attention mechanism, demonstrating excellent performance in N-ary relation prediction. However, it still faces limitations when dealing with complex and variable N-ary fact structures. Specifically, it struggles to model dynamic structures and has a relatively coarse semantic granularity of edge types. This limitation is evident in the results from the WikiPeople dataset, where the model’s mean reciprocal rank (MRR) for predicting attribute values (0.713) is significantly higher than for predicting subjects/objects (0.505), indicating considerable room for improvement in modeling the complex structural variations within facts. Lastly, Shomer et al.20 proposed the QUAD framework based on the StarE model, introducing multiple aggregators to enhance the processing of multi-attribute information.
The dual-level attention mechanism proposed by Luo, H. et al.21, while innovative, suffers from computational burden and scalability issues due to its introduced graph structure. Despite significant progress in HKG research, existing methods remain hampered by their insufficient extraction and utilization of heterogeneous graph structural information, which has become a critical bottleneck constraining the performance of N-ary fact modeling. These approaches neither delve into the type-attribute associations among heterogeneous nodes nor adequately consider the semantic hierarchy of heterogeneous edges. Consequently, performance gains are limited (MRR < 0.5%) in scenarios with sparse qualifiers. Furthermore, due to the shallow mining of heterogeneous features, performance improvements in key prediction tasks are mostly below 2%, ultimately impacting downstream task performance20–22. Mohammadi, Mehrnoush et al.23 proposed a knowledge tracing model based on a temporal hypergraph memory network, which captures complex dependencies through the ”multi-entity relational modeling capability” of hypergraph structures and dynamically updates semantic states using memory units. However, the model lacks a dedicated structure designed for the hierarchical ”main triple-key-value pair” relationships in N-ary and does not incorporate the semantic information of edge types in heterogeneous graphs. Therefore, We propose a novel N-ary heterogeneous graph representation method to address the problem of insufficient extraction and utilization of structural information in multi-ary heterogeneous graphs.
Graph neural networks
Graph Neural Networks (GNN) have been widely used in graph representation learning and have achieved state-of-the-art performance in tasks such as node classification and LP24. The rise of neural networks has provided new ideas for KG modeling25. Current mainstream methods mainly include the use of Fully Connected Networks (FCN), Convolutional Neural Networks (CNN), Transformers, and GNN to encode the associations between elements in n-ary facts26. These neural network-based techniques have shown significant advantages in capturing interactions between elements in hyper-relational facts, effectively promoting progress in modeling complex relational structures27. In research in this field, scholars have proposed a series of research methods. Luo et al.28 proposed the first RHKH model based on the original format, which mines tuple associations through associative hypergraph neural networks and preserves structural and combinatorial information. The HyConvE model proposed by Wang et al.29 uses convolutional neural networks to build an embedding framework, enhancing the extraction of local patterns from structured data. Li et al.30 expanded hypergraph neural networks to integrate position-aware information, strengthening the encoding of hypergraph structural dependencies. The MetaNIR model proposed by Wei et al.31 is based on a meta-learning framework and uses GNN to generate embeddings of unseen elements, solving the generalization problem in inductive scenarios. However, although the FSFDW method proposed by Nasiri et al.32 can constrain the ”consistency between topology and attributes” in attribute graph clustering through regularization terms, it relies on manually designed loss functions and fails to capture dynamically changing changes in structural dependencies within N-ary. Therefore,Given that neural networks have demonstrated significant effectiveness and feasibility in knowledge hypergraph modeling, this study plans to adopt graph neural networks to extract local structural information, aiming to fully utilize the inherent structural characteristics of hyper-relational facts and thereby improve the scoring performance of relevant node pairs in LP tasks.
Graph transformer
The Transformer architecture, first proposed by Vaswani et al.33, employs an encoder that effectively captures long-range dependencies in data through the synergistic interaction of self-attention mechanisms and feed-forward networks34. A key advantage of Graph Transformer over traditional GNN lies in its ability to overcome the limitations of local neighborhood aggregation and avoid strict structural inductive biases. Conventional GNN typically aggregate information from a node’s immediate neighbors, which restricts their receptive field and may fail to capture long-range dependencies or complex relational patterns beyond local substructures. In contrast, Graph Transformer employs a global self-attention mechanism, allowing each node to attend to all other nodes in the graph. This enables the model to dynamically capture both local and global interactions without being constrained by predefined neighborhood ranges. Moreover, while GNN often rely on strong inductive biases–such as the assumption that graph structure is locally Euclidean or that relationships are primarily neighborhood-based–Graph Transformer minimizes such structural priors. Instead of hard-coding aggregation rules or dependency ranges, it learns node interactions entirely from data through attention weights. This makes it particularly suitable for heterogeneous graphs like N-ary relational facts, where entities and relations exhibit diverse and complex semantic associations that cannot be adequately captured by localized aggregation alone. This capability has motivated significant research interest in adapting the model for graph-structured data. In terms of heterogeneous graph information extraction, Hu et al.35 proposed a heterogeneous graph Transformer architecture, enabling efficient and scalable training of Web-scale heterogeneous graphs. Lai et al.36 combined bidirectional encoder representations from Transformers with Graph Transformer, reducing noise interference through an adjacent attention mechanism. Ying et al.37 pointed out that the correct encoding of graph structures is the core of applying Transformers to graph data. The Graphormer they designed integrates three types of encodings–spatial, edge, and centrality–into the attention mechanism, significantly improving the effect of graph representation learning. The GraTransDRP model by Chu et al.36 uses Graph Transformer to efficiently extract drug representations while reducing information redundancy. For heterogeneous graphs with complex entities and relationships such as KG, Peyman et al.38 designed a multi-self-attention head Transformer architecture to capture multi-directional interactions between entities and relationships. Shi et al.39 proposed a general KG embedding framework called TGformer. Wang et al.40 proposed HyperSAT, a structure-aware Transformer for HKG. HyperSAT incorporates heterogeneous attention biases and introduces a direction layer based on TransE to capture structural information. Li et al.41 proposed the DHRL4HKG model, which effectively models one-to-many relationships and complex dependencies in HKG by constructing a dual-hypergraph structure, achieving significant performance improvement in LP tasks. Graph Transformer overcomes the limitations of local neighborhood aggregation and avoids strict structural inductive biases42. Given its excellent performance in graph structure information extraction, this study proposes a Graph Transformer architecture scheme, aiming to solve the problem of extracting structural information from N-ary heterogeneous graphs.
Methodology
This section presents the NAGT model, with applications in domains such as consumer electronics recommendation. In response to the complex correlations among multiple entities such as users, commodities, and attributes in the consumer electronics field, the model proposes an improved N-ary relational heterogeneous graph representation method to more accurately depict the interaction characteristics of multiple entities. The model operates as follows: the N-ary heterogeneous graph serves as input and is first converted into vector representations by the embedding layer, features are extracted through the Graph Transformer encoder with N-ary structural bias attention mechanism, which captures latent dependencies among entities and enhances feature interaction. In terms of the training strategy for the N-ary LP task, the model obtains the representation of missing nodes through the readout layer, outputs the probability distribution through the prediction layer, and uses the cross-entropy loss function for optimization. This model is compatible with the needs of consumer electronics recommendation and helps to improve recommendation performance.
N-ary heterogeneous graph
The primary challenge of N-ary relation LP is forming N-ary relation facts into an appropriate representation. In this paper, we modify the N-ary heterogeneous graph representation proposed by12.
An N-ary fact consists of a primary triple and m additional attribute key-value pairs, defined as
, where
and
, with E and R being the sets of entities and relations, respectively. We establish a fixed node ordering for the graph G to facilitate the structural bias computation. The nodes are indexed as:
. Consequently, nodes with indices 0, 1, 2 correspond to the primary triple (s, r, o), and the i-th attribute key-value pair
is located at indices
and
, respectively.
We follow the definition of12 to convert an N-ary fact into a heterogeneous graph
. First, the node set V of heterogeneous graph G consists of E and R in F, representing two types of nodes: entity-type and relational-type. The link set L contains six types of
undirected edges, which are:
1 subject-relation edge (s, r),
1 object-relation edge (o, r),
1 subject-object edge (s, o),
m relation-attribute edge
,m attribute-value edge
,m relation-value edge
.
Different from GRAN12, we added two types of edges: relation-value and subject-object. The N-ary fact heterogeneous graph is shown in Fig. 2. This aims to strengthen the correlation between s and o, and reduce the negative impact on s and o caused by the noise of r. At the same time, the new relation type, relation-value, also enhances the positive influence of v on r. This adjustment makes the representation of the N-ary fact heterogeneous graph more comprehensive.
To illustrate our graph construction method, consider a toy N-ary fact: (Leonardo_DiCaprio, won, Academy_Award, for: Titanic, year: 2014). Figure 3 demonstrates how this fact is converted into our heterogeneous graph representation. An illustrative example of converting an N-ary fact into our proposed heterogeneous graph representation. The primary triple (Leonardo_DiCaprio, won, Academy_Award) forms the core, with additional attribute key-value pairs (for, Titanic) and (year, 2014) connected via the six edge types defined in Section 3.1.
Fig. 3.
N-ary Fact Heterogeneous Graph Construction.
N-ary graph transformer
The proposed NAGT model performs an N-ary LP task. It aims to predict a missing element from an N-ary fact, i.e., to predict the primary object of the incomplete N-ary fact
. As described in the previous section, we convert the incomplete N-ary fact into a heterogeneous graph, where the missing element ‘?’ is denoted by a particular token [MASK]. This heterogeneous graph will serve as input to the NAGT model. It will first obtain a vector representation of each node through an embedding layer. These vectors are then fed into a graph transformer encoder with L layers. Finally, the encoded missing element vector is fed into a prediction layer to obtain the probability distribution of the element. The overall structure of the NAGT model is shown in Fig. 4.
Fig. 4.
The NAGT model.
Algorithm 1.
Forward Propagation of NAGT Model
As detailed in Algorithm 1, the integration pipeline of the NAGT model proceeds as follows: First, all nodes in the heterogeneous graph are converted into embedding vectors. These vectors are then fed into an L-layer Graph Transformer encoder. Within each layer, two operations are performed in parallel:
A GNN layer processes the graph adjacency relations to generate node representations
that encapsulate local structural information, which will be used for subsequent structure bias computation.- The multi-head attention mechanism is modified to incorporate our bias information. Specifically, for each node pair (i, j), the attention score is determined by three components:
- the standard query-key dot product;
- the edge-aware bias
based on edge types; - the structure-aware bias
based on the structural roles of node pairs.
Finally, the value vectors after weighted summation are likewise augmented with corresponding edge biases
and structure biases
. This design enables the attention mechanism to simultaneously perceive both semantic (via edge types) and topological (via node roles) information of the graph. After L layers of such processing, the final representation of the [MASK] node is read out and passed to the prediction layer.
Graph transformer encoder
Using a transformer model on heterogeneous graph data has attracted much attention. It can alleviate some of the limitations of graph neural networks, i.e., reducing over-smoothing and over-squeezing phenomena, capturing complex relationships between nodes, and handling different types of nodes and edges. Most of the graph transformer models are enhancements to the transformer encoder module. Therefore, we first briefly introduce the computational details of the transformer encoder.
From the functional point of view, the core role of the transformer encoder is to extract features, and the most critical part is the multi-head attention module, which is composed of multiple self-attention modules. It can be formulated as:
![]() |
1 |
![]() |
2 |
where
,
to H denotes the number of attention heads, and
denotes concatenation. The outputs
of multi-head attention are then passed to a residual connections and normalization layers, as:
![]() |
3 |
Then it passes through a Feed Forward Network(FFN), a two-layer fully connected layer. The activation function of the first layer is Relu, and the second layer is not used. The corresponding formula is as follows:
![]() |
4 |
![]() |
5 |
where
,
denotes the activation function and
represents the output of the
encoder block.
Existing approaches to incorporating graph information into Transformers can be divided into three typical ways43: (i) GNN as Auxiliary Modules, (ii) Improved Positional Embedding from Graphs, and (iii) Improved Attention Matrix from Graphs. To fully extract the structure information of an N-ary fact heterogeneous graph, this paper proposes a new attention mechanism based on N-ary structure bias by improving the attention matrix of the graph.
N-ary structure-biased attention mechanism
N-ary fact heterogeneous graphs differ from ordinary graphs because they have a specific structural form, and the aggregated information at each node differs. In N-ary fact, the primary triple is affected by the additional attribute key-value pair information. In contrast, additional attribute key-value pairs should also consider the information of the primary triplet but do not need to add much attention to other key-value pairs. For example, the
node (or
node) is more affected by the primary triplet (s, r, o) but less affected by other additional key-value pairs
. The original Transformer framework44, which utilizes a fully connected attention mechanism, cannot extract this particular structural information. Therefore, to solve this problem, we propose an N-ary structure-biased attention mechanism based on N-ary fact heterogeneous graphs. It aims to augment the corresponding node-pair scores using the N-ary fact structure. It can be formulated as:
![]() |
6 |
where
denotes edge-biased key and
denotes the structure-biased,
is softmax function. Specifically, we use the GNN to extract the local information and add the structure information to the attention score calculation through a mapping function.
![]() |
7 |
![]() |
8 |
![]() |
9 |
where
is the representation of node i in GNN at layer l,
is the adjacency matrix,
is the degree matrix corresponding to
,
and
are weight matrices.
and
are two different types of structure-aware attentive biases that we design for the structural characteristics of N-ary facts. As mentioned above, we divide structure types into two categories: (i) when the i node is a primary triplet element, it interacts with all nodes; (ii) When the i node belongs to an additional attribute key-value pair, it only interacts with the primary triplet node and another key-value pair node. Therefore, we define
as follows:
![]() |
10 |
The assignment of the
and
biases is designed to reflect the inherent semantic hierarchy of N-ary facts. We posit that the most critical interactions occur within the primary triple (case 1) and between a key-value pair and the primary triple it qualifies (case 2). Therefore, these direct, semantically strong associations are governed by the
parameters.
In contrast, interactions between different key-value pairs (case 3, ”other”) often represent secondary, indirect relationships. Assigning them a separate set of parameters
allows the model to adaptively learn the appropriate attention level for these cross-qualifier connections, preventing the model from over-emphasizing potentially spurious or weak correlations.
All structural bias terms—
and
—are implemented as learnable scalar parameters. They are initialized to zero and updated via gradient descent during training. This formulation allows the model to discover whether a specific structural relationship should be generally enhanced (resulting in a positive learned value) or suppressed (resulting in a negative learned value) within the attention mechanism, providing the necessary flexibility to model complex fact structures.
In addition, we also follows12 edge-biased attention setting. Different from12, there are six edge types in this paper. The following formula is shown:
![]() |
11 |
Structure-aware attentive biases focus on structure information so that the NAGT model can deal with the information of different substructures of heterogeneous graphs. Edge-aware attentive biases enable the NAGT model to deal with different types of edge information between two nodes. These two biases complement and lead to a more comprehensive NAGT model understanding of N-ary facts heterogeneous graphs.
Complexity analysis
The time and space complexity of a single Graph Transformer layer in NAGT is primarily dominated by the global self-attention mechanism, which is
in time and
in space for a graph with n nodes and hidden dimension d, aligning with the standard Transformer complexity. While this poses a challenge for extremely large graphs, we argue that the effective computational footprint is mitigated by two key factors in our specific application.
First, the size n of the input heterogeneous graph for a single N-ary fact is typically small (e.g.,
for m qualifiers, usually less than 20). Therefore, the absolute cost per fact remains manageable.
Second, and more importantly, the introduced N-ary structural biases act as an implicit sparsification mechanism. While the attention matrix is computed densely, the strong prior provided by
and
guides the model to quickly converge towards a focused, near-sparse attention distribution during training. This reduces the effective ”epochs-to-convergence” and combats overfitting by preventing the model from wasting capacity on learning noisy, long-range dependencies between unrelated nodes in the sparse fact representation.
Algorithm 2.
N-ary Structure-Biased Attention
Attention visualization
To demonstrate the effect of our structure-biased attention, we visualize the attention patterns for the toy example from Fig. 3. Figure 5 compares the attention distribution of a standard Graph Transformer and our NAGT model when predicting the missing object [MASK] in the incomplete fact (Leonardo_DiCaprio, won, [MASK], for: Titanic, year: 2014).
Fig. 5.
Attention Pattern Comparison.
Attention visualization comparing (a) standard Graph Transformer and (b) our NAGT model on the toy example. The standard Transformer produces a diffuse attention pattern, while NAGT’s structure-biased attention focuses strongly on semantically relevant connections, particularly between the relation won and the object Academy_Award, as well as the supporting qualifiers for and Titanic.
As shown in Fig. 5b, our NAGT model, guided by the structural biases, allocates high attention weights to semantically critical edges: specifically between the relation won and the candidate object Academy_Award, and between won and the qualifier for that provides essential context. In contrast, the standard Transformer (Fig. 5a) produces a more uniform and less interpretable attention distribution. This visualization confirms that our bias mechanism successfully steers the model’s focus toward the most relevant structural components for accurate link inference.
Prediction strategy
Passing through the L-layer encoder, the NAGT model outputs the representation
of each node in the N-ary heterogeneous graph. The representation
of the missing node [MASK] is obtained through a readout layer. We then enter
into a prediction layer to predict its probability distribution
over entities in E or relations in R. In our NAGT model, the prediction layer consists of two linear transformation layers and softmax function. The formula is:
![]() |
12 |
where
are freely learnable. In the LP task, we use the cross-entropy loss as our loss function. The formula is as follows:
![]() |
13 |
Structure-biased attention
To demonstrate the superiority of our proposed N-ary Structure-Biased Attention (S_biased) over standard Transformers in semantic modeling, we introduce the structure Bias in N-ary Heterogeneous Graphs. Let
denote an N-ary fact heterogeneous graph, where the node set
includes entity and relation nodes, and the edge set
comprises six edge types. We define the structure bias term
as:
![]() |
14 |
In standard Transformers, attention weights rely solely on node embedding similarity, ignoring structural roles. With the introduction of
, the attention score becomes:
![]() |
15 |
This mechanism effectively enhances semantic distinction between the primary triple and attribute key-value pairs, reducing attention interference from irrelevant nodes. By grouping nodes according to their structural roles, we can demonstrate that under the effect of
, attention weights within the same semantic group are significantly higher than those across groups. This effect can be quantified by the reduction in entropy of the attention distribution.
Let
denote the primary triple nodes and
denote the attribute key-value pair nodes. The attention entropy for node
is defined as:
![]() |
16 |
With structure bias, the attention distribution becomes more peaked towards structurally relevant nodes, thus reducing
compared to the standard Transformer baseline. This entropy reduction indicates improved semantic separability.
Semantic-capture
Let
be an N-ary heterogeneous graph with multiple relation and attribute types. Using a standard Transformer, the attention matrix
may introduce noise due to full connectivity. In contrast, with S_biased attention, the resulting attention matrix
satisfies:
![]() |
17 |
where
is the ideal semantic attention matrix. This inequality indicates that the S_biased mechanism brings the attention distribution closer to the ideal semantic structure.
By treating
as a structural prior, it can be modeled as a graph regularizer that constrains the attention distribution using graph Laplacian properties. Let
be the normalized Laplacian of the heterogeneous graph
. The structure bias can be viewed as imposing a smoothness constraint on the attention weights:
![]() |
18 |
The S_biased attention minimizes a composite objective combining the standard Transformer loss with this regularization term. Following the analysis in45, this regularization improves the fit to the underlying semantic structure, thereby reducing the Frobenius norm distance to the ideal attention matrix
. In scenarios with sparse relations or cold-start entities, standard Transformers are susceptible to noise from irrelevant edges. The S_biased mechanism, by constraining the attention scope through structural bias, enhances the identification of critical relationships.
Experiments
Experiment setup
Environment and dataset
To evaluate the performance of the NAGT model, we constructed a learning framework based on PyTorch and PyTorch Geometric in the experiments. A 12-layer Graph Transformer was adopted, with each layer containing 8 attention heads and a hidden dimension of 512. The model used the Adam optimizer, and the batch size for each dataset on a 32G V100 GPU was 1024 samples. To evaluate the model’s generalization capability and performance in N-ary LP tasks, the experiments were conducted based on three representative standard datasets for N-ary LP tasks. These datasets cover scenarios with different domains, data scales, and relationship complexities, enabling a comprehensive assessment of the model’s performance limits and applicability.
The experiments were carried out on three representative standard N-ary LP task datasets, namely JF17K46, WikiPeople47, and WD50K48. The JF17K dataset is carefully extracted from the massive Freebase K, covering a wide range of knowledge categories and reflecting complex knowledge association structures. It provides an extensive testing scenario for the model in handling multi-source heterogeneous knowledge LP tasks. Both WikiPeople and WD50K are derived from Wikidata, which, as the world’s largest free collaborative K, features authoritative, comprehensive, and continuously updated data. WikiPeople focuses on person-related information, containing a large number of person entities and their relationships, and is highly effective in verifying the model’s ability to predict entity relationships in specific domains. WD50K, on the other hand, has a unique data structure in terms of statement qualifiers and other aspects, which can effectively test the accuracy of the model in LP for complex semantic relationships.
In the experiments of this paper, we used the original dataset partitioning method, dividing each dataset into a training set, a validation set, and a test set. Since the original JF17K dataset did not provide a validation set, To ensure the integrity and scientific rigor of the experimental process, 20% of the data was rigorously extracted from its training set as the validation set.
In addition, considering that the proportion of N-ary facts (N>2) in the above three datasets is relatively low, which may affect the accurate verification of the N-ary LP effect, we further analyzed the WD50K (100) dataset. This dataset is extracted from WD50K and contains only data on multi-ary relational facts with N>2, which can more focusedly test the model’s prediction performance in complex multi-ary relational scenarios. For specific information about these datasets, As shown in Table 1.
Table 1.
Dataset statistics, where the columns indicate the number of all facts, N-ary facts with
, entities, relations, and facts in train/dev/test sets, all possible arities, respectively.
| Dataset | All facts | Higher-arity facts(%) | Entities | Relations | Train | Valid | Test | Arity |
|---|---|---|---|---|---|---|---|---|
| JF17K | 100,947 | 46320(45.9) | 28,645 | 501 | 76,379 | – | 24,568 | 2–6 |
| Wikipeople | 382,229 | 44315(11.6) | 47,765 | 193 | 305,725 | 38,223 | 38,281 | 2–9 |
| WD50K | 236,507 | 32167(13.6) | 47,156 | 532 | 166,435 | 23,913 | 46,159 | 2–67 |
| WD50K(100) | 31,314 | 31314(100) | 18,792 | 279 | 22,738 | 3,279 | 5297 | 3–67 |
Baselines
To evaluate the performance of the NAGT model proposed in this paper in N-ary LP tasks, this paper selects relatively representative models in this field for comparative experiments. The selection of models is based on their importance and wide recognition in the field of N-ary LP, covering different technical paradigms such as extensions of traditional binary LP methods, models based on convolutional neural networks, models utilizing neural modules, and Transformer fusion models, so as to better evaluate the NAGT model.
m-TransH10: As a classic generalized version of the TransH method in the scenario of N-ary facts, this model first realizes the direct extension of TransH from binary to N-ary scenarios. It projects all entities onto the hyperplane corresponding to a specific relationship and measures the rationality of N-ary facts by calculating the weighted sum of projected embeddings.
RAE46: It generalizes the binary relationship LP method TransH to higher-order scenarios. By measuring the compatibility between n entity values to evaluate the effectiveness of high-order N-ary facts, it provides a unique technical path for the application of the TransH framework in complex multi-ary scenarios.
NaLP47: A typical method using a convolutional neural network (CNN) architecture to handle N-ary LP tasks. It innovatively converts N-ary facts into a role-value pair structure and extracts deep features through convolution operations to achieve prediction. This design idea provides a unique technical paradigm for multi-ary relationship representation based on deep learning.
NeuInfer49 and HINGE50:As typical representatives of the introduction of neural module fusion methods in metafact reasoning, they handle complex multiary relationships through a three-stage framework of ”main triple validity measurement
auxiliary information compatibility evaluation
multimodule output fusion”, providing a landmark technical paradigm for Nary fact modeling combining neural symbolism. Among them, the hierarchical module fusion mechanism proposed by NeuInfer and the attention-weighted fusion strategy designed by HINGE respectively represent two typical ideas of neural module collaborative reasoning, which have been widely cited as benchmarks for neural methods in recent N-ary knowledge reasoning research and have significant domain representativeness. utilize neural modules to measure the validity of the primary triplet and its compatibility with each auxiliary description. They combine these modules to obtain an overall score of a fact.StarE48: As a representative model integrating message passing mechanism and Transformer architecture, it uses CompGCN as a message passing encoder to capture local semantic associations between entities and relationships, and models global dependencies among multi-ary entities through a Transformer decoder, finally outputting the validity score of N-ary facts.
Hy-Transformer51 and GRAN12 : As typical representatives of directly applying Transformer models to N-ary scenarios, they respectively realize the application of standard Transformer from different dimensions. Hy-Transformer enhances the model’s ability to capture multi-ary semantics by designing a qualifier prediction auxiliary task, while GRAN adapts to the link structure of N-ary input by modifying the attention mechanism. These two strategies together constitute the representative technical routes of ”auxiliary task enhancement” and ”structural adaptation optimization” for Transformer in N-ary LP tasks.
HyperSAT40: A recently proposed hypergraph-based semantic parsing model and structure-aware Transformer for HKG that incorporates heterogeneous attention biases and direction information, making it particularly suitable for handling complex semantic relationships and logical reasoning tasks.
The above baseline methods cover the mainstream technical paths in the field of N-ary LP, enabling us to verify the performance of the NAGT model from multiple perspectives. They can not only evaluate its prediction accuracy in different N-ary scenarios, but also test its adaptability and generalization ability to multi-ary knowledge structures. Subsequent experiments will quantitatively compare the performance of each method using standard evaluation metrics such as Mean Reciprocal Rank (MRR) and Hits@k (k = 1,10), clearly highlighting the technical advantages and innovative value of the NAGT model in handling Nary LP tasks.
Implementation details and hyperparameters
To ensure the reproducibility of our experiments, we provide a detailed description of our implementation and the hyperparameters used for training the NAGT model. Our code is implemented in PyTorch and PyTorch Geometric, and has been made publicly available at https://doi.org/10.5281/zenodo.17440593.
The hyperparameters were tuned on the validation set of each dataset. The key hyperparameter settings for the NAGT model across all datasets are summarized in Table 2 below.
Table 2.
Hyperparameter settings for the NAGT model.
| Hyperparameter | Value / Description |
|---|---|
| Optimizer | Adam |
| Learning Rate | 0.001 |
| Batch Size | 1024 |
| Graph Transformer Layers | 12 |
| Attention Heads per Layer | 8 |
| Hidden Dimension | 512 |
| Activation Function | ReLU |
| Dropout Rate | 0.1 |
| GNN Layers for Structure Bias | 2 |
| Weight Initialization | Xavier Uniform |
| Gradient Clipping | Norm of 1.0 |
| Training Epochs | 200 (Early stopping with patience=10 on validation MRR) |
Note: For the WD50K(100) dataset, which contains exclusively higher-arity facts, we observed that a slightly reduced learning rate of 0.0005 led to more stable convergence. This is also documented in the provided code.
Evaluation metrics
We subdivide each prediction task into subject/object prediction in the original triple and all entity prediction in the N-ary fact to test the model’s predictive ability, thereby more systematically revealing the model’s capability to capture and reason about relationships with different structures.
In terms of selecting evaluation metrics, we adopt MRR and Hits@k (k=1,10), which are widely recognized in KG LP tasks, as quantitative criteria. Among them, MRR comprehensively reflects the overall ranking performance of the model in entity prediction by calculating the average of the reciprocals of the ranks of correct entities in all test samples; a higher value indicates better average prediction accuracy of the model. Hits@k counts the proportion of correct entities in test samples that appear among the top k positions in the prediction results. This metric intuitively reflects the accuracy of the model in the top prediction results: Hits@1 is used to evaluate the model’s precise prediction ability, while Hits@10 measures the model’s recall performance within a certain fault-tolerant range. By combining these two types of metrics, the model’s prediction effect can be objectively quantified from different dimensions, ensuring the comprehensiveness and reliability of the evaluation results.
To ensure the robustness of our results, we perform statistical significance tests using the Wilcoxon signed-rank test over 5 independent runs with different random seeds. This non-parametric test is used to determine whether NAGT’s performance improvements over baseline methods are statistically significant. We report statistical significance at p < 0.05 level, indicated by asterisks (*) in the results tables.
Experiment results
Through experiments on three representative datasets, namely JF17K, Wikipeople, and WD50K, this paper compares and analyzes the performance of the NAGT model on the two subtasks of ”All Entities” and ”Subject/Object” to verify its applicability and advantages in different scenarios, as shown in Table 3.
JF17K dataset: Comprehensiveness in basic scenarios.
Table 3.
Comparison of NAGT with other models, composed of entity prediction accuracy on JF17K, WikiPeople and WD50K.
| Model | JF17K | Wikipeople | WD50K | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| All Entities | Subject/Object | All Entities | Subject/Object | All Entities | Subject/Object | |||||||||||||
| MRR | H@1 | H@10 | MRR | H@1 | H@10 | MRR | H@1 | H@10 | MRR | H@1 | H@10 | MRR | H@1 | H@10 | MRR | H@1 | H@10 | |
| m-TransH | 0.102 | 0.069 | 0.168 | 0.206 | 0.206 | 0.462 | – | – | – | 0.063 | 0.063 | 0.300 | – | – | – | – | – | – |
| RAE | 0.310 | 0.219 | 0.504 | 0.215 | 0.215 | 0.466 | 0.172 | 0.102 | 0.320 | 0.058 | 0.058 | 0.306 | – | – | – | – | - | – |
| NaLP | 0.366 | 0.290 | 0.516 | 0.221 | 0.165 | 0.331 | 0.338 | 0.272 | 0.466 | – | – | – | 0.224 | 0.158 | 0.330 | – | – | – |
| NeuInfer | 0.473 | 0.397 | 0.618 | 0.449 | 0.361 | 0.624 | 0.333 | 0.259 | 0.477 | – | – | – | 0.228 | 0.162 | 0.341 | 0.243 | 0.176 | 0.377 |
| HINGE | 0.517 | 0.436 | 0.675 | 0.431 | 0.342 | 0.611 | 0.350 | 0.282 | 0.467 | 0.342 | 0.272 | 0.463 | 0.232 | 0.164 | 0.343 | – | – | – |
| StarE | 0.542 | 0.454 | 0.685 | 0.574 | 0.496 | 0.725 | 0.378 | 0.265 | 0.542 | – | – | – | – | – | – | 0.349 | 0.271 | 0.496 |
| Hyper2 | – | – | – | 0.583 | 0.500 | 0.746 | – | – | – | 0.461 | 0.391 | 0.597 | – | – | – | – | – | – |
| HyTransformer | – | – | – | 0.582 | 0.501 | 0.742 | – | – | – | – | – | – | – | – | – | 0.356 | 0.281 | 0.498 |
| GRAN | 0.645 | 0.571 | 0.792 | 0.609 | 0.530 | 0.765 | 0.479 | 0.410 | 0.604 | 0.461 | 0.370 | 0.594 | 0.370 | 0.302 | 0.500 | 0.336 | 0.266 | 0.470 |
| HyperSAT | – | – | – | – | – | – | 0.496* | 0.430* | 0.613* | 0.493* | 0.427* | 0.610* | 0.380 | 0.306 | 0.493 | 0.345 | 0.270 | 0.489 |
| NAGT(Ours) | 0.657* | 0.583* | 0.803* | 0.617* | 0.539* | 0.773* | 0.482 | 0.415 | 0.607 | 0.482 | 0.419 | 0.599 | 0.400* | 0.328* | 0.540* | 0.365* | 0.292* | 0.510* |
Best results in each tasks are in bold.
In the ”All Entities” task of the JF17K dataset, NAGT achieves an MRR of 0.657, H@1 of 0.583, and H@10 of 0.803. In the ”Subject/Object” subtask, NAGT yields an MRR of 0.617, H@1 of 0.539, and H@10 of 0.773. All these results are significantly better than those of the second-best model, GRAN, verifying the effectiveness of NAGT in structural information extraction and subject/object entity recognition tasks. Moreover, during the process of entity relationship modeling, NAGT can more efficiently integrate structural information, reduce noise interference, and improve prediction accuracy.
-
(2)
Wikipeople dataset: Accurate breakthrough in person relationship scenarios.
The Wikipeople dataset focuses on person-related scenarios. In the ”All Entities” task, NAGT achieves an MRR of 0.482, H@1 of 0.415, and H@10 of 0.607. In the ”Subject/Object” subtask, NAGT yields an MRR of 0.482, H@1 of 0.419, and H@10 of 0.599, all of which are better than those of the second-best model, GRAN. This indicates that for the characteristics of ”many-to-many and implicit associations” in person relationships, NAGT can more effectively mine complex relationships, improve the logicality, accuracy, and coverage of predictions, thereby solving the problems of ”sparse relationships and ambiguous semantics” in person-related scenarios.
-
(3)
WD50K dataset: Balancing Efficiency and Accuracy in Large-scale KG.
WD50K is a dataset with a large amount of data and complex relationships. Large-scale graphs are prone to problems of ”information overload and relationship redundancy”. In the ”All Entities” task, NAGT achieves an MRR of 0.400, H@1 of 0.328, and H@10 of 0.540. In the ”Subject/Object” subtask, NAGT yields an MRR of 0.365, H@1 of 0.292, and H@10 of 0.510, which are better than those of the second-best model, GRAN. This indicates that when processing WD50K, NAGT can filter out invalid associations and focus on core relationships through an efficient mechanism for compressing and extracting structural information, providing a more reliable basis for entity prediction.
-
(4)
Comparison with HyperSAT
The inclusion of HyperSAT 40 provides an important comparison point, as both methods employ structure-aware Transformer architectures. As shown in Table 3, NAGT consistently outperforms HyperSAT across all datasets and tasks. Specifically, on the WD50K dataset, NAGT achieves a 2.5% relative improvement in MRR for ”All Entities” prediction and a 2.4% improvement for ”Subject/Object” prediction compared to HyperSAT. This performance advantage can be attributed to several key differences in our approaches. While HyperSAT incorporates heterogeneous attention biases and direction information, our NAGT model introduces a more fine-grained N-ary structure-biased attention mechanism that explicitly models the hierarchical relationships between primary triples and key-value pairs. Additionally, our approach does not require the complex subgraph sampling strategy used in HyperSAT, making NAGT more computationally efficient while maintaining superior performance.
The statistical significance tests (Wilcoxon signed-rank, p < 0.05) confirm that NAGT’s improvements over HyperSAT are statistically significant across multiple runs, validating the effectiveness of our proposed structural biases. That NAGT performs excellently on all datasets, indicating that the model has a good understanding ability in N-ary LP tasks, and its efficiency in structural information extraction is better than that of existing methods. NAGT provides an efficient and robust model option for N-ary LP tasks, and its design ideas have an important reference value to improve the practicality of K reasoning models.
Ablation experiment
To verify the effect of the N-ary structure-biased attention mechanism (S_biased) and N-ary heterogeneous graph (Hete_G) in the NAGT model on the N-ary LP performance, we perform ablation experiments on the dataset WD50K(100), all of which are N-ary facts with
. The experimental results of the ablation study are shown in Table 4.
Table 4.
Results of the N-ary LP Ablation Experiment on the NAGT Model.
| Model | WD50K(100%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| All Entities | Subject/Object | |||||||||
| Hete_G | S_biased | Modules | PE | MRR | H@1 | H@10 | MRR | H@1 | H@10 | |
| NaLP | 0.458 | 0.398 | 0.563 | – | – | – | ||||
| HINGE | 0.492 | 0.417 | 0.636 | – | – | – | ||||
| Hy Transformer | 0.562 | 0.499 | 0.677 | – | – | – | ||||
| StarE | 0.654 | 0.588 | 0.777 | – | – | – | ||||
| GRAN | 0.657 | 0.608 | 0.748 | 0.614 | 0.563 | 0.713 | ||||
| NAGT | ![]() |
![]() |
0.688 | 0.636 | 0.792 | 0.644 | 0.588 | 0.753 | ||
![]() |
![]() |
0.676 | 0.615 | 0.788 | 0.634 | 0.573 | 0.747 | |||
![]() |
![]() |
0.674 | 0.625 | 0.767 | 0.631 | 0.579 | 0.729 | |||
| Graph Transformer | ![]() |
![]() |
before | ![]() |
0.680 | 0.627 | 0.775 | 0.636 | 0.581 | 0.735 |
![]() |
![]() |
alternatively | ![]() |
0.674 | 0.622 | 0.774 | 0.630 | 0.574 | 0.737 | |
![]() |
![]() |
parallel | ![]() |
0.671 | 0.619 | 0.772 | 0.626 | 0.569 | 0.736 | |
![]() |
![]() |
before | LapPE | 0.682 | 0.631 | 0.779 | 0.639 | 0.585 | 0.742 | |
![]() |
![]() |
before | SVDPE | 0.683 | 0.629 | 0.786 | 0.638 | 0.581 | 0.750 | |
The results of the ablation experiment based on the WD50K (100%) dataset show that: in the verification of the two core modules S_biased and Hete_G in the NAGT model, when either module is turned off individually, the performance of the NAGT model in the ”All Entities” and ”Subject/Object” tasks both show varying degrees of decline. The results indicate that the synergistic effect of S_biased and Hete_G can significantly improve the accuracy of N-ary . Especially in complex N-ary fact scenarios, their effect on extracting and integrating structural information is more prominent. Meanwhile, compared with other baseline models, the NAGT with both modules enabled shows significant advantages in for N-ary facts where
, effectively enhancing the model’s ability to understand and utilize N-ary structural information, and providing a more efficient technical solution for the completion and reasoning of complex KG.
In a standard Transformer applied to a sparse graph, the model can easily overfit by latching onto spurious statistical correlations between unrelated nodes due to the lack of structural constraints. Our method counters this by injecting a strong structural prior. This prior, derived from the intrinsic semantics of N-ary facts, acts as a powerful regularizer. It constrains the hypothesis space, steering the model towards semantically plausible connections (e.g., between a subject and its primary relation) from the outset. This reduces its propensity to memorize noise in the sparse training data. The significant performance gain of the full NAGT model over its ablated variants (Table 4) on the WD50K dataset, which contains many complex and sparse facts, provides empirical evidence that our approach improves generalization rather than promoting overfitting.
Meanwhile, compared with other baseline models, the NAGT with both modules enabled shows significant advantages for N-ary facts where
, effectively enhancing the model’s ability to understand and utilize N-ary structural information, and providing a more efficient technical solution for the completion and reasoning of complex KG.
Analysis of graph transformer
To verify the impact of different structural information on the final performance of N-ary tasks, we compared the NAGT model with structural information extraction methods based on Graph Transformer. The experiment covered three types of combination modes between GNN and Transformer (before/alternatively/parallel) and introduced two typical positional embedding strategies (LapPE: Laplacian eigenvector embedding; SVDPE: adjacency matrix SVD decomposition embedding) to comprehensively verify the gain value of structural information.
From the experimental results on the WD50K (100%) dataset Table 4, among the three combination modes of GNN and Transformer without positional embedding, the ”before” mode performed relatively better, achieving an MRR of 0.680 in the ”All Entities” prediction task. However, its overall performance was still weaker than that of the NAGT model with both S_biased and Hete_G modules enabled, which achieved an MRR of 0.688.
After introducing LapPE or SVDPE into the before model, the performance of Graph Transformer was further improved. When LapPE was introduced, the MRR of the ”All Entities” task reached 0.682; when SVDPE was introduced, the MRR of this task reached 0.683. This indicates that supplementing structural information with positional embedding can enhance the ability of Graph Transformer to capture topological features of N-ary heterogeneous graphs, indirectly verifying that structural information has a positive promoting effect on model performance.
However, regardless of the combination mode adopted by Graph Transformer or whether positional embedding is introduced, the performance of its optimal configuration (SVDPE + before, MRR: 0.683) is still lower than that of NAGT with both modules enabled (MRR: 0.688). This proves that under the collaborative design of S_biased and Hete_G, NAGT can more accurately mine the structural information contained in N-ary relationships, and is more suitable for N-ary LP tasks compared with traditional Graph Transformer methods. At the same time, it also re-verifies that structural information plays a key role in improving model performance in N-ary LP tasks.
Parameter sensitivity and robustness
We provide a theoretical analysis of NAGT robustness based on its design principles. The structure-biased attention mechanism introduces a strong inductive bias that guides the model toward semantically meaningful patterns from the outset. This prior acts as a regularizer, constraining the hypothesis space and making the learning process less sensitive to specific hyperparameter settings. The consistent performance gains observed across all datasets Table 3 and the significant drop when removing key components Table 4 provide indirect evidence of this robustness. The model performs well on both JF17K and WD50K, suggesting stability across different fact arities and graph densities. Furthermore, the GNN used for structural feature extraction is shallow, which naturally limits its sensitivity to depth variations while being sufficient to capture the local neighborhood information in typically small N-ary fact graphs.
The consistent performance of NAGT across JF17K, WikiPeople, and WD50K–datasets with differing proportions of N-ary facts, relational complexities, and domains–suggests a degree of generalizability. However, as rightly noted by reviewers, future work must extend this evaluation to include newer benchmarks featuring noisy and inductive scenarios to fully establish the model’s broader applicability.
Analysis and discussion
To provide deeper insights into why NAGT outperforms baseline methods and to thoroughly understand its strengths and limitations, this section presents a comprehensive comparative analysis from multiple perspectives, including structural representation, attention mechanisms, and generalization capability.
Comparative analysis with baseline methods
We categorize the baseline methods into several technical paradigms and analyze NAGT’s advantages over each category:
Traditional Translation-based Models (e.g., m-TransH, RAE) These methods treat N-ary facts as flat structures and lack explicit modeling of the hierarchical relationships between the primary triple and attribute key-value pairs. In contrast, NAGT’s heterogeneous graph representation explicitly distinguishes between core triple elements and auxiliary qualifiers, enabling better semantic discrimination.
CNN-based Approaches (e.g., NaLP) While capable of capturing local interactions, CNNs struggle with long-range dependencies and complex relational patterns in heterogeneous graphs. NAGT leverages the global self-attention mechanism of Transformers to model interactions between all nodes in the graph, overcoming this limitation.
Neural Module Fusion Methods (e.g., NeuInfer, HINGE) These approaches enhance reasoning through modular design, but their static composition of modules may not adapt well to dynamically varying graph structures. NAGT’s structure-biased attention dynamically adjusts focus based on node roles and edge types, offering greater flexibility.
GNN-based Models (e.g., StarE) GNNs excel at local neighborhood aggregation but often suffer from over-smoothing in deep architectures. NAGT combines GNN-derived structural features with Transformer-based global attention, preserving both local structural information and high-order dependencies.
Transformer-based Methods (e.g., GRAN, HyperSAT) While these methods incorporate graph structural information, their attention bias designs are relatively coarse-grained. NAGT introduces a fine-grained N-ary structural bias that explicitly models the semantic hierarchy between primary triples and key-value pairs, leading to more accurate semantic capture.
Core advantages of NAGT
Hierarchical Graph Representation The addition of subject-object and relation-value edges strengthens the internal cohesion of the primary triple and enhances the positive influence of attribute values on the relation, reducing noise propagation through relation nodes.
Structure-Biased Attention Mechanism By integrating both topological (via GNN-extracted node roles) and semantic (via edge types) biases into attention computation, NAGT achieves more focused and semantically meaningful attention distributions. The entropy reduction analysis (Section 3.3) and attention visualization (Fig. 5) confirm this improved focus.
Generalization and Regularization Effect The structural biases act as a powerful regularizer, constraining the hypothesis space and guiding the model toward semantically plausible connections. This is particularly beneficial in sparse relation or cold-start scenarios, as evidenced by NAGT’s robust performance across datasets with varying arities and relational complexities.
Qualitative Analysis To gain deeper insights beyond quantitative metrics, we conducted a post-hoc analysis of model predictions on complex facts from the WD50K test set. We observe that NAGT demonstrates superior performance particularly on high-arity facts with multiple, interdependent qualifiers. For instance, in predicting the object for a fact like (Researcher, received, Award, for: Theory, field: Physics, year: 2019), NAGT’s structure-biased attention effectively distributes focus across the core relation (received) and the crucial contextual qualifiers (for, field). In contrast, strong baselines like GRAN occasionally fail to capture these complex interactions, leading to over-reliance on a single attribute. This observation aligns with our design intuition that explicitly modeling the hierarchical structure between the primary triple and its qualifiers is beneficial for fine-grained reasoning.
Conclusion
To address the issue of insufficient utilization of heterogeneous graph structural information in N-ary fact representation methods, this study, considering the complex structural characteristics of N-ary facts composed of a main triple and multiple sets of additional attribute key-value pairs, designs an improved N-ary heterogeneous graph representation scheme. This scheme addresses the issues of missing or redundant structural information in traditional representations, making the graph structure more in line with the semantic logic of N-ary facts. Meanwhile, to fully utilize the structural information of N-ary heterogeneous graphs, the study proposes the NAGT model and designs a new attention mechanism based on N-ary structural bias, enabling comprehensive understanding of N-ary fact heterogeneous graphs and further enhance the recognition accuracy of key entity associations in cold-start scenarios. Experimental results show that on three benchmark datasets JF17K, Wikipeople, and WD50K, the NAGT model significantly outperforms mainstream baseline methods such as m-TransH and GRAN in indicators like MRR and Hits@k (k=1,10), and performs prominently in the complex multi-ary relationship scenarios with N>2 contained in the WD50K (100%) dataset. This verifies the superiority of the NAGT model in structural information extraction, provides an effective solution for N-ary tasks, and contributes to KG completion and enhances the performance of recommendation systems.
We future studies could extend the NAGT framework to model temporal dynamics in evolving KG, enhance its interpretability through explainable attention mechanisms, and improve computational efficiency for web-scale applications via sparse attention techniques. To fully validate the broader applicability and robustness of our approach, future work should include testing on newer 2025 benchmarks that feature more challenging scenarios. Additionally, exploring inductive learning capabilities for unseen entities and integration with large language models represent valuable avenues for advancing HKG completion. We believe the NAGT model provides a solid foundation for these future research endeavors.
Acknowledgements
This work was supported in part by the Science and Technology Development Fund, Macau SAR (0009/2024/ITP1). National Natural Science Foundation of China (62162054). Natural Science Foundation under of Guangxi (2025GXNSFAA069497, 2025GXNSFAA069688). Innovation Project of Guangxi Graduate Education (YCSW2023188).
Author contributions
Conceptualization, B.H. and H.G.; methodology, G.P. and X.T.; software, J.L. and D.X.; validation, J.L. and X.T.; formal analysis, H.G. and X.T.; resources, H.G. and G.P.; data curation, J.L. and X.T.; writing-original draft preparation, X.T. and J.L.; writing-review and editing, B.H., H.G., G.P.,X.T.,D.X. and J.L.; visualization, J.L. and X.T.; supervision, H.G. and B.H.; project administration, B.H.; funding acquisition, B.H.,H.G. All authors have read and agreed to the published version of the manuscript
Data availability
The JF17K, WikiPeople, and WD50K datasets used in this study are publicly available from their original sources as cited in the manuscript. The source code and scripts necessary to reproduce the reported results are publicly available in the ”Graph Transformer for LP on N-ary Facts” repository at https://doi.org/10.5281/zenodo.17440593.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Xiongjie Tao, Email: 2240004281@student.must.edu.mo.
Hui Guo, Email: guohui@gxuwz.edu.cn.
References
- 1.Chen, X., Jia, S. & Xiang, Y. A review: Knowledge reasoning over knowledge graph. Expert Syst Appl141, 112948 (2020). [Google Scholar]
- 2.Zhang, Q. et al. Knowgpt: Knowledge graph based prompting for large language models. In Advances in neural information processing systems Vol. 37 (eds Globerson, A. et al.) 6052–6080 (Curran Associates Inc, Red Hook, 2024). [Google Scholar]
- 3.Bollacker, K., Evans, C., Paritosh, P., Sturge, T. & Taylor, J. Freebase: a collaboratively created graph database for structuring human knowledge. SIGMOD ’08, 1247–1250 (Association for computing machinery, New York, NY, USA, 2008).
- 4.Bollacker, K., Evans, C., Paritosh, P., Sturge, T. & Taylor, J. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 1247–1250 (2008).
- 5.Vrandečić, D. & Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM57, 78–85 (2014). [Google Scholar]
- 6.Li, W., Zhou, Y., Han, D., Feng, Z. & Zhou, M. Structural optimization and sequence interaction enhancement for hyper-relational knowledge graphs. In Advanced intelligent computing technology and applications (eds Huang, D.-S. et al.) 261–273 (Springer Nature Singapore, Singapore, 2024). [Google Scholar]
- 7.Fatemi, B., Taslakian, P., Vazquez, D. & Poole, D. Prediction beyond binary relations, Knowledge hypergraphs, (2020).
- 8.Li, J., Luo, X., Lu, G. & Zhang, S. Dhrl4hkg: A dual-hypergraph representation learning for hyper-relational knowledge graphs. Knowledge-Based Systems 113886 (2025).
- 9.Li, J., Luo, X., Lu, G. & Zhang, S. Hyper-relational knowledge representation learning with multi-hypergraph disentanglement. In Proceedings of the ACM on Web Conference2025, 3288–3299 (2025).
- 10.Wen, J., Li, J., Mao, Y., Chen, S. & Zhang, R. On the representation and embedding of knowledge bases beyond binary relations. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 1300–1307 (2016).
- 11.Liu, Y., Yao, Q. & Li, Y. Role-aware modeling for n-ary relational knowledge bases. In Proceedings of the Web Conference2021, 2660–2671 (2021).
- 12.Wang, Q., Wang, H., Lyu, Y. & Zhu, Y. Link prediction on n-ary relational facts: A graph-based approach. In Findings of the Association for Computational Linguistics: ACL-IJCNLP2021, 396–407 (2021).
- 13.Nasiri, E., Berahmand, K., Rostami, M. & Dabiri, M. A novel link prediction algorithm for protein-protein interaction networks by attributed graph embedding. Comput Biol Med137, 104772 (2021). [DOI] [PubMed] [Google Scholar]
- 14.Hu, Z., Gutiérrez-Basulto, V., Xiang, Z., Li, R. & Pan, J. Z. Hyperformer: Enhancing entity and relation interaction for hyper-relational knowledge graph completion. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 803–812 (2023).
- 15.Liu, Y. et al. Self-supervised dynamic hypergraph recommendation based on hyper-relational knowledge graph. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 1617–1626 (2023).
- 16.Di, S., Yao, Q. & Chen, L. Searching to sparsify tensor decomposition for n-ary relational data. In Proceedings of the Web Conference2021, 4043–4054 (2021).
- 17.Li, Z., Wang, C., Wang, X., Chen, Z. & Li, J. HJE: Joint convolutional representation learning for knowledge hypergraph completion. IEEE Trans Knowl Data Eng36, 3879–3892 (2024). [Google Scholar]
- 18.Rosso, P., Yang, D. & Cudré-Mauroux, P. Beyond triplets: Hyper-relational knowledge graph embedding for link prediction. In Proceedings of the Web Conference2020, 1885–1896 (2020).
- 19.Yu, D. & Yang, Y. Improving hyper-relational knowledge graph completion. arXiv:2104.08167 (2021).
- 20.Shomer, H., Jin, W., Li, J., Ma, Y. & Liu, H. Learning representations for hyper-relational knowledge graphs. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining, 253–257 (2023).
- 21.Luo, H. et al. Hahe: Hierarchical attention for hyper-relational knowledge graphs in global and local level. arXiv:2305.06588 (2023).
- 22.Li, Z. et al. Hysae: an efficient semantic-enhanced representation learning model for knowledge hypergraph link prediction. In: Proceedings of the ACM on Web Conference2025, 86–97 (2025).
- 23.Mohammadi, M., Berahmand, K., Sadiq, S. & Khosravi, H. Knowledge tracing with a temporal hypergraph memory network. In International Conference on Artificial Intelligence in Education, 77–85 (Springer, 2025).
- 24.Yun, S., Jeong, M., Kim, R., Kang, J. & Kim, H. J. Graph transformer networks. Advances in neural information processing systems32 (2019).
- 25.Sun, H., Wei, M., Li, J., Xu, X. & Zhou, B. A fact prediction model based on hyper graph and neural network. In 2024 5th International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE), 880–883 (IEEE, 2024).
- 26.Wei, J. et al. A survey of link prediction in n-ary knowledge graphs. arXiv:2506.08970 (2025).
- 27.Wei, J., Guan, S., Jin, X., Guo, J. & Cheng, X. Few-shot link prediction on hyper-relational facts. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 7196–7207 (2024).
- 28.Luo, H. et al. Hahe: Hierarchical attention for hyper-relational knowledge graphs in global and local level. arXiv:2305.06588 (2023).
- 29.Wang, C., Wang, X., Li, Z., Chen, Z. & Li, J. Hyconve: A novel embedding model for knowledge hypergraph link prediction with convolutional neural networks. In Proceedings of the ACM Web Conference2023, 188–198 (2023).
- 30.Li, M., Shi, X., Qiao, C., Zhang, T. & Jin, H. Hyperbolic hypergraph neural networks for multi-relational knowledge hypergraph representation. arXiv:2412.12158 (2024).
- 31.Wei, J., Guan, S., Jin, X., Guo, J. & Cheng, X. Inductive link prediction in n-ary knowledge graphs. In Proceedings of the 31st International Conference on Computational Linguistics, 8885–8896 (2025).
- 32.Berahmand, K. et al. Relative entropy-based regularized non-negative matrix factorization for attributed graph clustering. ACM Trans Knowl Discov Data19(9), 1–28 (2025). [Google Scholar]
- 33.Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems30 (2017).
- 34.Zhang, T. et al. Transg-net: Transformer and graph neural network based multi-modal data fusion network for molecular properties prediction. Appl Intell53, 16077–16088 (2023). [Google Scholar]
- 35.Hu, Z., Dong, Y., Wang, K. & Sun, Y. Heterogeneous graph transformer. In Proceedings of the Web Conference2020, 2704–2710 (2020).
- 36.Lai, P.-T. & Lu, Z. Bert-gt: Cross-sentence n-ary relation extraction with bert and graph transformer. Bioinformatics36, 5678–5685 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ying, C. et al. Do transformers really perform badly for graph representation?. Adv Neural Inf Process Syst34, 28877–28888 (2021). [Google Scholar]
- 38.Baghershahi, P., Hosseini, R. & Moradi, H. Self-attention presents low-dimensional knowledge graph embeddings for link prediction. Knowl-Based Syst260, 110124 (2023). [Google Scholar]
- 39.Shi, F., Li, D., Wang, X., Li, B. & Wu, X. Tgformer: A graph transformer framework for knowledge graph embedding. IEEE Transactions on Knowledge and Data Engineering (2024).
- 40.Wang, J., Chen, H. & Zhang, W. Structure-aware transformer for hyper-relational knowledge graph completion. Expert Syst Appl277, 126992 (2025). [Google Scholar]
- 41.Li, J., Luo, X., Lu, G. & Zhang, S. Dhrl4hkg: A dual-hypergraph representation learning for hyper-relational knowledge graphs. Knowledge-Based Systems 113886 (2025).
- 42.Chen, C. et al. A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024). [DOI] [PubMed]
- 43.Min, E. et al. Transformer for graphs: An overview from architecture perspective. arXiv:2202.08455 (2022).
- 44.Vaswani, A. et al. Attention is all you need. Adv Neural Inf Process Syst30, 1 (2017). [Google Scholar]
- 45.Velickovic, P. et al. Graph attention networks. stat1050, 48550 (2017).
- 46.Zhang, R., Li, J., Mei, J. & Mao, Y. Scalable instance reconstruction in knowledge bases via relatedness affiliated embedding. In Proceedings of the 2018 World Wide Web Conference, 1185–1194 (2018).
- 47.Guan, S., Jin, X., Wang, Y. & Cheng, X. Link prediction on n-ary relational data. In: The world wide web conference, 583–593 (2019).
- 48.Galkin, M., Trivedi, P., Maheshwari, G., Usbeck, R. & Lehmann, J. Message passing for hyper-relational knowledge graphs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7346–7359 (2020).
- 49.Guan, S., Jin, X., Guo, J., Wang, Y. & Cheng, X. Neuinfer: Knowledge inference on n-ary facts. In Proceedings of the 58th annual meeting of the association for computational linguistics, 6141–6151 (2020).
- 50.Rosso, P., Yang, D. & Cudré-Mauroux, P. Beyond triplets: hyper-relational knowledge graph embedding for link prediction. In Proceedings of the Web Conference2020, 1885–1896 (2020).
- 51.Yu, D. & Yang, Y. Improving hyper-relational knowledge graph completion. arXiv:2104.08167 (2021).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The JF17K, WikiPeople, and WD50K datasets used in this study are publicly available from their original sources as cited in the manuscript. The source code and scripts necessary to reproduce the reported results are publicly available in the ”Graph Transformer for LP on N-ary Facts” repository at https://doi.org/10.5281/zenodo.17440593.











































