Abstract
The task of pose-guided person image generation has long been confronted with two major challenges: high-frequency texture details tend to blur and be lost during appearance transfer, while the semantic identity of the person is difficult to maintain consistently during pose changes. To address these issues, this paper proposes a diffusion-based generative framework that integrates frequency awareness and global semantic alignment. The framework consists of two core modules: a multi-level fractional-order Gabor frequency-aware network, which accurately extracts and reconstructs high-frequency texture features such as hair strands and fabric wrinkles, enhances image detail fidelity through fractional-order filtering and complex domain modeling; and a global semantic-pose alignment module that utilizes a cross-modal attention mechanism to establish a global mapping between pose features and appearance semantics, ensuring pose-driven semantic alignment and appearance consistency. The collaborative function of these two modules ensures that the generated results maintain structural integrity and natural textures even under complex pose variations and large-angle rotations. The experimental results on the DeepFashion and Market1501 datasets demonstrate that the proposed method outperforms existing state-of-the-art approaches in terms of SSIM, FID, and perceptual quality, validating the effectiveness of the model in enhancing texture fidelity and semantic consistency.
Keywords: pose-guided person image synthesis, diffusion models, fractional gabor filters, frequency-aware feature extraction, global semantic-pose alignment, high-fidelity texture reconstruction
1. Introduction
Pose-guided person image generation (PGPIS) aims to generate person images with a target pose while maintaining consistent identity information (as shown in Figure 1). This task requires that the generated images not only preserve identity consistency and semantic coherence, but also exhibit realistic texture details and high visual quality, matching the target pose. PGPIS has significant applications in various fields. For instance, in animation production, it can automatically generate high-resolution character images in different poses, saving time spent manually drawing. It also plays an important role in virtual try-on, game character customization, and special effects production, where it helps increase automation, reduce production costs, and provide users with an enhanced visual experience. In PGPIS, accurately transferring high-frequency texture details and maintaining stable person-specific semantic information are the key challenges for achieving high-fidelity generation. The former requires the reproduction of details such as fabric textures, hair strands, and skin texture, while the latter necessitates precise pose alignment and the avoidance of semantic distortion. To address these challenges, current research primarily follows two technological paths: Generative Adversarial Networks (GANs) [1] and diffusion models [2].
Figure 1.
Illustration of pose-guided person image synthesis.
GAN-based approaches pioneered the PGPIS paradigm and explored how to balance appearance detail and pose control. XingGAN [3] processes appearance and pose in dual branches and fuses them via collaborative attention, but its complex structure and imperfect fusion lead to remaining visual artifacts. MARLM-GAN [4] improves global relationships through inter-activation and residual linear modeling, yet struggles with large pose changes and occlusions due to the limitations of single-image information. RSA-GAN [5] incorporates human parsing priors to enhance realism but incurs high training costs and performs poorly in low-resolution or complex scenes. UPG-GAN [6] combines CycleGAN and VAE to enable unpaired training and diverse generation, though it suffers from low resolution and weak handling of rare poses and facial areas. DPIG [7] disentangles the foreground, background, and pose in a two-stage framework, but insufficient structural consistency and the absence of skip connections limit detail reconstruction. Despite their foundational role, GANs face inherent issues—including instability, mode collapse, limited diversity, and difficulty modeling complex semantics and high-frequency textures—making them less suitable for modern high-resolution PGPIS tasks. Diffusion models alleviate these issues through stable Markov-based optimization, avoiding mode collapse and supporting rich semantic conditioning, yet still struggle with fine-grained texture control, component-level semantics, and occlusion handling. Although recent diffusion-based improvements have targeted PGPIS, they continue to expose new limitations.
PIDM [8] encodes multi-scale textures and injects them into UNet cross-attention to link source appearance with target pose, but it lacks explicit body part modeling and struggles with high-frequency detail under large pose changes. IMAGPose [9] combines the VAE and CLIP encoders for texture and semantic extraction, yet VAE compression causes detail loss and CLIP lacks component-level understanding, leading to incorrect texture mappings. CFLD [10] decouples multi-scale features with Swin Transformer and refines them via attention, but still fails to model complex garment components, often causing texture misalignment. PoCoLD [11] performs latent diffusion guided by DensePose, but VAE downsampling discards details and lacks multi-view consistency. HumanSD [12] conditions diffusion on text and pose but does not extract appearance features, making fine textures difficult to preserve. RePoseDM [13] attempts texture–pose alignment via attention, yet feature correspondence remains weak, leading to misalignment and distortion under large pose variations. DNAF [14] injects Swin-extracted style features as constant conditions and uses noise-aware features for detail enhancement, but may cause perceptual distortion. UPGPT [15] extracts region-specific features via CLIP, but the pipeline is complex and CLIP’s semantic bias harms fine-grained texture fidelity. TCAN [16] uses an appearance UNet pre-trained on Realistic Vision to improve consistency, though the system is heavy, costly, and reliant on multiple cooperating modules.
In summary, existing source image processing methods mainly fall into two categories (as shown in Figure 2): direct and multi-level encoding in the real domain, and direct embedding strategies lacking semantic alignment mechanisms. However, both categories struggle with preserving fine textures and achieving component-level semantic understanding. Most rely on compressed pre-trained encoders (e.g., VAE, CLIP, Swin Transformer, and ControlNet) that model appearance only in the real domain while ignoring magnitude–phase information, leading to high-frequency detail loss. Consequently, although global structures can be produced, facial blurriness and texture distortion remain common. Even multi-scale or noise-enhanced methods suffer from texture misalignment and regional detail loss. Moreover, current approaches lack precise body part understanding and fail to achieve interactive alignment between pose and appearance features, resulting in inconsistent mapping between the source’s appearance and the target pose.
Figure 2.
Analysis of limitations in existing source image processing. On the right, direct encoding methods (e.g., VAE) are shown, while the left illustrates condition injection methods (e.g., Swin Transformer). Current approaches suffer from a lack of frequency modeling and semantic alignment mechanisms, resulting in semantic misalignment and blurred high-frequency details. The “sad face” icons denote the specific loss of texture fidelity in these methods.
Despite providing a powerful generative prior for person image generation, diffusion models still face the dual challenge of detail preservation and semantic alignment when processing source images. Achieving high-fidelity texture reconstruction and precise component-level semantic control without sacrificing generation efficiency remains a key direction for future research.
To overcome the limitations of general-purpose pre-trained encoders in texture detail preservation and component-level semantic understanding due to their modeling of texture details only in the real domain, this work proposes a texture feature extraction network based on fractional-order Gabor frequency awareness (MLGFN). This approach abandons traditional real-domain modeling and adopts a multi-level complex frequency-aware architecture to capture texture details from multiple perspectives. By utilizing fractional-order Gabor filters, both the amplitude and phase information of textures are captured simultaneously. The combination of frequency-aware units and frequency-domain self-attention mechanisms enables the more precise modeling of texture directional features and long-range dependencies. Compared to traditional methods, MLGFN significantly improves the retention of high-frequency textures, enhances the fidelity of facial features and fabric details, and provides more accurate component-level semantic understanding. To address issues of semantic distortion and pose misalignment, we also introduce a Global Semantic–Pose Alignment Module (GSPAM). Through a cross-attention mechanism, this module enables adaptive matching between the pose and the global semantic features, effectively improving structural consistency and texture fidelity under pose variations and in complex clothing scenarios. The source code is available at: https://github.com/bing32475/FrGPose, accessed on 15 February 2026. The main contributions of this paper are as follows:
We proposed a texture feature extraction network based on fractional-order Gabor frequency-awareness, enhancing the capture of both amplitude and phase features of textures. This reduces high-frequency detail loss from aggressive downsampling in traditional encoders, improving the fidelity of fine-grained structures such as fabric, hair, and facial textures.
We design a global semantic–pose alignment mechanism based on cross-attention, which combines global appearance semantics from the Swin Transformer with deep pose features. By aligning the semantic pose in the global feature domain for image reconstruction, this method ensures structural consistency and texture continuity in scenarios with large pose variations and complex garments.
An end-to-end multi-scale feature fusion generation framework is proposed, combining frequency-aware texture networks with global pose–appearance alignment. This unified framework optimizes texture detail preservation and semantic structure modeling, achieving superior visual quality and semantic consistency across multiple datasets.
2. Related Work
2.1. Texture Preservation Approaches for PGPIS
In PGPIS, preserving fine textures under pose changes remains a significant challenge, and recent progress highlights the importance of frequency-domain cues for improving detail fidelity. Early GAN-based methods adopted encoder–decoder architectures but suffered from training instability and poor texture accuracy. Diffusion-based approaches have improved robustness, yet they remain insufficient in generating high-frequency details. These limitations indicate that relying solely on image–space features is inadequate and that frequency-domain signals are essential for high-fidelity texture recovery. However, existing frequency-domain methods lack effective directional feature extraction and amplitude–phase joint modeling. The Res FFT-Conv Block [17] improves global consistency but utilizes isotropic FFT features lacking directional selectivity, while Yang et al. [18] separate amplitude and phase using FFC but lack complex-domain modeling, thereby limiting control over structured textures.
2.2. Global Semantic–Pose Alignment Mechanism
Accurate semantic and pose alignment remains a fundamental challenge in person image generation. Existing appearance flow methods achieve alignment by predicting coordinate offsets; however, they are susceptible to flow distortion under large pose variations, leading to structural and textural deformation. While attention-based alignment methods improve semantic correspondence, they often result in semantic ambiguity in detailed regions due to a lack of fine-grained spatial constraints or difficulties in modeling complex spatial relations. Furthermore, although recent studies attempt to decouple style and texture, or introduce strong supervision signals, they are constrained by the absence of explicit structural priors and detail degradation during latent diffusion, making it difficult to ensure alignment accuracy and global consistency under large-scale or non-rigid transformations.
To address these challenges, we propose a unified framework comprising a fractional-order Gabor frequency-aware network and a GSPAM module. The Gabor network leverages fractional filters and complex-domain attention to capture directionally selective texture primitives, ensuring high-fidelity structural preservation. Concurrently, GSPAM aligns global appearance with pose features via cross-attention and multi-level fusion. This synergy establishes fine-grained correspondences, effectively minimizing semantic fragmentation and pose drift in complex regions such as joints and clothing boundaries.
3. The Proposed Framework
3.1. Preliminaries
3.1.1. Stable Diffusion
We first review the foundations that our framework builds upon, including Stable Diffusion and two classic time-frequency transforms. Stable Diffusion (Rombach et al., 2022) [2] is an efficient diffusion probabilistic model that operates in the latent space to reduce computational costs, enabling high-resolution image synthesis. Its core lies in the noise addition and denoising processes. In this work, we extend its conditional injection mechanism by introducing dual guidance with semantic–pose alignment features and appearance detail encoding. Specifically, the noise addition process follows forward diffusion, where Gaussian noise is gradually added in the latent space . For a time step , the noise addition formula is
| (1) |
where is the cumulative variance schedule coefficient, and represents the noise schedule (either linear or cosine schedule). The denoising process reverses the diffusion via , predicting the added noise:
| (2) |
Here, c represents the conditional input, which is injected into the UNet through cross-attention. The iterative denoising process begins from pure noise and progressively restores , which is then decoded into the image by the VAE decoder D. This core formula ensures the stability of the generation process. Based on this framework, we fine-tune the model to integrate semantic–pose alignment and appearance detail perception features, achieving the simultaneous optimization of texture fidelity and pose consistency.
3.1.2. Gabor Transform and Fractional Fourier Transform
The 2D Gabor transform is a widely used time–frequency analysis method that performs localized frequency decomposition by multiplying a Gaussian window with a complex exponential wave. Unlike the global Fourier transform, it preserves local spatial–frequency information, enabling effective analysis of regional signal variations. For a 2D signal , the Gabor transform is defined as
| (3) |
where ∗ denotes the complex conjugate operator. The window function is typically chosen as a Gaussian function. The Gaussian function localizes the signal in the spatial domain, while the complex exponential term performs frequency-domain analysis. The window function is expressed as
| (4) |
where is the aspect ratio controlling the scale difference between the two spatial directions, and determines the width of the window function. To construct the Gabor filter, this real-valued envelope is modulated by a complex sinusoidal carrier. The specific 2D Gabor filter kernel is formulated as
| (5) |
where represents the wavelength of the sinusoidal factor, denotes the orientation of the parallel stripes, and is the phase offset. The rotated coordinates are defined as and . This transform allows local frequency analysis in multiple orientations and is widely applied in image edge detection, texture analysis, and other computer vision tasks.
The traditional Fourier transform (FT) can be viewed as a 90° rotation of the signal in the time–frequency plane, while the fractional Fourier transform (FRFT) introduces a more flexible joint time–frequency analysis framework by controlling arbitrary rotations through the fractional order parameter , making it suitable for the analysis of local and non-stationary signals.
The FRFT is mathematically defined as
| (6) |
where the kernel function is given by
| (7) |
Here, represents the fractional order that determines the rotation angle in the time–frequency plane. When , the FRFT reduces to the conventional Fourier transform, and when , it corresponds to the original signal.
In the two-dimensional case, FRFT performs analysis by rotating the time–frequency plane along both axes. The 2D FRFT is defined as
| (8) |
where and are the FRFT kernels along the x- and y-directions, respectively. This formulation enables representation of 2D signal characteristics at arbitrary rotation angles in the time–frequency plane.
3.1.3. Fractional Gabor Transform
The fractional Gabor transform (FGT) integrates the FRFT with the conventional Gabor transform, enabling local frequency analysis in arbitrary orientations through a rotated time–frequency plane. By applying Gabor filtering within the FRFT domain, FGT offers more flexible and precise local representations, making it well suited for directional or non-stationary signal analysis.
The general definition of the 2D fractional Gabor transform is as follows:
| (9) |
where is the input signal, is the fractional Gabor window function, and are the FRFT kernels in the x- and y-directions, denotes the spatial displacement (i.e., the center position of the window), and represents the rotated frequency coordinates. This transform effectively combines the rotational property of FRFT with the localization capability of the Gabor window, enabling the robust extraction of features such as oblique textures and image edges. The fractional Gabor window function combines a Gaussian envelope with fractional-order phase modulation and is expressed as
| (10) |
The first term provides the Gaussian envelope-controlling window scale, while the second introduces fractional-order phase modulation determined by the FRFT rotation angle . This enables orientation-adaptive time–frequency resolution, making the window effective for signals with strong directionality or complex frequency patterns.
While lighter representations like Short-Term Fourier Transform (STFT) or standard Wavelets are effective for stationary texture analysis, they often lack flexibility in capturing the non-stationary and direction-variant characteristics of clothing textures under complex pose deformations. Fabric wrinkles and patterns (e.g., stripes) often exhibit varying orientations that do not align with fixed coordinate axes. The rationale for employing the fractional Gabor transform lies in its unique ability to perform time–frequency analysis in a rotated plane determined by the fractional order . This provides an additional degree of freedom to optimally align the analysis window with the varying principal directions of the texture, offering a more compact and sparse representation for non-rigidly deformed patterns than isotropic filters or standard wavelets.
3.2. Multi-Level Fractional Gabor Frequency Network
In PGPIS, accurately extracting high-frequency details such as wrinkles, skin texture, and hair is essential for realism and consistency. Existing pre-trained encoders (e.g., VGG and CLIP) often lose fine textures and lack component-level semantic understanding. To address these limitations, we introduce a feature generation framework based on fractional-order Gabor filtering and frequency-domain attention (Figure 3). Specifically, A represents the final-layer semantic features extracted via the Swin Transformer, while denote multi-scale pose features extracted by lightweight ResNet blocks [19]. Furthermore, correspond to the texture features extracted by our proposed MLGFN. Within the MLGFN, fractional Gabor kernels capture direction- and scale-sensitive texture responses, and a complex-domain attention mechanism aligns global phase information, thereby enabling the accurate preservation of fine details—such as clothing, hair, and edges—even under large pose changes. Finally, denotes the aligned features’ output by our proposed GSPAM after aligning the semantic features with pose features.
Figure 3.
Overview of the proposed FreqPose framework. The source image is processed by the Multi-Level Gabor Frequency Network (MLGFN) (orange block) to extract frequency-aware texture features . These features, along with the Pose features from the Pose Encoder (blue block), are fused via the Global Semantic–Pose Alignment Module (GSPAM) (purple block) before being injected into the denoising UNet. The ’Fire’ icon denotes trainable modules, while the ’Snowflake’ denotes frozen pre-trained weights.
We apply transformations to the input at different angles. For each direction , the rotated coordinates are defined as
| (11) |
The angles are used to capture the spatial relationships of textures through four different directions. Based on Euler’s formula, we extend the Gabor window function in Equation (10) to the following form for each angle and each scale s:
| (12) |
The scale is used to capture the spatial relationships of textures through four different scales. Although the fractional Gabor transform is defined in the continuous domain, our implementation operates in the discrete space compatible with Convolutional Neural Networks. We discretize the continuous kernel by sampling it on a fixed grid of size . This yields the real and imaginary parts of the Gabor kernel:
| (13) |
where . Then, we apply normalization:
| (14) |
Here, is the fractional-order parameter, and s represent the direction and scale, respectively, and is a stability constant. Compared to the traditional Gabor kernel, fractional-order control allows for the flexible adjustment of the frequency band shape, enhancing the filter’s ability to respond to both fine-grained and coarse-grained textures. This is particularly important for regions such as fabric folds and hair, enabling the model to capture continuous texture patterns across scales, rather than solely focusing on local strong textures.
Based on the above kernel, complex convolution is performed on the input feature X.
| (15) |
Here, ∗ denotes the convolution operation, followed by aggregation along the direction and scale dimensions:
| (16) |
where ∑ denotes the summation operator in the equation. This approach not only extracts the magnitude features of the texture but, more importantly, preserves the phase information, thereby capturing the directionality, continuity, and regularity of the texture. This means that when the pose of the person undergoes deformation, local structures such as striped fabrics, hair direction, and clothing folds can still be well preserved, alleviating misalignment and blurring issues.
We further apply GaborFPU to split the input channels into four groups for parallel filtering, followed by fusion through convolution and residual mapping:
| (17) |
Here, represents the ReLU activation, denotes the shortcut mapping and ⊕ stands for concatenation. Because each branch processes a distinct orientation and scale, the network learns complementary directional responses; the residual path guarantees that the original feature map is preserved, preventing over-filtering. Through parallel multi-directional filtering and residual design, GaborFPU enhances the texture direction diversity while maintaining overall appearance stability (as shown in Figure 4), providing a structurally richer foundation for the subsequent attention mechanism.
Figure 4.
Architectural diagrams of the FractionalGaborFilter, GaborSingle, and GaborFPU modules.
In global frequency domain modeling, we use FrequencyAttention (as shown in Figure 5) to capture texture phase dependencies across different locations. Let the complex output of GaborSingle be Z. We first perform dimensionality reduction on its real and imaginary parts using convolution and apply complex normalization to obtain
| (18) |
This normalization mitigates the issue of amplitude imbalance across different directional and scale branches, resulting in more stable attention aggregation. Subsequently, we rearrange into a multi-head format, where , h is the number of heads, and is the dimension per head.
| (19) |
To perform attention computation in the complex space, we calculate the scaled dot product attention for the real and imaginary parts separately:
| (20) |
Here, T is the temperature coefficient. We apply softmax separately to the real and imaginary attention scores, which empirically yields more stable gradients than a joint complex softmax. After applying softmax to and , the weighted matrices are obtained, with the corresponding real and imaginary parts of the output being
| (21) |
Finally, the complex output is obtained as :
| (22) |
This separate computation of the real and imaginary components is more stable than directly applying complex multiplication, while clearly distinguishing in-phase correlations (the real part) from quadrature-phase correlations (the imaginary part). This enables the model to explicitly capture phase alignment relationships within textures, which provides a clear advantage in maintaining texture continuity under large pose variations.
Figure 5.
Framework of the FrequencyAttention module.
The magnitude is computed as . Finally, we take the complex magnitude of and fuse it with the channel attention result of . The fused output is then formed through convolution and normalization to produce the frequency attention output:
| (23) |
where ⊙ denotes the Hadamard product, and represents the modulus operator. Channel attention suppresses the irrelevant frequency bands and highlights the texture regions that are related to person identity features, such as fabric patterns, lace, logos, and decorative designs, making the output more focused and expressive. Finally, we fuse with :
| (24) |
The method achieves dual texture modeling at both the local and global levels. Fractional-order Gabor filtering provides direction-controllable and scale-adaptive complex texture responses at the local level, while FrequencyAttention establishes phase consistency and long-range texture dependencies at the global level. By applying normalization and channel attention, the model’s ability to focus on key high-frequency features is enhanced. The combination of both allows the network to simultaneously preserve the fine structure of local textures and the consistency of global appearance when handling complex pose transformations, significantly improving the clarity and identity fidelity of the generated results.
3.3. Global Semantic–Pose Alignment Module
In pose-guided portrait generation, achieving deep alignment between pose structures and semantic features remains challenging. Existing methods often use simple feature concatenation or linear fusion, which fails to preserve spatial correspondences in the deep semantic space, leading to local misalignment and texture drift under large pose variations.
To effectively align the appearance features of the source image with the target pose, we propose a global semantic–pose alignment module (as shown in Figure 6). This module employs a multi-depth alignment architecture to align the high-level semantic features of the source image with the global spatial guidance information from the target pose map. The aligned features are then injected into the UNet backbone network for pose-guided image synthesis.
Figure 6.
Framework of the Global Semantic–Pose Alignment module.
First, the pose features undergo 2D average pooling and a convolution projection to reduce spatial redundancy and align the channel dimensions, resulting in the deepest layer of pose features:
| (25) |
Then, it is flattened into a sequence of length , followed by layer normalization to obtain the initial query representation:
| (26) |
Here, denotes the layer normalization operation, and represents the flattening operation. To retain the spatial relationship corresponding to the query, learnable position embeddings are introduced and added to the query features during subsequent attention computation. This design explicitly preserves topological structural information in pose queries, ensuring that the model does not lose spatial priors of the pose during the global mapping process, thereby enhancing its ability to locate key pose regions.
The deep semantic features of appearance from the Swin Transformer are represented as . To strengthen spatial awareness, key-value position embeddings are added to obtain . This operation allows the appearance features to provide clear spatial anchors in cross-attention, alleviating the misalignment issues caused by significant pose changes.
During the cross-attention phase, let be the query at the l-th layer. The query, key, and value are first computed through a linear transformation, as follows:
| (27) |
Here, and , are the projection matrices. Let the number of attention heads be h, with the dimension of each head . Then, the output of the cross-attention is
| (28) |
The enhanced query representation is then obtained through a residual connection:
| (29) |
With this design, pose queries can establish precise correspondences with appearance features on a global scale, achieving structural-level cross-domain alignment. This cross-attention-based global alignment method is more effective than simple feature concatenation, as it fully utilizes the source image information and exhibits stronger robustness to large-scale pose variations. To further enhance the internal consistency of the aligned pose, the model applies self-attention to the query features after the cross-attention stage. Specifically, the query features are obtained through a linear transformation:
| (30) |
Compute self-attention, as follows:
| (31) |
Then, update through residual connections and a feedforward network:
| (32) |
Here, FFN denotes the feedforward network. The introduction of self-attention establishes rich semantic relationships within the pose query, ensuring that local regions remain highly coordinated and continuous after alignment, thus avoiding issues such as feature fragmentation and local drift. During the output stage, to further enhance the preservation of appearance information, a residual alignment mechanism at the encoding end is introduced, where is passed through a linear transformation,
| (33) |
and aligns the sequence length to through adaptive average pooling:
| (34) |
Finally, we obtain
| (35) |
where L denotes the output of the last Transformer block. The residual fusion between the high-level features of the Swin Transformer and the pose queries provides a direct appearance feedback pathway beyond attention alignment, which enhances the preservation of critical appearance information and improves both the gradient stability and training convergence. The entire alignment process is composed of four sequential stages including pose querying, global cross-attention, local self-attention, and residual fusion, through which structural and appearance information are efficiently coupled. The cross-attention mechanism ensures global geometric and appearance correspondence, the self-attention mechanism refines local semantic consistency, and the residual pathway further stabilizes information propagation. This design effectively reduces structural misalignment and texture drift under complex pose variations, achieving higher alignment accuracy and better detail fidelity.
3.4. Optimization Objectives and Sampling Strategies
In this work, the initial latent state of the real image is defined as . Therefore, the mean squared error loss in Equation (2) can be re-expressed as
| (36) |
To facilitate the transfer from the source pose to the target pose, as suggested in [20], a source-to-source self-reconstruction task is introduced during the training process. The corresponding reconstruction loss is expressed as
| (37) |
Here, , and represents the latent representation obtained at time step t after noise is injected into . Therefore, the overall objective function can be expressed as
| (38) |
Therefore, the overall objective function is defined as the sum of the mean squared error loss and the reconstruction loss. As mentioned, once the conditional latent variable diffusion model is trained, the inference process can begin. Specifically, sampling starts from Gaussian noise, i.e., , and at each time step , the denoising network is applied to reverse Equation (1), thereby obtaining the predicted latent representation . To simultaneously enhance the source image appearance and the guidance effect of the target pose, this study adopts a classifier-free guidance accumulation strategy, which is computed as follows:
| (39) |
When the source image is unavailable, a learnable vector is used as the conditional embedding. During training, with probability , both and are randomly dropped to achieve regularized learning. If the target pose is missing, the output of the pose adapter is set to zero. To accelerate the sampling process, this study adopts the DDIM scheduler [21] with 50 sampling steps, consistent with [8,11]. Finally, the generated image is obtained through the VAE decoder, i.e., . Additionally, the distribution strategy of the time steps follows a cubic function form:
| (40) |
Overall, this strategy increases the probability of the time steps falling in the early sampling stages, thereby enhancing the model’s guidance capability, promoting faster convergence, and effectively reducing training time. In summary, the proposed MultiLevelGaborNet is outlined in Algorithm 1.
| Algorithm 1: Multi-Level Gabor Frequency Network |
|
Input: Source image ; fractional order ; angle set ; scale set . Output: Multi-level frequency-aware features .
|
4. Experimental Results and Discussion
4.1. Experimental Procedure
All the experiments follow the In-Shop Clothes Retrieval benchmark on DeepFashion [22] with the protocol of [11,21], using resolutions of 256 × 176 and 512 × 352. The dataset contains 52,712 images, split into 101,966 training and 8570 testing pairs following [23]. Market-1501 [24], with lower resolution and more complex backgrounds, and uses 263,632 training and 12,000 testing pairs under the same split strategy [23].
Quantitative evaluation employs FID [25], LPIPS [26], SSIM [27], and PSNR, where FID and LPIPS measure deep feature realism and perceptual similarity, while SSIM and PSNR assess pixel-level reconstruction. Following [11], user studies adopt the Jab indices [8,28,29] to compute the proportion of images judged best, and R2G/G2R metrics [23,30] quantify real–generated indistinguishability.
4.2. Implementation Details
All the experiments were conducted on four NVIDIA RTX 4090 GPUs with 24 GB of memory, using Python 3.10, PyTorch 2.0 [31], and the HuggingFace Diffusers framework [32]. The core generative model was based on Stable Diffusion v1.5 [2]. The input image resolution was standardized to 256 × 256. Table 1 summarizes the default settings and the number of trainable parameters for each component.
Table 1.
The number of trainable parameters for each component.
| Component | Default | Trainable Params |
|---|---|---|
| MLGN | 28.0 M | |
| GSPAM | 103.3 M | |
| Swin-B | 87.0 M | |
| 18.4 M | ||
| 30.6 M | ||
| 10.3 M |
During training, the Adam optimizer [33] was employed for a total of 160 epochs, with an initial learning rate of , scaled according to the overall batch size. A linear warm-up strategy was applied for the first 1000 steps, followed by a learning rate decay by a factor of 0.1 at the 80th epoch. During inference, to enhance structural consistency and pose control, we adopted a classifier-free guidance mechanism, setting both the pose prompt weight and and the appearance prompt weight to . Moreover, to improve model robustness and generalization, conditional perturbation training was applied by randomly dropping the conditional inputs and with a probability of during training.
4.3. Comparative Experiments
4.3.1. Quantitative Results
DeepFashion: On the DeepFashion dataset, the proposed method was compared with existing PGPIS baselines, including PATN [23], ADGAN [20], GFLA [34], PISE [35], SPGNet [28], DPTN [36], NTED [37], CASD [29], PIDM [8], PoCoLD [11], and CFLD [10]. To ensure fair comparison, the results of these methods were obtained using the officially released source codes and pre-trained checkpoints provided by the respective authors. Among them, PATN, ADGAN, XingGAN, and GFLA are representative GAN-based approaches, whereas PoCoLD, PIDM, and CFLD are diffusion model-based PGPIS methods. The evaluation was conducted on test images with resolutions of 256 × 176 and 512 × 352.
Table 2 and Table 3 report the quantitative evaluation results of our method on the DeepFashion dataset at two resolutions, 256 × 176 and 512 × 352. The comparison metrics include FID, LPIPS, SSIM, and PSIM. Overall, our method consistently outperforms existing methods across all four metrics at both resolutions, demonstrating comprehensive performance advantages.
Table 2.
Comparison of our method with advanced approaches on the DeepFashion dataset at a resolution of 256 × 176.
| Method | Venue | FID ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
|---|---|---|---|---|---|
| Evaluate on 256 × 176 resolution | |||||
| PATN | CVPR19 | 20.731 | 0.2531 | 0.6719 | - |
| ADGAN | CVPR20 | 14.545 | 0.2245 | 0.6738 | - |
| GFLA | CVPR20 | 9.817 | 0.1863 | 0.7062 | - |
| PISE | CVPR21 | 11.520 | 0.2251 | 0.6437 | - |
| SPGNet | CVPR21 | 16.183 | 0.2256 | 0.6959 | 17.222 |
| DPTN | CVPR22 | 17.409 | 0.2087 | 0.6978 | 17.811 |
| NTED | CVPR22 | 8.518 | 0.1757 | 0.7148 | 17.740 |
| CASD | ECCV22 | 13.137 | 0.1781 | 0.7223 | 17.880 |
| PIDM | CVPR23 | 6.812 | 0.2106 | 0.6631 | 15.630 |
| PoCoLD | ICCV23 | 8.069 | 0.1642 | 0.7314 | - |
| CFLD | CVPR24 | 6.804 | 0.1519 | 0.7368 | 18.235 |
| Ours | - | 6.741 | 0.1408 | 0.7687 | 18.839 |
| Ground Truth | - | 7.847 | 0.0000 | 1.0000 | |
Table 3.
Comparison of our method with advanced approaches on the DeepFashion dataset at a resolution of 512 × 352.
| Method | Venue | FID ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
|---|---|---|---|---|---|
| Evaluate on 512 × 352 resolution | |||||
| CoCosNet2 | CVPR21 | 13.315 | 0.2271 | 0.7226 | - |
| NTED | CVPR22 | 7.621 | 0.1998 | 0.7365 | 17.385 |
| PoCoLD | ICCV23 | 8.411 | 0.1923 | 0.7433 | - |
| CFLD | CVPR24 | 7.150 | 0.1817 | 0.7488 | 17.645 |
| Ours | - | 7.008 | 0.1672 | 0.7694 | 18.110 |
It is noteworthy that even compared to diffusion model-based methods (e.g., PIDM and PoCoLd), our approach still shows superior performance. While these models have made improvements in image realism, their limited conditional control mechanisms prevent them from fully integrating the global semantic information of the source image with the structural features of the target pose, leading to only marginal improvements in perceptual similarity metrics. In contrast, our MLGFN and GSPAM significantly enhance the accuracy of source image detail and structure restoration, while maintaining high image realism.
In the overall architecture, MLGFN achieves the multi-scale modeling of high-frequency features, such as clothing textures and hand details, through fractional-order Gabor filters and frequency-domain attention mechanisms, improving the detail fidelity and structural stability of the generated images. GSPAM establishes a global mapping between pose and appearance features at the semantic level, using cross-modal feature interaction and attention alignment mechanisms to precisely fuse the appearance information of the person with the target pose, alleviating the appearance distortion issues under large pose variations. The synergy of these two components ensures that the model maintains geometric consistency and structural integrity in regions such as joints and hands, even in complex scenes, and significantly improves structural consistency metrics (e.g., SSIM). This demonstrates exceptional cross-pose generalization and high-fidelity generation performance.
4.3.2. Market-1501
On the Market-1501 dataset at 128 × 64 resolution (Table 4), our method was benchmarked against state-of-the-art models including GFLA, XingGAN, SPGNet, PIDM, CFLD, and DPTN. Across all four quantitative metrics, our approach consistently achieves the best performance, with particularly large gains in FID and LPIPS, indicating superior realism, distribution alignment, and perceptual detail preservation. The higher SSIM and PSNR scores further confirm stronger structural consistency and pixel-level reconstruction accuracy. These results collectively demonstrate the effectiveness of the proposed multi-level fractional-order Gabor frequency-aware network and semantic–pose alignment module in restoring fine details and achieving robust cross-pose generalization for low-resolution pedestrian image generation in complex backgrounds.
Table 4.
Comparison of the proposed method with state-of-the-art approaches on the Market-1501 dataset at a resolution of 128 × 64.
| Method | Venue | FID ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
|---|---|---|---|---|---|
| Evaluate on 128 × 64 resolution | |||||
| GFLA | CVPR20 | 19.741 | 0.2825 | 0.2808 | 14.337 |
| XingGAN | ECCV20 | 22.520 | 0.3049 | 0.3044 | 14.446 |
| SPGNet | CVPR21 | 23.517 | 0.2787 | 0.3139 | 14.489 |
| DPTN | CVPR22 | 18.996 | 0.2712 | 0.2854 | 14.521 |
| PIDM | CVPR23 | 14.451 | 0.2415 | 0.3054 | - |
| CFLD | CVPR24 | 11.972 | 0.2636 | 0.2854 | 15.337 |
| Ours | - | 11.821 | 0.2451 | 0.3263 | 15.349 |
| Ground Truth | - | 4.845 | 0.0000 | 1.0000 | |
4.3.3. Qualitative Results
Figure 7 presents a visual comparison of the results obtained by our method and other state-of-the-art methods, including SPGNet, DPTN, NTED, CASD, GFLA, XingGAN, and PIDM.
Figure 7.
Visual comparison between our method and state-of-the-art methods.
As shown in Figure 7 (rows 1–2), our method demonstrates superior structural preservation and high-frequency texture restoration on source images with rich details, effectively suppressing artifacts such as fractures and blurriness. This benefit mainly comes from the MLGFN, whose adaptive fractional-order frequency-domain filtering and multi-branch frequency-aware modeling enable the precise capture of non-stationary textures. In structure-sensitive regions—including hands, clothing edges, and patterned textures—as shown in Figure 7 (rows 3–4), traditional GAN- and diffusion-based baselines often suffer from disconnections, blurring, or misalignment, whereas the proposed GSPAM establishes a global semantic mapping between pose and appearance through cross-modal attention and Swin Transformer-based feature interactions, achieving accurate appearance alignment and improved structural coherence. Even under large pose variations or occlusions, as shown in Figure 7 (rows 5–6), combining MLGFN’s multi-level frequency-aware features with GSPAM’s semantic–pose alignment representations forms a unified conditional modulation for the Stable Diffusion denoising process, ensuring consistent structure and realistic texture reconstruction in challenging scenarios. Additional qualitative results are provided in Figure 8.
Figure 8.
Generated results of our method.
To comprehensively evaluate the computational cost and scalability of our proposed framework, we conducted a comparative analysis of model parameters, Floating Point Operations (FLOPs), and inference latency against the state-of-the-art diffusion baseline PIDM, the pose-constrained latent diffusion model PoCoLD, and the lightweight method CFLD. The experiments were performed on a single NVIDIA RTX 4090 GPU at two standard resolutions: and . Quantitative Comparison: As presented in Table 5, our FreqPose model contains 277.6 M parameters. This makes it significantly more parameter efficient than both PIDM (688.0 M) and PoCoLD (395.9 M), while being comparable to the lightweight CFLD (248.2 M). At low resolution (), our method requires 112.4 G FLOPs and 0.52 s per image. This performance is highly competitive with PoCoLD (100.0 G; 0.50 s) and CFLD (89.5 G; 0.45 s). The marginal overhead (∼15% vs. CFLD) is primarily attributed to the multi-level fractional Gabor calculations in the MLGFN module, yet it remains well within the range for real-time applications. At high resolution (), the computational advantage over heavy diffusion models becomes more pronounced. PIDM requires 4.50 s per image, making it impractical for efficient deployment, whereas FreqPose achieves high-fidelity generation in 1.87 s, maintaining a reasonable inference speed comparable to lighter methods. Trade-off Analysis: While FreqPose introduces a slight computational cost over the lightest baselines, this trade-off is fully justified by the substantial performance gains. For instance, although PoCoLD achieves a similar inference speed to our method, its texture fidelity is limited (SSIM 0.731 in Table 2). In contrast, FreqPose achieves superior quality (SSIM 0.769), outperforming PoCoLD and CFLD by a clear margin. The proposed framework thus strikes an optimal balance, delivering state-of-the-art generation quality with compact parameter size and inference speeds suitable for practical deployment.
Table 5.
Efficiency and performance comparison with state-of-the-art methods at different resolutions. Metrics are measured on a single NVIDIA RTX 4090. FreqPose achieves the best trade-off, maintaining efficient inference speed at high resolutions while delivering superior texture fidelity.
| Method | Params (M) | Efficiency | Efficiency | Performance () | |||||
|---|---|---|---|---|---|---|---|---|---|
| FLOPs (G) | Time (s) | Mem (GB) | FLOPs (G) | Time (s) | Mem (GB) | FID↓ | SSIM↑ | ||
| PIDM | 688.0 | 185.3 | 1.25 | 8.5 | 741.2 | 4.50 | 14.2 | - | - |
| PoCoLD | 395.9 | 100.0 | 0.50 | 6.8 | 400.0 | 1.80 | 10.5 | 8.41 | 0.743 |
| CFLD | 248.2 | 89.5 | 0.45 | 5.8 | 358.0 | 1.62 | 9.0 | 7.15 | 0.748 |
| FreqPose (Ours) | 277.6 | 112.4 | 0.52 | 9.2 | 449.6 | 1.87 | 13.5 | 7.01 | 0.769 |
4.4. Ablation Experiments
To validate the superiority of the proposed MLGFN and GSPAM in texture feature extraction and semantic alignment, we conducted a series of systematic ablation experiments across multiple feature encoding baselines. As shown in Table 6, we configured five comparison setups. Baseline B1 uses VAE to directly encode the source image, B2 employs the CLIP image encoder to extract features from the source image, B3 follows mainstream diffusion approaches [8,11], utilizing multi-scale features extracted by general pre-trained models such as Swin Transformer as conditional embeddings, B4 represents the proposed method, which utilizes a multi-level fractional-order Gabor frequency-aware network as the core for texture feature extraction, with its output serving as the bias term for the query in the UNet cross-attention mechanism to achieve fine-grained control of texture details, and B5 employs the traditional CLIP image encoder for texture extraction from the source image, while incorporating our proposed GSPAM for semantic alignment. Finally, B6 represents the complete model proposed in this paper.
Table 6.
Ablation experiment.
| Method | Embedding | LPIPS ↓ | SSIM ↑ |
|---|---|---|---|
| B1 | VAE | 0.2021 | 0.6948 |
| B2 | CLIP | 0.2093 | 0.6948 |
| B3 | Swin-B | 0.1652 | 0.7121 |
| B4 | MLGFN | 0.1468 | 0.7399 |
| B5 | GSPAM | 0.1493 | 0.7128 |
| B6 (Ours) | FreqPose | 0.1408 | 0.7687 |
The experimental results (visualized in Figure 9) indicate that configurations based on VAE and CLIP (B1 and B2) perform the weakest in terms of texture restoration, with even simple fabric textures failing to be fully preserved. This highlights the clear modal differences between general-purpose visual models and diffusion generation tasks. While the Swin Transformer architecture used in B3 has advantages in semantic understanding, it still has inherent limitations in high-frequency texture modeling. In contrast, the frequency-aware network (B4) proposed in this paper demonstrates significant improvements in reconstruction metrics like SSIM and LPIPS, fully proving its exceptional ability in high-frequency detail capture and restoration. Its performance not only surpasses all baselines but also underscores the irreplaceable advantage of a frequency-aware module specifically designed for texture characteristics, compared to general pre-trained encoders, in generation tasks. B5 further builds upon this by introducing the Global Semantic–Pose Alignment Module, explicitly establishing cross-modal correspondences between pose and appearance in the high-level semantic space, achieving the simultaneous optimization of structural consistency and global coherence in the generated results, thereby further enhancing the overall perceptual quality of cross-pose person image synthesis.
Figure 9.
Visualization of ablation experiment.
Fine-grained Ablation Study. To further investigate the effectiveness of the internal designs within MLGFN and GSPAM, we conducted a component-wise ablation study on the DeepFashion dataset. The results are reported in Table 7.
Table 7.
Component-wise ablation study of MLGFN and GSPAM on the DeepFashion dataset. The results validate the necessity of each internal design choice.
| Module | Variant | LPIPS ↓ | SSIM ↑ |
|---|---|---|---|
| MLGFN | w/o Fractional Order () | 0.155 | 0.748 |
| w/o Frequency Attention | 0.151 | 0.753 | |
| GSPAM | w/o Cross-Attention (Concat) | 0.162 | 0.741 |
| w/o Self-Attention | 0.148 | 0.758 | |
| Full Model | Proposed Method | 0.1408 | 0.7687 |
(1) Effectiveness of MLGFN Components: We analyzed two variants to validate the texture extraction module:
w/o Fractional Order (): We fixed the fractional order to 1, effectively degrading the fractional Gabor filters to standard Gabor filters. As shown in Table 7, SSIM drops from 0.769 to 0.748. This decline confirms that the rotational flexibility provided by the fractional order is crucial for capturing non-stationary textures like fabric wrinkles, which standard isotropic filters fail to model effectively.
w/o Frequency Attention: Removing the frequency-domain attention mechanism leads to a performance drop (LPIPS increases to 0.151). This indicates that the global phase alignment captured by the attention module is essential for maintaining texture continuity and reducing artifacts.
(2) Effectiveness of GSPAM Components: We evaluated the contribution of the cross-modal alignment strategy:
w/o Cross-Attention: Replacing the cross-attention mechanism with simple feature concatenation results in a significant degradation (SSIM drops to 0.741). This validates our hypothesis that direct concatenation cannot establish precise spatial correspondence between the source appearance and target pose, leading to structural misalignment.
w/o Self-Attention: Removing the subsequent self-attention layer also hurts performance (SSIM 0.758), suggesting that refining local semantic consistency within the pose queries is beneficial for generating coherent body structures.
4.5. User Study
To evaluate the perceptual realism of the generated images and the superiority of our method, we conducted two user studies following the PIDM [8] standard, with a total of 120 participants:
(1) Realism Test (R2G/G2R):
The participants were asked to determine whether the images were real photographs or model generated. The results, as shown in Figure 10, reveal that our method achieved a G2R score exceeding , significantly outperforming the comparison methods. The generated images were perceived as more realistic, with the textures, structures, and lighting details more closely resembling real images. The lower R2G score indicates that participants were still able to accurately identify real images, validating that the model improvements stem from objective perceptual optimization. This advantage is attributed to MLGFN’s precise modeling of high-frequency textures and GSPAM’s constraints on structural consistency.
Figure 10.
Our method is compared with the excellent methods in terms of subjective metrics.
(2) Subjective Preference Test (Jab):
From the results of multiple methods, participants selected the image most similar to the target pose. Our method achieved a Jab score of , a improvement over the next best method, excelling in texture reconstruction, pose alignment, and semantic consistency. Qualitative feedback highlighted that our method produced natural transitions in joint connections and edge details, demonstrating the synergistic advantage of MLGFN and GSPAM.
In summary, both in terms of perceptual realism (G2R/R2G) and subjective preference (Jab), our method significantly outperforms existing approaches, showcasing the visual credibility and application potential of high-fidelity cross-pose person image generation.
5. Conclusions
This study systematically addresses the issues of visual distortion and structural inconsistency in pose-guided person image generation, arising from insufficient texture modeling and inadequate semantic alignment. To tackle these challenges, we propose a diffusion-based generative framework that integrates frequency awareness and semantic alignment. By introducing a MLGFN and a semantic–pose alignment module, our method significantly enhances the model’s ability to perform high-frequency texture modeling and cross-pose semantic mapping. MLGFN employs fractional-order Gabor filtering and complex-domain self-attention to robustly extract clothing textures and edge details in the frequency domain, while the semantic–pose alignment module leverages hierarchical semantic features from the Swin Transformer to achieve precise correspondence between appearance and target pose. The experimental results on the DeepFashion and Market-1501 datasets demonstrate that our method outperforms existing approaches in multiple objective metrics and subjective visual quality, showing clear advantages in complex texture reconstruction and large pose transformation tasks. Future work will explore the integration of explicit 3D human priors and multimodal semantic guidance mechanisms to further enhance the model’s generalization and generation stability.
Acknowledgments
We thank the Faculty of Information Engineering and Automation, Kunming University of Science and Technology for providing computational resources.
Author Contributions
Conceptualization, M.W. and B.W.; methodology, M.W. and B.W.; investigation, B.W.; formal analysis, B.W.; validation, H.C., J.R. and X.T.; data curation, H.C. and J.R.; visualization, X.T.; resources, M.W.; writing—original draft preparation, M.W. and B.W.; writing—review and editing, M.W. and B.W.; supervision, M.W. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Publicly available datasets were analyzed in this study. The data can be found at the following sources: DeepFashion (https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html, accessed on 27 June 2016) and Market-1501 (https://www.kaggle.com/datasets/pengcw1/market-1501, accessed on 7 December 2015). The project code used to support the findings of this study is available at (https://github.com/bing32475/FrGPose) (accessed on 15 February 2026).
Conflicts of Interest
The authors declare no conflicts of interest.
Funding Statement
The work was supported by the National Natural Science Foundation of China (62466031) and the Faculty of Information Engineering and Automation, Kunming University of Science and Technology.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y. Generative adversarial networks. Commun. ACM. 2020;63:139–144. doi: 10.1145/3422622. [DOI] [Google Scholar]
- 2.Rombach R., Blattmann A., Lorenz D., Esser P., Ommer B. High-resolution image synthesis with latent diffusion models; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA. 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- 3.Tang H., Bai S., Zhang L., Torr P.H., Sebe N. Proceedings of the European Conference on Computer Vision. Springer; Cham, Switzerland: 2020. Xinggan for person image generation; pp. 717–734. [Google Scholar]
- 4.Liu J., Zhu Y. Mutually activated residual linear modeling GAN for pose-guided person image generation. Neurocomputing. 2022;514:451–463. doi: 10.1016/j.neucom.2022.09.089. [DOI] [Google Scholar]
- 5.Liu J., Weng Z., Zhu Y. Precise region semantics-assisted GAN for pose-guided person image generation. CAAI Trans. Intell. Technol. 2024;9:665–678. doi: 10.1049/cit2.12255. [DOI] [Google Scholar]
- 6.Chen X., Song J., Hilliges O. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019) Computer Vision Foundation (CVF); New York, NY, USA: 2019. Unpaired pose guided human image generation. [Google Scholar]
- 7.Ma L., Sun Q., Georgoulis S., Van Gool L., Schiele B., Fritz M. Disentangled person image generation; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–23 June 2018; pp. 99–108. [Google Scholar]
- 8.Bhunia A.K., Khan S., Cholakkal H., Anwer R.M., Laaksonen J., Shah M., Khan F.S. Person image synthesis via denoising diffusion model; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada. 17–24 June 2023; pp. 5968–5976. [Google Scholar]
- 9.Shen F., Tang J. Imagpose: A unified conditional framework for pose-guided person generation. Adv. Neural Inf. Process. Syst. 2024;37:6246–6266. [Google Scholar]
- 10.Lu Y., Zhang M., Ma A.J., Xie X., Lai J. Coarse-to-fine latent diffusion for pose-guided person image synthesis; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 16–22 June 2024; pp. 6420–6429. [Google Scholar]
- 11.Han X., Zhu X., Deng J., Song Y.Z., Xiang T. Controllable person image synthesis with pose-constrained latent diffusion; Proceedings of the IEEE/CVF International Conference on Computer Vision; Paris, France. 1–6 October 2023; pp. 22768–22777. [Google Scholar]
- 12.Ju X., Zeng A., Zhao C., Wang J., Zhang L., Xu Q. Humansd: A native skeleton-guided diffusion model for human image generation; Proceedings of the IEEE/CVF International Conference on Computer Vision; Paris, France. 1–6 October 2023; pp. 15988–15998. [Google Scholar]
- 13.Khandelwal A. Reposedm: Recurrent pose alignment and gradient guidance for pose guided image synthesis; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 17–18 June 2024; pp. 2495–2504. [Google Scholar]
- 14.Guo L., Song K., Xu M., Lai H. DNAF: Diffusion with Noise-Aware Feature for Pose-Guided Person Image Synthesis; Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME); Niagara Falls, ON, Canada. 15–19 July 2024; pp. 1–6. [Google Scholar]
- 15.Cheong S.Y., Mustafa A., Gilbert A. Upgpt: Universal diffusion model for person image generation, editing and pose transfer; Proceedings of the IEEE/CVF International Conference on Computer Vision; Paris, France. 2–6 October 2023; pp. 4173–4182. [Google Scholar]
- 16.Kim J., Kim M.J., Lee J., Choo J. Proceedings of the European Conference on Computer Vision. Springer; Cham, Switzerland: 2024. Tcan: Animating human images with temporally consistent pose guidance using diffusion models; pp. 326–342. [Google Scholar]
- 17.Wu J., Si S., Wang J., Xiao J. Improving Human Image Synthesis with Residual Fast Fourier Transformation and Wasserstein Distance; Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN); Padua, Italy. 18–23 July 2022; pp. 1–8. [Google Scholar]
- 18.Yang G., Liu W., Liu X., Gu X., Cao J., Li J. Delving into the frequency: Temporally consistent human motion transfer in the fourier space; Proceedings of the 30th ACM International Conference on Multimedia; Lisboa, Portugal. 10–14 October 2022; pp. 1156–1166. [Google Scholar]
- 19.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 770–778. [Google Scholar]
- 20.Men Y., Mao Y., Jiang Y., Ma W.Y., Lian Z. Controllable person image synthesis with attribute-decomposed gan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 13–19 June 2020; pp. 5084–5093. [Google Scholar]
- 21.Salimans T., Goodfellow I., Zaremba W., Cheung V., Radford A., Chen X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016;29:2234. [Google Scholar]
- 22.Liu Z., Luo P., Qiu S., Wang X., Tang X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 1096–1104. [Google Scholar]
- 23.Zhu Z., Huang T., Shi B., Yu M., Wang B., Bai X. Progressive pose attention transfer for person image generation; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA. 15–20 June 2019; pp. 2347–2356. [Google Scholar]
- 24.Zheng L., Shen L., Tian L., Wang S., Wang J., Tian Q. Scalable person re-identification: A benchmark; Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile. 7–13 December 2015; pp. 1116–1124. [Google Scholar]
- 25.Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S. Gans trained by a two time-scale update rule converge to a local nash equilibrium; Proceedings of the 31st International Conference on Neural Information Processing Systems; Long Beach, CA, USA. 4–9 December 2017; [Google Scholar]
- 26.Zhang R., Isola P., Efros A.A., Shechtman E., Wang O. The unreasonable effectiveness of deep features as a perceptual metric; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–23 June 2018; pp. 586–595. [Google Scholar]
- 27.Wang Z., Bovik A.C., Sheikh H.R., Simoncelli E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004;13:600–612. doi: 10.1109/TIP.2003.819861. [DOI] [PubMed] [Google Scholar]
- 28.Lv Z., Li X., Li X., Li F., Lin T., He D., Zuo W. Learning semantic person image generation by region-adaptive normalization; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 20–25 June 2021; pp. 10806–10815. [Google Scholar]
- 29.Zhou X., Yin M., Chen X., Sun L., Gao C., Li Q. Proceedings of the European Conference on Computer Vision. Springer; Cham, Switzerland: 2022. Cross attention based style distribution for controllable person image synthesis; pp. 161–178. [Google Scholar]
- 30.Ma L., Jia X., Sun Q., Schiele B., Tuytelaars T., Van Gool L. Pose guided person image generation. Adv. Neural Inf. Process. Syst. 2017;30:406–416. [Google Scholar]
- 31.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019;32:8024–8035. [Google Scholar]
- 32.von Platen P., Patil S., Lozhkov A., Cuenca P., Lambert N., Rasul K., Davaadorj M., Nair D., Paul S., Liu S., et al. Diffusers: State-of-the-Art Diffusion Models (Version 0.12.1) [Python] 2022. [(accessed on 4 February 2026)]. Available online: https://github.com/huggingface/diffusers.
- 33.Adam K.D.B.J. A method for stochastic optimization. arXiv. 2014 doi: 10.48550/arXiv.1412.1196.1412.6980 [DOI] [Google Scholar]
- 34.Ren Y., Yu X., Chen J., Li T.H., Li G. Deep image spatial transformation for person image generation; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 13–19 June 2020; pp. 7690–7699. [Google Scholar]
- 35.Zhang J., Li K., Lai Y.K., Yang J. Pise: Person image synthesis and editing with decoupled gan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 20–25 June 2021; pp. 7982–7990. [Google Scholar]
- 36.Zhang P., Yang L., Lai J.H., Xie X. Exploring dual-task correlation for pose guided person image generation; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA. 18–24 June 2022; pp. 7713–7722. [Google Scholar]
- 37.Ren Y., Fan X., Li G., Liu S., Li T.H. Neural texture extraction and distribution for controllable person image synthesis; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA. 18–24 June 2022; pp. 13535–13544. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Publicly available datasets were analyzed in this study. The data can be found at the following sources: DeepFashion (https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html, accessed on 27 June 2016) and Market-1501 (https://www.kaggle.com/datasets/pengcw1/market-1501, accessed on 7 December 2015). The project code used to support the findings of this study is available at (https://github.com/bing32475/FrGPose) (accessed on 15 February 2026).










