Abstract
The explosive growth of medical data poses significant challenges for storage and sharing. Current compression techniques utilizing Implicit Neural Representations (INRs) effectively strike a balance between encoding accuracy and compression ratio, yet they suffer from slow encoding speeds. By contrast, data-driven compressors encode fast but heavily rely on the training data and cannot generalize well. To develop a practical compression tool overcoming all these limitations, we introduce Shape-Texture Decoupled Compression (DeepSTD), which focuses on the data set of the same modality and body parts and proposes decoupling the variations into shape and texture components for separate encoding. Disentangling two components facilitates designing proper encoding strategies suitable for their respective characteristics—swift shape encoding based on INRs and effective data-driven texture encoding. The proposed approach combines the advantages of INR-based and data-driven models, to achieve high fidelity, fast encoding speed, as well as good generalizability. Comprehensive evaluations on large-scale Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) datasets demonstrate superior performance across encoding quality, compression ratio, and speed. Besides, with features like parallel acceleration with multiple Graphics Processing Units (multi-GPU), flexible control of compression ratio, and broad applicability, DeepSTD offers a robust and efficient solution for the pressing demands of modern medical data compression.
Subject terms: Research data, Medical research
The authors introduce Shape-Texture Decoupled Compression (DeepSTD), a practical compression tool for medical imaging data. They provide evaluations on different modality imaging datasets, providing a route towards a solution for the pressing demands of modern medical data compression.
Introduction
As the resolution of medical images increases with advances in imaging technologies1–7 and the healthcare access expands8–11, the data size in medical institutions grows significantly12–17. For example, the CTSpine1K dataset18 consisting of just 1,005 Computed Tomography (CT) scans requires nearly 200GB of storage, while estimating based on CT scanners at USA19 at least 30,000 TB CT data are generated annually. Many other widely used modalities such as Magnetic Resonance Imaging (MRI), ultrasound, and X-rays, will produce much more data, creating grand storage and sharing challenges20–23. In addition, various information technologies have been applied for medical image processing, data mining, and large-scale data management, but the limited storage space and bandwidth often require downsampling or filtering of datasets24. Although high-capacity servers25,26 offer some relief, user-side storage and bandwidth constraints continue to slow down the research progress and raise costs. Therefore, efficient data encoding is crucial to reduce the volume of massive medical image data, which will advance the diverse fields related to medical images.
The explosive growth of medical data necessitates high-performance data compression techniques to accommodate the limited storage and bandwidth. Existing algorithms can be broadly categorized into three classes: (i) Conventional image and video compression techniques, such as H.26427, HEVC28, and VVC29, which are primarily intended for natural scenes, and thus struggle with medical image data, leading to either low compression ratios or noticeable artifacts due to the distinct characteristics of medical images. (ii) Data-driven compression techniques design deep neural networks30–32 and leverage supervised training to derive optimal compression specifications from extensive datasets, and have demonstrated enhanced performance compared to conventional methods, yet they cannot generalize well across various datasets and scenarios. (iii) Implicit neural representation (INR)-based techniques utilize the powerful representational capacity of deep neural networks to represent the original data as a continuous function, which pursues an optimal network to map each spatial coordinate towards its corresponding image intensity33. Among these options, the INR methods strike the best balance between data fidelity and compression ratio, but their reliance on time-consuming optimization processes limits their effectiveness in settings like online clinical diagnostics and real-time monitoring. While substantial advancements have been made, each of these methods comes with its own set of benefits and drawbacks. Few existing compressors can effectively combine fast compression with high data fidelity at sizable compression ratios.
Texture analysis has long been a fundamental tool of understanding medical images for a long time. Haralick et al. proposed gray-level co-occurrence matrix (GLCM)-based texture features, which have since been extended and widely used in medical image analysis tasks, including tumor characterization, tissue classification, and disease diagnosis34,35. In parallel, the Digital Imaging and Communications in Medicine (DICOM) standard has provided a unified format for storage and transmission of medical images, along with metadata that describes acquisition parameters and patient-specific context, which are crucial for texture interpretation in clinical contexts. These developments indicate a strong connection between texture features and DICOM image properties: textures capture local intensity variations, while DICOM preserves the overall quality of the acquisition. Building on these foundations, our shape-texture decoupled compression revisits the challenge from a representational perspective, where the “shape” component reflects global structural variations, while the “texture” component corresponds to fine-scale intensity patterns, effectively bridging classical texture analysis with contemporary neural representations.
The INR-based encoding runs slowly because the medical data encompasses large individual variations and rich details, and the encoder needs a long time to fit the complex function mapping spatial coordinates toward the intensity33,36–38. To accelerate the encoding process, we propose to focus on the data set of the same region and modality that exists widely in clinical applications and intend to exploit the within-group similarity for fast compression. Here we draw inspiration from research on medical templates39,40, where global anatomical variability and local intensity fluctuations are modeled as separable components. This idea echoes classical image processing principles of shape-texture separation (e.g., Haralick’s texture descriptors and DICOM metadata34). In our framework, shape is represented as a smooth deformation field well-suited for implicit neural representations, while texture—after template alignment—becomes a more homogeneous residual pattern suitable for data-driven encoding. This conceptually grounded decoupling ensures that each component is efficiently compressed by a dedicated algorithm, thereby avoiding the entanglement of shape and texture. In this paradigm, the data compression process can be revisited from a template-based perspective, i.e., the data encoding and decoding are performed concerning a reference template of its located data group. Benefiting from the shape-texture decoupling, supported by a template-based anatomical prior, one can adopt proper encoding strategies separately for two components. The template provides a spatial reference ensuring that shape encoding captures smooth anatomical deformations while texture encoding operates on anatomically aligned data, thus enhancing fidelity and compression efficiency.
Building on this analysis, we propose a decompositional encoding scheme—Shape-Texture Decoupled Compression (DeepSTD). On the one hand, the shape difference resembles an elastic deformation and is fitted by an INR efficiently. On the other hand, after being geometrically aligned, the diversity among the data group is largely reduced and can be modeled well with a pre-trained network learned in a data-driven manner. The registration and decreased texture variation also mitigate the generalization issue of data-driven compressors. As illustrated in Fig. 1a, we divide the data encoding task into two distinct phases: shape encoding that models the shape differences between the data and the template with two INRs; texture encoding that encodes the texture after shape aligned with the template. The shape and texture information are concatenated to form the final compressed data. In implementation, we specially designed a Cycle INR algorithm tailored for shape encoding and a 3DTCMPE (3D Transformer-CNN Mixed Encoder with Positional Encoding) algorithm optimized for texture encoding. This decoupled approach leverages the strengths of INRs for fast low-frequency encoding38 while enabling efficient and effective data-driven compression of textures, and thus simultaneously achieves accelerated encoding, high compression efficiency, and high compression accuracy. The decoupled strategy also retains the advantage of INRs in precisely controlling the compression ratio while overcoming the limitation of data-driven methods that heavily rely on extensive training datasets. To handle the complex and diverse shapes inherent in medical image data, we further introduced a progressive encoding strategy. Moreover, for cases with limited availability of medical training data, we also developed data augmentation and knowledge distillation strategies, enhancing the robustness and effectiveness of our approach in these scenarios.
Fig. 1. The schematic of DeepSTD and its compression performance on the CT and MRI data.
a DeepSTD decouples the medical data compression into two stages: shape encoding and texture encoding. In the former stage, we solve an optimization problem to obtain the optimal pair of forward and backward deformation fields, both parameterized as implicit neural representations (INRs). The forward deformation field registers the given data to the template while the backward deformation field is stored as shape information. In the latter stage, the registered data is passed through a pre-trained Transformer-CNN mixed encoder to extract texture information. Finally, the shape and texture information are concatenated together as the compressed data. b Visualization of an exemplar decompressed CT data by DeepSTD, for a sample from the CTSpine1K dataset18. The leftmost is a 3D rendering of the decompressed CT, with APSILR indicating the orientations: Anterior (A), Posterior (P), Superior (S), Inferior (I), Left (L), and Right (R). The middle column shows three representative slices highlighted by blue, green, and orange frames, alongside their respective residues with respect to the original data on the rightmost column. c Visualization of a decompressed MRI data volume from the Amos dataset50, with the same layout as in (b). Panels b and c show the selected templates for CTSpine1K and Amos datasets, respectively. d Comparison of DeepSTD's compression quality and speed against the state-of-the-art (SOTA) algorithms, performed on the CTSpine1K dataset (CT modality) at 256 × compression and evaluated in terms of Mean Absolute Error (MAE) and Structural Similarity Index (SSIM). e DeepSTD's performance in comparison to baseline compressors on the Amos dataset (MRI modality) at 128 × compression, using the same legend as in (d).
DeepSTD has been extensively evaluated on large-scale CT and MRI datasets, demonstrating superior performance across encoding quality, compression ratio, and speed. As the algorithm with the highest encoding quality, DeepSTD delivers a two-order-of-magnitude speed improvement over the best INR-based algorithm—BRIEF33, across all tested datasets and compression ratios. DeepSTD also applies to a wide range of medical images. These results establish DeepSTD as the leading solution for modern medical data compression. Beyond the exceptional performance, DeepSTD also offers practical features that enhance its utility. The proposed compressor also supports acceleration with multiple graphics processing units (multi-GPU), significantly speeding up encoding and decoding processes, and can be trained effectively even with limited training data by leveraging data augmentation and knowledge distillation strategies. Furthermore, it provides stable and flexible control over compression ratios, ensuring precise fit to the available storage or bandwidth.
The key contributions of this work can be summarized as follows:
Proposal of shape-texture decoupled compression scheme (DeepSTD). We introduce a template for the data of the same organ(s) and modality, and decouples the given data into shape and texture concerning the template for separate encoding. The proposed scheme uses INR-based and data-driven compression algorithms respectively to compress shape and texture components and aggregate the advantages of both regimes for leading compression performance.
Design of medical image specific algorithm. We propose a compressor that encompasses a cycle INR optimized for shape encoding and a 3DTCMPE algorithm designed for texture encoding, which is implemented progressively to handle complex and diverse variations in medical data.
Leading performance across diverse data. DeepSTD demonstrates superior performance on various medical data in terms of a joint assessment of data fidelity, compression ratio, and speed, achieving up to a two-order-of-magnitude speed improvement over the existing best-performance compressor while maintaining the best encoding quality.
DeepSTD’s advantageous features supporting its practical applications. DeepSTD allows multi-GPU acceleration, efficient training from limited data, stable and flexible compression ratio control, and wide applicability across diverse medical images, establishing it as a versatile and efficient solution for modern medical data compression.
Results
The principle of the proposed DeepSTD
Shape-Texture Decoupled Compression (DeepSTD) decouples the shape and texture variations in a specific data set (the same modality and body part) and conducts compact encoding separately using methodologies suitable for their characteristics. Encoding and decoding are conducted with respect to an introduced template that describes the “average” or “typical” shape of this set, and the complete workflow of DeepSTD is shown in Fig. 1a. During encoding, for a given data, DeepSTD first extracts and compactly encodes its geometrical deviations from the template as the shape information. Then the given data is aligned with the template to remove the shape individuality and undergoes the texture encoding to retrieve the texture information. Finally, the shape and texture information are packaged into the compressed data. The decoding process is the reverse of encoding: the texture and shape information are unpacked from the encoded file, followed by texture decoding and shape decoding (morphing back to the original shape) sequentially.
The rationale behind decoupling the encoding process into shape and texture components is twofold. First, shape information predominantly can be easily represented as a continuous function, which is inherently suitable for efficient encoding using INRs and supports fast and high-performance encoding. Second, after being registered to the same template, the textures of different samples are more semantically aligned and facilitate data-driven methods to efficiently removing redundancies, achieving both high fidelity and high compression ratio. Eliminating the shape variations also helps to resolve the generalization issue faced by data-driven algorithms. Together, these two principles underpin the proposed DeepSTD framework, enabling accelerated encoding, enhanced compression quality, and high robustness across dataset. In the following, we provide a detailed explanation of the key designs for shape encoding and texture encoding algorithms that specially considers the unique characteristics of medical data.
The shape encoding module targets to encode the shape differences between the given data and a predefined template. To facilitate encoding and decoding, we define a pair of reciprocal mappings—a forward deformation field that registers the given data to the predefined template and a backward field that deforms conversely. Then we designed an algorithm to pursue these two reciprocal deformation fields and restore the backward deformation field as the shape codes of the given data. We employ INRs38 to represent the deformation fields, i.e., encoding the fields with INR parameters. To ensure perfect reciprocity of the forward and backward deformations, we draw inspiration from the cyclic registration algorithm proposed in41 to simultaneously optimize both fields by introducing a cyclic loss function to encourage the deformation fields approximate inverse transformations of each other. The joint optimization of two fields outperforms the sequential in computational complexity, as shown in Supplementary Fig. 9 (a). Benefiting from its capability and suitability to describe the 3D continuous morphing, INRs provide superior registration accuracy, less parameters, and higher encoding speed to other methods, as shown in Supplementary Fig. 8. Considering that even for the same body part, the set of medical image data exhibits significant shape variations, such as varying body types of different individuals, pre- and post-surgical changes, the appearance of lesions, or slight variations in imaging regions. We introduced a gradual degradation strategy to enhance the robustness of shape encoding, by downsampling the data before shape encoding and redesigning the INR hyperparameters to focus on the geometrical deformations and disregard texture details. The contribution of the strategy is illustrated in Supplementary Fig. 9 (b), demonstrating consistent performance improvements.
The texture encoding module aims to represent the geometrically aligned medical images compactly. After being registered to the template, the data share high similarity and one can learn a dedicated encoder in a data-driven manner. Referring to the common textures of the training set as an internal “template”, this data-driven texture encoder-decoder effectively encodes the texture differences between the shape-aligned data and the template. Specifically, we propose a Transformer-CNN mixed texture encoder, which is built on the state-of-the-art (SOTA) image encoding algorithm TCM42, featuring two major improvements: (i) Extension from 2D to 3D, to fully leverage the 3D spatial structure of the medical data and thus raise the encoding quality; (2) Introduction of positional encoding, which divides the given data volume into blocks to circumvent the demanding memory and computational consumption when compressing large-scale data, and associates the position of each data block with respect to the reference shape template. These two improvements deliver moderate gains in encoding quality and significant boosts in encoding speed, as illustrated in Supplementary Fig. 11 (a). The 3DTCMPE algorithm also outperforms other SOTA methods in both encoding quality and speed, as shown in Supplementary Fig. 10.
It is worth emphasizing that shape registration plays a crucial role in the success of the 3DTCMPE algorithm, since before registration the intensity variations among the set of individual data volumes are highly complex and beyond the capabilities of both CNN- and Transformer-based deep neural networks. As experimentally shown in Supplementary Fig. 6 (b), the performance of 3DTCMPE improves significantly with shape encoding.
DeepSTD Achieves High Performance and Fast Compression
DeepSTD achieves high encoding efficiency by decoupling the encoding task into two problems that can be processed independently and efficiently. While maintaining high compression quality, DeepSTD achieves a speed improvement of two-order-of-magnitude over the benchmark compressor with the best quality, BRIEF33, and one order of magnitude over the next follow-up competitor, VVC29, across various compression ratios. These attributes make DeepSTD well-suited for modern medical data applications, which require high precision, high compression, as well as high-speed encoding. The exceptional performance of DeepSTD has been comprehensively validated in large-scale data sets in both CT and magnetic resonance modalities, demonstrating its robustness and adaptability in diverse medical imaging scenarios.
First, we evaluated the performance of our approach on highly representative CT data from the CTSpine1K dataset18, a torso CT dataset covering the upper part of the human body. The average data size is 512 × 512 × 504 pixels, and we selected the first 787 data samples in this experiment, in which 630 samples for training and 157 for testing. The data set occupies approximately 194 GB. Considering that CT pixel values have physical significance, we evaluated the pixel-level compression loss in terms of Mean Absolute Error (MAE) and Structural Similarity Index (SSIM) between the compressed data with respect to the original data. It is important to note that we do not use the Peak Signal-to-Noise Ratio (PSNR) because the large black background regions in the images can artificially inflate PSNR values. To further assess the loss from the medical perspective, we use the Dice Similarity Coefficient (DSC) and the 95% Hausdorff Distance (HD95) between the organ segmentation results of the data before and after compression.
DeepSTD achieves 256 × compression at high data fidelity, which significantly reduces the difficulty of data storage and sharing. For example, downloading the entire CTSpine1K dataset (787 data samples) from a dataset website with a 500 KB/s bandwidth would typically take 5 days, and network instability may cause download failures. After applying our method, the same data can be downloaded in just 26 minutes under the same network conditions, as shown in Fig. 2a. Even at a 256 × compression ratio, our algorithm achieves almost lossless encoding performance. In terms of pixel-level data fidelity, the average MAE is 6.8 HU and SSIM is 0.993, as shown in Fig. 1b and Fig. 2b–d. From the medical perspective, we compared the organ segmentation accuracy and achieved an average DSC of 90.55 and HD95 of 4.52, as shown in Fig. 2b and e. Here the organ segmentation was performed using the UNETR++43 trained on the Synapse dataset44 for Liver, Stomach, and Spleen, and on the CTSpine1K dataset18 for the Spine. As for the encoding efficiency, DeepSTD achieves an average encoding speed of 3.1 MB/s, processing about 1,604,963 voxels per second (assuming 16-bit precision) on a workstation equipped with an AMD Ryzen Threadripper 3970X CPU and NVIDIA RTX 3090 GPUs. On average, encoding a 512 × 512 × 504-pixel volume takes 82 seconds, as shown in Fig. 2f. Encoding the 787 CT volumes totaling 194 GB takes approximately 18 hours.
Fig. 2. DeepSTD achieved high efficiency CT data compression, demonstrated on the CTSpine1K18 dataset.
a The estimated file sizes and minimum transmission time with 500 KB/s bandwidth before and after applying DeepSTD to compress the first 787 data samples by 256 × . b, The encoding quality of DeepSTD at 256 × compression ratio, in terms of intensity fidelity---average Mean Absolute Error (MAE) of 6.8 HU, Structural Similarity Index (SSIM) of 0.993, and organ segmentation precision---Dice Similarity Coefficient (DSC) of 90.55 and 95% Hausdorff Distance (HD95) of 4.52. These box plots share the same legend and show the distribution of the metrics from 157 samples: M (median, center line), Q1 and Q3 (first and third quartiles, box bounds), and Min and Max (whisker extents). Here the organ segmentation was performed using the UNETR++43 trained on the Synapse dataset44 for the Liver, Stomach, and Spleen, and on the CTSpine1K dataset for the Spine. c, d Visual comparison of DeepSTD's intensity fidelity after 256 × compression. We present the axial cross-sectional images at depth 511 (c) and depth 472 (d) of data #0001 in the leftmost column, two zoomed-in comparisons of two regions of interest (ROIs) in the middle columns, and their profiles on the rightmost column. e Visual comparison of the organ segmentation precision before and after applying DeepSTD for 256 × compression. We display the axial cross-sectional image at depth 497 of data sample #0431, with the color-coded segmentation results overlaid on top. The differences between segmentation results on the original and compression versions are also shown alongside for better visualization, with true positives (TP), false positives (FP), and false negatives (FN) highlighted in different colors. f Encoding (left) and decoding (right) time breakdown of DeepSTD between shape and texture, recorded at 256 × compression ratio and using a single RTX 3090 GPU. g–i Comparison of DeepSTD and other baseline algorithms at different compression ratios, in terms of intensity fidelity---MAE and SSIM (g), downstream organ segmentation precision---DSC and HD95 (h), and compression speed (i).
We conducted a comprehensive comparison with SOTA algorithms across three dimensions: encoding quality, compression ratio, and speed. The compared methods include i) Traditional image compression algorithms— JPEG45 and JPEG200046, ii) video compression algorithms— HEVC28 and VVC29, iii) algorithms based on implicit representation—BRIEF33, iv) explicit representation-based methods—3DGS47,48, and v) data-driven deep learning methods—DVC49 and TCM42. The results are shown in Fig. 2g–i. Note that the JPEG algorithm cannot achieve a 256 × compression and the result is thus not plotted. From the comparison, one can see that DeepSTD outperforms all SOTA algorithms in encoding quality across compression ratios ranging from 32 × to 256 × . Moreover, as the algorithm with the highest encoding quality, DeepSTD achieves a significantly higher efficiency than the two competitors with the best quality, 37 to 139 times faster than BRIEF across compression ratios from 32 × to 256 × , and 18 to 46 times faster than VVC. These advancements make DeepSTD exceptionally well-suited for modern medical data applications.
Secondly, we tested our algorithm on the MRI modality data from the Amos dataset50, an MRI modality dataset that targets the abdominal as the imaging region and includes multiple organs. We started with the 1,200 MRI scans provided as supplementary data and filtered low-resolution scans with spatial resolution below 4 mm or longitudinal resolution below 20 mm. The remaining 1,144 scans were rescaled to a uniform resolution of 2mm × 2mm × 10mm, and the average size of the rescaled scans is 387 × 362 × 44 pixels. We split these data into a training set containing 915 samples and a test set containing 229 samples. The total file size of the dataset is approximately 13 GB. We used the same evaluation metrics as in the above CT data, except that MAE is not used to evaluate the pixel-level fidelity because the MRI pixel values do not have physical significance.
DeepSTD achieves a compression ratio of 128 × , which significantly reduces the difficulty of data sharing. With a 500 KB/s bandwidth, downloading the entire Amos dataset (preprocessed) typically takes 8 hours and even fails if the network is unstable. After compression, the same data can be downloaded within just 4 minutes, as shown in Fig. 3(a). Even at such a high compression ratio, DeepSTD achieves an average SSIM of 0.993, as shown in Figs. 1c and 3b–d, and an average DSC of 87.8 and HD95 of 9.1, as shown in Fig. 3b and e. The organ segmentation was performed using the UNETR++43 trained on the labeled Amos dataset50 for the Liver, Stomach, and Spleen. Regarding efficiency, DeepSTD achieves an average encoding speed of 0.3 MB/s, encoding an average of 152,445 voxels per second (assuming 16-bit precision). Encoding a 387 × 362 × 44-pixel volume takes 40 seconds, as shown in Fig. 3f, and compressing the entire Amos dataset (the preprocessed totaling 13 GB) takes approximately 13 hours.
Fig. 3. DeepSTD achieved high efficiency MRI data compression, demonstrated on the Amos dataset50.
a The reduction in file size and minimum transmission time with 500 KB/s bandwidth by applying our compression algorithm for 128 × compression to the preprocessed Amos dataset. b The encoding quality of DeepSTD on 229 samples from the Amos dataset at 128 × compression ratio, in terms of intensity fidelity---an average Structural Similarity Index Metrics (SSIM) of 0.993, and the organ segmentation precision---an average Dice Similarity Coefficient (DSC) of 87.8 and 95% Hausdorff Distance (HD95) of 9.1. Here the plots share the same legend with Fig. 2b, and the organ segmentation was performed using the UNETR++43 trained on the labeled Amos dataset for the Liver, Stomach, and Spleen. c, d Visual comparison of DeepSTD's intensity fidelity on 128 × compression of two axial cross-sectional images at depth 225 (c) and depth 270 (d) of data sample #7298, with the zoomed-in comparisons of three regions of interest (ROIs) and corresponding line profiles alongside for clear demonstration. e Visual comparison of the downstream organ segmentation before and after applying DeepSTD for 128 × compression. We present the axial cross-sectional images at depth 29 of data sample #7200, with color-coded segmentation results overlaid on top, and place the differences between segmentation on the original and compressed data shown alongside. Additionally, we show zoomed-in comparisons of three ROIs for a clearer view. f The encoding/decoding time breakdown of DeepSTD between shape and texture information, recorded at 128 × compression and using a single RTX 3090 GPU. g–i Comparison between DeepSTD against baselines in terms of intensity fidelity---SSIM (g), segmentation precision---DSC and HD95 (h), and encoding speed (i) at four different compression ratios.
We conducted a comprehensive comparison with the same SOTA algorithms as in the previous CT compression task, the results are shown in Fig. 3g–i. One can arrive at the same conclusion that DeepSTD outperforms all SOTA algorithms in compression quality across compression ratios from 32 × to 256 × . At the highest encoding quality, DeepSTD is also of similar advantages in running efficienty as in CT, with 44 to 165 times faster than BRIEF at compression ratios ranging from 32 × to 256 × .
Besides, DeepSTD demonstrates broad applicability across datasets from other imaging regions and modalities. (i) CheXpert-small dataset51, an X-ray dataset targets chest imaging and consists of 2D data, differing from the 3D datasets used previously. To accommodate 2D data, we adapted DeepSTD’s texture encoder-decoder to a 2D version. We randomly selected 10,000 frontal-view samples from the study1 subset of the dataset for training, with a compression ratio set to 128 × , and the results are shown in Supplementary Fig. 5. (ii) Brain MRI dataset, a subset of the NYU fastMRI dataset52,53, containing 700 randomly selected T2-weighted reconstructed samples. The compression ratio was set to 128 × , and the results are presented in Supplementary Fig. 6. (ii) Knee MRI dataset, another subset of the NYU fastMRI dataset52,53, containing 460 randomly selected samples, and the result at 128 × compression are shown in Supplementary Fig. 7.
DeepSTD owns beneficial features supporting its practicality
One can encode and decode the texture information using multiple GPUs for acceleration. The texture encoder-decoder operates in two modes: full-data mode directly processes the entire data volume, after being geometrically aligned with the template; the block-based mode assigns positional encoding to each data block according to its spatial location and feeds the position code into TCM’s Swin Transformer module to extract the corresponding texture information. Using the block-based encoding mode, different data blocks can be encoded in parallel on multiple GPUs, accelerating the encoding process. As shown in Fig. 4a, we evaluated the encoding time on the CTSpine1K and Amos datasets using different numbers of GPUs (1, 2, 4, and 8). The results demonstrate a linear decrease in encoding time with the number of GPUs, while the shape encoding time remains nearly constant. This indicates that scalability is favorable but bounded by communication overhead and the non-parallelizable shape optimization, reflecting the key computational trade-offs of multi-GPU acceleration. It is worth noting that in this experiment, we only applied parallel processing on the texture encoding and the shape encoding time remains constant. Similarly, texture decoding can be accelerated by data division and multi-GPU processing as well. Shape decoding also benefits from multi-GPU parallelization by distributing the inference of the deformation fields in different regions on multiple GPUs. This significantly accelerates the decoding of the shape information, as shown in Fig. 4b, in which we evaluated the decoding time on the CTSpine1K and Amos datasets using varying numbers of GPUs (1, 2, 4, and 8), and observed a linear reduction in decoding time with the GPU number.
Fig. 4. DeepSTD possesses beneficial features enhancing practicality.
a, b The encoding (a) and decoding (b) speed of DeepSTD when implemented with different numbers of GPUs, calculated on the CTSpine1K and Amos datasets with 256 × and 128 × compression, respectively. Here we plot the statistics on the datasets and the figure on a specific example with dashed lines. c The time breakdown of DeepSTD's training stage, among three subtasks---selecting reference shape, aligning shape, and training encoder/decoder. The time for CT (left) and MRI (right) data are respectively calculated on the CTSpine1K training set with 256 × compression and the Amos training set with 128 × compression, using 4 RTX 3090 GPUs. d The error bar (mean ± standard deviation) of DeepSTD's performance in terms of SSIM after different training durations, averaged over three runs with different random seeds and on the same dataset and compression ratio with c. The vertical dashed line indicates the training time adopted in this paper. For better readability, two horizontal axes are provided: one for the number of iterations and the other for training time. e DeepSTD's performance and training time in cases with limited training data, with “Sufficient” and “Limited” denoting the full Amos training dataset and a small training set consisting of 50 randomly selected samples, respectively. Additionally, we tested the contribution of two proposed countermeasures---data augmentation and knowledge distillation, as well as their combination, denoted as “Data Aug.'', “Know. Dis.” and “Data Aug. + Know. Dis.''. All experiments were conducted with a compression ratio of 128 × . f DeepSTD's stable and flexible control over the compression ratio. Left: The boxplot displaying the achieved compression ratios of different compressors for the given 256 × compression ratio, tested on 50 random data from the Amos test set. Right: DeepSTD's final achieved compression ratios at 20 different target compression ratios ranging from 32 × to 512 × , tested on 50 random data from the Amos test set. The left panel uses the same legend as Figs. 2 and 3, while the error bars in the right panel indicate the mean ± standard deviation.
The training of DeepSTD is highly efficient, with training a DeepSTD for the CTSpine1K data taking only 25 hours using 4 RTX 3090 GPUs and that for the Amos data taking around 22 hours. Note that we can employ more GPUs to shorten the training time further. We quantitatively analyzed the time consumption of each stage during training, as shown in Fig. 4(c). The most time-intensive stage is the training of the texture encoder-decoder ("Training Encoder Decoder” in the figure), which accounts for approximately 96% of the total training time. The second most time-consuming stage is aligning the training data with the reference shape ("Aligning Shape” in the figure), contributing about 4% of the total time. Finally, the reference shape selection stage ("Selecting Reference Shape”) takes only 0.01% of the training time. Additionally, we recorded the performance changes of our algorithm when varying the training durations, as shown in Fig. 4d, to visualize the convergence.
For certain modalities, collecting training data may be more challenging than widely-used torso CT or MRI. Insufficient training data can significantly degrade the performance of data-driven methods. To simulate this scenario, we randomly selected 50 samples from the Amos training set to form a reduced training set and re-trained our algorithm on this smaller dataset. The result is shown in Fig. 4e, showing that the performance of our algorithm declined. To mitigate the impact of limited training data, we propose two strategies to enhance DeepSTD’s performance: (i) Data augmentation, in which we applied randomly generated deformation fields to the training data and created a more diverse set of training samples; (ii) Knowledge distillation, to learn a teacher model from a dataset comprised sufficient data of a similar imaging modality, and distill its knowledge into a student model designed for the target modality with scarce data. Each strategy independently improves performance, and combining them yields further gains. However, it is important to note that these improvements come at the cost of increased training time, as shown in Fig. 4e.
Efficient management of large-scale medical data calls for a compressor that makes extensive use of the available storage or transmission budget, so a flexible approach that allows users to set desired compression ratios is largely demanded. DeepSTD features a flexible control of the final compression ratio. To validate this point, we set a target compression ratio of 256 × on 50 random data from Amos test set and configured all comparison algorithms through pre-experiments to achieve this target ratio. The deviations from the target compression ratios of all compression algorithms are shown in the left panel of Fig. 4f. The plot shows that DeepSTD achieves a stable compression ratio, unaffected by variations in data size or content.
This stability is jointly achieved in two stages. Firstly, the compression ratio of texture encoding is content-independent, with deviations typically less than 0.3% across CT and MRI datasets when zero-padding is needed to match block dimensions. Secondly, shape encoding compensates for deviations in texture encoding by adjusting the number of neurons in the INR network structure. Since shape encoding is implemented by solving an optimization problem, DeepSTD can flexibly adapt to given bandwidth or storage constraints by adjusting the compression ratio and controlling encoding quality, as shown in Fig. 4f right.
Discussion
This article reports Shape-Texture Decoupled Compression (DeepSTD) for efficient medical image storage and sharing, which introduces a template for the images of the same imaging modality and body part(s) and defines each image with respect to the template. This definition disentangles the shape and texture features in the individual-specific image and represents them with a morphing field and shape-aligned texture, respectively, which facilitates adopting proper encoding schemes for both components separately. We note that the interdependence between the two components is asymmetric. Inaccuracies in shape encoding can lead to residual edge artifacts in texture encoding, while errors in texture encoding tend to remain localized and do not compromise the integrity of shape recovery. This highlights the importance of precise registration in minimizing the workload on the texture encoder and maintaining overall fidelity. In addition, anatomical variations or pathological conditions underrepresented in the training data may cause localized degradations in registration or texture fidelity. Nevertheless, these errors remain bounded and spatially confined and do not compromise the overall anatomical plausibility of the decompressed images. Additionally, in extreme cases, we can evaluate the compression loss and retain the compression residues of these rare samples to compensate for the compression loss and even incrementally update the compressor to enhance the robustness. In this way, DeepSTD maintains a favorable trade-off between compression ratio and image quality: shape encoding with INRs ensures precise ratio control and efficient low-frequency representation, while texture encoding after shape alignment reduces redundancy and allows high-fidelity compression. The proposed approach therefore fully explores the respective advantages of INR-based and data-driven compression techniques and bypasses their disadvantages, achieving high ratios (up to 256 × ) without compromising diagnostic quality. Besides, the proposed encoding scheme is a general framework open to coming compression algorithms for both shape and aligned textures.
Under the above framework, we also developed a medical-image-specific algorithm. Regarding shape encoding, we utilize a Cycle INR to address the challenges of encoding shape differences concerning the predefined template, by solving both forward and backward registration problems simultaneously. The algorithm optimizes deformation fields using INR parameters, allowing for efficient encoding of the shape information. The algorithm optimizes the latent deformation fields using INR parameters, allowing for efficient encoding of shape information. To ensure that the shape encoding captures meaningful anatomical variations rather than arbitrary deformations, we adopt three safeguards: (i) a cycle-consistency loss to enforce reciprocal deformation, (ii) Jacobian determinant regularization to maintain smooth and invertible fields, and (iii) a gradual degradation strategy to highlight large-scale anatomical structures. Together, these constraints guide the model toward encoding biologically plausible shape differences. For texture encoding, we developed the 3DTCMPE algorithm, a data-driven encoder-decoder that efficiently encodes the texture after aligning with the template. The algorithm extends the SOTA TCM algorithm to 3D and introduces positional encoding to better handle large-scale data, which delivers significant gains in encoding quality and speed. Additionally, data augmentation and knowledge distillation strategies improve robustness under limited training data, ensuring the applicability of 3DTCMPE across diverse medical imaging scenarios. The proposed synergy between INR-based and autoencoder-based modules stems directly from the shape-texture decoupling: INRs efficiently encode the low-frequency morphing field that aligns individual scans to a shared template, while the subsequent shape alignment enables data-driven encoders to compactly and robustly capture fine-scale textures. This hybridization bypasses the limitations of either approach in isolation, achieving both high efficiency and strong generalization.
DeepSTD has been rigorously evaluated on large-scale CT and MRI datasets, demonstrating exceptional performance in terms of encoding quality, compression ratio, and speed. Notably, DeepSTD outperforms existing algorithms, achieving up to 165 times the speed of BRIEF and 46 times the speed of VVC, while maintaining superior encoding quality. These results establish DeepSTD as a leading solution for modern medical image compression, where precision and efficiency are both essential. In the CT modality (CTSpine1K dataset), DeepSTD achieved a compression ratio of 256 × with minimal loss in both pixel level accuracy (MAE, SSIM) and medical relevance (DSC, HD95). Similarly, on the MRI modality (Amos dataset), DeepSTD delivered a 128 × compression ratio with high SSIM and DSC scores. Furthermore, DeepSTD demonstrates broad applicability, performing effectively across diverse medical datasets, including chest X-ray, Brain MRI and Knee MRI, making it a versatile tool for various medical images. These findings highlight DeepSTD’s superior comprehensive performance to SOTA compressors and robustness across diverse medical imaging modalities.
Another notable strength of DeepSTD is its practicality. It can efficiently leverage multiple GPUs for accelerated encoding and decoding, significantly reducing processing time for large datasets. With a fast training process of 22-25 hours using 4 GPUs, DeepSTD is well-suited for rapid deployment in real-world applications. Additionally, its ability to handle limited training data through data augmentation and knowledge distillation ensures stable performance even in data-scarce environments. DeepSTD also provides flexible control over compression ratios, adapting to different data sizes and storage or bandwidth constraints while maintaining high-fidelity compression.
While DeepSTD demonstrates impressive performance, there are several avenues for further extensions. Future work could accelerate encoding and decoding by exploring more computationally efficient network architectures54,55, particularly for texture encoding and decoding, which would reduce the overall processing time. In the shape encoding phase, meta-learning techniques could be employed to accelerate INR optimization, enabling faster convergence and more efficient encoding. More efficient training strategies, such as leveraging synthetic data or semi-supervised learning, could also be considered to circumvent the constraints of limited training data. These improvements would boost DeepSTD’s fast adaptability to diverse imaging modalities lacking sufficient training data or computation resources, and thus enhance DeepSTD’s practicality in real-world medical scenes. Additionally, improving the generalization ability of the data-driven algorithms is a promising direction. Despite the current framework’s robust performance across various datasets, one can further explore methods like transfer learning, few-shot learning, and domain adaptation to increase generalizability to unseen data, particularly in data-scarce scenarios.
At the same time, several limitations of the current framework should be acknowledged. While DeepSTD has been validated across multiple anatomical imaging modalities, its robustness against out-of-distribution variations (e.g., unseen pathologies, scanner types, or acquisition protocols) has not been systematically established. To partially address this, we conducted a cross-protocol validation on the Amos MRI dataset, which revealed mild but noticeable performance degradation under unseen acquisition protocols. By contrast, such an evaluation was not performed on the CTSpine1K dataset due to its relative homogeneity in scanning protocols. Furthermore, the reliance on anatomical registration makes the current framework unsuitable for functional imaging modalities (e.g., fMRI, PET), where voxel intensities lack stable anatomical correspondence. Addressing these limitations may require domain-adaptation strategies or alternative representations tailored for functional data.
In summary, DeepSTD obtains a good trade-off among compression performance, efficiency, and flexibility, and represents a significant advancement in medical image compression, offering a highly efficient and flexible solution for compact data storage and sharing in clinical settings.
Methods
Encoding shape and texture
Given the data D and the reference template R, we initially pad D to match the dimensions of R and downsample them by a factor of 2. We then apply a cyclic Implicit Neural Representation (INR) registration algorithm to obtain the forward deformation field ΦF and the backward field ΦB using the downsampled D and R. The forward deformation field ΦF transforms the shape of D toward that of R, while the backward deformation field ΦB performs the inverse transformation. Finally, the forward-transformed data Dwarped is computed as
| 1 |
with [ ⋅ ] denoting a coordinate-to-intensity indexing, and then passed to subsequent texture encoding. The backward deformation field ΦB is stored as shape information in the compressed data and will be used in deformation.
After shape encoding, we use a trained encoder to extract a latent variable y as the texture information:
| 2 |
where genc represents the encoder. The latent variable y is stored in the compressed data as texture information. At this point, the encoding process is complete, with the compressed data concatenating shape and texture information.
Decoding texture and shape
After obtaining the compressed data, the texture information y is extracted, and then a trained decoder is used to reconstruct from the texture information.
| 3 |
where gdec represents the decoder. The reconstructed data currently has a shape similar to the reference shape. Next, the backward deformation field ΦB is extracted from the compressed data, and then we restore the original shape of the data as:
| 4 |
where is the final decompressed data.
Selecting template for shape registration
The training sample with the widest imaging range is selected as the template to maximize detail coverage and facilitate registration. To avoid bias from noisy or atypical scans, we exclude samples with severe artifacts or strong non-uniformity, and apply a gradual degradation strategy in the Cycle-INR encoder to focus on global geometry while down-weighting local irregularities. This makes the method robust to common MRI artifacts and outlier shapes. The selected templates for CTSpine1K and Amos datasets are visualized in Fig. 1b and c, respectively. The imaging range of each sample is evaluated based on its dimensions and imaging resolution. For the sample set with an identical imaging range, one can use a randomly selected sample as the template. In a sensitivity study on AMOS MRI (Supplementary Table 2), at a fixed compression ratio (128 × ), the widest-range exemplar yields the best overall fidelity, a random exemplar is moderately worse, and a simple average template-constructed by rigidly aligning five volumes and voxel-wise averaging-performs noticeably worse in SSIM/DSC and HD95, likely due to template blurring from unresolved inter-subject anatomical variability and the resulting increase in residual complexity56.
Cycle-INR based shape registration
We build on the cycle INR registration algorithm proposed in ref. 41 to compute both forward and backward deformation fields between the target image and the template. The forward deformation field is denoted as ΦF(x) = x + uF(x), which maps coordinates x in R to x + uF(x) in D. Similarly, we define the backward deformation field ΦB(x) = x + uB(x) to map coordinates from D back to R.
Here, the functions uF and uB provide the deformation vectors corresponding to specific coordinates, and both are parameterized by a SIREN network with an identical structure. The SIREN network is a fully connected network utilizing sinusoidal activation functions, functioning as an implicit neural representation. We simultaneously optimize ΦF(x) and ΦB(x) as shown in Supplementary Fig. 12 (a). The objective function is as follows:
| 5 |
where α and β are weight coefficients. and respectively measure the registration accuracy of the forward and backward transformations by calculating
| 6 |
where NCC is the normalized cross-correlation, x are sampled coordinates and bs is the batch size. We adopt the Jacobian determinant regularization as the regularization term for the deformation fields
| 7 |
The cycle loss terms and are designed to encourage the two deformation fields to approximate each other’s inverse transformation, and defined as
| 8 |
We adopt the squared L2 norm here because it strongly penalizes large deviations, stabilizing joint optimization and helping to maintain anatomically plausible deformations in medical image registration. Additionally, we employ SIREN37 as the architecture for the INRs.
Gradual degradation strategy for robust registration
Medical data may exhibit significant shape variations across different samples, such as differences in body types, pre- and post-surgical changes, or variations in the imaging regions. To address the shape variations and improve the robustness of shape encoding, we adopted the gradual degradation strategy. Specifically, we initially set the hyperparameter Ω of the SIREN network to its default value, e.g., Ω = 10 for the CTSpine1K dataset and Ω = 12 for the Amos dataset. If, after 100 optimization steps, the registration loss falls below a predefined threshold, we reduce the value of Ω by 2 and re-encode the data. The registration process is repeated until either the loss exceeds the threshold or Ω drops below 5. The hyperparameter Ω of the SIREN network determines the spectral range of the implicit neural representation36. Lowering the value of Ω enables the registration process to focus on larger-scale alignment while ignoring smaller shape details57, thereby preventing the registration process from being overly influenced by unique shapes, which could lead to erroneous results. The threshold is determined based on the training process. Specifically, we use the 75th percentile of the registration loss values on the training data as the threshold for testing. Unless otherwise specified, the total optimization time for shape encoding is set to be the same as the encoding time for texture encoding in all experiments.
Architecture of texture encoder
The encoder architecture, as shown in Supplementary Fig. 12 (b), adopts the classical autoencoder structure, consisting of alternating feature extraction modules and down-sampling modules connected in sequence. For the feature extraction module, we follow the design of the Transformer-CNN Mixture (TCM) block from58, which utilizes parallel CNN and Transformer paths for feature extraction. At the module’s input, a 1 × 1 × 1 convolution layer is applied to increase the data dimensionality (i.e., the number of channels). The data is then split along the channel dimension into two parts: one processed by a CNN and the other by a Transformer. The CNN path consists of two sequential groups of 3 × 3 × 3 convolution layers and Leaky ReLU activation layers, with the entire structure employing residual connections. The Transformer path adopts the Swin Transformer Block structure proposed in ref. 59. The feature extraction module as a whole also utilizes residual connections. For the down-sampling module, we used the Residual Block with stride (RBS) proposed in ref .42, which includes a strided convolution with a kernel size of 3 × 3 × 3 and a stride of 2 for down-sampling, followed by a Leaky ReLU activation layer, and another 3 × 3 × 3 convolution layer. The entire module employs residual connections to expand the receptive field.
The texture encoder-decoder has two modes: full-data encoding/decoding mode and block-based encoding/decoding mode. In the full-data encoding/decoding mode, a single texture encoder-decoder is used to encode and decode the entire data, which is the default mode we adopt. However, the drawback is that it cannot utilize multi-GPU acceleration. In the block-based encoding/decoding mode, multiple texture encoders are used, each responsible for encoding and decoding a fixed data subregion. The advantage of this mode is that different encoders can parallelize the encoding/decoding process across multiple GPUs, speeding up the overall operation. This approach is also employed when the memory capacity of a single GPU is insufficient. Specifically, in the block-based encoding/decoding mode, each block is assigned a learnable positional encoding according to its relative position. This positional encoding is then input into the Swin Transformer Block of the TCM Block, enabling the encoder to recognize each block’s relative position and extract texture information accordingly. The above working modes both benefit from the intensified data similarity introduced by the shape registration.
Architecture of the texture decoder
The structure of the decoder is shown in Supplementary Fig. 12 (b), which adopts a design similar to the encoder, with the primary difference being the replacement of down-sampling modules with up-sampling modules. For the up-sampling module, we used the Residual Block with Upsampling (RBU) proposed in ref. 42, employing subpixel convolution instead of transposed convolution as upsampling units to retain more details. This module also employs residual connections to expand the receptive field.
Standard training procedure
Before training, all samples are first aligned to the template using Cycle INR optimization to ensure consistent geometry across the dataset. The encoder and decoder are then trained simultaneously, and the loss function is defined as the pixel-wise difference between the decoded data and the original version:
| 9 |
where N is the total number of training samples, genc is the encoder, and gdec is the decoder. The Adam60 optimizer is used for training with a batch size of 2, and the learning rate is set to be 1 × 10−4.
Data augmentation and knowledge distillation for limited training data
For modalities with limited data due to image collection difficulties, we adopted two strategies to train the encoder and decoder. On the one hand, we applied data augmentation by generating random deformation fields to dynamically alter the shape and texture of the training data during training. Specifically, we adopted the method proposed in ref. 61. Random Gaussian parameters are generated to create a deformation field that is then applied to the next 10 training samples to introduce shape and texture variations. This strategy ensures diversity in the training set, helping to improve the robustness and generalization ability of the encoder-decoder.
On the other hand, we employed knowledge distillation for effective training on a small data set. First, we identified a similar modality with abundant training data. Using the method described in “Standard Training Procedure”, we trained an encoder-decoder, called the teacher model. Next, we trained an encoder-decoder for the target modality, called the student model. In addition to using the loss function defined in Equation (9), we introduced a loss function that leverages the intermediate layer features of the teacher model as supervision to guide the training of the student model:
| 10 |
| 11 |
where α is a weighting factor, M is the total number of TCM blocks in the encoder-decoder, represents the features extracted by the j-th TCM block of the teacher model from data Di, and represents the features extracted by the j-th TCM block of the student model from the same data. For the knowledge distillation experiments (Fig. 4e), the teacher model was trained on the full Amos training set, while the student model was trained on a reduced subset consisting of 50 randomly selected samples. In Equation (10), the weighting factor for the distillation loss was set to α = 0.5. Note that when training the teacher model, if data block partitioning is required, we set the positional encoding to random noise to prevent the teacher model from learning incorrect shape information from a different modality.
Comparison methods
The baseline algorithms adopted for performance comparison and their detailed implementations are detailed here.
As for medical image registration algorithms, we chose six algorithms.
Affine and B-spline: We used Elastix software62 for their implementation.
SyN: We adopted the implementation provided by ANTs63.
CorrField, VoxelMorph, CycleMorph, and INR: We used the official implementations of CorrField64, VoxelMorph65, CycleMorph66, and INR38.
To validate our compression performance, we compared both conventional compression tools and recent deep-learning-based ones.
JPEG: We used the OpenCV implementation of JPEG45. To ensure compatibility, we normalize the data to the range [0, 255] before encoding.
JPEG2000: We utilized the Glymur implementation of JPEG200046. Similar to JPEG, we normalize the data to the range [0, 255] before encoding.
HEVC: We employed the FFmpeg implementation of HEVC28. we identified that using the gray16 encoding format produced better results. Therefore, we normalize the data to the range [0, 65535] before encoding.
VVC: We used the VTM implementation of VVC29. Since the 16-bit format results in excessive computational overhead, making it impractical for experiments, we encode the data using the 8-bit format instead.
BRIEF, 3DGS, DVC, and TCM: We adopted the official implementations of BRIEF33, 3DGS47, DVC49, and TCM42.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We acknowledge Dr. Q. Cao and Dr. Y. Cheng for their valuable suggestions on the manuscript. This work is jointly supported by the National Key R&D Program of China (Grant No. 2024YFF0505703) and the National Natural Science Foundation of China (Grant No. 62088102).
Author contributions
J.S., R.Y., T.X., and Y.C. conceived this project. R.Y. designed the DeepSTD architecture. J.S. supervised this research. R.Y., T.X., Y.C., and J.S. conducted the experiments and data analysis. All the authors participated in the writing of this paper.
Peer review
Peer review information
Nature Communications thanks Seung Kwan Kang, Amit Shakya and Jian Zhang for their contribution to the peer review of this work. A peer review file is available.
Data availability
For performance comparison of image registration, we used the Synapse dataset44 that can be downloaded from this link. For experiments on torso CT data, we used the CTSpine1K dataset18 accessible at this link. The CTSpine1K is a CT data set that covers the upper part of the human body. The average data size is 512 × 512 × 504 pixels, highly representative within the CT modality. We selected the first 787 data samples and divided them into a training set of 630 samples and a test set of 157 samples. For experiments on torso MRI data, we used the Amos dataset50 accessible at this link. The Amos dataset is an MRI modality dataset that targets the abdominal region and includes multiple organs. We started with the 1,200 MRI scans provided as supplementary data by the authors, filtered the scans with an imaging resolution worse than 4 mm and a longitudinal resolution worse than 20 mm, and rescaled the remaining 1,144 scans to a uniform resolution of 2 mm × 2 mm × 10 mm. The average size of the rescaled scans is 387 × 362 × 44 pixels, and we split the dataset into a training set containing 915 samples and a test set containing 229 samples. For performance testing on chest radiograph data, we used the CheXpert-small dataset51 that can be downloaded from this link, and randomly selected 10,000 frontal view samples from the “study1” subset to train the compressor. For performance testing on brain and knee MRI data, we used the NYU fastMRI dataset52,53 that can be downloaded from this link. We randomly selected 700 T2 weighted reconstructed data from the “Brain MRI” and 460 data from the “Knee MRI” subset for respective model training, with both compression ratios at 128 × .
Code availability
The code is available on GitHub at https://github.com/RichealYoung/DeepSTD.git.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-026-68292-9.
References
- 1.Lin, L. et al. High-speed three-dimensional photoacoustic computed tomography for preclinical research and clinical translation. Nat. Commun.12, 882 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Feinberg, D. A. et al. Next-generation MRI scanner designed for ultra-high-resolution human brain imaging at 7 Tesla. Nat. Methods20, 2048–2057 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Boulant, N. et al. In vivo imaging of the human brain with the Iseult 11.7-T MRI scanner. Nat. Methods21, 2013–2016 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Liu, Y. et al. Single-point mutated lanmodulin as a high-performance MRI contrast agent for vascular and kidney imaging. Nat. Commun.15, 9834 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.He, Y. et al. Perovskite computed tomography imager and three-dimensional reconstruction. Nat. Photonics18, 1052–1058 (2024). [Google Scholar]
- 6.Gambini, L. et al. Video frame interpolation neural network for 3D tomography across different length scales. Nat. Commun.15, 7962 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hou, B. et al. Materials innovation and electrical engineering in X-ray detection. Nat. Rev. Electr. Eng.1, 639–655 (2024). [Google Scholar]
- 8.Edelman Saul, E. et al. The challenges of implementing low-dose computed tomography for lung cancer screening in low- and middle-income countries. Nat. Cancer1, 1140–1152 (2020). [DOI] [PubMed] [Google Scholar]
- 9.Ye, A., Deng, Y., Li, X. & Shao, G. The impact of informatization development on healthcare services in China. Sci. Rep.14, 31041 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Tammemägi, M. C. et al. Risk-based lung cancer screening performance in a universal healthcare setting. Nat. Med.30, 1054–1064 (2024). [DOI] [PubMed] [Google Scholar]
- 11.Liu, X. et al. Improving access to cardiovascular care for 1.4 billion people in China using telehealth. npj Digital Med.7, 376 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zheng, Q. et al. Large-scale long-tailed disease diagnosis on radiology images. Nat. Commun.15, 10147 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kim, C. et al. Transparent medical image AI via an image-text foundation model grounded in medical literature. Nat. Med.30, 1154–1165 (2024). [DOI] [PubMed] [Google Scholar]
- 14.Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature616, 259–265 (2023). [DOI] [PubMed] [Google Scholar]
- 15.Sun, Y., Wang, L., Li, G., Lin, W. & Wang, L. A foundation model for enhancing magnetic resonance images and downstream segmentation, registration and diagnostic tasks. Nature Biomedical Engineering (2024). [DOI] [PMC free article] [PubMed]
- 16.Wang, J. et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nature Medicine (2024). [DOI] [PubMed]
- 17.Wang, C. et al. Data-driven risk stratification and precision management of pulmonary nodules detected on chest computed tomography. Nat. Med.30, 3184–3195 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Deng, Y. et al. Ctspine1k: A large-scale dataset for spinal vertebrae segmentation in computed tomography (2024). arXiv: 2105.14711.
- 19.Stewart, C. Number of computer tomography (ct) scanners in selected countries as of 2021 (2023). Accessed: 2024-12-29.
- 20.Suthakar, U., Magnoni, L., Smith, D. R., Khan, A. & Andreeva, J. An efficient strategy for the collection and storage of large volumes of data for computation. J. Big Data3, 21 (2016). [Google Scholar]
- 21.Kanwal, S., Khan, F. Z., Lonie, A. & Sinnott, R. O. Investigating reproducibility and tracking provenance - A genomic workflow case study. BMC Bioinforma.18, 337 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mammoliti, A. et al. Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat. Commun.12, 5797 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.DeWitt, P. E., Rebull, M. A. & Bennett, T. D. Open source and reproducible and inexpensive infrastructure for data challenges and education. Sci. Data11, 8 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.McDole, K. et al. In toto imaging and reconstruction of post-implantation mouse development at the single-cell level. Cell175, 859–876.e33 (2018). [DOI] [PubMed] [Google Scholar]
- 25.Boergens, K. M. et al. webKnossos: Efficient online 3D data annotation for connectomics. Nat. Methods14, 691–694 (2017). [DOI] [PubMed] [Google Scholar]
- 26.Wiseman, S. A FAIR platform for data-sharing. Nat. Neurosci.24, 1640–1640 (2021). [DOI] [PubMed] [Google Scholar]
- 27.Wiegand, T., Sullivan, G. J., Bjontegaard, G. & Luthra, A. Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol.13, 560–576 (2003). [Google Scholar]
- 28.Sullivan, G. J., Ohm, J.-R., Han, W.-J. & Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol.22, 1649–1668 (2012). [Google Scholar]
- 29.Bross, B. et al. Overview of the versatile video coding (vvc) standard and its applications. IEEE Trans. Circuits Syst. Video Technol.31, 3736–3764 (2021). [Google Scholar]
- 30.Yang, Y., Bamler, R. & Mandt, S. Improving inference for neural image compression. In Neural Information Processing Systems, vol. 33, 573–584 (2020).
- 31.Hinton, G. E. & Van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Annual Conference on Computational Learning Theory, 5–13 (1993).
- 32.Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv:1312.6114 (2013).
- 33.Yang, R. et al. Sharing massive biomedical data at magnitudes lower bandwidth using implicit neural function. Proc. Natl. Acad. Sci.121, e2320870121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Haralick, R. M., Shanmugam, K. & Dinstein, I. Textural features for image classification. IEEE Trans. Syst., Man, Cybern.SMC-3, 610–621 (1973). [Google Scholar]
- 35.Castellano, G., Bonilha, L., Li, L. M. & Cendes, F. Texture analysis of medical images. Clin. Radiol.59, 1061–1069 (2004). [DOI] [PubMed] [Google Scholar]
- 36.Yang, R. et al. Sci: A spectrum concentrated implicit neural compression for biomedical data. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 4774–4782 (2023).
- 37.Sitzmann, V., Martel, J. N. P., Bergman, A. W., Lindell, D. B. & Wetzstein, G. Implicit neural representations with periodic activation functions. In Neural Information Processing Systems, vol. 33, 7462–7473 (2020).
- 38.Wolterink, J. M., Zwienenberg, J. C. & Brune, C. Implicit neural representations for deformable image registration. In Konukoglu, E. et al. (eds.) Proceedings of The 5th International Conference on Medical Imaging with Deep Learning, vol. 172 of Proceedings of Machine Learning Research, 1349–1359 (PMLR, 2022).
- 39.Rood, J. E., Maartens, A., Hupalowska, A., Teichmann, S. A. & Regev, A. Impact of the Human Cell Atlas on medicine. Nat. Med.28, 2486–2496 (2022). [DOI] [PubMed] [Google Scholar]
- 40.Goetz, L. H. & Schork, N. J. Personalized medicine: motivation, challenges, and progress. Fertil. Steril.109, 952–963 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.van Harten, L. D., Stoker, J. & Išgum, I. Robust deformable image registration using cycle-consistent implicit representations. IEEE Trans. Med. Imaging43, 784–793 (2024). [DOI] [PubMed] [Google Scholar]
- 42.Liu, J., Sun, H. & Katto, J. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14388–14397 (2023).
- 43.Shaker, A. et al. Unetr++: Delving into efficient and accurate 3d medical image segmentation. IEEE Trans. Med. Imaging43, 3377–3390 (2024). [DOI] [PubMed] [Google Scholar]
- 44.Landman, B. et al. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge, vol. 5, 12 (2015).
- 45.Wallace, G. K. The JPEG still picture compression standard. IEEE Trans. Consum. Electron.38, xviii–xxxiv (1992). [Google Scholar]
- 46.Skodras, A., Christopoulos, C. & Ebrahimi, T. The jpeg 2000 still image compression standard. IEEE Signal Process. Mag.18, 36–58 (2001). [Google Scholar]
- 47.Kerbl, B., Kopanas, G., Leimkuehler, T. & Drettakis, G. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42 (2023).
- 48.Zhang, X. et al. Gaussianimage: 1000 fps image representation and compression by 2d gaussian splatting. In Leonardis, A. et al. (eds.) Computer Vision – ECCV 2024, 327–345 (Springer Nature Switzerland, Cham, 2025).
- 49.Lu, G. et al. DVC: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11006–11015 (2019).
- 50.Ji, Y. et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. arXiv:2206.08023 (2022).
- 51.Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, 590–597 (AAAI Press, 2019).
- 52.Knoll, F. et al. fastmri: A publicly available raw k-space and dicom dataset of knee images for accelerated mr image reconstruction using machine learning. Radiology: Artif. Intell.2, e190007 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Zbontar, J. et al. fastmri: An open dataset and benchmarks for accelerated MRI. CoRR (2018). arXiv: 1811.08839.
- 54.Koyuncu, A. B., Jia, P., Boev, A., Alshina, E. & Steinbach, E. Efficient contextformer: Spatio-channel window attention for fast context modeling in learned image compression. IEEE Trans. Circuits Syst. Video Technol.34, 7498–7511 (2024). [Google Scholar]
- 55.Arezki, B., Mokraoui, A. & Feng, F. Efficient image compression using advanced state space models. In 2024 IEEE 26th International Workshop on Multimedia Signal Processing (MMSP), 1–6 (2024).
- 56.Yoon, U., Fonov, V. S., Perusse, D. & Evans, A. C. The effect of template choice on morphometric analysis of pediatric brain data. NeuroImage45, 769–777 (2009). [DOI] [PubMed] [Google Scholar]
- 57.Li, X. et al. Continuous spatial-temporal deformable image registration (cpt-dir) for motion modelling in radiotherapy: beyond classic voxel-based methods. arXiv:2405.00430 (2024).
- 58.Cheng, Z., Sun, H., Takeuchi, M. & Katto, J. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020).
- 59.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022 (2021).
- 60.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).
- 61.Li, J. et al. Gaussian primitives for deformable image registration (2024). arXiv: 2406.03394. [DOI] [PMC free article] [PubMed]
- 62.Klein, S., Staring, M., Murphy, K., Viergever, M. A. & Pluim, J. P. W. elastix: A toolbox for intensity-based medical image registration. IEEE Trans. Med. Imaging29, 196–205 (2010). [DOI] [PubMed] [Google Scholar]
- 63.Avants, B., Epstein, C., Grossman, M. & Gee, J. Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal.12, 26–41 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Heinrich, M. P., Handels, H. & Simpson, I. J. A. Estimating large lung motion in copd patients by symmetric regularised correspondence fields. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 338–345 (Springer International Publishing, Cham, 2015).
- 65.Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J. & Dalca, A. V. Voxelmorph: A learning framework for deformable medical image registration. IEEE Trans. Med. Imaging38, 1788–1800 (2019). [DOI] [PubMed] [Google Scholar]
- 66.Kim, B. et al. Cyclemorph: Cycle consistent unsupervised deformable image registration. Med. Image Anal.71, 102036 (2021). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
For performance comparison of image registration, we used the Synapse dataset44 that can be downloaded from this link. For experiments on torso CT data, we used the CTSpine1K dataset18 accessible at this link. The CTSpine1K is a CT data set that covers the upper part of the human body. The average data size is 512 × 512 × 504 pixels, highly representative within the CT modality. We selected the first 787 data samples and divided them into a training set of 630 samples and a test set of 157 samples. For experiments on torso MRI data, we used the Amos dataset50 accessible at this link. The Amos dataset is an MRI modality dataset that targets the abdominal region and includes multiple organs. We started with the 1,200 MRI scans provided as supplementary data by the authors, filtered the scans with an imaging resolution worse than 4 mm and a longitudinal resolution worse than 20 mm, and rescaled the remaining 1,144 scans to a uniform resolution of 2 mm × 2 mm × 10 mm. The average size of the rescaled scans is 387 × 362 × 44 pixels, and we split the dataset into a training set containing 915 samples and a test set containing 229 samples. For performance testing on chest radiograph data, we used the CheXpert-small dataset51 that can be downloaded from this link, and randomly selected 10,000 frontal view samples from the “study1” subset to train the compressor. For performance testing on brain and knee MRI data, we used the NYU fastMRI dataset52,53 that can be downloaded from this link. We randomly selected 700 T2 weighted reconstructed data from the “Brain MRI” and 460 data from the “Knee MRI” subset for respective model training, with both compression ratios at 128 × .
The code is available on GitHub at https://github.com/RichealYoung/DeepSTD.git.




