Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 26.
Published in final edited form as: IEEE Trans Med Imaging. 2025 Oct;44(10):3973–3983. doi: 10.1109/TMI.2024.3482228

Core-Periphery Multi-Modality Feature Alignment for Zero-Shot Medical Image Analysis

Xiaowei Yu 1, Lu Zhang 2, Zihao Wu 3, Dajiang Zhu 4,*
PMCID: PMC12740343  NIHMSID: NIHMS2119674  PMID: 39418140

Abstract

Multi-Modality learning, exemplified by the language-image pair pre-trained CLIP model, has demonstrated remarkable performance in enhancing zero-shot capabilities and has gained significant attention recently. However, simply applying language-image pre-trained CLIP to medical image analysis encounters substantial domain shifts, resulting in severe performance degradation due to inherent disparities between natural (non-medical) and medical image characteristics. To address this challenge and uphold or even enhance CLIP’s zero-shot capability in medical image analysis, we develop a novel approach, Core-Periphery feature alignment for CLIP (CP-CLIP), to model medical images and corresponding clinical text jointly. To achieve this, we design an auxiliary neural network whose structure is organized by the core-periphery (CP) principle. This auxiliary CP network not only aligns medical image and text features into a unified latent space more efficiently but also ensures alignment driven by principles of brain network organization. In this way, our approach effectively mitigates and further enhances CLIP’s zero-shot performance in medical image analysis. More importantly, the proposed CP-CLIP exhibits excellent explanatory capability, enabling the automatic identification of critical disease-related regions in clinical analysis. Extensive experiments and evaluation across five public datasets covering different diseases underscore the superiority of our CP-CLIP in zero-shot medical image prediction and critical features detection, showing its promising utility in multimodal feature alignment in current medical applications.

Index Terms—: Zero-Shot, CLIP, Feature Alignment, Multi-Modality, Core-Periphery, Brain-inspired AI

I. Introduction

Multi-Modality learning has emerged as a promising approach to enhance the understanding and analysis of complex tasks by leveraging information from multiple sources. There has been a growing research interest focusing on integrating textual modalities into vision models [17] [22]. The synergy between image and text modalities offers mutual benefits, enhancing modeling and reasoning capabilities, and aligns closely with the multimodal perceptual environment of the human brain [21]. One notable advancement in this field is the pre-trained CLIP (Contrastive Language-Image Pre-training) model, which has demonstrated remarkable performance in various tasks by jointly learning from language and image data [13]. The CLIP model aligns image and text embeddings in the latent space through contrastive learning on a dataset comprising 400 million image-text pairs sourced from a diverse range of publicly accessible online platforms. The fusion of text and image modalities has significantly improved zero-shot capabilities, allowing the model to generalize to unseen tasks or domains without explicit training [14] [6] [23].

Nevertheless, despite its remarkable accuracy and feature extraction capability, CLIP’s zero-shot performance heavily relies on large-scale, high-quality image-text paired datasets [24]. Creating such datasets poses significant challenges, especially in specialized domains like healthcare and radiology, where data is not only scarce but often presents distinct patterns in both image and text components compared to natural images and text that CLIP is trained on [25]. That is, there exists a significant domain shift between natural (non-medical) and medical images [20]. Additionally, CLIP’s reliance solely on contrastive loss for extracting image and text features imposes limitations on its ability to align these features effectively [4] [19] [26]. Thus, there is an increasing need to enhance CLIP with additional mechanisms that can not only improve the multimodality feature alignment between image and text features but also leverage CLIP’s zero-shot capability on downstream tasks with limited datasets [27] [28] [29].

To address the zero-shot performance degradation of the CLIP model in the medical imaging domain, our strategy aims to improve the effectiveness when aligning multimodal features in latent space by developing a novel information exchange mechanism for neural networks. This mechanism is inspired by the Core-Periphery (CP) organization that universally exists in brain functional networks of humans and other mammals [1]. It has been widely confirmed that the CP organization can effectively promote the efficiency of information transmission and communication for biologically integrative processing [5] [53]. In general, CP organization is composed of two qualitatively distinct components: a dense “core” of nodes that are strongly interconnected with one another, allowing for integrative information processing to facilitate the rapid transmission of messages and a sparse “periphery” of nodes that sparsely connected to the cores [30]. In this work, we aim to incorporate the CP principle into model design to effectively guide neural networks in aligning the text and image features extracted from CLIP, consolidating them into a unified latent space. Through this way, our CP-CLIP can align the critical features from multimodal data in latent space more efficiently with limited paired image-text samples, thereby alleviating performance degradation of original CLIP and simultaneously facilitating the identification of potential disease-related regions. The main idea of CP-CLIP is shown in Fig. 1. In vanilla CLIP, the image and text are encoded into the image feature space and text feature space separately by different encoders. The feature alignment is achieved through a contrastive loss. Instead, our CP-CLIP further encodes the image embeddings and text embeddings into a unified core-periphery aligned feature space using the designed core-periphery principle guided neural network, and then implements contrastive learning on the core-periphery aligned image and text features.

Fig. 1:

Fig. 1:

The distinction between CLIP (top panel) and CP-CLIP (bottom panel) lies in their approach to feature alignment. CLIP utilizes a single contrastive learning loss to regulate feature alignment, while CP-CLIP first maps image and text embeddings to a unified core-periphery aligned feature space and then regulates feature alignment accordingly.

We have applied our proposed CP-CLIP to five public medical datasets under zero-shot scenarios, and the experimental results show that CP-CLIP improves the zero-shot performance of CLIP by 13.34% on ChestXray, 2.50% on SIIM-ACR, 6.04% on INBreast, 0.40% on CheXpert5×200, and 2.72% on TMED. Additionally, it effectively identifies critical disease-related regions for the diseases. The primary contributions of this work are as follows:

  • We bridge deep neural networks and brain science by applying the brain-inspired Core-Periphery structure to neural network design, resulting in a core-periphery principle guided neural network.

  • We integrate the core-periphery principle guided neural network into the CLIP model, introducing CP-CLIP for aligning medical image and text features, thereby establishing effective multi-modal feature alignment.

  • We demonstrate the improved zero-shot performance of CP-CLIP by implementing extensive experiments on five publicly accessible medical datasets, achieving enhanced zero-shot accuracy and the ability to identify disease-related areas. The interpretability of the CLIP model is significantly enhanced, enabling precise identification of disease-relevant regions.

II. Related Work

A. CP Structure

The core-periphery paradigm delineates a topology within a graph theoretic framework where central (“core”) nodes exhibit a high degree of interconnectivity, whereas peripheral nodes maintain sparse connections, both among themselves and with the “core”. This structural archetype has been extensively recognized and employed across a multitude of disciplinary boundaries, such as sociological network analysis [31] [2], economic systems [32], and the biological sciences, where it serves to model the intricate lattice of protein interactions [33]. Within the realm of brain science, empirical evidence substantiates the existence of a core-periphery schema underpinning cerebral dynamics [1]. Functional neural networks are similarly corroborated to manifest this architecture [7]. A vanguard investigation has illuminated this core-periphery dichotomy in the human brain through an anatomical lens [34], positing that gyri and sulci – the salient morphological features of cortical folding, synergistically form a core-periphery network. Such an organization is posited to augment the efficacy of neural signal propagation.

The Core-Periphery (CP) principle proposes that in complex networks such as brain networks, nodes (or neurons, in the context of neural networks) are organized into a densely connected “core” and a sparsely connected “periphery”. In the context of common latent spaces, such as those used in multimodal models like CLIP, these dynamics can influence how different modalities (e.g., text and images) are integrated. Nodes in the core are highly interconnected and responsible for preserving the most critical information across different modalities. In a common latent space, core-core interactions could represent strong, consistent features that are shared across modalities, ensuring robust representation. Core-periphery interaction reflects the exchange of information between central and peripheral nodes. In latent spaces, core-periphery dynamics might help in mapping unique features of one modality onto the shared space, supporting cross-modal translation or transfer learning. Nodes in the periphery are less connected and often represent specialized or less critical information. In a common latent space, these interactions may correspond to modality-specific nuances or less prominent features that do not strongly influence the shared representation.

B. Contrastive Learning

Contrastive learning is a machine learning technique that trains models to distinguish between similar and dissimilar data samples [35]. It is particularly powerful in the field of unsupervised and self-supervised learning, where labeled data is scarce or expensive to obtain. The core idea behind contrastive learning is to learn representations by enforcing that similar or related samples stay close to each other in the representation space, while dissimilar or unrelated samples are pushed apart [36].

Contrastive learning hinges on the definition of positive and negative pairs. Positive pairs are those data samples that are considered similar, such as different augmentations of the same image or sentence [37] [4] [38] [39]. Negative pairs, on the other hand, are dissimilar data samples, such as augmentations from different images or sentences. The contrastive learning then uses a contrastive loss function, such as the triplet loss [40], to optimize the distances between positive and negative pairs in the representation space. Contrastive learning has gained significant attention in the field of computer vision and natural language processing due to its ability to leverage large amounts of unlabeled data effectively. In computer vision, it has been used to develop state-of-the-art models for image tasks without the need for extensively annotated datasets [41] [13].

C. Feature Alignment

Feature alignment refers to the technology in which extracted features from different datasets or different views of the same dataset are compatible with each other. In other words, encoded features can match with others in the feature space according to their similarities. This is crucial for domain shift scenarios, where zero-shot capability is desired on samples with new distributions that a pre-trained model never saw before. Feature alignment plays an important role in generalization in multi-modality large language models [13]. Since the same concepts can be represented by different modalities, by processing and aligning these features, these models are able to understand information from multiple modalities and execute downstream tasks in new modalities.

One common idea of feature alignment is to learn a shared feature space where the representations of data from different modalities or views are indistinguishable. Techniques such as Canonical Correlation Analysis (CCA) [42] and Adversarial Discriminative Domain Adaptation (ADDA) [43] have been proposed to achieve this. These methods often minimize some measures of distance between feature distributions, such as the Maximum Mean Discrepancy (MMD), or leverage adversarial training to make the domains indistinguishable to a domain classifier.

D. CLIP for Medical Imaging-Text

CLIP (Contrastive Language–Image Pre-training) [13] is a multimodal pre-training algorithm proposed by OpenAI, continuing the legacy of the GPT series in leveraging substantial scale to achieve remarkable performance. This model represents a multimodal paradigm, concurrently processing image and text modalities. The training objective is constructed through the computation of similarity between feature vectors from these two distinct modalities. In an effort to train this model, OpenAI has amassed a dataset comprising over 400 million image-text pairs. Demonstrating exceptional performance across a variety of multimodal tasks, such as image retrieval and image classification, CLIP has shown that even through unsupervised learning, it can achieve outstanding performance comparable to those of mainstream supervised algorithms in many areas. The conceptual simplicity of CLIP, coupled with its impressive performance, underscores the substantial potential for future development in multimodal models [44].

In the rapidly evolving domain of medical imaging and diagnostics, the application of CLIP models has demonstrated significant potential [19] [45]. These models are adept at interpreting and understanding medical images in conjunction with textual descriptions, thus enabling nuanced recognition patterns that closely mimic the diagnostic process of medical professionals. However, the efficacy and robustness of such models are intrinsically tied to the volume and quality of the training data. However, acquiring large-scale, annotated medical datasets poses a critical challenge for successfully training CLIP models in the medical domain. These datasets should not only be large but also carefully curated to accurately represent the diverse range of clinical cases. This ensures the effective correlation between medical imaging features and their corresponding textual descriptions.

III. Method

The overview of CP-CLIP framework is shown in Fig. 2. The CP-CLIP comprises four essential components: the pre-trained image and text encoders, the generation of core-periphery graphs (CP graph), the CP graph guided neural network, and the integration of the CP-guided neural network into CLIP. The details of each component will be elaborated in the following sections.

Fig. 2:

Fig. 2:

The CP-CLIP framework. Part (a): The core-periphery principle guided feature alignment. Latent features extracted from the image and text encoders are subsequently projected into a unified latent space via the core-periphery principle guided multilayer perceptron neural network. Part (b): The communication of information and the pattern of neuron connections within the core-periphery principle-guided neural network is determined by the generated core-periphery graphs.

A. Pre-trained Image and Text Encoders

To leverage the feature extraction ability of pre-trained models and reduce computational costs, we adopt the pre-trained ResNet50 as the default image encoder and the pre-trained BERT as the text encoder [50] [47]. In the training stage, each subject encompasses two modalities: a medical image and the corresponding clinical report. For each subject, we have an image-text pair I,T.

Image Encoder

We encode the image into embedding uID via the image encoder EI. A projection head then maps the raw embedding to uP.

uI=EII,u=fIuI (1)

where fI is the projection head of the image encoder.

Text Encoder

Similarly, we encode the clinical report into embedding uTM via the text encoder ET, then we project uT to tP, which can be formulated as:

uT=ETI,t=fTuT (2)

where fT is the projection head of the text encoder. This results in an embedding dimension P identical to that of the image encoder, making it well-suited for contrastive learning. In the inference stage, as clinical reports may not be available for zero-shot medical datasets, we use the prompt “An image of [mask].” for classification tasks, where “mask” represents possible disease labels.

B. Core-Periphery Graph Generation

The core-periphery neural network in CP-CLIP is controlled by the generated Core-Periphery graphs (CP graphs). We introduce the CP graph generation process, which generates a diverse range of CP graphs within the graph space defined by core ratios. The core ratio is defined as the proportion of core nodes to all nodes. It is worth mentioning that in a vanilla neural network, neurons are fully connected, meaning each neuron is connected to all other neurons. Therefore, the connections in a vanilla neural network can be represented by complete graphs, with a core ratio of 1.0. To generate graphs with a Core-Periphery (CP) property [31] [2], we define CP graphs as having nodes categorized into core and periphery nodes. Core nodes, acting as information integration hubs, are connected to all nodes, while periphery nodes are solely connected to core nodes. Denoting the total number of nodes as N and the core ratio as p, p0,1, we calculate the number of core nodes as n=N×p and the number of periphery nodes as m=N×1p. Note that when the core ratio equals 1.0, the CP graphs degrade to the complete graphs, implying that the CP-guided multilayer perceptron network reverts to its vanilla form.

Based on the above analysis, the adjacency matrix AN×N of the generated CP graphs can be expressed as:

Ai,j1ifi,jn0ifi,jm (3)

where 1 signifies the presence of an edge between nodes i and j, and 0 indicates no edge between the nodes. By employing various core ratios, denoted by different combinations of n and m, a wide range of candidate graphs can be generated within the graph space. Examples of CP graphs and complete graph are shown in Fig. 3. As shown in the figure, there are different connection patterns of the CP graphs under different core ratios.

Fig. 3:

Fig. 3:

Examples of Core-Periphery Graphs with different core ratios. The first row displays the graphs, while the second row shows their corresponding adjacency matrices. The core nodes are shown in red, and the periphery nodes are shown in blue. In the adjacency matrices, the white area denotes 0, indicating no connection, while the black area denotes 1, representing connections between nodes.

C. Core-Periphery Principle Guided Neural Network

In this work, we integrate the core-periphery principle in neural networks through the Core-Periphery Graph Guided Multilayer Perceptron (MLP) Neural Network. In CP graphs, core nodes maintain connections to all other nodes, while periphery nodes only connect to the core nodes. To integrate the CP principle into the organization of the multilayer perceptron neural network, we reschedule the neuron connections based on the generated CP graphs. Here, neurons are considered as nodes, and connections between neurons are regarded as edges. This approach allows us to represent neural networks as graphs and utilize the generated Core-Periphery (CP) graph to guide connections between the nodes. Following this representation paradigm, a complete graph can represent vanilla multilayer perceptron networks. Similarly, we incorporate the Core-Periphery principle into the multilayer perceptron architecture by substituting the complete graph with the generated CP graphs. The new connection rules can then be redefined: CP graph can be represented by G=V,E, with nodes set V=ν1,,νn, edges set Eνi,νj|νi,νjV, and adjacency matrix A. The information exchange in the CP graph guided multilayer perceptron network for a specific node i at r-th layer is defined as:

zir+1=σrwijrzjr,jNi (4)

where σ· is the activation function, zir+1 and zr are the features stored in the nodes, wijr is the weight of the edge connecting the node i and j, Ni are the neighborhood nodes of node i. We can rewrite the Eq. 4 in matrix form as:

Zr+1=σrAWrZr (5)

where Z is the feature matrix, and W is the weight matrix, and is the element-wise matrix multiplication.

Each node corresponds to one or multiple neurons. We propose the following neuron assignment pipeline to map the original neurons to the nodes [51]: for a CP graph with N nodes, each node will be assigned either M/N+1 or M/N neurons, where M is the dimension at a specific layer. For example, if we utilize a CP graph with 5 nodes for a layer with 196 dimensions, the 5 nodes will have 40, 39, 39, 39, and 39 neurons, respectively. Conversely, if we employ a CP graph with M nodes for M dimensions, each node will correspond to one neuron.

D. Core-Periphery Feature Alignment for CLIP

We integrate the CP graph guided neural network into the CLIP model to enhance the fusion of modality information from both images and texts, mapping them into a unified latent space. This framework, visually represented in Fig. 2, is termed CP-CLIP. After extracting features from images and texts using CLIP, their embeddings are fed into the CP network with shared weights. This process encourages the embeddings of both modalities to converge into a unified latent space, thereby facilitating the alignment of features between them.

For a mini-batch of images and texts I and T, the embeddings extracted from CLIP are represented as ZI and ZT. We refer to the core-periphery graph guided network as fcp. Then, the CP network maps the text and image embeddings to a unified space as follows:

ZI=fcpZIZT=fcpZT (6)

The logits are obtained from the cosine similarity between the embeddings of images and texts, which have been aligned by the CP network. This can be formulated as follows:

s=ZIZTT (7)

where T means transpose. For an image i, the scaled cosine similrity is obtained by normalizing across the logits, which represent the scores or probabilities associated with the image’s relevance to various text descriptions:

yijIT=expsij/τj=1Nbatchexpsij/τ (8)

where τ is the learnable temperature, similar to CLIP [13], j1,Nbatch correspond to the batch of texts. Likewise, we can computer the yjiTI, and reach the loss function:

L=12logyijIT12logyjiTI (9)

IV. Results

In this section, we conduct extensive experiments to evaluate the performance of the proposed CP-CLIP model on various medical image datasets under zero-shot scenarios. We begin by introducing the datasets utilized in this study and providing training details. Subsequently, we apply CP-CLIP to five medical datasets in a zero-shot manner for image classification tasks and to identify critical areas potentially related to diseases. Additionally, we conduct ablation studies to investigate the influence of different core ratios and different vision backbones.

A. Datasets and Training Details

  • MIMIC-CXR: The MIMIC Chest X-ray (MIMIC-CXR) database [11] offers a comprehensive public repository of chest radiographs in DICOM format, accompanied by free-text radiology reports. This extensive dataset comprises 377,110 images across 227,835 radiographic studies, positioning it as a pivotal resource for text-image pair analysis within the medical field.

  • ChestXray: The NIH ChestX-ray dataset [18], encompasses an extensive collection of 112,120 chest X-ray images derived from 30,805 distinct patients, each annotated with disease labels. Predominantly, this dataset focuses on two critical categories: pneumonia and normal states.

  • SIIM-ACR: The SIIM-ACR dataset [48], is facilitated by the Society for Imaging Informatics in Medicine (SIIM) in collaboration with the American College of Radiology (ACR). The dataset is composed of 12,047 radiographic images, among which 2,669 are annotated to delineate the presence or absence of pneumothorax, categorically differentiated into two distinct classes: normal lung function and collapsed lung, the latter indicative of pneumothorax.

  • INbreast: The INbreast database [12] is a mammographic database, with images acquired at a Breast Centre, located in Hospital de São João, Breast Centre, Porto, Portugal. INbreast has a total of 115 cases (410 images) of which 90 cases are from women with both breasts (4 images per case) and 25 cases are from mastectomy patients (2 images per case).

  • CheXpert5×200: The CheXpert dataset [10] is a large public dataset for chest radiograph interpretation, designed for developing algorithms to automate the reading and interpretation of chest x-ray images. This dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. Labels were automatically extracted from radiology reports using a NLP tool. In this paper, we select 5 classes (atelectasis, cardiomegaly, consolidation, edema, pleural effusion), each class with 200 chest radiographs to form CheXpert5×200.

  • TMED: The TMED dataset [9], short for Tufts Medical Echocardiogram Dataset (TMED), consists of 599 studies from 577 unique patients, some of whom underwent multiple studies on distinct days. All patients have aortic stenosis (AS) diagnostic label (none, early AS, or significant AS; for further details, see our severity diagnosis label primer). Additionally, some images from each study have view label annotations. After data clearing, there are a total of 24964 images available for zero-shot classification.

We initialize CP-CLIP with the pre-trained CLIP weights and subsequently train both CP-CLIP and CLIP models on the MIMIC-CXR dataset [11], which comprises medical textimage pairs. The image encoder in CLIP and the CP network are trained during fine-tuning, while the text encoder in CLIP remains frozen. Subsequently, we evaluate their zero-shot performance on five additional medical image datasets. Note that there are no available text descriptions for the other five datasets. For classification tasks, we create text descriptions in the format “An image of [mask].” Unlike supervised learning settings on the five datasets, we utilize all images in both the training and testing splits as testing images in the zero-shot scenarios. This significantly increases the size of the testing dataset, effectively demonstrating the efficacy of our CP-CLIP. The default number of layers for the CP graph guided MLP is set to 1. Training involves 10 epochs with a batch size of 128 on Titan GPUs, utilizing the AdamW optimizer and cosine learning rate scheduler [49]. We searched across a wide range of core ratios from 0.1 to 1.0 with an interval of 0.1. Learning rates are 1e−8 for the CLIP model and 1e−3 for the CP network.

B. Classification Results

We evaluate the trained CP-CLIP model under a zero-shot setting to assess the model’s multimodal representation robustness and generalizability. Specifically, we select five unseen medical imaging datasets covering breast images, chest X-ray images, and ultrasound heart images, and compare them against two baselines: CLIP and MedCLIP. Note that MedCLIP was well trained on the MIMIC-CXR dataset [19]. The zero-shot classification results are presented in Table I. CP-CLIP, along with two other baselines, was trained on the MIMIC-CXR dataset. However, as depicted in Table I, compared to vanilla CLIP, CP-CLIP demonstrates superior classification accuracy across the datasets, showing a 13.34% improvement on ChestXray, a 2.50% improvement on SIIM-ACR, a 6.04% improvement on INbreast, a 0.40% improvement on CheXpert5×200, and a 2.72% improvement on TMED, highlighting the efficacy of the CP principle guided feature alignment. Compared to the competitive MedCLIP, which is optimized for medical data, our CP-CLIP also achieves higher classification accuracy on four datasets.

TABLE I:

Comparison of zero-shot classification performance among CLIP, MedCLIP, and CP-CLIP across five medical image datasets. Pretrained ResNet50 and BERT are used as image and text encoders, respectively, in both CP-CLIP and CLIP. Results reported for CP-CLIP represent the best performance achieved across various core ratios. Balanced accuracy is reported in percentage. The best results are in bold.

Model ChestXray (2 classes) SIIM-ACR (2 classes) INbreast (3 classes) CheXpert5×200 (5 classes) TMED (3 classes)

CLIP 49.50 47.50 34.67 21.80 33.00
MedCLIP 68.31 50.00 33.56 12.90 33.33

CP-CLIP 62.84 50.00 40.71 22.20 35.72

C. Critical Area Identification

To investigate how the CP mechanism enhances feature alignment, we utilized Grad-CAM [15] visualization on images from various datasets to analyze how the model processes core information during inference. Here, we assume that the core information refers to the critical area closely related to the disease status. We presented the visualization results in Figures 4, 5, and 6. As shown in Fig. 4, the original CLIP struggles to identify critical areas for disease identification. For instance, in the case of malignancy from the INbreast dataset, the area identified by CLIP is far from the cancer area and unrelated to breast cancer identification. In contrast, the CP-CLIP model effectively identifies the cancer area in the image, demonstrating effective feature alignment between the “malignant” in the text and the cancer areas in the image. Another example is the benign case in INbreast, compared to CLIP, our CP-CLIP not only focuses on lesion regions but also considers minor areas for comprehensive reasoning. For chest X-ray images, such as ChestXray and SIIMACR, CLIP tends to focus on the spine, whereas our CP-CLIP model can adjust to focus on the lung areas. In the CheXpert5×200 dataset, depicted in Fig. 5, CLIP often identifies unrelated areas such as the spine and background. In contrast, our CP-CLIP effectively mitigates CLIP’s tendency for shortcuts and accurately identifies disease-related areas. In TMED, as depicted in Fig. 6, CP-CLIP effectively focuses on the aortic valves across three different clinical statuses, whereas CLIP fails to identify these critical areas.

Fig. 4:

Fig. 4:

Visualization comparison of CLIP and CP-CLIP on the ChestXray, SIIMACR, and INbreast datasets. The left column displays the original images, the middle column illustrates the critical areas identified by CLIP, and the right column showcases the critical areas identified by CP-CLIP. CP-CLIP highlights critical disease-related areas, while CLIP tends to identify shortcuts.

Fig. 5:

Fig. 5:

The visualization comparison between CLIP and CP-CLIP on the CheXpert5×200 dataset. The left column displays the original images, the middle column illustrates the critical areas identified by CLIP, and the right column showcases the critical areas identified by CP-CLIP.

Fig. 6:

Fig. 6:

The visualization comparison between CLIP and CP-CLIP on the TMED dataset. The left column displays the original images, the middle column illustrates the critical areas identified by CLIP, and the right column showcases the critical areas identified by CP-CLIP.

The identification of critical areas demonstrates the strong interpretability of the proposed CP-CLIP model. This interpretability helps explain why CP-CLIP achieves higher classification performance, as reported in Table I, by effectively identifying the crucial areas related to disease status. These visualization results further demonstrate the effectiveness of leveraging our core-periphery principle guided feature alignment in conjunction with contrastive loss, as opposed to using contrastive loss alone.

D. Ablation Study

To comprehensively investigate the influence of different core ratios of CP graphs on classification performance, as well as the classification performance with different vision backbones, we conducted ablation studies on CP-CLIP with various core ratios and vision backbones. The results are presented in Fig. 7. As shown in Fig. 7, the impact of the core ratio on classification performance can vary depending on the dataset and model architecture, within the same vision backbone, different core ratios lead to varying zero-shot classification accuracy. For instance, in the upper left part of the RN50 backbone, the best core ratio on the ChestXray dataset is 0.7, whereas, on the INbreast dataset, the best core ratio is 0.5. We can also observe that different vision backbones impact the model performance. For example, among the four backbones, RN50 exhibits the best performance on the INbreast dataset in terms of the highest zero-shot accuracy, while ViT-B/32 demonstrates the best performance on the SIMM-ARC dataset, and RN101 performs the best on the CheXpert5×200 dataset. Another interesting finding is that the feature alignment guided by a CP graph with sparsity usually outperforms feature alignment guided by a complete graph without sparsity. This is particularly evident as the best zero-shot accuracy tends to appear under conditions where the core ratio is smaller than 1.0. For example, the highest zero-shot accuracy with ViT-B/16 occurs at a core ratio of 0.7 rather than 1.0. These results demonstrate the superiority of our proposed CP-CLIP model by outperforming the baselines across a wide range of core ratios. Additional zero-shot performance results for the original CLIP model across these five public datasets are presented in Table II. In Table II, CLIP is not fine-tuned on the MIMIC-CXR dataset, and our CP-CLIP also demonstrates superior performance compared to CLIP. We can conclude that CP-CLIP exhibits superiority over CLIP, regardless of fine-tuning CLIP on medical image datasets. We also present the best zero-shot accuracy achieved with different vision backbones on the five datasets in Table III, for quantitative visualization.

Fig. 7:

Fig. 7:

The impact of varying core ratios on zero-shot classification accuracy across the five datasets under different vision backbones.

TABLE II:

Comparison of classification performance between CP-CLIP and original CLIP on medical image datasets using RN50 as the vision backbone. CP-CLIP results represent the best performance across various core ratios. Balanced accuracy is presented in percentage.

Model ChestXray (2 classes) SIIM-ACR (2 classes) INbreast (3 classes) CheXpert5×200 (5 classes) TMED (3 classes)
CLIP 50.00 50.00 33.26 20.00 33.36
CP-CLIP 62.84 50.00 40.71 22.20 35.72

TABLE III:

Classification comparison among various vision backbones. CP-CLIP results are based on optimal core ratios. Balanced accuracy is presented as a percentage.

Vision Backone ChestXray (2 classes) SIIM-ACR (2 classes) INbreast (3 classes) CheXpert5×200 (5 classes) TMED (3 classes)
RN50 62.84 50.00 40.71 22.20 35.72
RN101 50.00 50.00 36.82 23.10 36.67
ViT-B/16 50.00 58.77 37.85 19.80 34.96
ViT-B/32 67.07 56.23 36.61 21.40 37.38

Furthermore, to evaluate the impact of different numbers of CP layers and to compare CP with other sparsity methods, we conducted an ablation study and presented the results in Table IV and Table V, respectively. CP-CLIP demonstrated superior performance on the SIIM-ACR, INbreast, and TMED datasets when utilizing a single CP layer. In contrast, the model performed better on the ChestXray and CheXpert5×200 datasets with three CP layers. Additionally, CP sparsity outperformed the other two sparsity methods, namely random sparsity and the Watts-Strogatz method [52]. To demonstrate the effectiveness of implementing the contrastive loss on the core-periphery aligned space, we compared it against applying the contrastive loss on both the CP aligned and CLIP space. The results are shown in Table VI. For most of the datasets, CP-CLIP achieved better performance with features solely aligned in the CP space, indicating that features were well-aligned by the core-periphery network.

TABLE IV:

Classification comparison with different numbers of CP layers using the RN50 as the vision backbone. Balanced accuracy is presented as a percentage.

CP Layer ChestXray (2 classes) SIIM-ACR (2 classes) INbreast (3 classes) CheXpert5×200 (5 classes) TMED (3 classes)
1 layer 62.84 50.00 40.71 22.20 35.72
2 layer 51.10 44.85 34.71 20.40 33.33
3 layer 73.71 48.29 33.54 23.30 33.33

TABLE V:

Classification comparison of Core-periphery with other sparse methods. Using RN50 as the vision backbone. Balanced accuracy is presented as a percentage.

Sparse method ChestXray (2 classes) SIIM-ACR (2 classes) INbreast (3 classes) CheXpert5×200 (5 classes) TMED (3 classes)
Core-periphery 62.84 50.00 40.71 22.20 35.72
Random Sparse 60.45 44.59 33.63 18.40 33.50
Watts-Strogatz 61.10 50.00 33.33 20.00 33.33

TABLE VI:

Classification comparison between contrastive loss on CP space versus CP and CLIP space. Using RN50 as the vision backbone. Balanced accuracy is presented as a percentage.

Alignment ChestXray (2 classes) SIIM-ACR (2 classes) INbreast (3 classes) CheXpert5×200 (5 classes) TMED (3 classes)
CP Space 62.84 50.00 40.71 22.20 35.72
CLIP Space 50.00 50.00 33.26 20.00 33.36
CP Space+CLIP space 65.81 46.44 40.70 19.90 33.20

V. Discussion and Limitation

Since our core-periphery principle-guided network is inspired by brain networks, where different networks responsible for various tasks (such as vision and movement) have different core ratios [34], we have adopted a similar approach by experimenting with different core ratios to find the optimal ratio for specific datasets or tasks.

In this paper, we assign artificial network neurons to core and periphery nodes using an existing neuron assignment strategy [51], which treats all neurons equally. This approach may not be optimal. Exploring improved neuron assignment strategies is a promising direction for future research. Future work could consider the importance of neurons, assigning more critical neurons to core nodes and less important ones to periphery nodes. This paper focuses on integrating the brain-inspired Core-periphery principle into CLIP rather than exploring different neuron assignment strategies.

VI. Conclusion

In this study, we introduce CP-CLIP, a novel framework that incorporates the core-periphery principle into the CLIP model to enhance multi-modality feature alignment in medical applications. CP-CLIP constructs an auxiliary core-periphery graph guided neural network specifically tailored for zero-shot medical image analysis. This auxiliary network enhances the fine-grained alignment between image and text embeddings, guiding the model’s attention to focus on crucial information. Experimental results across five diverse medical image datasets validate the effectiveness of CP-CLIP in medical image analysis.

Acknowledgments

This work was supported by the National Institutes of Health (R01AG075582 and RF1NS128534).

Contributor Information

Xiaowei Yu, The University of Texas at Arlington, Arlington, TX 76019, USA.

Lu Zhang, Department of Computer Science, Indiana University Indianapolis, IN 46202, USA.

Zihao Wu, School of Computing, University of Georgia, Athens, GA 30602 USA.

Dajiang Zhu, The University of Texas at Arlington, Arlington, TX 76019, USA.

References

  • [1].Bassett DS, Wymbs NF, Rombach MP, Porter MA, Mucha PJ and Grafton ST, “Task-based core-periphery organization of human brain dynamics,” in PLoS computational biology, vol. 9, no. 9, pp. e1003171, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Cattani G and Ferriani S, “A core/periphery perspective on individual creative performance: Social networks and cinematic achievements in the hollywood film industry,” in Organization science, vol. 19, no. 6, pp. 824–844, 2008. [Google Scholar]
  • [3].Holme P, “Core-periphery organization of complex networks,” in Physical Review E, vol. 72, no. 4, pp. 046111, 2005. [DOI] [PubMed] [Google Scholar]
  • [4].Chen T, Kornblith S, Norouzi M, and Hinton G, “A simple framework for contrastive learning of visual representations,” International conference on machine learning, pp. 1597–1607, Nov. 2020. [Google Scholar]
  • [5].Csermely P, London A, Wu L, and Uzzi B, “Structure and dynamics of core/periphery networks,” Journal of Complex Networks, vol. 1, no. 2, pp. 93–123, 2013. [Google Scholar]
  • [6].Esmaeilpour S, Liu B, Robertson E, and Shu L, “Zero-shot out-of-distribution detection based on the pre-trained model clip,” Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 6, pp. 6568–6576, 2022. [Google Scholar]
  • [7].Gu S, Xia CH, Ciric R, Moore TM, Gur RC, Gur RE, Satterthwaite TD and Bassett DS, “Unifying the notions of modularity and core–periphery structure in functional brain networks during youth,” Cerebral Cortex, vol. 30, no. 3, pp. 1087–1102, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Huang Z, Long G, Wessler B, and Hughes MC, “A new semi-supervised learning benchmark for classifying view and diagnosing aortic stenosis from echocardiograms,” Machine Learning for Healthcare Conference, pp. 614–647, 2021. [PMC free article] [PubMed] [Google Scholar]
  • [9].Huang Z, Long G, Wessler B, and Hughes MC, “TMED 2: a dataset for semi-supervised classification of echocardiograms,” Data-Perf: Benchmarking Data for Data-Centric AI Workshop, 2022. [Google Scholar]
  • [10].Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H et al. , “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 1, pp. 590–597, 2019. [Google Scholar]
  • [11].Johnson A, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C, Mark RG and Horng S, “MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports,” Scientific data, vol. 6, no. 1, pp. 317, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Moreira IC, Amaral I, Domingues I, Cardoso A, Cardoso MJ, and Cardoso JS, “Inbreast: toward a full-field digital mammographic database,” Scientific data, vol. 19, no. 2, pp. 236–248, 2012. [DOI] [PubMed] [Google Scholar]
  • [13].Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G et al. , “Learning transferable visual models from natural language supervision,” International conference on machine learning, pp. 8748–8763, 2021. [Google Scholar]
  • [14].Sanghi A, Hang C, Lambourne JG, Wang Y, Cheng C, Fumero M, and Malekshan KR, “Clip-forge: Towards zero-shot text-to-shape generation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18603–18613, 2022. [Google Scholar]
  • [15].Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, and Batra D, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” Proceedings of the IEEE international conference on computer vision, pp. 618–626. 2017. [Google Scholar]
  • [16].Wollek A, Graf R, Č ečatka S, Fink N, Willem T, Sabel BO and Lasser T, “Attention-based saliency maps improve interpretability of pneumothorax classification,” Radiology: Artificial Intelligence, vol. 5, no. 2, pp. e220187. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y and Bashlykov N et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [Google Scholar]
  • [18].Wang X, Peng Y, Lu L, Lu Z, Bagheri M, and Summers RM, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. 2017. [Google Scholar]
  • [19].Wang Z, Wu Z, Agarwal D, and Sun J, “MedCLIP: Contrastive Learning from Unpaired Medical Images and Texts,” EMNLP, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Zhang L, Liu Z, Zhang L, Wu Z, Yu X, Holmes J, Feng H et al. , “Generalizable and promptable artificial intelligence model to augment clinical delineation in radiation oncology,” Medical Physics, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Zhao L, Zhang L, Wu Z, Chen Y, Dai H, Yu X, Liu Z et al. , “When brain-inspired ai meets agi,” Meta-Radiology, pp. 100005, 2023. [Google Scholar]
  • [22].Liu J, Dian R, Li S, and Liu H, “SGFusion: A saliency guided deep-learning framework for pixel-level image fusion,” Information Fusion, vol. 91, pp. 205–214, 2023. [Google Scholar]
  • [23].Lu H, Huo Y, Ding M, Fei N and Lu Z, “Cross-modal contrastive learning for generalizable and efficient image-text retrieval,” Machine Intelligence Research, vol. 20, no. 4, pp. 569–582, 2023. [Google Scholar]
  • [24].Guo Z, Zhang R, Qiu L, Ma X, Miao X, He X and Cui B, “Calip: Zero-shot enhancement of clip with parameter-free attention,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, pp. 746–754, 2023. [Google Scholar]
  • [25].You K, Gu J, Ham J, Park B, Kim J, Hong EK, Baek W and Roh B, “Cxr-clip: Toward large scale chest x-ray language-image pre-training,” International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 101–111, 2023. [Google Scholar]
  • [26].Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H and Qiao Y, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024. [Google Scholar]
  • [27].Zhou Z, Lei Y, Zhang B, Liu L, and Liu Y, “Zegclip: Towards adapting clip for zero-shot semantic segmentation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11175–11185, 2023. [Google Scholar]
  • [28].Wang Z, Liang J, He R, Xu N, Wang Z, and Tan T, “Improving zero-shot generalization for clip with synthesized prompts,” Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3032–3042, 2023. [Google Scholar]
  • [29].Jiao S, Wei Y, Wang Y, Zhao Y, and Shi H, “Learning mask-aware clip representations for zero-shot segmentation,” Advances in Neural Information Processing Systems, vol. 36, pp. 35631–35653, 2023. [Google Scholar]
  • [30].Yanchenko E and Sengupta S, “Core-periphery structure in networks: A statistical exposition,” Statistic Surveys, vol. 17, pp. 42–74, 2023. [Google Scholar]
  • [31].Borgatti SP and Everett MG, “Models of core/periphery structures,” Social networks, vol. 21, no. 4, pp. 375–395, 2000. [Google Scholar]
  • [32].Kostoska O, Sonja S, Jovanovski P, and Kocarev L, “Core-periphery structure in sectoral international trade networks: A new approach to an old theory,” PloS one, vol. 15, no. 4, pp. e0229547, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Luo F, Li B, Wan X, and Scheuermann RL, “Core and periphery structures in protein interaction networks,” BMC bioinformatics, vol. 10, pp. 1–11, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Yu X, Zhang L, Dai H, Zhao L, Lyu Y, Wu Z, Liu T, and Zhu D, “Gyri vs. sulci: Disentangling brain core-periphery functional networks via twin-transformer,” arXiv preprint arXiv:2302.00146, 2023. [Google Scholar]
  • [35].Dosovitskiy A, Springenberg JT, Riedmiller M, and Brox T, “Discriminative unsupervised feature learning with convolutional neural networks,” Advances in neural information processing systems, vol. 27, 2014. [DOI] [PubMed] [Google Scholar]
  • [36].Tian Y, Krishnan D, and Isola P, “Contrastive multiview coding,” Computer Vision–ECCV 2020: 16th European Conference, pp. 776–794, 2020. [Google Scholar]
  • [37].Wu Z, Xiong Y, Yu SX, and Lin D, “Unsupervised feature learning via non-parametric instance discrimination,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742, 2018. [Google Scholar]
  • [38].Bachman P, Hjelm RD, and Buchwalter W, “Learning representations by maximizing mutual information across views,” Advances in neural information processing systems, 2019. [Google Scholar]
  • [39].Mikolov T, Sutskever I, Chen K, Corrado GS, and Dean J, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, 2013. [Google Scholar]
  • [40].Yu B, Liu T, Gong M, Ding C, and Tao D, “Correcting the triplet selection bias for triplet loss,” Proceedings of the European Conference on Computer Vision (ECCV), pp. 71–87, 2018. [Google Scholar]
  • [41].Zhang Y, Jiang H, Miura Y, Manning CD, and Langlotz CP, “Contrastive learning of medical visual representations from paired images and text,” Machine Learning for Healthcare Conference, pp. 2–25, PMLR, 2022. [Google Scholar]
  • [42].Andrew G, Arora R, Bilmes J, and Livescu K, “Deep canonical correlation analysis,” International conference on machine learning, pp. 1247–1255, PMLR, 2013. [Google Scholar]
  • [43].Tzeng E, Hoffman J, Saenko K, and Darrell T, “Adversarial discriminative domain adaptation,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7167–7176, 2017. [Google Scholar]
  • [44].Zhang S, Shi E, Wu L, Wang R, Yu S, Liu Z, Xu S, Liu T, and Zhao S, “Differentiating brain states via multi-clip random fragment strategy-based interactive bidirectional recurrent neural network,” Neural Networks, vol. 165, pp. 1035–1049, 2023. [DOI] [PubMed] [Google Scholar]
  • [45].Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, and Xie W, “Pmc-clip: Contrastive language-image pre-training using biomedical documents,” International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 525–536, 2023. [Google Scholar]
  • [46].Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M et al. , “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [Google Scholar]
  • [47].Devlin J, Chang M, Lee K, and Toutanova K, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [Google Scholar]
  • [48].Society for imaging informatics in medicine: Siim-acr pneumothorax segmentation. https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation, 2019. [Google Scholar]
  • [49].Loshchilov I, and Hutter F, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016. [Google Scholar]
  • [50].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. [Google Scholar]
  • [51].You J, Jure L, He K, and S. X, “Graph structure of neural networks,” International Conference on Machine Learning, pp. 10881–10891, 2020. [Google Scholar]
  • [52].Watts DJ, and Strogatz SH. ”Collective dynamics of ‘small-world’networks.” nature, vol. 393, no. 6684, pp. 440–442, 1998. [DOI] [PubMed] [Google Scholar]
  • [53].Yu X, Zhang L, Dai H, Lyu Y, Zhao L, Wu Z, Liu D, Liu T, and Zhu D. ”Core-periphery principle guided redesign of self-attention in transformers.” arXiv preprint arXiv:2303.15569, 2023. [Google Scholar]

RESOURCES