CarcassFormer: an end-to-end transformer-based framework for simultaneous localization, segmentation and classification of poultry carcass defect

Minh Tran; Sang Truong; Arthur FA Fernandes; Michael T Kidd; Ngan Le

doi:10.1016/j.psj.2024.103765

. 2024 May 21;103(8):103765. doi: 10.1016/j.psj.2024.103765

CarcassFormer: an end-to-end transformer-based framework for simultaneous localization, segmentation and classification of poultry carcass defect

Minh Tran ^⁎,¹, Sang Truong ^*,¹, Arthur FA Fernandes ^†, Michael T Kidd ^‡, Ngan Le ^⁎,²

PMCID: PMC11255899 PMID: 38925080

Abstract

In the food industry, assessing the quality of poultry carcasses during processing is a crucial step. This study proposes an effective approach for automating the assessment of carcass quality without requiring skilled labor or inspector involvement. The proposed system is based on machine learning (ML) and computer vision (CV) techniques, enabling automated defect detection and carcass quality assessment. To this end, an end-to-end framework called CarcassFormer is introduced. It is built upon a Transformer-based architecture designed to effectively extract visual representations while simultaneously detecting, segmenting, and classifying poultry carcass defects. Our proposed framework is capable of analyzing imperfections resulting from production and transport welfare issues, as well as processing plant stunner, scalder, picker, and other equipment malfunctions.

To benchmark the framework, a dataset of 7,321 images was initially acquired, which contained both single and multiple carcasses per image. In this study, the performance of the CarcassFormer system is compared with other state-of-the-art (SOTA) approaches for both classification, detection, and segmentation tasks. Through extensive quantitative experiments, our framework consistently outperforms existing methods, demonstrating re- markable improvements across various evaluation metrics such as AP, AP@50, and AP@75. Furthermore, the qualitative results highlight the strengths of CarcassFormer in capturing fine details, including feathers, and accurately localizing and segmenting carcasses with high precision. To facilitate further research and collaboration, the source code and trained models will be made publicly available upon acceptance.

Key words: carcass defect, detection, segmentation, classification, defect automation

INTRODUCTION

Increased consumption of poultry products will be a certainty for global food security achievement in the upcoming 30 y based on the efficiency of the utilization of poultry, as well as diverse consumer acceptance. The Food and Agriculture Organization of the United Nations 2005/2007 has projected that production of poultry will increase more than 100 percent by the year 2050 with an increased tonnage of poultry products, primarily broiler chickens, surpassing 180 million tons, with current projection estimated at just over 80 million tons Alexandratos and Bruinsma (2012). Numerous studies have demonstrated increasing annual poultry consummation rates, mainly due to relatively inexpensive price, nutritional value, and health benefits Elam (2022). In the U.S., broiler chicken efficiency of feed utilization has increased 7% from 2021 to the present at a similar slaughter age between 47 and 48 d across the decade Council (March 18, 2021). With annualized increases in broiler production, concomitant increases in labor are necessary for meat production supply chain efficiency. In addition to the costs of increased workforce labor and workforce development, many poultry companies are suffering from labor shortages Kaminski (2020); Wu et al. (2022). Another negative side of relying on people for the process of poultry processing represents the varying results of carcass evaluation consistency. Thus, many companies use assembly lines stationed by employees to inspect the quality of chicken carcasses, which leaves room for human error and can result in miscategorized carcass defections. As a result, numerous agriculture industries, including poultry production facilities and poultry processing plant factories, are researching and investing in automated robotic technologies to improve processing and labor wellbeing, as well as profit Ren et al. (2020); Ahlin (2022); Park et al. (2022). Further, there are numerous automation technologies offering noticeable economic benefits to agricultural production as of late Jin et al. (2021).

In the era of precision agriculture, Machine Learning (ML) and Computer Vision (CV) have emerged as high-performance computing technologies that are creating new opportunities to improve broiler management, production, and identification of processing defects with non-invasive low-cost techniques Aydin (2017); Caldas-Cueva et al. (2021). In this study, the focus was on utilizing modern ML & CV, that is, Deep Learning, to analyze chicken carcasses after scalding, picking, and removal of head and feet in processing plants. Visual inspection is one of the most basic but also most important steps in controlling meat quality before the product is prepared, packaged, and distributed to the market. The image processing and classification within the poultry processing plants can optimize such systems, in addition to heightening food safety. Hence, our proposed intelligent and automated system will analyze and improve poultry processing concomitantly with increased data acquisition. Our computer vision system functions as an automated detection model capable of classifying defects and contaminated carcasses. While detection, segmentation, and classification are widespread tasks in computer vision Dong et al. (2021); Zhou et al. (2021); Le et al. (2022), they have focused on various tasks such as autonomous driving Le et al. (2017b,c); Janai et al. (2020); Tong et al. (2020); Truong et al. (2022); Nguyen et al. (2022), surveillance Gabeur et al. (2020); Wray et al. (2021); Yamazaki et al. (2022); Vo et al. (2022), biometrics Le and Savvides (2016); Le et al. (2017a); Duong et al. (2019a,b); Quach et al. (2022), and medical imaging Han et al. (2017); Le et al. (2018); Tran et al. (2022b); Le et al. (2023); Thang Pham et al. (2023); Nguyen et al. (2023), amodal understanding Tran et al. (2022a) which mainly target humans, car, objects, face, human organs. None of them target analyzing poultry carcass condemnations defects. One of the main reasons is the lack of publicly available data.

In the context of poultry carcass analysis, distinguishing between single and multiple carcasses in an image is a crucial step for accurate quality assessment. To achieve this, the problem was approached as an instance segmentation task, involving the localization of individual instances. Additionally, mask classification was performed to determine whether a single poultry carcass was defective or not. While per-pixel classification (e.g., FCN Long et al. (2015), Unet-based approaches Ronneberger et al. (2015); Zhou et al. (2018); Ibtehaz and Rahman (2020); Le et al. (2021); Tran et al. (2022b)) applies a classification loss to each output pixel and partitions an image into regions of different classes, mask classification (e.g., Mask-RCNN He et al. (2017), DETR Zhu et al. (2020)) predicts a set of binary masks, each associated with a single class prediction. In recent years, there has been a significant growth in the adoption of Transformer architecture Vaswani et al. (2017) for semantic segmentation tasks. This trend is underscored by numerous approaches that have leveraged Transformer models, demonstrating state-of-the-art performance in the field. Notable examples include DETR Carion et al. (2020), SegFormer Xie et al. (2021), Mask2Former Cheng et al. (2022), FASeg He et al. (2023a), and Mask DINO Li et al. (2023). In this paper, the question of how to simultaneously handle both mask classification and pixel-level classification is addressed. To address the aforementioned question, we particularly leverage the recent Transformer technique Vaswani et al. (2017) and propose CarcassFormer, which aims to simultaneously localize poultry carcasses from moving shackles, segment the poultry carcass body, and classify defects or contaminated carcasses. To develop CarcassFormer, an experiment was set up at the University of Arkansas-Agricultural Experiment Station Pilot Processing Plant on the poultry research farm by placing a camera adjacent to the shackles of carcasses moving along a processing line. Each poultry carcass in the view of the camera will be analyzed by localizing with a bounding box, segmenting the boundary, and classifying to determine its imperfections. Any unapproved birds are then reworked. Notably, a bird is considered to be defective if it has one of the following issues: feathers, un-clean/dirty, skin peel, broken wings, or broken legs. The annotation requirement is following instructions provided by USAD USDA.

Our contribution is 3-fold as follows:

•
Dataset: A dataset containing a total of 7,321 images of poultry carcasses on a Pilot processing plant. The images in this diverse dataset contain real-world examples of chickens of a range of ages, sizes, and numbers of chickens per frame. The dataset has been carefully annotated by 3 experts.
•
Methodology: We propose CarcassFormer, an effective end-to-end Transformer-based framework for simultaneously localizing poultry carcass regions, segmenting car- cass areas, and determining carcasses with imperfections. CarcassFormer is based on Transformer-based Unet architecture.
•
Pretrained Models and Code: We will release our pre-trained model and source code of CarcassFormer for research purposes.

Related Work

Image Segmentation

Image segmentation is a critical computer vision task that involves dividing an image into different regions based on visual features. This process can be accomplished through either semantic segmentation or instance segmentation. Semantic segmentation categorizes pixels into multiple classes, for example, foreground and background, but does not differentiate between different object instances of the same class. Popular semantic segmentation models include the Fully Convolutional Network (FCN) Long et al. (2015) and its variants, such as the U-Net family Ronneberger et al. (2015); Zhou et al. (2018); Ibtehaz and Rahman (2020); Le et al. (2021), as well as the Pyramid Scene Parsing Network (PSPNet) Zhao et al. (2017) and DeepLabV3 Chen et al. (2018).

In contrast, instance segmentation aims to detect and segment individual objects by providing a unique segmentation mask for each object. There are 2 types of instance segmentation approaches: 2-stage and 1-stage methods. Two-stage approaches, such as top-down Cai and Vasconcelos (2018); Chen et al. (2019); Cheng et al. (2020) and bottom-up methods Arnab and Torr (2016); Chen et al. (2017); Newell et al. (2017), detect bounding boxes first and then perform segmentation within each region of interest. On the other hand, one-stage approaches, such as anchor-based methods Li et al. (2017); Bolya et al. (2019) and anchor-free methods Ying et al. (2019); Chen et al. (2020); Lee and Park (2020), perform both detection and segmentation simultaneously, resulting in less time consumption. Anchor-based one-stage approaches generate class-agnostic candidate masks on candidate regions and extract instances from a semantic branch. However, these approaches rely heavily on predefined anchors, which are sensitive to hyper-parameters. To address this issue, anchor-free one-stage methods eliminate anchor boxes and use corner/center points instead. Moreover, based on their feature backbone and learning mechanism, various approaches to instance segmentation can be categorized into either Convolution Neural Network (CNN)- based or Transformer-based approaches as follows.

CNN-Based Instance Segmentation

The idea of “detect then segment” has dominated in instance segmentation task, which is a 2-stage method. In particular, Mask R-CNN He et al. (2017) is the most representative work. Based on the priority of detection and segmentation, there are 2 groups in this category: top-down methods and bottom-up methods. The former first predicts a bounding box for each object and then generates an instance mask within each bounding box O Pinheiro et al. (2015); He et al. (2017). On the other hand, the latter associates pixel-level projection with each object instance and adopts a postprocessing procedure to distinguish each instance Arnab and Torr (2016); Kong and Fowlkes (2018). While the top-down methods mainly rely on the detection results and are prone to systematic artifacts on an overlap- ping instance, the bottom-up methods depend on the performances of postprocessing and tend to suffer from undersegment or over-segment problems Fathi et al. (2017). With a large amount of pixel-wise mask annotations, fully-supervised learning instance segmentation methods have achieved great performance. However, pixel-wise mask annotating is labor intensive (e.g., 22 h to label 1,000 segmentation masks Lin et al. (2014)). Thus, weakly-supervised Zhu et al. (2016); Zhou (2018) and semisupervised Van Engelen and Hoos (2020) have been proposed. CNN-based image segmentation has been outreached in multiple Computer Vision tasks including amodal segmentation Li and Malik (2016), salient detection Fan et al. (2019), human segmentation Zhang et al. (2019), soft biometrics Luu et al. (2016). CNN-based instance segmentation survey can be found at Hafiz and Bhat (2020); Gu et al. (2022).

Transformer in Computer Vision

Transformer was first introduced by Vaswani et al. (2017) for language translation and obtained State-Of-The-Art (SOTA) results in many other language processing tasks. Recently, many models Carion et al. (2020), Liu et al. (2022), Li et al. (2022) successfully applied the Transformer concept to computer vision and achieved promising performance. The core idea behind transformer architecture Vaswani et al. (2017) is the self-attention mechanism to capture long-range relationships. It has obtained state-of-the-art in many Natural Language Processing (NLP) tasks. Besides, Transformers have worked well suited for parallelization, facilitating training on large datasets Transformer has been successfully applied to enrich global information in various tasks in Computer Vision such as image recognition Dosovitskiy et al. (2020); Touvron et al. (2021) object detection Carion et al. (2020); Zhu et al. (2020); Sun et al. (2021), image segmentation Ye et al. (2019); Zheng et al. (2021); Tran et al. (2022a), action localization Vo et al. (2021, 2022), video captioning Yamazaki et al. (2022, 2023). DETR Zhu et al. (2020) is the first model that uses Transformer as an end-to-end and query-based object detector, with bipartite-matching loss and set prediction objective. Inspired by Zhu et al. (2020); Cheng et al. (2021), which are end-to-end prediction objectives and successfully address multiple tasks without modifying the architecture, loss, or the training procedure, the merits of Transformer were inherited and CarcassFormer was proposed. Our network is an end-to-end Transformer-based framework and simultaneously tackles both segmentation and classification tasks.

Transformer-based networks have also found application in addressing detection and seg- mentation challenges within poultry science. Lin et al. (2022) proposes a vision transformer model to screen the breeding performance of roosters by analyzing correlations between cockscomb characteristics and semen quality, aiming to overcome the time-consuming and error-prone nature of human-based screening. Hu et al. (2023) improves pig segmentation in farming environments using a grouped transformer attention module with Mask R-CNN networks and data augmentation. Zhao et al. (2023) proposes a real-time mutton multipart classification and detection method using Swin-Transformer. He et al. (2023b) presents Residual-Transformer-Fine-Grained (ResTFG), a model merging transformer and CNN for precise classification of 7 chicken Eimeria species from microscopic images.

MATERIALS AND METHODS

Data Collection

The data was collected at the University of Arkansas pilot processing plant (Fayetteville, AR). Multiple broiler chicken products at different ages were processed using standard commercial practices and following rigorous animal-handling procedures that are in compliance with federal and institutional regulations regarding proper animal care practices FASS (2010). The video-capturing system was set up in the area after feather picking and before chilling and evisceration. We decided on his system placement so that 3 common kinds of defects that can occur during normal processing could be evaluated, namely tearing of the skin, presence of feathers, and broken/disjointed bones.

To obtain the dataset named CarcassDefect, a camera was set up in front of the shackle line, whereas a black curtain was hung behind the shackle. Videos were recorded at 10 frames per second. The camera setup can be visualized in Figures 2 and 3. In the end, a total of 7,321 images were collected, comprising 4,279 single carcass images and 3,042 multiple carcass images. Figure 4 illustrates some images from our CarcassDefect dataset, which comprised a large diversity of carcass quality such as resolution (small carcass, large carcass), the number of carcass per image (a single carcass per image, multiple carcasses per image), various defect (carcass defect can be the one with tearing of skin, feathers, broken/disjointed bones.), etc.

Camera setup for data collection. A black curtain is hung behind the shackle to provide a certain contrast to the carcasses. A camera is placed to capture the carcasses within the black curtain.

An overview image of the shooting location. The black curtain is hung on the wall behind the shackle.

Illustrations of *data collected*, which comprises (a) single carcass/instance per image/frame; (b) multiple carcass/instance per image/frame; (c)carcass/instance at different scale/resolution. The carcass is processed with various defects such as tearing of skin, feathers, broken/disjointed bones.

Data Annotation

Upon acquiring the video data, the next crucial step is to annotate the images extracted from the footage to generate training data for detection, segmentation, and classification tasks. Annotation involves labeling each frame with bounding boxes for detection, masks for segmentation, and labels for classification. This process enables the computer vision system to learn from the annotated data, which enhances its ability to perform these tasks accurately and efficiently. The data annotation process is illustrated in Figure 5. The annotated data is saved in a JSON file and follow the COCO format Lin et al. (2014) as demonstrated in Figure 6. In this COCO format, the data is described as follows:

•
Categories: defined as “normal” and “defect” presenting labeled in the dataset. The defect class is determined for a carcass that has either “feather” or “broken wings” “broken legs” or “peeled skin.”
•
Images: frames extracted from recorded videos. images is a list of objects with meta- data information about the image. An object includes the following keys:
- –
  id: a unique identifier that differentiates each image within a list. It can be defined as the file name.
- –
  file_name: the name of the file. In the example (Figure 6).
- –
  width: the image height such as 950 pixels.
- –
  height: the image height such as 960 pixels.
- –
  date_captured: the date and time when the image was captured.
•
Annotations: contain all meta-data about the labels related to an object. They are a bounding box, segmentation mask and classification label.
- –
  id: The index of instance.
- –
  image_id: The index of the corresponding image. This image_id is corresponding to id in images.
- –
  category_id: This is category id which is defined in categories. In our case, category_id is either ‘1′ - normal or ‘0′ - defect.
- –
  iscrowd: if there are multiple instances/carcasses in the image, iscrowd is set as 1. Otherwise, iscrowd is set as 0 if there is a single instance/carcass in the image.
- –
  area: is the area of instance in the image.
- –
  bbox: The bounding box determines an object's location represented as [xmin, ymin, width, height] where the (xmin, ymin) coordinates correspond to the top- left position of an object and (width, height) are width and height of the object. In the example shown in Figure 6, xmin = 27, ymin = 0, width = 546, height = 731.
- –
  segmentation: The segmentation mask is specified by Run-length encoded (RLE) values Golomb (1966).

An illustration of *Data Annotation Process*. Each frame from the recorded video is annotated with bounding boxes for detection, masks for segmentation, and defect labels for classification.

A demonstration of data annotation in JSON file following COCO format.

The data statistic of our CarcassDefect dataset is shown in Tables 1 and 2. Table 1 shows the distribution of the image between a single carcass per image and multiple carcasses per image between 2 categories of normal and defect. Table 2 shows the distribution of the instance between a single carcass per image and multiple carcasses per image between 2 categories of normal and defect.

Table 1.

The distribution of the images in CarcassDefect Dataset in regard to the normal carcass and defective carcass at both a single carcass per frame and multiple carcasses per frame.

	Single carcass per image	Multiple carcasses per image	Total
Trainset	3,017	2,115	5,132
Valset	754	535	1,289
Testset	508	392	900
Total	4,279	3,042	7,321

Open in a new tab

Table 2.

The distribution of the instances in CarcassDefect Dataset regarded as normal and defective carcasses at both a single carcass per frame and multiple carcasses per frame.

	Single carcass per image		Multiple carcasses per image		Total
	Normal	Defeat	Normal	Defeat
Trainset	1,302	1,715	1,571	1,842	6,430
Valset	355	399	422	466	1,642
Testset	320	188	359	267	1,134
Total	1,977	2,302	2,352	2,575	9,206

Open in a new tab

Proposed Method

In the sections below, the proposed end-to-end transformer-based framework, termed CarcassFormer, is introduced for chicken carcass detection, segmentation, and carcass defect classification. Figure 1 illustrates the flowchart of our CarcassFormer network consisting of 4 key parts: Backbone, Pixel Decoder, Multi-Scale Transformer Encoder, and Masked- attention Transformer Decoder. To train CarcassFormer, the stochastic gradient descent (SGD) optimizer was utilized with a learning rate of 0.0001 and a batch size of 4 over 100 epochs. The experiments were conducted using an Intel(R) Core(TM) i9-10980XE 3.00GHz CPU and a Quadro RTX 8000 GPU.

Backbone

A backbone network, a foundational architecture employed for feature extraction, is typically pretrained on a variety of tasks and has demonstrated its effectiveness across various domains. AlexNet Krizhevsky et al. (2017) is regarded as the inaugural Deep Learning (DL) backbone. The VGG family, which includes VGG-16 and VGG-19 Simonyan and Zisserman (2014), is one of the most prevalent backbones utilized in computer science endeavors. In contrast to AlexNet and VGG, ResNets He et al. (2016) are based on Convolutional Neural Networks (CNNs) and were developed concomitantly with the introduction of residual networks. ResNet variants, such as ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-151, are extensively employed for object detection and semantic segmentation tasks. Following the advent of ResNets, numerous other CNN-based backbones have been proposed, including Inception Szegedy et al. (2015), DenseNet Huang et al. (2017), DarkNet Lin et al. (2013), ShuffleNet Zhang et al. (2018), MobileNet Howard et al. (2017), and YOLO Huang et al. (2018). Recently, there has been a significant advancement in back- bone architectures, incorporating transformer architecture Vaswani et al. (2017), along with leveraging the multiscale features of ResNet. Prominent examples of these advancements include ViT Dosovitskiy et al. (2020), PvT Zhao et al. (2017), and Swin Liu et al. (2021). In the present work, ResNet (i.e., ResNet-18, 34, 50) He et al. (2016) and Swin (i.e., Swin-T) are employed as the backbone network.

By utilizing ResNets He et al. (2016) as the backbone network, an input image $I$ with dimensions $H \times W$ is transformed into a multiscale feature $F$ , specifically a set of 4 feature maps. These feature maps are represented as $F_{1} \in R^{C_{F_{1}} \times \frac{H}{4} \times \frac{W}{4}}, F_{2} \in R^{C_{F_{2}} \times \frac{H}{8} \times \frac{W}{8}}, F_{3} \in R^{C_{F_{3}} \times \frac{H}{16} \times \frac{W}{16}}, F_{4} \in R^{C_{F_{4}} \times \frac{H}{32} \times \frac{W}{32}}$ , where $C_{F_{1}}, C_{F_{2}}, C_{F_{3}}, C_{F_{4}}$ denote the number of channels.

Pixel Decoder

This module enhances the multi-scale features of an image by utilizing 4 feature maps from the backbone. It consists of 2 parts: the multi-scale transformer encoder and the per-pixel embeddings module. In general, the multi-scale transformer encoder uses an attention mechanism to learn the correlation between the multi-scale feature maps $F_{1}, F_{2}, F_{3}, F_{4}$ . This results in corresponding, richer encoded feature maps $D_{1}, D_{2}, D_{3}, D_{4}$ . Meanwhile, the per-pixel embeddings module takes the encoded feature map $D_{1}$ to compute the per-pixel embeddings $ℇ_{p i x e l}$ of the image.

Multi Scale Transformer Encoder

This module takes the last 3 features from the backbone, ordered from low to high resolution (i.e., $F_{4}, F_{3}, F_{2}, F_{1}$ ), are processed in a hierarchical fashion. These 3 features first go through an embedding projection $f_{E}$ to achieve the flattened embed feature $S_{i}$ with a consistent channel size $C_{e}$ . Note that the value of $C_{e}$ is equal to $C_{F_{1}}$ , and this is specifically intended for computing the per-pixel embeddings in section per-pixel embeddings module.

S_{i} = f_{E} (F_{i})

(1)

Where $i = {4, 3, 2, 1}$ , $S_{i} \in R^{H_{i} \times W_{i} \times C_{e}},$ $f_{E}$ is a $1 \times 1$ convolution layer with the output dimension of $C_{e}$ , followed by a flatten layer. The purpose of the flatten layer is to prepare $S_{i}$ as input for a transformer layer, which requires a sequence of embedding features rather than spatial features.

To investigate the correlated feature embeddings between different levels, the flattened embedding features $S_{i}$ from each multi-scale level were concatenated and passed through a transformer encoder. This involves merging the flattened embeddings from different levels into a single input sequence for the transformer encoder.

S = {[S_{i}]}_{i \in {4, 3, 2, 1}}

(2)

where $S \in R^{K \times C_{e}}$ ^, K is the total number of embedding features.

However, since the current embedding features $S$ are flattened out of their original spatial shapes and concatenated from multiple levels, they do not include information about the spatial location and scale level of each feature. To address this issue, each embedding feature in $S$ is supplemented with 2 types of learnable encoding. The first is a positional encoding that provides spatial information about the original location of each feature within the image. The second is a level encoding that enables the transformer encoder to distinguish between features from different scales. By incorporating these encodings, spatial and scale level information is preserved during the calculation process. Let denote the learnable positional encodings as $P$ and the learnable level encodings as $L$ , where $P$ and $L$ share the same shape with $S, P, L \in R^{K \times C_{e}}$ .

Let denote the multi scale transformer encoder as $f_{G}$ , this transformer follows the architecture designed in Dosovitskiy et al. (2020). This transformer encoder produces learned features from the input sequence. It takes a sequence of embedded features and outputs encoded features that capture the relationships between the elements. These encoded features retain important information while removing redundancy. The encoder computes each encoded feature using a self-attention mechanism Vaswani et al. (2017), allowing it to selectively focus on the most relevant features and capture long-range dependencies, making it effective for enriching multiscale features. Formulary, the correlated feature embeddings between different level $E \in R^{K \times C_{e}}$ is computed by passing $S, L$ and $P$ through $f_{G}$ .

E = f_{G} (S, P, L)

(3)

The correlated feature embedding E is divided into groups based on the multi-scale level, denoted as $E_{i}$ where $i \in {4, 3, 2, 1}$ and $E_{i} \in R^{H_{i} \times W_{i} \times C_{e}}$ . Next, each $E_{i}$ is restored to its original spatial shape by unflattening, resulting in an output enriched multi-scale feature map $D_{i}$ .

D_{i} = u n f l a t t e n (E_{i})

(4)

where $D_{i} \in R^{C_{e} \times H_{i} \times W_{i}}$ .

In summary, this module takes the multiscale feature map $F_{4}, F_{3}, F_{2}, F_{1}$ and outputs the enriched multi-scale feature map $D_{1}, D_{2}, D_{3}, D_{4}$ , which captures the correlation and important information while removing redundancy.

Perpixel Embeddings Module

This section describes the second stage of Pixel Decoder, where the perpixel embedding

$ℇ_{p i x e l}$ is computed. This module takes the encoded feature map $D_{1}$ from the Multi Scale Transformer Encoder module as input. The perpixel embedding $ℇ_{p i x e l}$ is computed as follow:

ℇ_{p i x e l} = f_{U} (D_{1})

(5)

The function $f_{U}$ is a sequence of two $2 \times 2$ transposed convolutional layers with stride 2, which scales up the spatial shape of $D_{1}$ 4 times to the original image's spatial shape.

As a result, $ℇ_{p i x e l}$ has a dimension of $R^{C_{e} \times H \times W} .$ Intuitively, each pixel feature of $ℇ_{p i x e l}$ represents both the semantic and the mask classification feature of the corresponding pixel on the original image.

Mask-Attention Transformer Decoder

Mask Predictor

To predict the segmented masks of possible instances in an image, perpixel embeddings $ℇ_{p i x e l}$ were utilized. These embeddings represent both the semantic and mask classification features of each corresponding pixel on the original image.

Then, the prediction process involves learning $N$ per-segment query embeddings $Q \in R^{N \times C_{e}}$ , which represent the features of the maximum $N$ possible instances in the image. Each instance query embedding correlates with every single pixel feature in $ℇ_{p i x e l}$ to determine whether the pixel belongs to the corresponding instance or not. Therefore, the predicted instance segmentation mask was derived as follows:

M = f_{P} (Q, ℇ_{p i x e l})

(6)

where $M \in R^{N \times H \times W}$ , which are $N$ masks of $N$ possible instances in the image. The Mask Predictor $f_{P}$ is a simple dot product on the feature channel $C_{e}$ , followed by a sigmoid activation.

Mask-Attention Transformer Decoder

The Mask-Attention Transformer Decoder $f_{T}$ was employed to obtain effective per- segment query embeddings $Q \in R^{N \times C_{e}}$ that represented instances in the image. This de- coder applies attention to the image features, allowing it to decode the per-segment query embeddings and capture the instance mask feature.

The third blob in our overall flowchart (Figure 1) illustrates the procedure on applying the Mask-Attention Transformer Decoder. In general, this module decodes $N$ per-segment query embeddings $Q \in R^{N \times C_{q}}$ from the encoded feature maps $D_{1}, D_{2}, D_{3}, D_{4}$ . These query embeddings represent the features of the maximum $N$ possible instances in the image. The decoding procedure is performed recurrently, with each step treated as a layer (denoted as $l$ ) and beginning at $l = 0$ . The encoded feature maps $D$ have 4 levels, denoted as $D_{1}, D_{2}, D_{3}, D_{4}$ . Therefore, this recurrent process occurs 4 times, progressing from the lowest to highest resolution encoded feature maps. During each recurrent step, the encoded feature maps that are considered are $D_{4 - l}$ , where $l$ represents the current layer. This means that during the first recurrent step $(l = 0)$ , the encoded feature maps that are used are $D_{4}$ . During the second recurrent step $(l = 1)$ , the encoded feature maps that are used are $D_{3}$ . During the third recurrent step $(l = 2)$ , the encoded feature maps that are used are $D_{2}$ . Finally, during the fourth recurrent step $(l = 3)$ , the encoded feature maps that are used are $D_{1}$ . At each layer, the queries $Q_{l + 1}$ are decoded from the previous layer's query $Q_{l + 1}$ and the corresponding encoded feature maps.

Additionally, a predicted mask, $M_{l}$ is computed by using the current query embeddings $Q_{l}$ and the per-pixel embeddings $ℇ_{p i x e l}$ . The resulting mask is then interpolated to the same size as the current feature map $D_{4 - l}$ . This mask is used as an attention mechanism that helps the query embeddings to focus on the most salient parts of the feature maps. Specifically, during the decoding process, the attention mask is applied to the encoded feature maps $D_{4 - l}$ , allowing the query embeddings to selectively attend to certain regions of the feature maps that are most relevant to the instance being decoded. Formulary, at each recurrent step:

M_{l} = f_{P} (Q_{l}, AptCommand 2107;_{p i x e l})

Q_{l + 1} = f_{T} (Q_{l}, D_{4 - l}, M_{l})

(7)

Instance Mask and Class Prediction

The procedure can be visualized using the fourth component of the overall flowchart, as shown in Figure 1. In this step, the query encoder $Q_{L}$ (where $L = 3$ ) and the perpixel embeddings $AptCommand 2107;_{p i x e l}$ are utilized to compute the output instance segmentation masks, denoted as $M_{f i n a l}$ . These masks, represented by a tensor $M_{f i n a l}$ , correspond to N possible instances within the image.

To generate the masks, the function $f_{P}$ takes $Q_{L}$ and $AptCommand 2107;_{p i x e l}$ as inputs:

M_{f i n a l} = f_{P} (Q_{L}, AptCommand 2107;_{p i x e l})

(8)

Moreover, alongside the masks, the semantic class of each instance is predicted using another function called $f_{C}$ . This function is implemented as a Multi-Layer Perceptron (MLP) with 2 hidden layers. It takes the per-segment embeddings $Q_{L}$ as input and produces $N$ semantic classes, represented by $C_{f i n a l} \in R^{N \times C}$ . Here, $C$ represents the number of semantic categories.

The prediction of semantic classes can be expressed as:

C_{f i n a l} = f_{C} (Q_{L})

(9)

By combining these steps, this module is able to generate both instance segmentation masks ( $M_{f i n a l}$ ) and predict the semantic class labels for each instance ( $C_{f i n a l}$ ) using the decoded instance queries ( $Q_{L}$ ) and per-pixel embeddings ( $AptCommand 2107;_{p i x e l}$ ).

Metrics

We adopt Average Precision (AP) to evaluate the method. AP quantifies how well the model is able to precisely locate and classify objects (e.g., defect or normal) within an image. The AP computation from MSCOCO Lin et al. (2014) was followed.

In the recognition task, each image is associated with a single prediction for classification. Evaluating the model became straightforward as the accuracy metric could be calculated, measuring the ratio of correct predictions. On the other hand, in the field of object detection and classification, a prediction comprises a bounding box or a segmentation mask that helps locate the object, along with the predicted category for that object. To determine a correct prediction, 2 criteria are considered. Firstly, the prediction must have an Intersection over Union (IoU) value greater than a threshold ϵ when compared to the actual box or mask of the object. Secondly, the prediction must accurately classify the category of the object. In addition, for each image, a method can output multiple predictions and the number of predictions can be higher or lower than the actual object within the image. Thus, precision and recall metrics are taken into account. Precision (Pre) is the ratio of correct predictions to the total number of predictions. Precision can be considered as a measure of how precise the model's predictions were in terms of correctly detecting objects. Meanwhile, recall (Rec) is the ratio of correct predictions to the total number of actual objects within the image. It can be thought of as a measure of how comprehensive the model's predictions are in terms of capturing all the objects present.

Specifically, AP is computed at different IoU thresholds $ϵ$ , which determine what is considered a correct prediction. For instance, when the threshold $ϵ$ is set to 50%, it is denoted as AP@50. Let's consider an image with a list of actual ground truth objects denoted as $A = {a_{1}, a_{2}, . . ., a_{n}}$ , and a method that generates $m$ predictions denoted as $B = {b_{1}, b_{2}, . . ., b_{m}}$ . The predictions are sorted in descending order based on their confidence scores. In the process, the sorted list $B$ was iterated through, and at each step, the correctness of the prediction $b_{i}$ (where $i \in {1, 2, . . ., m}$ ) was determined. This is done by checking if the category is correctly matched and if the IoU is greater than the specified threshold ϵ. The number of correct predictions at this step was kept track of, denoted as $C_{i}$ . Using $C_{i},$ the precision $P r e_{i}$ and recall $R e c_{i}$ could be computed at each step. The iteration stops at $R e c_{i} = 1$ , indicating that all the objects have been captured, or when all the predictions were iterated through. The $A P @ ϵ$ is then computed as follows:

A P @ ϵ = \int_{0}^{1} P r e (R e c), d R e c

(10)

The reported AP in our table is the average of AP values ranging from $A P @ 50$ to $A P @ 95$ , with a step size of 5% as depicted in Equation 13. This provides a comprehensive evaluation of the model's performance across different IoU thresholds and recall levels.

A P = \frac{1}{10} \sum_{ϵ = 0.5; ϵ + = 0.05}^{0.95} A P @ ϵ

(11)

We also quantify the model's complexity using 3 key metrics: the number of floating- point operations (FLOPs), the count of model parameters (Params), and the frames processed per second (FPS). FLOPs are computed as an average over 100 testing images. FPS is evaluated on a Quadro RTX 8,000 GPU with a batch size of 1, calculated as the average runtime across the entire validation set, inclusive of postprocessing time.

RESULTS AND DISCUSSION

Implementation Details

The implementation of the pixel decoder in this study involves the use of an advanced multiscale deformable attention Transformer (DERT) as described in Zhu et al. (2020). Specifically, the DERT is applied to feature maps with resolutions of 1/8, 1/16, and 1/32. A simple upsampling layer with a lateral connection is then employed on the final 1/8 feature map to generate the perpixel embedding feature map of resolution 1/4. The Transformer encoder used in this study is configured with L=3 and a set of 100 queries.

Quantitative Performance and Comparison

In this section, our proposed CarcassFormer was evaluated on the Carcass dataset (section data seletion), which consisted of 2 subsets corresponding to a single carcass per image and multiple carcasses per image. The performance of CarcassFormer on various metrics, as shown in Table 3 and Table 3 for different tasks, was detailed reported. The authors then compared CarcassFormer with both CNN-based networks, namely Mask R-CNN He et al. (2017) and HTC Chen et al. (2019), as well as Transformer-based networks, namely Mask DINO Li et al. (2023), Mask2Former Cheng et al. (2022) and QueryInst Fang et al. (2021). The comparison was conducted using 3 different backbone networks: ResNet-34, ResNet-50, and Swin-T. The performance for 2 subsets was reported: a single carcass per image (Table 5, Table 6) and multiple carcasses per image (Tables 7,8, and 9). For each table, the metrics for detection, classification, segmentation, and model complexity were reported, as defined in section metrics.

Table 3.

Detailed performance of CarcassFormer on single carcass per image dataset on both detection and segmentation, whereas AP_normal & AP_defect include classification results.

Backbone	Task	AP	AP₅₀	AP₇₅	AP₉₅	APnormal	APdefect	Params	FLOPs	FPS
ResNet 34	Detection	97.70	98.23	98.23	92.89	98.02	97.38	41M	274G	5.1
	Segmentation	99.22	99.22	99.22	99.22	100.00	98.45
ResNet 50	Detection	95.18	95.18	95.18	95.18	94.02	96.34	41M	274G	5.1
	Segmentation	98.43	98.43	98.43	98.43	99.79	97.06
Swin-T	Detection	95.69	95.79	95.93	95.32	94.23	97.15	46M	281G	4.5
	Segmentation	97.77	98.65	98.93	98.15	99.11	96.42

Open in a new tab

Table 5.

Performance comparison between CarcassFormer with both CNN-based networks, namely mask R-CNN He et al. (2017) and HTC Chen et al. (2019) and transformer-based networks, namely Mask2Former Cheng et al. (2022) and QueryInst Fang et al. (2021) on both detection, classification and segmentation tasks. The comparison is conducted on ResNet-34 backbone network and in the case of single carcass per image.

Net.	Method	Venue	Detection and classification			Segmentation			Model complexity
Net.	Method	Venue	AP	APnormal	APdefect	AP	APnormal	APdefect	Params	FLOPs	FPS
CNN-based	Mask R-CNN He et al. (2017)	2017	79.73	87.82	71.65	81.36	85.41	77.31	41M	204G	5.8
	HTC Chen et al. (2019)	2019	89.00	95.30	82.60	82.30	86.30	78.40	109M	290G	4.3
Transformer-based	QueryInst Fang et al. (2021)	2021	90.40	98.00	82.90	82.20	88.10	76.30	45M	279G	4.8
	Mask2Former Cheng et al. (2022)	2022	58.33	58.42	58.24	75.32	92.08	58.57	41M	272G	5.2
	Mask DINO Li et al. (2023)	2023	88.12	95.11	81.12	77.67	83.45	71.89	49M	278G	5.1
	CarcassFormer (Ours)		97.70	98.02	97.38	99.22	100.00	98.45	41M	274G	5.1

Open in a new tab

Net. denotes Network architecture.

Table 6.

Performance comparison between CarcassFormer with both CNN-based networks, namely mask R-CNN He et al. (2017) and HTC Chen et al. (2019) and transformer-based networks, namely Mask2Former Cheng et al. (2022) and QueryInst Fang et al. (2021) on both Detection, Classification and Segmentation tasks. The comparison is conducted on ResNet-50 backbone network and in the case of single carcass per image.

			Detection and classification			Segmentation			Model complexity
Net.	Method	Venue	AP	APnormal	APdefect	AP	APnorma	APdefect	Params	FLOPs	FPS
CNN-based	Mask R-CNN He et al. (2017)	2017	80.35	90.15	70.56	84.19	88.39	80.00	44M	207G	5.3
	HTC Chen et al. (2019)	2019	88.10	96.20	80.00	84.30	89.00	79.70	112M	294G	3.9
Transformer-based	QueryInst Fang et al. (2021)	2021	64.60	75.50	53.60	72.70	78.60	66.70	48M	281G	4.4
	Mask2Former Cheng et al. (2022)	2022	85.05	91.23	78.87	85.11	91.23	78.99	44M	276G	4.8
	Mask DINO Li et al. (2023)	2023	85.12	91.44	79.10	86.13	92.11	80.15	52M	280G	4.6
	CarcassFormer (Ours)		95.18	94.02	96.34	98.43	99.79	97.06	44M	278G	4.6

Open in a new tab

Net. denotes Network architecture. The best score in each table is highlighted in bold.

Table 7.

Performance comparison between CarcassFormer with both CNN-based networks, namely Mask R-CNN He et al. (2017) and HTC Chen et al. (2019) and Transformer-based networks, namely Mask2Former Cheng et al. (2022) and QueryInst Fang et al. (2021) on both Detection, Classification and Segmentation tasks. The comparison is conducted on ResNet-34 backbone network and in the case of multiple car- casses per image.

Net.	Method	Venue	Detection and classification			Segmentation			Model complexity
Net.	Method	Venue	AP	APnorma	APdefect	AP	APnormal	APdefect	Params	FLOPs	FPS
CNN-based	Mask R-CNN He et al. (2017)	2017	77.08	84.33	69.83	74.81	79.00	70.63	41M	204G	5.8
	HTC Chen et al. (2019)	2019	77.80	89.70	65.90	74.00	79.10	68.90	109M	290G	4.3
Transformer-based	QueryInst Fang et al. (2021)	2021	84.10	89.60	78.70	83.20	87.70	78.70	45M	279G	4.8
	Mask2Former Cheng et al. (2022)	2022	53.86	54.00	53.72	71.69	85.39	58.00	41M	272G	5.2
	Mask DINO Li et al. (2023)	2023	68.44	74.55	62.33	78.80	88.26	69.33	49M	278G	5.1
	CarcassFormer (Ours)		89.72	93.48	85.96	98.23	98.77	97.68	41M	274G	5.1

Open in a new tab

Net. denotes Network architecture. The best score in each table is highlighted in bold.

Table 8.

Performance comparison between CarcassFormer with both CNN-based networks, namely mask R-CNN He et al. (2017) and HTC Chen et al. (2019) and Transformer-based networks, namely Mask2Former Cheng et al. (2022) and QueryInst Fang et al. (2021) on both Detection, Classification and Segmentation tasks. The comparison is conducted on ResNet-50 backbone network and in the case of multiple car- casses per image.

Net.	Method	Venue	Detection and classification			Segmentation			Model complexity
Net.	Method	Venue	AP	APnorma	APdefect	AP	APnormal	APdefect	Params	FLOPs	FPS
CNN-based	Mask R-CNN He et al. (2017)	2017	78.76	85.61	75.73	80.67	85.61	75.73	44M	207G	5.3
	HTC Chen et al. (2019)	2019	77.40	83.50	71.40	74.90	77.70	72.10	112M	294G	3.9
Transformer-based	QueryInst Fang et al. (2021)	2021	60.90	67.70	54.00	60.40	66.90	54.00	48M	281G	4.4
	Mask2Former Cheng et al. (2022)	2022	73.35	88.03	58.66	75.54	90.93	60.15	44M	276G	4.8
	Mask DINO Li et al. (2023)	2023	76.22	84.11	68.33	79.93	92.12	67.74	52M	280G	4.6
	CarcassFormer (Ours)		90.45	93.42	87.49	98.96	99.15	98.76	44M	278G	4.6

Open in a new tab

Net. denotes Network architecture. The best score in each table is highlighted in bold.

Table 9.

Performance comparison between CarcassFormer with Mask2Former Cheng et al. (2022) on both Detection, Classification and Segmentation tasks. The comparison is conducted on Swin-T backbone network and in the case of multiple carcasses per image.

		Detection and classification			Segmentation			Model complexity
Method	Venue	AP	APnormal	APdefect	AP	APnormal	APdefect	Params	FLOPs	FPS
Mask R-CNN He et al. (2016)	2017	78.82	86.12	71.52	81.22	87.12	75.32	46M	230G	4.8
Mask2Former Cheng et al. (2022)	2022	73.10	88.14	58.05	77.68	92.89	62.47	46M	280G	4.6
CarcassFormer (Ours)		89.34	94.18	84.50	98.70	99.47	97.92	46M	281G	4.5

Open in a new tab

Net. denotes Network architecture. The best score in each table is highlighted in bold.

Detailed Quantitative Performance

Detailed performance conducted by our CarcassFormer is reported in Table 3, Table 4 corresponding to 2 subsets: a single carcass per image and multiple carcasses per image. In each subset, our CarcassFormer network was examined on 3 different backbone networks: ResNet-34, ResNet-50, and Swin-T. For both tasks of detection and segmentation, Average Precision (AP) at different metrics of AP@50, AP@75, AP@95, and AP[50:95] (referred to as AP) were reported. Regarding detection and classification, AP_normal and AP_defect were evaluated for normal and defect classes. The results obtained from 2 tables (Table 3, Table 4) underscores the remarkable performance of our model across various backbones and tasks, with every configuration achieving an AP of over 85 for all metrics. Additionally, it becomes evident that the Multiple Carcasses Per Image Dataset presents greater challenges compared to the Single Carcasses Per Image Dataset. This observation is substantiated by a noticeable decline in performance metrics when handling multiple overlapping carcasses per image, as opposed to the single carcass per image scenario.

Table 4.

Detailed performance of CarcassFormer on multiple carcasses per image dataset on both detection and segmentation, whereas AP_normal & AP_defect include classification results.

Backbone	Task	AP	AP₅₀	AP₇₅	AP₉₅	APnormal	APdefect	Params	FLOPs	FPS
ResNet 34	Detection	89.72	91.45	91.45	78.51	93.48	85.96	41M	274G	5.1
	Segmentation	98.23	99.34	98.86	92.55	98.77	97.68
ResNet 50	Detection	90.45	91.55	91.55	83.41	93.42	87.49	41M	274G	5.1
	Segmentation	98.96	99.98	99.48	94.36	99.15	98.76
Swin-T	Detection	89.34	91.27	91.73	79.11	94.18	84.50	46M	281G	4.5
	Segmentation	98.70	99.29	98.32	93.10	99.47	97.92

Open in a new tab

Single Carcass Per Image

Table 5, Table 6 present the performance on a single carcass per image using ResNet- 34 and ResNet-50, respectively.

Table 5 compares the performance of CarcassFormer with existing methods on ResNet-34. In the first group, HTC Chen et al. (2019) obtains better performance than Mask R-CNN He et al. (2017) whereas our CarcassFormer gains significant performance gaps compared to both HTC Chen et al. (2019) and Mask R-CNN He et al. (2017). Take HTC Chen et al. (2019) as an example, CarcassFormer outperforms HTC with 8.70% higher AP for detec- tion, 2.72% higher AP for normal carcass classification, 14.78% higher AP for defect carcass classification, 16.92% higher AP for segmentation, 13.70% higher AP segmentation for nor- mal carcass and 20.05% higher AP segmentation for defect carcass. In the second group, QueryInst Fang et al. (2021) obtains better performance than Mask2Former Cheng et al. (2022) and Mask DINO Li et al. (2023) while our CarcassFormer obtains the best performance. Compared to QueryInst Fang et al. (2021), CarcassFormer gains 7.30% higher AP for detection, 0.02% higher AP for normal carcass classification 14.48% higher AP for defect carcass classification, 17.02% higher AP for segmentation, 11.90% higher AP segmentation for normal carcass, 22.15% higher AP segmentation for defect carcass.

Table 6 compares the performance of CarcassFormer with existing methods on ResNet-50.

In the first group, HTC Chen et al. (2019) outperforms Mask R-CNN He et al. (2017) whereas CarcassFormer outperforms HTC Chen et al. (2019) with significant performance gaps, including 7.08% higher AP for detection, 16.34% higher AP for defect carcass classification, 14.13% higher AP for segmentation, 10.79% higher AP segmentation for normal carcass and 17.36% higher AP segmentation for defect carcass while it is compatible with HTC Chen et al. (2019) on normal classification. In the second group, while Mask2Former Cheng et al. (2022) and Mask DINO Li et al. (2023) obtains much better performance than QueryInst Fang et al. (2021), our CarcassFormer outperforms MaskDINO Li et al. (2023) 10.06% higher AP for detection, 2.58% higher AP for normal carcass classification, 17.24% higher AP for defect carcass classification, 13.30% higher AP for segmentation, 7.68% higher AP segmentation for normal carcass, 16.91% higher AP segmentation for defect carcass.

Multiple Carcasses Per Image

Table 7, Table 8 present the performance on multiple carcasses per image using ResNet-34 and ResNet-50, respectively.

Table 7 compares the performance of CarcassFormer with existing methods on ResNet-34. In the first group, while Mask R-CNN He et al. (2017) and HTC Chen et al. (2019) are quite compatible on all tasks, our CarcassFormer gains big performance gaps. Take HTC Chen et al. (2019) as an example, CarcassFormer achieves 11.92% higher AP for detection, 3.78% higher AP for normal carcass classification, 20.06% higher AP for defect carcass classification, 24.23% higher AP for segmentation, 19.67% higher AP segmentation for normal carcass, 28.79% higher AP segmentation for defect carcass. In the second group, while QueryInst Fang et al. (2021) outperforms Mask2Former Cheng et al. (2022) and Mask DINO Li et al. (2023), our CarcassFormer obtains better performance than QueryInst Fang et al. (2021) with notable gaps, that is, 5.62% higher AP for detection, 3.88% higher AP for normal carcass classification, 7.26% higher AP for defect carcass classification, 15.03% higher AP for segmentation, 11.07% higher AP segmentation for normal carcass, 18.98% higher AP segmentation for defect carcass.

Table 8 compares the performance of CarcassFormer with existing methods on ResNet-50. In the first group, while Mask R-CNN He et al. (2017) outperforms HTC Chen et al. (2019), our CarcassFormer achieves a best performance with 11.69% higher AP for detection, 7.81% higher AP for normal carcass classification, 11.76% higher AP for defect carcass classification, 18.29% higher AP for segmentation, 13.54% higher AP segmentation for normal carcass, 23.03% higher AP segmentation for defect carcass compared to Mask R-CNN He et al. (2017). In the second group, while Mask2Former Cheng et al. (2022) Mask DINO Li et al. (2023) obtain better performance than QueryInst Fang et al. (2021), our CarcassFormer achieves the best performance. It gains 14.23% higher AP for detection, 9.31% higher AP for normal carcass classification, 19.16% higher AP for defect carcass classification, 19.03% higher AP for segmentation, 7.03% higher AP segmentation for normal carcass, 31.02% higher AP segmentation for defect carcass compared to the second best method Mask DINO Li et al. (2023).

Table 9 compares the performance of CarcassFormer with Mask2Former Cheng et al. (2022) and Mask R-CNN He et al. (2017) on Swin-T. In comparison to the CNN-based method, Mask R-CNN He et al. (2017), our approach yields significant improvements across various performance metrics. Specifically, we observe a 10.52% increase in average precision (AP) for detection, an 8.06% enhancement for normal carcass classification, a 12.98% boost for defect carcass classification, a remarkable 17.48% rise for segmentation, as well as notable gains of 12.35% and 22.6% in AP segmentation for normal and defect carcasses, respectively. In terms of comparison with transformer-based network, namely Mask2Former, our CarcassFormer achieves the significant better performance than both Mask2Former and Mask R-CNN. Indeed, it gains 16.24% higher AP for detection, 6.04% higher AP for nor- mal carcass classification, 26.45% higher AP for defect carcass classification, 21.02% higher AP for segmentation, 6.58% higher AP segmentation for normal carcass, 35.45% higher AP segmentation for defect carcass.

Model Complexity. Analysis of model complexity reveals that our method exhibits comparable complexity to the majority of existing methods. However, it consistently delivers notable performance enhancements across diverse tasks. Specifically, in the case of ResNet- 34, as illustrated in Table 5, Table 7, our model has the smallest number of model parameters, equivalent to that of Mask R-CNN He et al. (2017) and Mask2Former Cheng et al. (2022), while maintaining comparable FLOPs and FPS with these models. However, our model exhibits a significant performance advantage over both. This trend is similarly observed for ResNet-50, as shown in Table 6, Table 8, and for Swin-T, as depicted in Table 9, where our model demonstrates comparable model complexity but yields substantial performance gains compared to other methods.

Qualitative Performance and Comparison

Based on the quantitative comparison in section quantitative performance and comparison, Mask R-CNN He et al. (2017) was selected from the first group, and Mask2Former Cheng et al. (2022) was chosen from the second group to conduct the qualitative comparison. Specifically, the qualitative comparison was reported on both the detection and segmentation tasks, with a greater emphasis on the case of defect, namely feather and skin tearing.

Single Carcass Per Image

Figure 7 presents a qualitative performance comparison among 3 models: Mask R- CNN (a), Mask2Former (b), and our proposed CarcassFormer (c) on the defect of single carcass with feathers is present. While Mask R-CNN can segment the global content well, it fails to segment the details, such as feathers. On the other hand, Mask2Former performs better than Mask R-CNN in capturing details, but it still faces difficulties in capturing fine details, which can be seen at high resolution. Moreover, Mask R-CNN and Mask2Former exhibited a tendency to under-localize the carcass, as observed from the detected bounding box that did not encompass the entire carcass with details on the boundary, such as wings and feathers. In contrast, our CarcassFormer not only accurately localizes the carcass with a fitting bounding box but is also capable of segmenting details at high resolution.

Performance comparison (A) Mask R-CNN He et al. (2016), (B) Mask2Former Cheng et al. (2022) and (C) our CarcassFormer on the defect where *single carcass with feathers.* In the Segmentation column, notable parts with feathers were highlighted. Compare with Mask R-CNN and Mask2Former, our CarcassFormer can localize carcass with more accurate bounding box and segment carcass with more details on feathers.

Figure 8 depicts a qualitative performance comparison among 3 models: Mask R-CNN (a), Mask2Former (b), and our proposed CarcassFormer (c), on single carcass with both 2 defects of feather and skin tearing. Although Mask2Former performs better than Mask R-CNN in localizing the carcass with skin tearing, it still faces difficulties in localizing all details on feathers. Conversely, our CarcassFormer accurately localizes the carcass with a fitting bounding box and is also capable of segmenting details at high resolution.

Performance comparison (A) Mask R-CNN He et al. (2016), (B) Mask2Former Cheng et al. (2022) and (C) our CarcassFormer on 2 defects *where single carcass with skins tearing on the back and feathers.* In the Segmentation column, notable parts with feathers and skins tearing occurred were highlighted. Compared with Mask R-CNN and Mask2Former, our CarcassFormer desnt not only detect the feathers well but also accurately localize carcass with its all skins tearing.

Multiple Carcasses Per Image

In this section, the qualitative performance on images with multiple carcasses and their overlap was reported. Figure 9 illustrates the qualitative performance of 3 models: Mask R-CNN (a), Mask2Former (b), and our proposed CarcassFormer (c) on multiple, overlapping carcasses with feathers. Both Mask R-CNN and Mask2Former struggle to accurately segment and localize each individual carcass, especially in cases where the feathers of different carcasses overlap. Mask R-CNN lacks detail in the segmentation results, while Mask2Former, despite performing better in localizing individual carcasses, still fails to capture all the details accurately. On the contrary, our CarcassFormer excels at accurately segmenting each individual carcass and capturing the details of feathers, even in complex scenarios of overlap. In Figure 10, the qualitative performance comparison of the 3 models on multiple, overlapping carcasses with both feathers and skin tearing presented. Here, it is evident that Mask R-CNN and Mask2Former both face significant challenges in accurately localizing individual carcasses and providing detailed segmentation, especially for tiny objects like feathers and areas of skin tearing. In stark contrast, our CarcassFormer performs outstandingly in these complex situations. It not only accurately localizes each carcass but also segments the fine details of feathers and skin tearing areas, thereby providing a comprehensive and detailed segmentation output.

Performance comparison between Mask R-CNN (A) He et al. (2016) Mask2Former Cheng et al. (2022) (B) and our CarcassFormer (C) on *overlapping carcasses with feathers*. In the Segmentation column, notable parts with feathers were highlighted. Mask R-CNN not only lacks details in the segmentation results but also fails to localize individual carcasses. Although Mask2Former performs better than Mask R-CNN in localizing individual carcasses, it still struggles to accurately segment all details. In contrast, our CarcassFormer can simultaneously segment carcasses with details and accurately localize individual carcasses.

Performance comparison between (A) Mask R-CNN He et al. (2016), (B) Mask2Former Cheng et al. (2022) and (C): our CarcassFormer on *overlapping carcasses with feathers and skins tearing*. Segmentation highlights some notable parts with feathers occur. Both Mask R-CNN and Mask2Former struggle with accurately localizing individual carcasses and providing detailed segmentation, especially for fine details like feathers. In contrast, our CarcassFormer excels in simultaneously segmenting carcasses with fine details and accurately localizing individual carcasses.

CONCLUSIONS

In conclusion, an end-to-end Transformer-based network for checking carcass quality, CarcassFormer, has been described. Our CarcassFormer is designed with 4 different components: Network Backbone to extract visual features, Pixel Decoder to utilize feature maps from various scales, Mask-Attention Transformer Decoder to predict the segmented masks of all possible instances, and Instance Mask and Class Prediction to provide segmentation mask and corresponding label of an individual instance. To benchmark the proposed CarcassFormer network, a valuable realistic dataset was conducted at a poultry processing plant. The dataset acquired contained various defects including feathers, broken/disjointed bones, skins tearing, on different settings of a single carcass per image and multiple carcasses per image, and the carcass at various ages and sizes. The CarcassFormer was evaluated and compared with both CNN-based networks, namely Mask R-CNN He et al. (2017) and HTC Chen et al. (2019), as well as Transformer-based networks, namely Mask2Former Cheng et al. (2022) and QueryInst Fang et al. (2021), on both detection, classification, and segmentation tasks using 2 different backbone networks, ResNet-34 and ResNet-50. The extensive qualitative and quantitative experiment showed that our CarcassFormer outperforms the existing methods with remarkable gaps on various metrics of AP, AP@50, AP@75.

Our current CarcassFormer system operates solely on image-based inputs, limiting our ability to track carcasses across frames. While our model can currently determine whether a carcass is defective or not, it lacks the capability to identify specific types of defects, such as feathers around the carcass, feathers on the skin, flesh abnormalities, or broken wings. In our future endeavors, we aim to expand our research to include video analysis, enabling us to track carcasses across frames and thereby enhance the scalability of our system to process a larger volume of carcasses. Additionally, we intend to implement finer-grained defect detection to precisely identify the nature of defects present. This enhancement will provide more detailed insights into the types of defects observed, facilitating improved diagnosis and statistical analysis.

•
Dataset: A poultry carcass dataset was acquired, comprising a total of 7,321 images gathered from real-world environments and collected from diverse chicken ages, chicken size, and number of chickens per frame. The dataset has been carefully annotated by 3 experts.
•
Methodology: CarcassFormer, an effective end-to-end Transformer-based frame- work, was proposed for simultaneously localizing poultry carcass regions, segmenting carcass areas, and determining carcasses with imperfections. CarcassFormer is based on Transformer-based Unet architecture.

Our CarcassFormer is designed with 4 different components: Network Backbone to extract visual features, Pixel Decoder to utilize feature maps from various scales, Mask- Attention Transformer Decoder to predict the segmented masks of all instances, and Instance Mask and Class Prediction to provide segmentation mask and corresponding label of an individual instance. The extensive experiments showed that Carcass- Former outperforms both CNN-based networks, namely Mask R-CNN He et al. (2017) and HTC Chen et al. (2019), and Transformer-based networks, namely Mask2Former Cheng et al. (2022) and QueryInst Fang et al. (2021) on different backbone networks of ResNet-34 and ResNet-50 on various metrics of AP, AP@50, AP@75.
•
Pretrained models and Code: The pre-trained model and source code of Car- cassFormer is available for research purposes at: https://github.com/UARK-AICV/ CarcassFormer.

DISCLOSURES

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: uark.edu; cobbvantress.com

ACKNOWLEDGMENTS

This material is based upon work supported by Cobb Vantress Inc., the National Science Foundation (NSF) under Award No OIA-1946391 RII Track-1, NSF 2119691 AI SUSTEIN.

REFERENCES

Ahlin K. The robotic workbench and poultry processing 2.0. Anim. Front. 2022;12:49–55. doi: 10.1093/af/vfab079. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alexandratos N., Bruinsma J. Agriculture Development Economics Division. Food and Agriculture Organization of the United Nations; Rome, Italy: 2012. World Agriculture Towards 2030/2050: The 2012 Revision. [Google Scholar]
Arnab A., P.H.S. Torr. Bottom-up instance segmentation using deep higher-order crfs, in: Wilson, R.C., Hancock, E.R., Smith, W.A.P. (Eds.), In: Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, BMVA Press, 2016. http://www.bmva.org/bmvc/2016/papers/paper019/index.html. [Accessed 20 June. 2024]
Aydin A. Development of an early detection system for lameness of broilers using computer vision. Comp. Electr. Agric. 2017;136:140–146. [Google Scholar]
Bolya D., Zhou C., Xiao F., Lee Y.J. YOLACT: real-time instance segmentation. Pages 9156–9165 in 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019; Seoul, Korea (South), October 27 - November 2, 2019; IEEE; 2019. [Google Scholar]
Cai Z., Vasconcelos N. Cascade R-CNN: delving into high quality object detection. Pages 6154–6162 in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018; Salt Lake City, UT, June 18-22, 2018; IEEE Computer Society; 2018. [Google Scholar]
Caldas-Cueva J.P., Mauromoustakos A., Sun X., Owens C.M. Detection of woody breast condition in commercial broiler carcasses using image analysis. Poult. Sci. 2021 doi: 10.1016/j.psj.2020.12.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S. End-to-end object detection with transformers. Pages 213–229 in Computer Vision–ECCV 2020: 16th European Conference; Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16; Springer; 2020. [Google Scholar]
Chen H., Sun K., Tian Z., Shen C., Huang Y., Yan Y. Blendmask: top-down meets bottom-up for instance segmentation. Pages 8570–8578 in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition; CVPR 2020, Seattle, WA, June 13-19, 2020,; IEEE; 2020. [Google Scholar]
Chen K., Pang J., Wang J., Xiong Y., Li X., Sun S., Feng W., et al. Pages 4974–4983 in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. Hybrid task cascade for instance segmentation. [Google Scholar]
Chen L.C., Papandreou G., Kokkinos I., Murphy K., Yuille A.L. Deeplab: Semantic image seg- mentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transact. Pattern Analysis Machine Intelligence. 2017;40:834–848. doi: 10.1109/TPAMI.2017.2699184. [DOI] [PubMed] [Google Scholar]
Chen L.C., Zhu Y., Papandreou G., Schroff F., Adam H. Pages 801–818 in Proceedings of the European Conference on Computer Vision (ECCV) 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. [Google Scholar]
Cheng B., Misra I., Schwing A.G., Kirillov A., Girdhar R. Pages 1290–1299 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Masked-attention mask transformer for universal image segmentation. [Google Scholar]
Cheng B., Schwing A., Kirillov A. Per-pixel classification is not all you need for semantic segmen- tation. Adv. Neural Inform. Proc. Syst. 2021;34:17864–17875. [Google Scholar]
Cheng T., Wang X., Huang L., Liu W. Pages 660–676 in European Conference on Computer Vision. Springer; 2020. Boundary-preserving mask R-CNN. [Google Scholar]
Dong S., Wang P., Abbas K. A survey on deep learning and its applications. Comp. Sci. Rev. 2021;40 [Google Scholar]
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., et al. International Conference on Learning Representations (ICLR) 2020. An image is worth 16x16 words: transformers for image recognition at scale. [Google Scholar]
Duong C.N., Quach K.G., Jalata I., Le N., Luu K. Pages 1–6 in 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS) IEEE; 2019. Mobiface: a lightweight deep learning face recognition on mobile devices. [Google Scholar]
Duong C.N., Quach K.G., Luu K., Le T., Savvides M., Bui T.D. Learning from longitudinal face demonstration—where tractable deep modeling meets inverse reinforcement learning. Int. J. Comp. Vision. 2019;127:957–971. [Google Scholar]
Elam, T. E., 2022. Live chicken production trends.
Fan R., Cheng M.M., Hou Q., Mu T.J., Wang J., Hu S.M. Pages 6103–6112 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. S4net: Single stage salient-instance segmentation. [Google Scholar]
Fang, Y., Yang, S., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., Liu, W., 2021. Instances as queries, Pages 6910–6919 in: Proceedings of the IEEE/Cvf International Conference on Computer Vision.
FASS . Federation of Animal Science Societies. 3rd ed. 2010. Guide for the care and use of agricultural animals in research and teaching.http://www.fass.orgorfromthe [Google Scholar]
Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song, H. O., Guadarrama, S., Murphy, K. P., 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277.
Gabeur V., Sun C., Alahari K., Schmid C. Multi-modal transformer for video retrieval. Pages 214–229 in European Conference on Computer Vision; Springer; 2020. [Google Scholar]
Golomb S. Run-length encodings (corresp.) IEEE Transact. Inform. Theory. 1966;12:399–401. [Google Scholar]
Gu W., Bai S., Kong L. A review on 2d instance segmentation based on deep neural networks. Image Vision Computing. 2022;120 [Google Scholar]
Hafiz A.M., Bhat G.M. A survey on instance segmentation: state of the art. Int. J. Multimedia Inform. Retr. 2020;9:171–189. [Google Scholar]
Han L., Le T.H.N., Savvides M. Pages 224–231 in Medical Imaging 2017: Biomedical Applications in Molecular, Structural, and Functional Imaging. SPIE; 2017. An automatic cells detection and segmentation. [Google Scholar]
He K., Gkioxari G., Dollár P., Girshick R. Pages 2961–2969 in Proceedings of the IEEE International Conference on Computer Vision. 2017. Mask R-CNN. [Google Scholar]
He K., Zhang X., Ren S., Sun J. Pages 770–778 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Deep residual learning for image recognition. [Google Scholar]
He H., Cai J., Pan Z., Liu J., Zhang J., Tao D., Zhuang B. Pages 11299–11308 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Dynamic focus-aware positional queries for semantic segmentation. [Google Scholar]
He P., Chen Z., He Y., Chen J., Hayat K., Pan J., Lin H. A reliable and low-cost deep learning model integrating convolutional neural network and transformer structure for fine-grained classification of chicken Eimeria species. Poult. Sci. 2023;102 doi: 10.1016/j.psj.2022.102459. [DOI] [PMC free article] [PubMed] [Google Scholar]
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Hu Z., Yang H., Yan H. Attention-guided instance segmentation for group-raised pigs. Animals. 2023;13:2181. doi: 10.3390/ani13132181. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. Pages 4700–4708 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Densely connected convolutional networks. [Google Scholar]
Huang R., Pedoeem J., Chen C. Pages 2503–2510 in 2018 IEEE International Conference on Big Data (Big Data) IEEE; 2018. Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers. [Google Scholar]
Ibtehaz N., Rahman M.S. Multiresunet: Rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Networks. 2020;121:74–87. doi: 10.1016/j.neunet.2019.08.025. [DOI] [PubMed] [Google Scholar]
Janai J., Güney F., Behl A., Geiger A., et al. Computer vision for autonomous vehicles: problems, datasets and state of the art. Found. Trends® Comp. Graph. Vision. 2020;12:1–308. [Google Scholar]
Jin Y., Liu J., Xu Z., Yuan S., Li P., Wang J. Development status and trend of agricultural robot technology. Int. J. Agric. Biol. Eng. 2021;14:1–19. [Google Scholar]
Kaminski D.M. Michigan State University; 2020. Re-Moo-Ving Barriers Within Labor: Exploring Current Events Related to Dairy and Poultry Labor Markets. [Google Scholar]
Kong S., Fowlkes C.C. Pages 9018–9028 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Recurrent pixel embedding for instance grouping. [Google Scholar]
Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM. 2017;60:84–90. [Google Scholar]
Le D., Truong S., Brijesh P., Adjeroh D., Le N. scl-st: supervised contrastive learning with semantic transformations for multiple lead ecg arrhythmia classification. IEEE J. Biomed. Health Inform. 2023 doi: 10.1109/JBHI.2023.3246241. [DOI] [PubMed] [Google Scholar]
Le N., Bui T., Vo-Ho V.K., Yamazaki K., Luu K. Narrow band active contour attention model for medical segmentation. Diagnostics. 2021;11:1393. doi: 10.3390/diagnostics11081393. [DOI] [PMC free article] [PubMed] [Google Scholar]
Le N., Rathour V.S., Yamazaki K., Luu K., Savvides M. Deep reinforcement learning in computer vision: a comprehensive survey. Art. Intell. Rev. 2022:1–87. [Google Scholar]
Le T., Gummadi R., Savvides M. Pages 646–653 in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2018. Deep recurrent level set for segmenting brain tumors. [Google Scholar]
Le T.H.N., Savvides M. A novel shape constrained feature-based active contour model for lips/mouth segmentation in the wild. Pattern Recognition. 2016;54:23–33. [Google Scholar]
Le T.H.N., Luu K., Zhu C., Savvides M. Semi self-training beard/moustache detection and segmentation simultaneously. Image Vision Comp. 2017;58:214–223. [Google Scholar]
Le T.H.N., Quach K.G., Zhu C., Duong C.N., Luu K., Savvides M. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 39–46) 2017. Robust hand detection and classification in vehicles and in the wild. [Google Scholar]
Le T.H.N., Zhu C., Zheng Y., Luu K., Savvides M. Deepsafedrive: a grammar-aware driver parsing approach to driver behavioral situational awareness (db-saw) Pattern Recognition. 2017;66:229–238. [Google Scholar]
Lee Y., Park J. Centermask: Real-time anchor-free instance segmentation. Pages 13903–13912 in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition; CVPR 2020, Seattle, WA, June 13-19, 2020; IEEE; 2020. [Google Scholar]
Li F., Zhang H., Liu S., Guo J., Ni L.M., Zhang L. Pages 13619–13627 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Dn-detr: Accelerate detr training by introducing query denoising. [Google Scholar]
Li F., Zhang H., Xu H., Liu S., Zhang L., Ni L.M., Shum H.Y. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. [Google Scholar]
Li K., Malik J. Amodal instance segmentation. Pages 677–693 in Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14; Springer; 2016. [Google Scholar]
Li Y., Qi H., Dai J., Ji X., Wei Y. Fully convolutional instance-aware semantic segmentation. Pages 4438–4446 in 2017 IEEE Conference on Computer Vision and Pattern Recognition; CVPR 2017, Honolulu, HI, July 21-26, 2017; IEEE Computer Society; 2017. [Google Scholar]
Lin, M., Chen, Q., Yan, S., 2013. Network in network. arXiv preprint arXiv:1312.4400.
Lin T.Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L. Mi- crosoft coco: common objects in context. Pages 740–755 in Computer Vision–ECCV 2014: 13th European Conference; Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13; Springer; 2014. [Google Scholar]
Lin X., Yan Q., Wu C., Chen Y. Pages 37–42 in Proceedings of the 2022 5th International Conference on Sensors, Signal and Image Processing. 2022. Judgment model of cock reproductive performance based on vison transformer. [Google Scholar]
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L., 2022. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, Pages 10012–10022 in Proceedings of the IEEE/CVF International Conference on Computer Vision.
Long J., Shelhamer E., Darrell T. Pages 3431–3440 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. Fully convolutional networks for semantic segmentation. [DOI] [PubMed] [Google Scholar]
Luu K., Zhu C., Bhagavatula C., Le T.H.N., Savvides M. A deep learning approach to joint face detection and segmentation. Adv. Face Detect. Facial Image Anal. 2016:1–12. [Google Scholar]
Newell A., Huang Z., Deng J. Associative embedding: End-to-end learning for joint detection and grouping. Adv. Neural Inform. Proces. Syst. 2017;30 [Google Scholar]
Nguyen P., Quach K.G., Duong C.N., Le N., Nguyen X.B., Luu K. Pages 2569–2578 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Multi-camera multiple 3d object tracking on the move for autonomous vehicles. [Google Scholar]
Nguyen T.P., Pham T.T., Nguyen T., Le H., Nguyen D., Lam H., Nguyen P., Fowler J., Tran M.T., Le N. Pages 1981–1990 in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. Embryosformer: Deformable transformer and collaborative encoding-decoding for embryos stage development classification. [Google Scholar]
O Pinheiro P.O., Collobert R., Dollár P. Learning to segment object candidates. Adv. Neural Inform. Process. Syst. 2015;28 [Google Scholar]
Park M., Britton D., Daley W., McMurray G., Navaei M., Samoylov A., Usher C., Xu J. Artificial intelligence, sensors, robots, and transportation systems drive an innovative future for poultry broiler and breeder management. Anim. Front. 2022;12:40–48. doi: 10.1093/af/vfac001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quach K.G., Le N., Duong C.N., Jalata I., Roy K., Luu K. Non-volume preserving-based fusion to group-level emotion recognition on crowd videos. Pattern Recognition. 2022;128 [Google Scholar]
Ren G., Lin T., Ying Y., Chowdhary G., Ting K. Agricultural robotics research applicable to poultry production: a review. Comp. Electr. Agric. 2020;169 [Google Scholar]
Ronneberger O., Fischer P., Brox T. Pages 234–241 in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2015. U-net: Convolutional networks for biomedical image segmen- tation. [Google Scholar]
Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sun P., Zhang R., Jiang Y., Kong T., Xu C., Zhan W., Tomizuka M., Li L., Yuan Z., Wang C., et al. Pages 2881–2890 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Sparse r-cnn: end-to-end object detection with learnable proposals. [Google Scholar]
Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D., Erhan D., Vanhoucke V., Rabinovich A. Pages 1–9 in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Going deeper with convolutions. [Google Scholar]
Thang Pham, T., Brecheisen, J., Nguyen, A., Nguyen, H., Le, N., 2023. I-ai: a controllable & interpretable ai system for decoding radiologists’ intense focus for accurate cxr diagnoses. arXiv e-prints, arXiv–2309.
Tong K., Wu Y., Zhou F. Recent advances in small object detection based on deep learning: a review. Image Vision Comp. 2020;97 [Google Scholar]
Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., Jégou H. Pages 10347–10357 in International Conference on Machine Learning. PMLR; 2021. Training data-efficient image transformers & distillation through attention. [Google Scholar]
Tran M., Vo K., Yamazaki K., Fernandes A., Kidd M., Le N. British Machine Vision Conference (BMVC) 2022. Aisformer: amodal instance segmentation with transformer. [Google Scholar]
Tran M., Vo-Ho V.K., Le N.T. Pages 4392–4398 in 2022 26th International Conference on Pattern Recognition (ICPR) 2022. 3dconvcaps: 3dunet with convolutional capsule encoder for medical image segmentation. [Google Scholar]
Truong T.D., Chappa R.T.N., Nguyen X.B., Le N., Dowling A.P., Luu K. Pages 2850–2856 in 2022 26th International Conference on Pattern Recognition (ICPR) IEEE; 2022. Otadapt: optimal transport-based approach for unsupervised domain adaptation. [Google Scholar]
Van Engelen J.E., Hoos H.H. A survey on semi-supervised learning. Machine Learning. 2020;109:373–440. [Google Scholar]
Vaswani A., Shazeer N., et al. Attention is all you need. NIPS. 2017:5998–6008. [Google Scholar]
Vo K., Joo H., Yamazaki K., Truong S., Kitani K., Tran M.T., Le N. AEI: actors-environment interaction with adaptive attention for temporal action proposals generation. BMVC. 2021 [Google Scholar]
Vo K., Truong S., Yamazaki K., Raj B., Tran M.T., Le N. Aoe-net: entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. Int. J. Computer Vision. 2022:1–22. [Google Scholar]
Wray M., Doughty H., Damen D. Pages 3650–3660 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. On semantic similarity in video retrieval. [Google Scholar]
Wu D., Cui D., Zhou M., Ying Y. Information perception in modern poultry farming: a review. Comp. Electr. Agric. 2022;199 [Google Scholar]
Xie E., Wang W., Yu Z., Anandkumar A., Alvarez J.M., Luo P. Segformer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inform. Process. Syst. 2021;34:12077–12090. [Google Scholar]
Yamazaki K., Truong S., Vo K., Kidd M., Rainwater C., Luu K., Le N. Pages 3656–3661 in 2022 IEEE International Conference on Image Processing (ICIP) IEEE; 2022. Vlcap: vision-language with contrastive learning for coherent video paragraph captioning. [Google Scholar]
Yamazaki K., Vo K., Truong S., Raj B., Le N. Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. AAAI. 2023 [Google Scholar]
Ye L., Rochan M., Liu Z., Wang Y. Pages 10502–10511 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. Cross-modal self-attention network for referring image seg- mentation. [Google Scholar]
Ying, H., Huang, Z., Liu, S., Shao, T., Zhou, K., 2019. Embedmask: embedding coupling for one-stage instance segmentation. ArXiv preprint abs/1912.01954. https://arxiv.org/abs/1912.01954. [Accessed 20 June. 2024]
Zhang S.H., Li R., Dong X., Rosin P., Cai Z., Han X., Yang D., Huang H., Hu S.M. Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 2019. Pose2seg: detection free human instance segmentation. [Google Scholar]
Zhang X., Zhou X., Lin M., Sun J. Pages 6848–6856 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. [Google Scholar]
Zhao H., Shi J., Qi X., Wang X., Jia J. Pages 2881–2890 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Pyramid scene parsing network. [Google Scholar]
Zhao S., Bai Z., Wang S., Gu Y. Research on automatic classification and detection of mutton multi-parts based on swin-transformer. Foods. 2023;12:1642. doi: 10.3390/foods12081642. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng S., Lu J., Zhao H., Zhu X., Luo Z., Wang Y., Fu Y., Feng J., Xiang T., Torr P.H., et al. Pages 6881–6890 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. [Google Scholar]
Zhou S.K., Le H.N., Luu K., Nguyen H.V., Ayache N. Deep reinforcement learning in medical imaging: a literature review. Med. Image Analysis. 2021;73 doi: 10.1016/j.media.2021.102193. [DOI] [PubMed] [Google Scholar]
Zhou Z.H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018;5:44–53. [Google Scholar]
Zhu C., Zheng Y., Luu K., Hoang Ngan Le T., Bhagavatula C., Savvides M. Pages 25–33 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016. Weakly supervised facial analysis with dense hyper-column features. [Google Scholar]
Zhou Z., Rahman Siddiquee M.M., Tajbakhsh N., Liang J. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pp. 3-11. Springer International Publishing. 2018. Unet++: A nested u-net architecture for medical image segmentation. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2020. Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

[bib0001] Ahlin K. The robotic workbench and poultry processing 2.0. Anim. Front. 2022;12:49–55. doi: 10.1093/af/vfab079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0002] Alexandratos N., Bruinsma J. Agriculture Development Economics Division. Food and Agriculture Organization of the United Nations; Rome, Italy: 2012. World Agriculture Towards 2030/2050: The 2012 Revision. [Google Scholar]

[bib0003] Arnab A., P.H.S. Torr. Bottom-up instance segmentation using deep higher-order crfs, in: Wilson, R.C., Hancock, E.R., Smith, W.A.P. (Eds.), In: Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, BMVA Press, 2016. http://www.bmva.org/bmvc/2016/papers/paper019/index.html. [Accessed 20 June. 2024]

[bib0004] Aydin A. Development of an early detection system for lameness of broilers using computer vision. Comp. Electr. Agric. 2017;136:140–146. [Google Scholar]

[bib0005] Bolya D., Zhou C., Xiao F., Lee Y.J. YOLACT: real-time instance segmentation. Pages 9156–9165 in 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019; Seoul, Korea (South), October 27 - November 2, 2019; IEEE; 2019. [Google Scholar]

[bib0006] Cai Z., Vasconcelos N. Cascade R-CNN: delving into high quality object detection. Pages 6154–6162 in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018; Salt Lake City, UT, June 18-22, 2018; IEEE Computer Society; 2018. [Google Scholar]

[bib0007] Caldas-Cueva J.P., Mauromoustakos A., Sun X., Owens C.M. Detection of woody breast condition in commercial broiler carcasses using image analysis. Poult. Sci. 2021 doi: 10.1016/j.psj.2020.12.074. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0008] Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S. End-to-end object detection with transformers. Pages 213–229 in Computer Vision–ECCV 2020: 16th European Conference; Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16; Springer; 2020. [Google Scholar]

[bib0009] Chen H., Sun K., Tian Z., Shen C., Huang Y., Yan Y. Blendmask: top-down meets bottom-up for instance segmentation. Pages 8570–8578 in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition; CVPR 2020, Seattle, WA, June 13-19, 2020,; IEEE; 2020. [Google Scholar]

[bib0010] Chen K., Pang J., Wang J., Xiong Y., Li X., Sun S., Feng W., et al. Pages 4974–4983 in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. Hybrid task cascade for instance segmentation. [Google Scholar]

[bib0011] Chen L.C., Papandreou G., Kokkinos I., Murphy K., Yuille A.L. Deeplab: Semantic image seg- mentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transact. Pattern Analysis Machine Intelligence. 2017;40:834–848. doi: 10.1109/TPAMI.2017.2699184. [DOI] [PubMed] [Google Scholar]

[bib0012] Chen L.C., Zhu Y., Papandreou G., Schroff F., Adam H. Pages 801–818 in Proceedings of the European Conference on Computer Vision (ECCV) 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. [Google Scholar]

[bib0013] Cheng B., Misra I., Schwing A.G., Kirillov A., Girdhar R. Pages 1290–1299 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Masked-attention mask transformer for universal image segmentation. [Google Scholar]

[bib0014] Cheng B., Schwing A., Kirillov A. Per-pixel classification is not all you need for semantic segmen- tation. Adv. Neural Inform. Proc. Syst. 2021;34:17864–17875. [Google Scholar]

[bib0015] Cheng T., Wang X., Huang L., Liu W. Pages 660–676 in European Conference on Computer Vision. Springer; 2020. Boundary-preserving mask R-CNN. [Google Scholar]

[bib0017] Dong S., Wang P., Abbas K. A survey on deep learning and its applications. Comp. Sci. Rev. 2021;40 [Google Scholar]

[bib0018] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., et al. International Conference on Learning Representations (ICLR) 2020. An image is worth 16x16 words: transformers for image recognition at scale. [Google Scholar]

[bib0019] Duong C.N., Quach K.G., Jalata I., Le N., Luu K. Pages 1–6 in 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS) IEEE; 2019. Mobiface: a lightweight deep learning face recognition on mobile devices. [Google Scholar]

[bib0020] Duong C.N., Quach K.G., Luu K., Le T., Savvides M., Bui T.D. Learning from longitudinal face demonstration—where tractable deep modeling meets inverse reinforcement learning. Int. J. Comp. Vision. 2019;127:957–971. [Google Scholar]

[bib0021] Elam, T. E., 2022. Live chicken production trends.

[bib0022] Fan R., Cheng M.M., Hou Q., Mu T.J., Wang J., Hu S.M. Pages 6103–6112 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. S4net: Single stage salient-instance segmentation. [Google Scholar]

[bib0023] Fang, Y., Yang, S., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., Liu, W., 2021. Instances as queries, Pages 6910–6919 in: Proceedings of the IEEE/Cvf International Conference on Computer Vision.

[bib0024] FASS . Federation of Animal Science Societies. 3rd ed. 2010. Guide for the care and use of agricultural animals in research and teaching.http://www.fass.orgorfromthe [Google Scholar]

[bib0025] Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song, H. O., Guadarrama, S., Murphy, K. P., 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277.

[bib0026] Gabeur V., Sun C., Alahari K., Schmid C. Multi-modal transformer for video retrieval. Pages 214–229 in European Conference on Computer Vision; Springer; 2020. [Google Scholar]

[bib0027] Golomb S. Run-length encodings (corresp.) IEEE Transact. Inform. Theory. 1966;12:399–401. [Google Scholar]

[bib0028] Gu W., Bai S., Kong L. A review on 2d instance segmentation based on deep neural networks. Image Vision Computing. 2022;120 [Google Scholar]

[bib0029] Hafiz A.M., Bhat G.M. A survey on instance segmentation: state of the art. Int. J. Multimedia Inform. Retr. 2020;9:171–189. [Google Scholar]

[bib0030] Han L., Le T.H.N., Savvides M. Pages 224–231 in Medical Imaging 2017: Biomedical Applications in Molecular, Structural, and Functional Imaging. SPIE; 2017. An automatic cells detection and segmentation. [Google Scholar]

[bib0032] He K., Gkioxari G., Dollár P., Girshick R. Pages 2961–2969 in Proceedings of the IEEE International Conference on Computer Vision. 2017. Mask R-CNN. [Google Scholar]

[bib0033] He K., Zhang X., Ren S., Sun J. Pages 770–778 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Deep residual learning for image recognition. [Google Scholar]

[bib0031] He H., Cai J., Pan Z., Liu J., Zhang J., Tao D., Zhuang B. Pages 11299–11308 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Dynamic focus-aware positional queries for semantic segmentation. [Google Scholar]

[bib0034] He P., Chen Z., He Y., Chen J., Hayat K., Pan J., Lin H. A reliable and low-cost deep learning model integrating convolutional neural network and transformer structure for fine-grained classification of chicken Eimeria species. Poult. Sci. 2023;102 doi: 10.1016/j.psj.2022.102459. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0035] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

[bib0036] Hu Z., Yang H., Yan H. Attention-guided instance segmentation for group-raised pigs. Animals. 2023;13:2181. doi: 10.3390/ani13132181. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0037] Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. Pages 4700–4708 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Densely connected convolutional networks. [Google Scholar]

[bib0038] Huang R., Pedoeem J., Chen C. Pages 2503–2510 in 2018 IEEE International Conference on Big Data (Big Data) IEEE; 2018. Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers. [Google Scholar]

[bib0039] Ibtehaz N., Rahman M.S. Multiresunet: Rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Networks. 2020;121:74–87. doi: 10.1016/j.neunet.2019.08.025. [DOI] [PubMed] [Google Scholar]

[bib0040] Janai J., Güney F., Behl A., Geiger A., et al. Computer vision for autonomous vehicles: problems, datasets and state of the art. Found. Trends® Comp. Graph. Vision. 2020;12:1–308. [Google Scholar]

[bib0041] Jin Y., Liu J., Xu Z., Yuan S., Li P., Wang J. Development status and trend of agricultural robot technology. Int. J. Agric. Biol. Eng. 2021;14:1–19. [Google Scholar]

[bib0042] Kaminski D.M. Michigan State University; 2020. Re-Moo-Ving Barriers Within Labor: Exploring Current Events Related to Dairy and Poultry Labor Markets. [Google Scholar]

[bib0043] Kong S., Fowlkes C.C. Pages 9018–9028 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Recurrent pixel embedding for instance grouping. [Google Scholar]

[bib0044] Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM. 2017;60:84–90. [Google Scholar]

[bib0045] Le D., Truong S., Brijesh P., Adjeroh D., Le N. scl-st: supervised contrastive learning with semantic transformations for multiple lead ecg arrhythmia classification. IEEE J. Biomed. Health Inform. 2023 doi: 10.1109/JBHI.2023.3246241. [DOI] [PubMed] [Google Scholar]

[bib0046] Le N., Bui T., Vo-Ho V.K., Yamazaki K., Luu K. Narrow band active contour attention model for medical segmentation. Diagnostics. 2021;11:1393. doi: 10.3390/diagnostics11081393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0047] Le N., Rathour V.S., Yamazaki K., Luu K., Savvides M. Deep reinforcement learning in computer vision: a comprehensive survey. Art. Intell. Rev. 2022:1–87. [Google Scholar]

[bib0048] Le T., Gummadi R., Savvides M. Pages 646–653 in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2018. Deep recurrent level set for segmenting brain tumors. [Google Scholar]

[bib0051] Le T.H.N., Savvides M. A novel shape constrained feature-based active contour model for lips/mouth segmentation in the wild. Pattern Recognition. 2016;54:23–33. [Google Scholar]

[bib0049] Le T.H.N., Luu K., Zhu C., Savvides M. Semi self-training beard/moustache detection and segmentation simultaneously. Image Vision Comp. 2017;58:214–223. [Google Scholar]

[bib0050] Le T.H.N., Quach K.G., Zhu C., Duong C.N., Luu K., Savvides M. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 39–46) 2017. Robust hand detection and classification in vehicles and in the wild. [Google Scholar]

[bib0052] Le T.H.N., Zhu C., Zheng Y., Luu K., Savvides M. Deepsafedrive: a grammar-aware driver parsing approach to driver behavioral situational awareness (db-saw) Pattern Recognition. 2017;66:229–238. [Google Scholar]

[bib0053] Lee Y., Park J. Centermask: Real-time anchor-free instance segmentation. Pages 13903–13912 in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition; CVPR 2020, Seattle, WA, June 13-19, 2020; IEEE; 2020. [Google Scholar]

[bib0054] Li F., Zhang H., Liu S., Guo J., Ni L.M., Zhang L. Pages 13619–13627 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Dn-detr: Accelerate detr training by introducing query denoising. [Google Scholar]

[bib0055] Li F., Zhang H., Xu H., Liu S., Zhang L., Ni L.M., Shum H.Y. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. [Google Scholar]

[bib0056] Li K., Malik J. Amodal instance segmentation. Pages 677–693 in Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14; Springer; 2016. [Google Scholar]

[bib0057] Li Y., Qi H., Dai J., Ji X., Wei Y. Fully convolutional instance-aware semantic segmentation. Pages 4438–4446 in 2017 IEEE Conference on Computer Vision and Pattern Recognition; CVPR 2017, Honolulu, HI, July 21-26, 2017; IEEE Computer Society; 2017. [Google Scholar]

[bib0058] Lin, M., Chen, Q., Yan, S., 2013. Network in network. arXiv preprint arXiv:1312.4400.

[bib0059] Lin T.Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L. Mi- crosoft coco: common objects in context. Pages 740–755 in Computer Vision–ECCV 2014: 13th European Conference; Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13; Springer; 2014. [Google Scholar]

[bib0060] Lin X., Yan Q., Wu C., Chen Y. Pages 37–42 in Proceedings of the 2022 5th International Conference on Sensors, Signal and Image Processing. 2022. Judgment model of cock reproductive performance based on vison transformer. [Google Scholar]

[bib0061] Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L., 2022. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329.

[bib0062] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, Pages 10012–10022 in Proceedings of the IEEE/CVF International Conference on Computer Vision.

[bib0063] Long J., Shelhamer E., Darrell T. Pages 3431–3440 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. Fully convolutional networks for semantic segmentation. [DOI] [PubMed] [Google Scholar]

[bib0064] Luu K., Zhu C., Bhagavatula C., Le T.H.N., Savvides M. A deep learning approach to joint face detection and segmentation. Adv. Face Detect. Facial Image Anal. 2016:1–12. [Google Scholar]

[bib0065] Newell A., Huang Z., Deng J. Associative embedding: End-to-end learning for joint detection and grouping. Adv. Neural Inform. Proces. Syst. 2017;30 [Google Scholar]

[bib0066] Nguyen P., Quach K.G., Duong C.N., Le N., Nguyen X.B., Luu K. Pages 2569–2578 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Multi-camera multiple 3d object tracking on the move for autonomous vehicles. [Google Scholar]

[bib0067] Nguyen T.P., Pham T.T., Nguyen T., Le H., Nguyen D., Lam H., Nguyen P., Fowler J., Tran M.T., Le N. Pages 1981–1990 in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. Embryosformer: Deformable transformer and collaborative encoding-decoding for embryos stage development classification. [Google Scholar]

[bib0068] O Pinheiro P.O., Collobert R., Dollár P. Learning to segment object candidates. Adv. Neural Inform. Process. Syst. 2015;28 [Google Scholar]

[bib0069] Park M., Britton D., Daley W., McMurray G., Navaei M., Samoylov A., Usher C., Xu J. Artificial intelligence, sensors, robots, and transportation systems drive an innovative future for poultry broiler and breeder management. Anim. Front. 2022;12:40–48. doi: 10.1093/af/vfac001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0070] Quach K.G., Le N., Duong C.N., Jalata I., Roy K., Luu K. Non-volume preserving-based fusion to group-level emotion recognition on crowd videos. Pattern Recognition. 2022;128 [Google Scholar]

[bib0071] Ren G., Lin T., Ying Y., Chowdhary G., Ting K. Agricultural robotics research applicable to poultry production: a review. Comp. Electr. Agric. 2020;169 [Google Scholar]

[bib0072] Ronneberger O., Fischer P., Brox T. Pages 234–241 in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2015. U-net: Convolutional networks for biomedical image segmen- tation. [Google Scholar]

[bib0073] Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[bib0074] Sun P., Zhang R., Jiang Y., Kong T., Xu C., Zhan W., Tomizuka M., Li L., Yuan Z., Wang C., et al. Pages 2881–2890 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Sparse r-cnn: end-to-end object detection with learnable proposals. [Google Scholar]

[bib0075] Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D., Erhan D., Vanhoucke V., Rabinovich A. Pages 1–9 in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Going deeper with convolutions. [Google Scholar]

[bib0076] Thang Pham, T., Brecheisen, J., Nguyen, A., Nguyen, H., Le, N., 2023. I-ai: a controllable & interpretable ai system for decoding radiologists’ intense focus for accurate cxr diagnoses. arXiv e-prints, arXiv–2309.

[bib0077] Tong K., Wu Y., Zhou F. Recent advances in small object detection based on deep learning: a review. Image Vision Comp. 2020;97 [Google Scholar]

[bib0078] Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., Jégou H. Pages 10347–10357 in International Conference on Machine Learning. PMLR; 2021. Training data-efficient image transformers & distillation through attention. [Google Scholar]

[bib0080] Tran M., Vo K., Yamazaki K., Fernandes A., Kidd M., Le N. British Machine Vision Conference (BMVC) 2022. Aisformer: amodal instance segmentation with transformer. [Google Scholar]

[bib0081] Tran M., Vo-Ho V.K., Le N.T. Pages 4392–4398 in 2022 26th International Conference on Pattern Recognition (ICPR) 2022. 3dconvcaps: 3dunet with convolutional capsule encoder for medical image segmentation. [Google Scholar]

[bib0082] Truong T.D., Chappa R.T.N., Nguyen X.B., Le N., Dowling A.P., Luu K. Pages 2850–2856 in 2022 26th International Conference on Pattern Recognition (ICPR) IEEE; 2022. Otadapt: optimal transport-based approach for unsupervised domain adaptation. [Google Scholar]

[bib0084] Van Engelen J.E., Hoos H.H. A survey on semi-supervised learning. Machine Learning. 2020;109:373–440. [Google Scholar]

[bib0085] Vaswani A., Shazeer N., et al. Attention is all you need. NIPS. 2017:5998–6008. [Google Scholar]

[bib0086] Vo K., Joo H., Yamazaki K., Truong S., Kitani K., Tran M.T., Le N. AEI: actors-environment interaction with adaptive attention for temporal action proposals generation. BMVC. 2021 [Google Scholar]

[bib0087] Vo K., Truong S., Yamazaki K., Raj B., Tran M.T., Le N. Aoe-net: entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. Int. J. Computer Vision. 2022:1–22. [Google Scholar]

[bib0088] Wray M., Doughty H., Damen D. Pages 3650–3660 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. On semantic similarity in video retrieval. [Google Scholar]

[bib0089] Wu D., Cui D., Zhou M., Ying Y. Information perception in modern poultry farming: a review. Comp. Electr. Agric. 2022;199 [Google Scholar]

[bib0090] Xie E., Wang W., Yu Z., Anandkumar A., Alvarez J.M., Luo P. Segformer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inform. Process. Syst. 2021;34:12077–12090. [Google Scholar]

[bib0091] Yamazaki K., Truong S., Vo K., Kidd M., Rainwater C., Luu K., Le N. Pages 3656–3661 in 2022 IEEE International Conference on Image Processing (ICIP) IEEE; 2022. Vlcap: vision-language with contrastive learning for coherent video paragraph captioning. [Google Scholar]

[bib0092] Yamazaki K., Vo K., Truong S., Raj B., Le N. Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. AAAI. 2023 [Google Scholar]

[bib0093] Ye L., Rochan M., Liu Z., Wang Y. Pages 10502–10511 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. Cross-modal self-attention network for referring image seg- mentation. [Google Scholar]

[bib0094] Ying, H., Huang, Z., Liu, S., Shao, T., Zhou, K., 2019. Embedmask: embedding coupling for one-stage instance segmentation. ArXiv preprint abs/1912.01954. https://arxiv.org/abs/1912.01954. [Accessed 20 June. 2024]

[bib0095] Zhang S.H., Li R., Dong X., Rosin P., Cai Z., Han X., Yang D., Huang H., Hu S.M. Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 2019. Pose2seg: detection free human instance segmentation. [Google Scholar]

[bib0096] Zhang X., Zhou X., Lin M., Sun J. Pages 6848–6856 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. [Google Scholar]

[bib0097] Zhao H., Shi J., Qi X., Wang X., Jia J. Pages 2881–2890 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Pyramid scene parsing network. [Google Scholar]

[bib0098] Zhao S., Bai Z., Wang S., Gu Y. Research on automatic classification and detection of mutton multi-parts based on swin-transformer. Foods. 2023;12:1642. doi: 10.3390/foods12081642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0099] Zheng S., Lu J., Zhao H., Zhu X., Luo Z., Wang Y., Fu Y., Feng J., Xiang T., Torr P.H., et al. Pages 6881–6890 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. [Google Scholar]

[bib0100] Zhou S.K., Le H.N., Luu K., Nguyen H.V., Ayache N. Deep reinforcement learning in medical imaging: a literature review. Med. Image Analysis. 2021;73 doi: 10.1016/j.media.2021.102193. [DOI] [PubMed] [Google Scholar]

[bib0102] Zhou Z.H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018;5:44–53. [Google Scholar]

[bib0103] Zhu C., Zheng Y., Luu K., Hoang Ngan Le T., Bhagavatula C., Savvides M. Pages 25–33 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016. Weakly supervised facial analysis with dense hyper-column features. [Google Scholar]

[bib0101] Zhou Z., Rahman Siddiquee M.M., Tajbakhsh N., Liang J. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pp. 3-11. Springer International Publishing. 2018. Unet++: A nested u-net architecture for medical image segmentation. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0104] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2020. Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

PERMALINK

CarcassFormer: an end-to-end transformer-based framework for simultaneous localization, segmentation and classification of poultry carcass defect

Minh Tran

Sang Truong

Arthur FA Fernandes

Michael T Kidd

Ngan Le

Abstract

INTRODUCTION

Related Work

Image Segmentation

CNN-Based Instance Segmentation

Transformer in Computer Vision

MATERIALS AND METHODS

Data Collection

Figure 2.

Figure 3.

Figure 4.

Data Annotation

Figure 5.

Figure 6.

Table 1.

Table 2.

Proposed Method

Figure 1.

Backbone

Pixel Decoder

Multi Scale Transformer Encoder

Perpixel Embeddings Module

Mask-Attention Transformer Decoder

Mask Predictor

Mask-Attention Transformer Decoder

Instance Mask and Class Prediction

Metrics

RESULTS AND DISCUSSION

Implementation Details

Quantitative Performance and Comparison

Table 3.

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Detailed Quantitative Performance

Table 4.

Single Carcass Per Image

Multiple Carcasses Per Image

Qualitative Performance and Comparison

Single Carcass Per Image

Figure 7.

Figure 8.

Multiple Carcasses Per Image

Figure 9.

Figure 10.

CONCLUSIONS

DISCLOSURES

ACKNOWLEDGMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases