Skip to main content
Data in Brief logoLink to Data in Brief
. 2025 Apr 28;60:111594. doi: 10.1016/j.dib.2025.111594

Cauliflower leaf diseases: A computer vision dataset for smart agriculture

Sabbir Hossain Durjoy 1, Md Emon Shikder 1, Md Mehedi Hasan Shoib 1, Md Hasan Imam Bijoy 1,
PMCID: PMC12175232  PMID: 40534720

Abstract

Cauliflower is among the more well-known vegetables there are. Consumed all around the globe due to it being rich in nutrients such as vitamins, antioxidants, and for being high in fibre. These are nutritional qualities that help with digestion, immune-system, and minimizing inflammation. It is a common issue among farmers to have to deal with various diseases in cauliflower leaves that are difficult to diagnose in their early stages. These diseases have a tendency to propagate in a really swift pace throughout entire fields worth of crops. This in-turn causes heavy losses in the harvest, and makes it much more tedious and resource-intensive to protect the crops. As a result, farmers get more likely to use high amounts of pesticides and harmful chemicals to streamline the process of getting a more reliable yield on their crops. This is not only costly, but it is also harmful both to the quality of crops and to the well-being of the environment. In this publication, we are introducing a dataset containing a considerable number of images of cauliflower leaves. This is intended to drive development on this topic at a faster pace than it is now, and to help enhance disease monitoring, diagnosis, and precautionary techniques. We collected our dataset images between November 2024 and January 2025. In this dataset, cauliflower leaves were categorized into three classes: Healthy, Insect Holes, and Black Rot, each reflecting a specific condition that impacts plant health at different stages. This dataset consists of 2,661 images. The pictures were captured at different locations in Bangladesh, under different weather conditions, dates, temperatures, and with different devices. To enhance the data quality, we used several steps to process the dataset, making sure it would reflect real-world conditions and be ready for training. The images were resized to a standard size of 3000 × 3000 pixels, brightness was adjusted to make the images more easily discernible, and we removed duplicates and poor-quality images. These actions helped ensure the dataset was in the best possible shape for effective model training. This dataset will be highly effective for agricultural research, precision agriculture, and effective management of diseases. It should help develop highly accurate machine learning models for early detection of Cauliflower leaf diseases. The dataset is employed to train deep learning models to support automated monitoring and smart decision-making in precision agriculture. This data set also has immense potential for real-time and practical use. It can be utilized to develop applications like mobile apps or automated systems where farmers can easily identify diseases at early stages and take immediate action, without the requirement of expert on-site knowledge. This data set can also be utilized with smart farming equipment like drones and sensors to track big fields in real time.

Keywords: Cauliflower leaf disease, Deep learning, Image dataset, Agricultural dataset, Disease detection


Specifications Table

Subject Computer Sciences
Specific subject area Deep Learning, Computer Vision, Image Processing, Image Classification, Machine Learning.
Type of data Image (.JPG)
Data collection The Cauliflower Leaf Disease Dataset was collected from two distinct locations in Bangladesh: Zailla, Singair, Manikganj, and Dattapara, Ashulia, Savar, Dhaka. A total of 2,661 high-resolution images were captured, representing three categories of cauliflower leaves: Healthy, Insect Holes, and Black Rot. The dataset was collected between November 2024 and January 2025, ensuring diverse weather conditions, temperatures, and lighting variations. To enhance the dataset’s quality, images were resized to 3000 × 3000 pixels, brightness was adjusted for better visibility, and duplicate or poor-quality images were removed. This dataset provides a balanced collection to support deep learning models in agricultural disease detection and precision farming.
Captured Using:
(i) Redmi 10 Pro Max, (ii) OnePlus 8T, (iii) Tecno Spark 7, (iv) Redmi Note 10, (v) Huawei P30 Lite
Data source location The images were collected from the following geographic locations:
  • 1. Zailla, Singair, Manikganj

  • Latitude: 23°47′46.11"N Longitude: 90°13′15.73"E

  • 2. Dattapara, Ashulia, Savar, Dhaka

  • Latitude: 23°52′26.3"N Longitude: 90°19′06.3"E

Data accessibility Repository name: Mendeley Data
Data identification number: 10.17632/x995snz7p3.1
Direct URL to data: https://data.mendeley.com/datasets/x995snz7p3/1
The dataset is publicly available and can be accessed via the provided Mendeley Data repository link.
Related research article None

1. Value of the Data

  • This dataset holds high-resolution images of diseased cauliflower leaves infected with multiple diseases, which provide a wealth of material for the development and validation of computer vision-based automated disease detection systems.

  • It allows machine learning and deep learning techniques to be used for accurate classification and diagnosis of infected leaves, enhancing precision agriculture.

  • By using this dataset, create disease detection applications to help farmers accurately identify and manage plant diseases, which will benefit the agriculture sector, researchers, wholesalers, consumers, and agricultural policymakers.

  • Feature extraction and selection methods can be employed by researchers to identify the most significant visual patterns distinguishing healthy and infected leaves.

  • By enabling early detection of diseases, this dataset contributes to reducing crop loss, optimizes disease management strategy, and improving agricultural productivity.

2. Background

Cauliflower is a rich vegetable in important nutrients. It has high amounts of vitamins C and K and notable amounts of vitamins A, B1, and B9. It is also known for having ample amounts of fibre and antioxidants that help maintain a healthy digestion, boost the immune system, and reduce inflammation [1]. Along with the previously mentioned nutrients, cauliflower is also known for having healthy amounts of helpful minerals such as iron, magnesium, phosphorus, and potassium. It is low on fat and carbs, and at the same time rich in necessary nutrients like fibre, water, vitamins etc. which makes it a healthy option for any type of diet [2]. But farmers face leaf diseases that spread quickly and reduce crop yield. These diseases are hard to detect early, leading to more infections and increased pesticide use, raising costs and harming the environment. To help solve this problem, this dataset is created to assist researchers in developing computer programs that can automatically detect cauliflower leaf diseases. The dataset includes images collected from two locations in Bangladesh. Images were taken in different weather conditions and at different growth stages of the cauliflower plants. This will make the dataset more applicable to detect diseases in different real-world conditions. This dataset is a handy resource for scientists and researchers who want to improve computer-assisted plant disease detection. This dataset can be used for training and testing machine learning algorithms that can aid in developing better cauliflower crop protection tools.

3. Data Description

Cauliflower is the world's second most popular “cole” crop after cabbage, but it holds the top spot in Bangladesh [3]. It is largely grown by farmers as a seasonal crop in spring and fall. In order to enhance automated detection of plant disease, we provide a Cauliflower Leaf Disease Dataset with 2661 high-resolution images that were collected from Bangladesh's Savar and Manikganj regions. Fig. 1 shows a summary of the geographical locations where our dataset was collected.

Fig. 1.

Fig 1

Cauliflower Leaf Disease Dataset collect in geographic areas.

Images span natural variability in cauliflower leaf health under differing environmental conditions in order to have a rich and realistic dataset. All photos were taken using natural light, just as in actual circumstances experienced by farmers. Our dataset indicates three of the most significant conditions of cauliflower leaves: Healthy Leaves, Insect Holes Leaves, and Black Rot Disease-Infected Leaves [4]. Table 1 shows a classification of the dataset, which shows the number of images per class.

Table 1.

Statistics of the cauliflower leaf dataset.

Serial No Classes (Leaf) Number of Images
1 Healthy 934
2 Insect Hole 639
3 Black Rot 1088
Total 2661

Table 2 Shows a summary of the collected dataset, including the number of images, leaf types, short descriptions, and sample images.

Table 2.

Summary of the cauliflower leaf disease dataset.

Serial No Class Name Number of Images Description Sample Image
1 Healthy 934 These leaves are fresh and green and not diseased or damaged. Image, table 2
2 Insect Hole 639 Insect holes in cauliflower leaves are usually caused by caterpillars, such as cabbage worms or loopers. Common Symptoms are Irregular holes, bite marks, possible edge damage. Image, table 2
3 Black Rot 1088 Black rot is a bacterial disease caused by Xanthomonas campestris pv campestris.
They show signs of black rot disease, which kills and discolours the leaf tissue to become black. Common Symptoms are Irregular holes, bite marks, possible edge damage.Common Symptoms are Yellow V-shaped lesions, dark veins, leaf wilting.
Image, table 2

Cauliflower is a seasonal vegetable. We collected our dataset images from November 2024 to January 2025. The images were captured at different locations in Bangladesh, under varying weather conditions, dates, temperatures, and with different devices. This information is presented in Table 3.

Table 3.

Image collection details (date, location, temperature, weather and devices).

Class Date Location Temperature Weather Devices
Black Rot 10 November, 2024 Manikganj 31°C Sunny Huawei P30 Lite (50%), Redmi Note 10(25%), Tecno Spark 7(25%)
Healthy 29 November,2024 Savar 30°C Sunny Redmi 10 Pro Max (40%), OnePlus 8T (60%)
Insect Hole 2 December, 2024 Savar 28°C Sunny Redmi 10 Pro Max (55%), OnePlus 8T (45%)
Black Rot 14 December, 2024 Savar 24°C Foggy Redmi 10 Pro Max (20%), OnePlus 8T (80%)
Healthy 28 December,2024 Manikganj 25°C Foggy Huawei P30 Lite (80%), OnePlus 8T (10%), Redmi Note 10(10%)
Insect Hole 6 January, 2025 Manikganj 28°C Sunny Redmi 10 Pro Max (20%), Huawei P30 Lite (80%)

There is one dataset very similar to ours, titled “VegNet: An Organized Dataset of Cauliflower Disease for a Sustainable Agro-based Automation System” by U. Sara et al. [5]. It has images categorized into different classes of cauliflower disease. This dataset is very helpful but is limited in some areas. Our dataset, consists of a higher number of images, categorized into three: Healthy, Insect Holes, and Black Rot. We have compared our dataset to the dataset by U. Sara et al. [5]. This comparison highlights the differences between the two datasets. Table 4 shows the differences in the number of images and diseases, indicating how our dataset provides more diversified data.

Table 4.

Comparison with available datasets of cauliflower leaf.

Classes Number of Images
Our Dataset U. Sara et al. [5]
Healthy ✔ (934) ✔ (206)
Insect Hole ✔ (639) ✔ (177)
Black Rot ✔ (1088) ✔ (100)

Table 4 shows that our dataset is larger and contains a greater number of images for each class compared to the dataset by U. Sara et al. [5]. We collected images from different locations in Bangladesh, including Savar and Manikganj, whereas the U. Sara et al. [5]. dataset was limited to a single location, Manikganj. The wider range ensures a diversity in environmental conditions, making our dataset more representative and more robust for practical use. Another key difference is the duration of data capturing with 57 days in our case and their 10 days allowing us to capture images at different growth stages as well as weather conditions. our dataset offers higher image quality, with images of 3000 × 3000 pixels compared to 500 × 500 pixels in U. Sara et al. [5]. With each of our images containing about 36 times more detail, this higher resolution allows for better feature extraction, improving the accuracy of deep learning-based disease detection models. By offering a more diverse, higher-quality, and well-rounded dataset, our work contributes to more precise and reliable cauliflower leaf disease classification.

4. Experimental Design, Materials and Methods

4.1. Experimental design

We collected images for our dataset from various locations in Bangladesh, including Savar and Manikganj. Images of Cauliflower leaves were captured in different light conditions using different mobile phones to ensure that our dataset is representative of real-world diversity. The dataset separates leave into three classes: Healthy, Insect Holes, and Black Rot. The geographic locations of the collection sites are as follows:

  • 1. Zailla, Singair, Manikganj

  • Latitude: 23°47′46.11"N, Longitude: 90°13′15.73"E

  • 2. Dattapara, Ashulia, Savar, Dhaka

  • Latitude: 23°52′26.3"N, Longitude: 90°19′06.3"E

After we completed the image collection, we applied some pre-processing methods to our dataset. The process that we applied was background removal, resizing of each image, de-duplication of images, and the removal of low-quality images to ensure high-quality data remained for model training. We have split our dataset into two parts: 80% training and 20% validation. Then we used a pre-trained deep learning model like MobileNetV2 for classifying various types of Cauliflower leaves and trained them. In the model evaluation process, we checked its accuracy, precision, and recall. Through this process, we can easily see how quickly our machine learning system detects and handles Cauliflower leaf diseases by identifying them at an early stage. Fig. 2 showcases the categorization of images into various classes within the dataset using a machine learning model.

Fig. 2.

Fig 2

The method by which the diseases of Cauliflower Leaf is evaluated.

4.2. Materials (Camera Specification)

To ensure diverse image quality and real-world usability, different devices were used in taking the dataset images. The variation in camera specifications ensures a diverse dataset, capturing images at different resolutions and light conditions, making it more adequate for real-world cauliflower disease detection. Table 5 showcases the camera specifications for each device.

Table 5.

Camera specifications for each device.

Device Name Camera Resolution (MP) Aperture Image Dimensions
Redmi 10 Pro Max 108 f/1.9 3000 × 3000
OnePlus 8T 48 f/1.7 3000 × 3000
Tecno Spark 7 16 f/1.8 2992 × 2992
Redmi Note 10 48 f/1.8 3000 × 3000
Huawei P30 Lite 24 f/1.8 3000 × 4000

4.3. Image pre-processing and classification

Initially, we were collected our dataset's images from different geographical in Bangladesh. Next, we followed a series of steps, from saving images, resizing the images to a single size, improving brightness, labeling, and classifying them to get maximum quality. All these modifications simulate real conditions and allow the model to be efficient in handling new data. This process places the dataset in a good form to train a precise and loyal machine-learning model for recognition and classification of different conditions that occur in cauliflower leaf. It helps in agriculture as it increases the control of disease and the monitoring of plant health. Fig. 3 showcase workflow of cauliflower leaf images pre-processing and classification.

  • 1.

    Image acquisition: To create a balanced and realistic dataset, we first identified different locations in Bangladesh where Cauliflower plants were being grown. We collected images from different areas to obtain different environmental conditions and growth stages. For a proper mixture, we picked healthy and infected Cauliflower leaves carefully so that they were not damaged. For image taking, we used multiple smartphone cameras to take high-resolution photos of the leaves from the plants in their natural lighting. This avoided shadows and reflections and made the photos clear and true for efficient classification.

  • 2.

    Image Organization: After taking pictures of Cauliflower leaves, we safely stored them and organized them into a specific folder on a computer. This careful organization made the next steps of the project easier and more efficient.

  • 3.

    Images pre-processing: During this step, we applied a variety of techniques to make our dataset more uniform and better quality. All the images were resized to a standard size of 3000 × 3000 pixels to obtain consistency. The images were brightened by 1.2 so that they become clearer. All the repeated and poor-quality images were removed to keep only high-quality images. These pre-processing steps ensured that our dataset was a true reflection of real-world scenarios and thus a more suitable candidate for training and classifying models. Fig. 4 showcase the pre-processing steps for each category in the dataset.

  • 4.

    Image Labeling: After completing the pre-processing steps, we carefully labelled each image into one of three classes: Healthy, Black Rot, or Insect Hole. This careful labeling makes classification more accurate, and machine learning models can be trained and tested more effectively for cauliflower leaf disease detection.

  • 5.

    Classification: After completion of the labeling process, we sorted the images into separate folders based on their classes, such as Healthy, Black Spot, and Insect Hole. This is more efficient for improving data classification, pre-processing, and efficient training and testing of machine learning models for disease detection.

Fig. 3.

Fig 3

Workflow of image pre-processing and classification.

Fig. 4.

Fig 4

Processed images from the Cauliflower Leaf Disease Dataset.

4.4. Dataset Structure

The cauliflower leaf disease dataset images are divided into two folders: Original Image and Processed Image. Each folder contains the same category-based subfolders, such as “Healthy”, “Black Rot”, and “Insect Hole”. Each folder contains 2,661 images categorized into three classes. Fig. 5 presents a visual representation of the cauliflower leaf disease dataset organization. In the Original Image folder, images are categorized into three classes: Healthy leaves, which are fresh and green; Insect Hole, caused by Xanthomonas campestris pv. campestris; and Black Rot, caused by caterpillars.

Fig. 5.

Fig 5

Cauliflower Leaf Diseases Dataset Organization.

Similarly, the Processed Image folder is divided into the same three categories. However, the images in this folder have undergone pre-processing. We applied several pre-processing techniques, including background removal, resizing to 3000 × 3000 pixels, and increasing brightness by a factor of 1.2. Pre-processing helps the machine learning model work better by improving the quality of the images.

4.5. Data Annotation

The process of data annotation and labeling was conducted by Professor Dr. M. A. Rahim, Head of the Department of Agricultural Science and Dr. ATM Majharul Mannan, Assistant Professor of the Department of Agricultural Science at Daffodil International University (DIU), Dhaka, Bangladesh. With extensive expertise in plant disease identification and classification, he ensured a rigorous approach to labeling. The annotation process followed these key steps:

  • 1.

    Quality Screening: Each image was carefully examined to ensure clear and adequate representation of disease symptoms. Low-quality images or those lacking sufficient detail were removed from the dataset.

  • 2.

    Categorization: Images were classified based on visible disease symptoms such as discoloration, necrosis, and structural deformations. The categories included Healthy, Black Spot, Insect Hole, and Yellow Mosaic Virus.

  • 3.

    Validation: After the initial classification, the labeled images were re-evaluated to ensure accuracy and consistency throughout the dataset.

This expert-driven annotation process guarantees the creation of a high-quality dataset, making it well-suited for machine learning applications in plant disease detection and classification.

Limitations

The dataset has some limitations such as the images were taken using normal cameras, which may not have been of the best quality, and it will be more difficult for the model to detect small details. since the data was recorded at different times, there was a difference in brightness, colour, and leaf status, which may affect the consistency of the dataset. The dataset includes only three classes (Healthy, Insect Holes, Black Rot), which is too limited for real-world use. The data was collected from only two locations in Bangladesh, which limits how well the dataset can represent other regions with different climates and disease patterns. The dataset was collected over a short period (Nov 2024–Jan 2025), lacking seasonal variation needed for building a strong and reliable model. In future version we will address these limitations and update the dataset. And also has limitation likes the image distribution is imbalanced across classes, which can bias the model. To solve this limitation, we can apply augmentation technique to balance the classes of dataset, in future version of dataset we will ensure this.

Ethics Statement

We confirm that our study followed all ethical guidelines. No plants, animals, or people were harmed during the research. Also, we did not collect any data from social media. All authors agree to follow the ethical rules needed for publishing in Data in Brief.

Credit Author Statement

Sabbir Hossain Durjoy: conceptualization, data curation, methodology, writing original draft; Md. Emon Shikder: conceptualization, methodology, writing original draft, data curation; Md Mehedi Hasan Shoib: conceptualization, methodology, writing original draft; Md Hasan Imam Bijoy: conceptualization, supervision, formal analysis, writing - review & editing.

Acknowledgements

We sincerely thank Professor Dr. M. A. Rahim, Head of the Department of Agricultural Science and Dr. ATM Majharul Mannan, Assistant Professor of the Department of Agricultural Science at Daffodil International University (DIU), Dhaka, Bangladesh, for his valuable help in checking the data. His helpful advice and strong support played a key role in completing this project.

This research did not receive any funding from public, commercial, or non-profit organizations.

Declaration of Competing Interest

The authors confirm that they have no financial interests or personal connections that could have influenced the work in this paper.

Data Availability

References

  • 1.Alaba T.E., Holman J.M., Ishaq S.L., Li Y. Current knowledge on the preparation and benefits of cruciferous vegetables as relates to in vitro, in vivo, and clinical models of inflammatory bowel disease. Curr. Dev. Nutr. 2024;8 doi: 10.1016/J.CDNUT.2024.102160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Singh B.K., Singh B., Singh P.M. Breeding cauliflower: a review. Int. J. Vegetable Sci. 2018;24:58–84. doi: 10.1080/19315260.2017.1354242. [DOI] [Google Scholar]
  • 3.Chatterjee S. Screening of cauliflower genotypes against economically important diseases and disorder in mid hilly regions of Himachal Pradesh. Int. J. Pure Appl. Biosci. 2018;6:774–778. doi: 10.18782/2320-7051.6277. [DOI] [Google Scholar]
  • 4.Shah F.M., Razaq M., Ali Q., Shad S.A., Aslam M., Hardy I.C.W. Field evaluation of synthetic and neem-derived alternative insecticides in developing action thresholds against cauliflower pests. Sci. Rep. 2019;9:1–13. doi: 10.1038/s41598-019-44080-y. 2019 9:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sara U., Rajbongshi A., Shakil R., Akter B., Uddin M.S. VegNet: an organized dataset of cauliflower disease for a sustainable agro-based automation system. Data Brief. 2022;43 doi: 10.1016/j.dib.2022.108422. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement


Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES