A novel automated cloud-based image datasets for high throughput phenotyping in weed classification

Sunil G C; Cengiz Koparan; Arjun Upadhyay; Mohammed Raju Ahmed; Yu Zhang; Kirk Howatt; Xin Sun

doi:10.1016/j.dib.2024.111097

. 2024 Nov 1;57:111097. doi: 10.1016/j.dib.2024.111097

A novel automated cloud-based image datasets for high throughput phenotyping in weed classification

Sunil G C ^a, Cengiz Koparan ^a, Arjun Upadhyay ^a, Mohammed Raju Ahmed ^a, Yu Zhang ^a, Kirk Howatt ^b, Xin Sun ^a,^⁎

PMCID: PMC11599996 PMID: 39605934

Abstract

Deep learning-based weed detection data management involves data acquisition, data labeling, model development, and model evaluation phases. Out of these data management phases, data acquisition and data labeling are labor-intensive and time-consuming steps for building robust models. In addition, low temporal variation of crop and weed in the datasets is one of the limiting factors for effective weed detection model development. This article describes the cloud-based automatic data acquisition system (CADAS) to capture the weed and crop images in fixed time intervals to take plant growth stages into account for weed identification. The CADAS was developed by integrating fifteen digital cameras in the visible spectrum with gphoto2 libraries, external storage, cloud storage, and a computer with Linux operating system. Dataset from CADAS system contain six weed species and eight crop species for weed and crop detection. A dataset of 2000 images per weed and crop species was publicly released. Raw RGB images underwent a cropping process guided by bounding box annotations to generate individual JPG images for crop and weed instances. In addition to cropped image 200 raw images with label files were released publicly. This dataset hold potential for investigating challenges in deep learning-based weed and crop detection in agricultural settings. Additionally, this data could be used by researcher along with field data to boost the model performance by reducing data imbalance problem.

Keywords: Automated data acquisition, Cloud computing, Computer vision, Deep learning, Weed and crop detection

Specifications Table

Subject	Data acquisition system generated weed and crop dataset in precision weed management
Specific subject area	Weed Identification, Data Acquisition System, Computer Vision, Deep Learning
Type of data	Raw JPG weeds images.
Data collection	The dataset consists of six weed species (horseweed: Conyza canadensis, kochia: Bassia scoparia, palmer amaranth: Amaranthus palmeri, ragweed: Ambrosia artemisiifolia, redroot pigweed: Amaranthus retroflexus, and waterhemp: Amaranthus tuberculatus) and eight crop species (blackbean: Phaseolus vulgaris, canola: Brassica napus, corn: Zea mays, field pea: Pisum sativum, flax: Linum usitatissimum, lentil: Lens culinaris, soybean: Glycine max, and sugar beet: Beta vulgaris). Cloud based automatic data acquisition system was used to automate data collection.
Data source location	Waldron Greenhouse, North Dakota State University Fargo, North Dakota, United States of America
Data accessibility	Repository name: Mendeley Data Direct URL to data: https://data.mendeley.com/datasets/hs7d7kpd3z/2

Open in a new tab

1. Value of the Data

•
This dataset includes images of eight major crops commonly grown in North Dakota, USA, alongside the six problematic weed species that pose a challenge for farmers. This data could be used alongside field data to boost computer vision models for weed and crop detection.
•
This dataset includes wide variety of crop and weed species. This allows researchers to investigate on specific crops or weeds of interest. For instance, if researcher in mid-west region of USA is interested in dealing with palmer amaranth issues in soybean, he can use only soybean and palmer amaranths data.
•
This data could be useful in creating software application that effectively distinguishes the name of weeds and crops.
•
This dataset contains phenotypically similar weed species such as waterhemp, redroot pigweed, and palmer amaranth. Therefore, it could be helpful to develop more robust algorithm when weeds are visually similar.
•
When combined with existing publicly accessible datasets, this dataset offers a comprehensive resource for developing state-of-the-art deep learning models for crop and weed discrimination.

2. Data Description

This article present RGB images of eight commonly grown crop species and six troublesome weed species in North Dakota, USA. Eight crop species include blackbean, canola, corn, field pea, flax, lentil, soybean, and sugar beet, whereas six weed species include horseweed, palmer amaranth, redroot pigweed, ragweed, kochia, and waterhemp. The dataset was organized into two main folders: crops and weeds. The crops folder contained eight subfolders for individual crop species, while the weeds folder had six subfolders representing different weed species. A total of 2000 JPG images were included in each of the six weed and eight crop subfolders. The inconsistent size of cropped weed and crop objects in raw images resulted in images of varying sizes. In addition to cropped images, folder (raw image with labelled files) with 200 raw JPG images with corresponding bounding box information (in txt format) was provided, which also include classes.txt file for name of weeds and crop species. Fig 1. Illustrates the structure of the data folder and files. This data was gathered to conduct various studies on weed classification and to develop a weed detection deep learning models for ground robotics in the field [[1], [2], [3]]. Due to their variety of categories and large number of images, labelled raw images are well-suited for developing computer vision algorithms that can accurately detect weeds and crops. Conversely, cropped images for classification can be utilized to develop and test various deep learning feature extractor models for image classification.

Fig 1 — Folder structure of the data repository that is publicly available.

3. Experimental Design, Materials and Methods

3.1. Crops and weeds image acquisition system design

The main components of this study were twelve Canon EOS T7 and three EOS 90D visible spectrum cameras (Canon Inc., Tokyo, Japan) (Table 1), a desktop computer (Dell, Round Rock, USA) to be used as a central control unit, a four terabytes (TB) external hard drive (Transcend, Taipei, Taiwan) as an image storage device, USB cables for camera control and image access, power adapters for cameras to keep them running continuously, Verizon wireless device (Verizon wireless, Verizon, New York, USA) for internet connection, and camera mounting hardware. The system set up in a laboratory environment is shown in Fig. 2.

Table 1.

Canon EOS90D camera and Canon EOS T7 camera specification.

Specification	EOS 90D	EOS T7
Sensor Resolution	Actual: 34.4 Megapixel Effective: 32.5 Megapixel (6960 × 4640)	Actual: 24.7 Megapixel Effective: 24.1 Megapixel (6000 × 4000)
Image sensor	22.3 × 14.8 mm (APS-C) CMOS	22.3 × 14.9 mm (APS-C) CMOS
Image stabilization	Yes	No

Open in a new tab

Fig 2 — Automated image acquisition system components: a) Canon EOS Rebel 7T, b) Canon EOS 90D, c) Desktop computer with Linux OS, d) AC battery adapters for continuous power, e) LCD display f) Data storage disk g) USB hubs and h) power extension cords.

The cameras were connected with USB extension cables to a desktop computer (Dell, Round Rock, USA) with an Intel processor (Core i7–3770CPU @ 3.40GHz) , 8GB memory that had a Linux operating system installed (Fig. 3, Fig. 4). The USB extension cables allowed both sending signals to the cameras for triggering and image retrieval at certain time intervals. This control was achieved by using the gPhoto2 (http://www.gphoto.org/) image acquisition software (IAS). The gphoto2 allowed control of multiple cameras while downloading and storing images at a designated directory in the desktop computer or external hard disk simultaneously. GPhoto2 uses a set of software applications and libraries for digital cameras for Unix-like systems (Linux, FreeBSD, NETBSD, MacOSX).

Fig 3 — Block diagram showing automated image acquisition system with its component; external storage device, cloud, desktop with ubuntu 20.04, and canon camera; arrow directions show the direction of image flow from the camera.

Fig 4 — Automated image acquisition system greenhouse setup: a) Greenhouse benches with weed and crop pots, b) digital camera, c) Desktop computer with Linux OS attached to an external storage drive.

3.2. Image acquisition system bash script

A custom bash script was developed to parse the USB port, execute a command to capture and download images to the local system, and move these images to the external storage device and AWS (Amazon web server) cloud s3 storage. The detailed flow chart for the system script is provided in Fig. 5. First, the cameras were connected to the desktop computer through a USB hub. Second, the script scanned all cameras that were connected to the USB hub, and then a directory was created for each camera for storing images. Inside a single loop for 15 cameras, the image is captured and moved into the external storage device and copied to the amazon web (AWS) server s3 bucket. To solve “camera busy” errors, a process identification number (PID) that prevented camera from taking pictures with the system was terminated in each loop. Finally, the system waited for a user input time interval (t=30min) to start the next round of image acquisition, which is interchangeable. A 30-min time interval was used because plants shift slightly, and light variations in the transparent greenhouse alter the appearance of images upon close observation. Total time taken to capture an image, copy an image to AWS cloud s3 bucket, and move an image to external storage was recorded.. This system was named as “Cloud-based Automatic Data Acquisition System” (CADAS). The advantages of using the cloud include automating data acquisition, allowing users to access information from home or the office without needing to visit the greenhouse. Additionally, cloud log files help ensure the camera is functioning properly, saving time on data collection. Once acquired, the data remains unchanged in the cloud.

Fig 5 — Flowchart showing the Linux bash script algorithm for automatic image acquisition.

Fifteen cameras were mounted over the four benches, which were planted with six weed and eight crop species. Fig. 6 shows the raw image with weed and crop plants on the bench from the camera mounted over the bench. After the completion of image acquisition, data cleaning was performed to remove poor images from the datasets. Finally, Single image from each day was labeled using Labelmg [4] software (Fig. 6). The labeling process designated each crop and weed plant as a distinct object. Due to the constant position of the camera, remaining image for that day was automatically labeled using python script. Python labeled image cropping script was used to create class-specific datasets. To crop images for individual representations of crops and weeds, we utilized the labeled text files published, processed with a Python script. Fig. 7 depicts how individual weed and crop images were extracted from raw images. Two different research studies [2,3] was published from the data acquired from CADAS system. Additionally, greenhouse images were mixed with field images for developing site-specific weed control system [1]. While full dataset used in research [2] is not included in this article, it includes randomly selected dataset of 28,000 images (2000 per species) from six weed and eight crop species (Fig. 6). Additionally, sample of 200 raw images with corresponding labeled files were uploaded, which could be mixed with field images for developing site-specific weed control system [1].

Fig 7 — Flowchart depicting the methodology for cropping raw images to extract individual object classes, including relevant mathematical formulations.

Limitations

The dataset primarily consists of images of young plants (earlier stage). Using these images to train a classification model might not yield accurate results for later plant stages. Additionally, the greenhouse environment used for dataset generation may limit the model's performance in field conditions. To enhance model robustness, combining this dataset with field data is recommended. Additionally, the publicly available CADAS dataset is a subset of the original data, and the provided cropped images do not perfectly represent the complete raw image set.

Ethics Statement

This dataset was captured from CADAS system and do not involve humans or animals experiments nor data from any social media platforms.

Acknowledgments

This material is based upon work partially supported by the U.S. Department of Agriculture, agreement number 58-6064-8-023. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the author (s) and do not necessarily reflect the view of the U.S. Department of Agriculture. This work is/was supported by the USDA National Institute of Food and Agriculture, Hatch project number ND01487.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability

Mendeley DataCrop and weed dataset (Original data)

References

1.Upadhyay A., S. G C. Zhang Y., Koparan C., Sun X. Development and evaluation of a machine vision and deep learning-based smart sprayer system for site-specific weed management in row crops: an edge computing approach. J. Agric. Food Res. 2024;18 doi: 10.1016/J.JAFR.2024.101331. [DOI] [Google Scholar]
2.GC S., Zhang Y., Howatt K., Schumacher L.G., Sun X. Multi-species weed and crop classification comparison using five different deep learning network architectures. J. ASABE. 2024;67(2):43–55. doi: 10.13031/ja.15590. [DOI] [Google Scholar]
3.S. G C. Koparan C., Ahmed M.R., Zhang Y., Howatt K., Sun X. A study on deep learning algorithm performance on weed and crop species identification under different image background. Artif. Intell. Agric. 2022;6:242–256. doi: 10.1016/J.AIIA.2022.11.001. [DOI] [Google Scholar]
4.T. Lin, “LabelImg,” Online: https://github.com/tzutalin/labelImg, 2015.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Mendeley DataCrop and weed dataset (Original data)

[bib0001] 1.Upadhyay A., S. G C. Zhang Y., Koparan C., Sun X. Development and evaluation of a machine vision and deep learning-based smart sprayer system for site-specific weed management in row crops: an edge computing approach. J. Agric. Food Res. 2024;18 doi: 10.1016/J.JAFR.2024.101331. [DOI] [Google Scholar]

[bib0002] 2.GC S., Zhang Y., Howatt K., Schumacher L.G., Sun X. Multi-species weed and crop classification comparison using five different deep learning network architectures. J. ASABE. 2024;67(2):43–55. doi: 10.13031/ja.15590. [DOI] [Google Scholar]

[bib0003] 3.S. G C. Koparan C., Ahmed M.R., Zhang Y., Howatt K., Sun X. A study on deep learning algorithm performance on weed and crop species identification under different image background. Artif. Intell. Agric. 2022;6:242–256. doi: 10.1016/J.AIIA.2022.11.001. [DOI] [Google Scholar]

[bib0004] 4.T. Lin, “LabelImg,” Online: https://github.com/tzutalin/labelImg, 2015.

PERMALINK

A novel automated cloud-based image datasets for high throughput phenotyping in weed classification

Sunil G C

Cengiz Koparan

Arjun Upadhyay

Mohammed Raju Ahmed

Yu Zhang

Kirk Howatt

Xin Sun

Abstract

1. Value of the Data

2. Data Description

Fig. 1.

3. Experimental Design, Materials and Methods