Abstract
This data paper provides image dataset that includes 8432 high-quality images of Tamarindus indica [1] (tamarind), categorized into six types: Shelled Healthy Single, Shelled Healthy Multiple, Unshelled Healthy Single, Unshelled Healthy Multiple, Shelled Unhealthy Single, and Shelled Unhealthy Multiple. The collection is intended primarily to assist agricultural research as well as machine learning applications for identifying and evaluating quality. There are differences in brightness and orientation in each category in the collection, which showcases a wide variety of images taken under controlled conditions. For accurate Tamarindus indica quality assessment, this dataset offers a useful resource for training and assessing computer vision models and machine learning techniques. Application in agriculture could be possible, enabling rapid, localized quality evaluation, with potential for broader industry adoption when adapted to other crops. In order to improve plant quality assessment methods and contribute to the creation of trustworthy automated systems for Tamarindus indica quality evaluation, we invite researchers to investigate this dataset and use creative thinking.
Keywords: Tamarindus indica, Tamarind, Agricultural research, Machine learning, Quality assessment, Healthy and unhealthy categories, Shelled and unshelled tamarind
Specifications Table
| Subject | Computer Sciences |
| Specific subject area | Agronomy & Crop Science. |
| Data format | Raw |
| Type of data | Image |
| Data collection | The tamarind dataset consists of 8432 high-quality images of Tamarindus indica, categorized into six distinct types: Shelled Healthy Single, Shelled Healthy Multiple, Unshelled Healthy Multiple, Unshelled Healthy Single, Shelled Unhealthy Multiple, and Shelled Healthy Multiple. To support efficient processing, all images were manually categorized and saved in clearly labeled folders, with sequential filenames for ease of access. Images were captured from multiple angles and under various lighting conditions to simulate real-world scenarios, while maintaining consistency through controlled lighting conditions. All images are stored in JPG format for efficient storage and processing, resized to either landscape mode (1440 × 1080 pixels) or portrait mode (810 × 1080 pixels), with a horizontal and vertical resolution of 72 dpi and a 24-bit depth for rich color representation. The images were sequentially renamed to ensure clear organization within the dataset, making it a valuable resource for both academic research and practical applications in the food industry. |
| Data source location | Kendur, Taluka Shirur, Dist- Pune, Maharashtra 412403. State: Maharashtra, Country- India. Latitude-18.787195, Longitude-74.023510 |
| Data accessibility | Repository name: Tamarind Image Dataset: Healthy and Unhealthy Categories Data identification number: 10.17632/thd2zdrdzx.1 Direct URL to data: https://data.mendeley.com/datasets/thd2zdrdzx/1 |
| Related research article | None |
1. Value of the Data
-
•
The dataset consists of 8432 high-quality JPG images of Tamarindus indica, categorized into six health and shell-based classes.
-
•
Images are captured under diverse orientations, lighting conditions, and pod states (shelled/unshelled, healthy/unhealthy), simulating real-world agricultural scenarios. The dataset is publicly available and organized in a clear directory structure, enabling easy access for researchers and practitioners.
-
•
Applicable across disciplines including agronomy, food science, computer vision, environmental studies, and plant pathology.
-
•
Facilitates faster quality control, grading, and inspection in the tamarind supply chain, enhancing efficiency for producers and distributors.
-
•
Enables interdisciplinary innovation, particularly in developing AI-based systems for fruit quality monitoring and agricultural decision support.
-
•
The high-quality, organized dataset advances research by supporting automated quality evaluation for agriculture and food safety, and fostering interdisciplinary studies.
2. Background
Relevant prior datasets [[1], [2], [3], [4], [5], [6], [7], [8]] have demonstrated the utility of structured image datasets in machine learning applications for agricultural and food quality analysis, underscoring the significance of developing comprehensive datasets to enhance research and innovation across multiple domains.
Tamarindus indica (tamarind) is a versatile and widely cultivated tropical fruit, renowned for its nutritional value and diverse applications in culinary, medicinal, and industrial contexts. Native to Tropical Africa, tamarind is extensively grown in regions like India, where it thrives under rainfed conditions and serves as a common avenue tree, providing both fruit and timber. India is the largest producer and exporter of tamarind globally, with significant cultivation in states like Tamil Nadu, Karnataka, and Andhra Pradesh. The fruit's pulp is rich in antioxidants, vitamins, and minerals, contributing to its potential health benefits, including antioxidant, anti-inflammatory, and antimicrobial properties. Tamarind has been a staple in traditional medicine for centuries, used as a digestive aid, fever reducer, and pain reliever. Its pulp has been employed as a laxative, digestive aid, and remedy for biliousness and bile disorders. In modern research, tamarind is studied for its potential medicinal uses, including its antioxidant and anti-inflammatory effects, which can protect against diseases such as heart disease, cancer, and diabetes [9].
Tamarindus indica (tamarind) is widely cultivated not only in India but also in tropical regions across the world. While this dataset was collected in a local Indian context, the classification structure (shelled, unshelled, healthy, unhealthy) is universally applicable, making it useful for researchers globally. In the culinary world, Tamarindus indica (tamarind) is a key ingredient in many dishes, particularly in South Indian cuisine, where it adds a unique tang to various recipes. It is used to make condiments like chutney and can be consumed as fresh pulp, paste, juice, or in powdered form. The fruit's versatility and nutritional profile make it an essential component of a balanced diet, offering benefits such as improving digestion, supporting heart health, and aiding in weight management. Given its importance in agriculture, food science, and traditional medicine, as well as its significant economic role as a major export commodity, developing robust methods for assessing tamarind quality is crucial. This dataset, categorized into six types—Shelled Healthy Single, Shelled Healthy Multiple, Unshelled Healthy Multiple, Unshelled Healthy Single, Shelled Unhealthy Multiple, and Shelled Healthy Multiple—provides a comprehensive resource for training machine learning models aimed at automating quality evaluation processes. By facilitating the development of accurate and efficient quality assessment systems, this dataset supports advancements in agricultural research and food industry practices, ultimately benefiting both domestic and international markets [10].
Several publicly available datasets, such as FruitNet [5], VegNet [6], and the Citrus dataset [7], focus on fruit or vegetable quality assessment using images. However, most lack classification based on shelled/unshelled and healthy/unhealthy features, specific to tamarind. Our dataset bridges this gap by offering fine-grained labeling and in-field imaging to simulate real-world market conditions. Compared to existing datasets, our data also focuses on a less-studied yet economically significant fruit Tamarindus indica used globally in food and medicine. We anticipate our dataset will enable both crop-specific models and adaptable frameworks for other legumes and pod fruits.
While the tamarind dataset originates from India, the classification schema and imaging methodology are applicable to similar crops worldwide, making it valuable for international researchers. This dataset contributes to study in computer vision, food science, and agriculture by filling a research gap on tamarind quality. Numerous uses for it are possible, including automated grading systems and enhanced techniques for disease identification and agricultural quality control. The categorization structure adopted reflects the common forms tamarind is sold and evaluated in the Indian market, ensuring practical relevance for downstream applications.
3. Data Description
All 8432 images were captured using a high-resolution smartphone camera (Samsung Galaxy F23 5 G) under controlled lighting, resized to standard resolutions (1440 × 1080 for landscape, 810 × 1080 for portrait), with 72 dpi and 24-bit color depth, and stored in JPG format for efficient processing. Images are organized into six different folders: Shelled Healthy Single Tamarind, Shelled Healthy Multiple Tamarind, Unshelled Multiple Healthy Tamarind, Unshelled Single Healthy Tamarind, Unhealthy Multiple Tamarind, and Unhealthy single Tamarind. Shelled tamarind refers to the fruit after the hard outer shell has been removed, revealing the edible pulp inside. Unshelled tamarind, on the other hand, includes the entire fruit with its shell intact. The shell itself is often discarded but can be used for other purposes, such as making extracts with antioxidant properties.
In order to guarantee constant quality and a constant resolution throughout all images, the images were taken in a controlled setting. Because they are saved in JPG format, which strikes a mix between effective storage and excellent representation, the dataset is ideal for computer vision and machine learning applications. Its effective lossy compression contributes to file size reduction without sacrificing quality. The JPG format was chosen for its optimal balance of image quality, file size, and cross-platform compatibility. Large datasets benefit greatly from its adaptability and practical applications, particularly in computer vision and related domains.
For ease of access and efficient processing, every image is arranged into folders that are tagged with clear labels according to the class in which it belongs Table 1. Shows the distribution of images with respect to each category and their count.
Table 1.
Quantitative breakdown: image count per tropical fruit category.
| Name of the Dataset Folder | Image Count |
|---|---|
| Shelled Healthy Multiple Tamarind | 971 |
| Shelled Healthy Single Tamarind | 3237 |
| Unhealthy Multiple Tamarind | 822 |
| Unhealthy single Tamarind | 1202 |
| Unshelled Multiple Healthy Tamarind | 776 |
| Unshelled Single Healthy Tamarind | 1424 |
| Total | 8432 |
The dataset helps develop strong algorithms for illness identification and fruit quality classification by offering a wide variety of photos taken under various situations.
The dataset's images are of excellent quality and provide a clear, detailed visual depiction of the spices. Fig. 1 presents a quantitative breakdown of images categorical folder. Table 2 provides a representative collection of images with healthy, unhealthy categories.
Fig. 1.
Directory structure of the Tamarind image dataset.
Table 2.
Sample images of Tamarind dataset.
![]() |
4. Experimental Design, Materials and Methods
4.1. Experimental design
Selection of tamarind, Image acquisition, Image classification, Image Preprocessing, Image Organization were the various crucial steps in the organized process that went into creating the Tamarind dataset. Fig. 2 shows Stage-by-stage progress of tamarind dataset which includes selections to storage of dataset.
Fig. 2.
Stage-by-stage progress: dataset selections to storage.
4.1.1. Selection of tamarind
The initial step in creating the dataset involved defining its structure and selecting relevant categories of tamarind based on their appearance and health state. This process resulted in six unique classes: Shelled Healthy Single, Shelled Healthy Multiple, Shelled Unhealthy Single, Shelled Unhealthy Multiple, Unshelled Healthy Single, and Unshelled Healthy Multiple. These categories were chosen to capture a comprehensive range of tamarind conditions, ensuring the dataset's applicability in various agricultural and food science applications.
4.1.2. Sampling strategy and categorization rationale
The tamarind samples in this dataset were selected through a purposive sampling approach designed to capture diverse quality conditions while reflecting typical market variability. The following details define our sampling methodology:
-
•Sampling Population: Tamarind fruits sourced from retail and agricultural environments in Pune district, Maharashtra, India. Specifically:
-
○Five local retail markets: Kendur, Shirur, Wagholi, Shikrapur, and Ranjangaon.
-
○Two orchards located in Shirur taluka.
-
○
-
•
Sampling Frame: Approximately 850–1000 individual tamarind pods were manually screened and selected to represent the six predefined dataset categories.
-
•
Sample Size: The dataset was created from an estimated total of 850–1000 tamarind pods, ensuring balanced representation across all health and shell-type categories.
-
•
Inclusion Criteria: Pods exhibiting clearly visible healthy or unhealthy characteristics (e.g., discoloration, shriveling, spoilage marks). Both shelled and unshelled pods, in single and multiple formations, consistent with market presentation.
-
•Exclusion Criteria:
-
○Pods with ambiguous or indeterminate health condition.
-
○Pods that were excessively damaged or deteriorated beyond visual classification.
-
○
-
•
Handling of Sampling Limitations and Bias: The purposive sampling strategy was chosen to maximize category diversity relevant for machine learning applications rather than random or proportional sampling. By deliberately sourcing from both orchard and retail contexts, potential bias from single-source sampling was reduced. While the dataset does not encompass all markets within the region, this intentional sampling approach focused on controlled diversity over exhaustive coverage to support machine learning model development.
4.1.3. Image acquisition
To ensure consistent lighting and high-quality images, high-resolution smartphone camera was used to capture photographs in a controlled environment. The tamarind samples were organized, and multiple perspectives were taken to capture variations in appearance, brightness, and orientation. This approach guaranteed that the images accurately represented real-world conditions of tamarind fruit, providing a robust foundation for subsequent analysis and modelling.
4.1.4. Image classification
Following image acquisition, each image was manually classified into one of the six predefined categories based on the tamarind's health state. To facilitate clear organization and easy access, each class was assigned to its own folder. This systematic approach enabled efficient processing and analysis of the dataset. The images were saved in JPG format, which balances file size and quality, and resized to a consistent resolution of 72 dpi with landscape mode (1440 × 1080 pixels) and portrait mode (810 × 1080 pixels). Basic preprocessing tasks, such as cropping and adjusting brightness where necessary, were performed using IrfanView software. The version of 64 bit IrfanView software is 4.62. The preprocessing step ensured that the images were standardized and ready for further analysis.
4.1.5. Image storage
After classification and preprocessing, the photographs were organized into clearly labelled folders. Each image was sequentially named to ensure clarity and ease of access. To facilitate distribution and ensure compatibility across multiple platforms, the dataset contain JPG images. This meticulous organization and formatting make the dataset a valuable resource for researchers and practitioners in agriculture, food science, and related fields.
As Fig. 2 illustrates the flow of execution for constructing the dataset, there are numerous critical phases involved in the development of Tamarind dataset. In order to ensure that only superior samples are selected for imaging, the first step in the selection tamarind involves determining the market and the kinds of fruit photographs that should be included. In order to preserve uniformity between images, a white sheet or other uniform background is employed throughout the image acquisition phase, and high-quality cameras are positioned with consistent lighting.
4.2. Materials or specification of image acquisition system
Camera Details and Image Acquisition
For collecting high-quality images for the tamarind dataset, we utilized the Samsung Galaxy F23 5 G (SM-E236B) Android Mobile. This device is equipped with a 50-megapixel (f/1.8) rear primary camera, featuring the Sony IMX 582 1/2″ sensor, which is renowned for its excellent image quality and low-light performance. The camera's high resolution and advanced sensor technology allowed us to capture detailed images of the tamarind samples, ensuring that even the smallest features were clearly visible.
To ensure consistent and high-quality images, we followed standard image acquisition procedures. All images were captured using the rear camera, which provided the best possible resolution and clarity. The images were then processed to a standard resolution of 1440 × 1080 pixels in Landscape mode and 810 × 1080 pixels in Portrait mode. This resizing helped maintain a consistent format across the dataset while preserving the essential details of each image. The images were saved in JPG format, which offers an optimal balance between file size and image quality, making them suitable for storage and processing.
The use of a mobile device like the Samsung Galaxy F23 5 G for image acquisition highlights the accessibility and convenience of creating high-quality datasets without the need for specialized equipment. This approach can be particularly beneficial for researchers and practitioners working in resource-constrained environments or those requiring rapid data collection.
Additional Considerations
In addition to the technical specifications of the camera, we also considered environmental factors during image acquisition. The images were taken under controlled lighting conditions to minimize variability and ensure that the dataset accurately represents the real-world conditions of tamarind fruit. This meticulous approach ensures that the dataset is reliable and useful for training machine learning models aimed at automating tamarind quality assessment. Fig. 3 shows the timeline for the collection of Tamarindus Indica dataset.
Fig. 3.
Timeline for the collection of Tamarindus Indica dataset.
4.3. Methods
The tamarind dataset was developed through a systematic approach. The six defined tamarind categories capture diverse tamarind conditions. High-resolution images were captured using a Samsung Galaxy F23 5G under controlled lighting conditions, focusing on multiple views. Images were resized to standard resolutions (1440 × 1080 pixels for Landscape and 810 × 1080 pixels for Portrait) and saved in JPG format, with basic preprocessing tasks like cropping and brightness adjustment performed. Images were manually annotated and classified into their respective categories, then organized into labelled folders, sequentially named for clarity, and compressed for distribution. This approach ensures a comprehensive and well-organized dataset suitable for agricultural research and food science applications. This methodology facilitates the efficient creation and assessment of machine learning models aimed at assessing fruit quality, hence advancing research in agriculture and the food sector.
Ultimately, the photos were saved in JPG format at a quality of 72dpi (dots per inch) and arranged into folders according to health and shell type.
4.4. Evaluation framework
To determine whether the dataset is useful for developing dependable and accurate models for Tamarind detection, a thorough analysis is essential. We employ important metrics including recall, accuracy, precision, and F1-score to provide a comprehensive picture of the models' performance. We used the MobileNetV2, VGG16, and ResNet50 architectures, which are well-known for their effectiveness in picture recognition applications, on a dataset of Tamarind Indica photographs. Model performance was initially restricted to MobileNetV2′s 38 % accuracy, VGG16′s 15 % accuracy, and ResNet50′s 6 % accuracy.
But after using our dataset for training, notable gains were seen Table 3, Table 4 show that MobileNetV2 obtained 94 % accuracy, VGG16 reached 98 % accuracy, and ResNet50 achieved 99 % accuracy.
Table 3.
Accuracy values for Tamarindus Indica (Imli) dataset.
| Model | Accuracy before Training | Accuracy after Training on our dataset [1] |
|---|---|---|
| MobileNetV2 | 38 % | 94 % |
| VGG16 | 15 % | 98 % |
| ResNet50 | 6 % | 99 % |
Table 4.
Confusion matrix before and after training with different model.
![]() |
![]() |
![]() |
The confusion matrices for the pretrained machine learning models on the dataset, before and after training with the Tamarindus Indica Dataset, are presented in Table 4.
VGG16, ResNet50, and MobileNetV2 were chosen due to their high post-training accuracy and F1 scores, indicating strong performance in correctly classifying Tamarindus indica data. The metrics, derived from comparing predicted and actual labels, signify each model's effectiveness, with F1 score reflecting a balance between precision and recall, while accuracy measures overall correctness
LImitations
While this dataset provides a comprehensive collection of tamarind images across six distinct categories, expanding the dataset to include more regional varieties or larger samples from different environments could further enhance its applicability and generalization in real-world scenarios. Additionally, capturing images under varying environmental conditions could improve model robustness.
Ethics Statement
Human or animal participants are not used in this study. It thus conforms, as far as ethical issues are concerned, to all pertinent criteria and recommendations supplied by Data in Brief.
CRediT authorship contribution statement
Amol Bhosle: Conceptualization, Writing – review & editing. Deepali Godse: Conceptualization, Writing – original draft. Sandip Thite: Conceptualization, Writing – review & editing. Kailas Patil: Supervision, Methodology, Data curation, Writing – original draft. Touhid Bhuiyan: Supervision, Writing – review & editing.
Acknowledgements
We are grateful to Washington University of Science and Technology, Alexandria, USA and Vishwakarma University, Pune, India for their support and provision of necessary resources during this research endeavour. In addition to academic expertise, both Dr. Sandip Thite and Dr. Amol Bhosale are active practitioners in agriculture, directly involved in the cultivation and assessment of tamarind. Their field experience significantly informed the dataset curation process.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Contributor Information
Kailas Patil, Email: kailas.patil@vupune.ac.in.
Touhid Bhuiyan, Email: touhid.bhuiyan@wust.edu.
Data Availability
References
- 1.Thite S., Patil K. Tamarind image Dataset: healthy and unhealthy categories (Version 1) [Dataset], 2024. Mendeley Data. 2024 doi: 10.17632/thd2zdrdzx.1. [DOI] [Google Scholar]
- 2.Thite S., Suryawanshi Y., Patil K., Chumchu P. Coconut (Cocos nucifera) tree disease dataset: a dataset for disease detection and classification for machine learning applications. Data Brief. 2023;51 doi: 10.1016/j.dib.2023.109690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chumchu P., Patil K. Dataset of cannabis seeds for machine learning applications. Data Brief. 2023;47 doi: 10.1016/j.dib.2023.109257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Jadhav R., Suryawanshi Y., Bedmutha Y., Patil K., Chumchu P. Mint leaves: dried,fresh, and spoiled dataset for condition analysis and machine learning applications. Data Brief. 2023;51 doi: 10.1016/j.dib.2023.109717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Meshram V., Patil K. FruitNet: indian fruits image dataset with quality for machine learning applications. Data Br. 2022;40 doi: 10.1016/j.dib.2021.107686. ISSN 2352-3409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Suryawanshi Y., Patil K., Chumchu P. VegNet: dataset of vegetable quality images for machine learning applications. Data Brief. 2022;45 doi: 10.1016/j.dib.2022.108657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Huang M., Chen Y. Citrus dataset for image classification. Data Brief. 2023;51 doi: 10.1016/j.dib.2023.109628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Thite S., Patil K., Jadhav R., Suryawanshi Y., Chumchu P. Empowering agricultural research: a comprehensive custard apple (Annona squamosa) disease dataset for precise detection. Data Brief. 2024;53 doi: 10.1016/j.dib.2024.110078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rao Y.S., Mathew K.M. In: Handbook of Herbs and Spices. Second Edition. Peter K.V., editor. Woodhead Publishing; 2012. Tamarind; pp. 512–533. [DOI] [Google Scholar]
- 10.Passos R.S.F.T., de Sousa C.C.A., da Silva M.C.A., Herrero A.M., Ruiz-Capillas C., Cavalheiro C.P. Tamarind (Tamarindus indica L.) components as a sustainable replacement for pork meat in Frankfurter sausages. Foods. 2025;14(2):197. doi: 10.3390/foods14020197. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.







