Full article title Data management challenges for artificial intelligence in plant and agricultural research
Journal F1000Research
Author(s) Williamson, Hugh F.; Brettschneider, Julia; Caccamo, Mario; Davey, Robert P.; Goble, Carole; Jersey, Paul J.; May, Sean; Morris, Richard J.; Ostler Richard
Author affiliation(s) University of Exeter, University of Warwick, National Research Institute of Brewing, Earlham Institute, University of Manchester, Royal Botanic Gardens, University of Nottingham, John Innes Centre, Rothamsted Research, Alan Turing Institute, University of Edinburgh
Primary contact S dot Leonelli at exeter dot ac dot uk
Editors Ezer, Daphne; Witteveen, Joeri
Year published 2023
Volume and issue 10
Article # 324
DOI 10.12688/f1000research.52204.2
ISSN 2046-1402
Distribution license Creative Commons Attribution 4.0 International
Website https://f1000research.com/articles/10-324/v2
Download https://f1000research.com/articles/10-324/v2/pdf (PDF)

Abstract

Artificial intelligence (AI) is increasingly used within plant science, yet it is far from being routinely and effectively implemented in this domain. Particularly relevant to the development of novel food and agricultural technologies is the development of validated, meaningful, and usable ways to integrate, compare, and visualize large, multi-dimensional datasets from different sources and scientific approaches. After a brief summary of the reasons for the interest in data science and AI within plant science, the paper identifies and discusses eight key challenges in data management that must be addressed to further unlock the potential of AI in crop and agronomic research, and particularly the application of machine learning (ML), which holds much promise for this domain.

Keywords: data science, plant science, crop science, agricultural research, machine learning, data management, data quality, data sharing

Introduction

Data science is central to the development of plant and agricultural research and its application to social and environmental problems of a global scale, such as food security, biodiversity, and climate change. Artificial intelligence (AI) offers great potential towards elucidating and managing the complexity of biological data, organisms, and systems. It constitutes a particularly promising approach for the plant sciences, which are marked by the distinctive challenge of understanding not only complex genotype-environment (GxE) interactions that span multiple scales from the cellular through the microbiome to climate systems, but also GxE interactions with rapidly shifting human management practices (GxExM) in agricultural and other settings, whose reliance on digital innovations is growing at a rapid pace.[1][2] Accordingly, examples of useful applications of AI—and particularly machine learning (ML)—to plant science contexts are increasing, with the COVID-19 pandemic crisis further accelerating interest in this approach.[3]

Nevertheless, we are still far from a research landscape in which AI can be routinely and effectively implemented. A key obstacle concerns the development and implementation of effective and reliable data management strategies. Developing reliable and reproducible AI applications depends on having validated, meaningful, and usable ways to integrate large, multi-dimensional datasets from different sources and scientific approaches. This is especially relevant to the development of novel food and agricultural technologies, which rely on research from diverse fields including fundamental plant biology, crop research, conservation science, soil science, plant pathology, pest/pollinator ecology and management, water and land management, climate modelling, agronomy, and economics.

This paper explores data-related challenges to potential applications of AI in plant science, with particular attention paid to the analysis of GxExM interactions of relevance to crop science and agricultural implementations. It brings together the experiences of an interdisciplinary set of researchers from the plant and agricultural sciences, the engineering and computational sciences, and the social studies of science, all of whom are working with complex datasets spanning genomic, physiological, and environmental data and computational methods of analysis. The first part of the paper provides a brief overview of contemporary AI and data science applications within plant science, with particular attention paid to the UK and European landscape where the authors are based. The second part identifies and discusses eight challenges in data management that must be addressed to further unlock the potential of AI for plant science and agronomic research. We conclude with a reflection on how transdisciplinary and international collaborations on data management can foster impactful and socially responsible AI in this domain.

AI in plant research: Current status and challenges

Following wider trends in the biosciences, both basic and applied plant sciences have increasingly emphasized data-intensive modes of research over the last two decades.[4][5][6] The capacity to measure biological complexity at the molecular, organismal, and environmental scales has increased dramatically, as demonstrated by[7]:

  • advances in high-throughput genomics and norms and tools that have supported the development of a commons of publicly shared genomic data;
  • the development of platforms for high-throughput plant phenotyping in the laboratory, the greenhouse, and the field; and
  • the proliferation of remote sensing devices on crop-growing fields.

Such platforms and associated data generation have contributed to a booming AI industry in commercial agriculture, focused on the delivery of “precision” farming strategies, with estimates that the market will be worth US $1.55 billion by 2025.[8] Indeed, AI applications in plant research and agriculture have so far primarily benefited large-scale industrial farming[9], with R&D investment focused on commodity crops such as wheat, rice and maize; high-value horticulture crops such as soft fruits; and the enhancement of large-scale orchards and vineyards. In addition to this, however, the amount and type of data being collected, alongside advancements in AI methods, offer the opportunity to ask and address new questions of great importance to plant scientists and agricultural stakeholders around the world.[10]

AI is the field of study and development of computer hardware and software that perform functions, such as problem solving or learning, which have traditionally been considered properties of intelligent life. A range of research fields have contributed to the development of AI, currently the most prominent of which is ML, the design of algorithms for data processing, prediction, and decision support that are able to learn from a priori (“supervised”), inductive (“unsupervised”), and reward-based (“reinforcement”) experience.[11][a] This approach is particularly significant for applications that do not require an exact understanding of how the algorithm has reached its decision, as long as it has predictive power and it is possible to reproduce it.[12]

ML has been the dominant AI technology applied to plant and agricultural research so far. Many successful examples come from bioinformatics, where researchers may not need to worry about why a sequence of amino acids was classified as alpha-helical in structure as long as we know how reliable that prediction is, for instance. Indeed, ML has been widely used in the analysis of sequence data, for example to identify signal peptides and functional domains in amino-acid sequences via neural nets and profile hidden Markov models, such as Pfam and SMART[13], as well as other classic examples.[14] One key example from genomics that goes back to the 1990s is the use of models to identify genes and predict their functions based on training data from multiple species.[15][16][17] This has ongoing relevance for orphan and non-model crop research, where experimental approaches such as CRISPR knockouts to identify and validate gene function for individual species may not be feasible or cost-effective, but results may be inferred from experiments in model species.[17] Other challenges in genomics that can be addressed include the inference of gene regulatory networks[18] and the identification of pathogen virulence effector genes from genomic sequence data[19], for example. Thus, ML can help to identify correlations not readily picked up by more traditional approaches and in turn suggest fruitful directions for further research. To date, whether or not correlations have biological meaning typically needs to be ascertained via experiment and/or observational data.[20][21] Efforts towards explainable AI are, however, gaining momentum, and both methodological and computational techniques are emerging which promise to support biological use of ML.[22]

Alongside applications in genomics, AI offers new opportunities for linking genotypes to phenotypes.[1] Image-based plant phenotyping has proven a particularly fertile area for the application of ML techniques, with the rapid development of non-destructive methods for the evaluation of plant responses to biotic and abiotic stress[23][24][25] and estimation of photosynthetic capacity[26], as well as a variety of feature detection, counting, classification, and semantic segmentation tasks.[27] With the arrival of deep supervised convolutional networks, progress in the performance of ML algorithms in predicting leaf counts increased considerably.[28] Convolutional neural networks (CNNs) were also shown to be capable of performing challenging tasks of point feature detection[29] and pixelwise segmentation[30][31] on both roots and shoots in a variety of imaging modalities in both laboratory and field environments.[32] These technologies pose substantial new opportunities for analyzing and understanding GxExM interactions through the integration of high-throughput phenotyping data with other forms of research data, including genomic, field evaluation, and climatic data. As well as addressing fundamental research questions, AI applications in this area offer the opportunity to understand and improve a range of practical activities from crop breeding through agricultural management. Three are discussed in more detail here, while Table 1 neatly collates all ML and AI application examples discussed in this paper.

Table 1. Examples of machine learning (ML) and artificial intelligence (AI) applications in plant and agricultural science discussed in this paper, and methods used in those referenced papers.
Example Section discussed Key ML/AI methods used Source(s)
Gene identification and function prediction across species This section Various; see citations in review paper Zou et al. 2019[17]
Inference of gene regulatory networks This section Bayesian networks, random forest, Markov random fields, tree-based models, dynamic factor graph models Mochida et al. 2018[18]
Identification of pathogen virulence effector genes from genomic sequence data This section Support vector machine, random forest, convolutional neural networks, ensemble learning, Bayesian networks, tree-based models Sperschneider 2019[19]
Non-destructive evaluation of plant responses to biotic and abiotic stress This section Support vector machine, artificial neural networks, convolutional neural networks Singh et al. 2016[23]; Mohanty et al. 2016[24]; Ramcharan et al.[25] 2017
Automatic estimation of photosynthetic capacity This section Artificial neural networks, support vector machine, least absolute shrinkage and selection operator (LASSO), random forest, Gaussian process regression Fu et al. 2019[26]
Convolutional neural networks for plant phenotyping image analysis This section Convolutional neural networks, support vector machine, random forest, encoder-decoder model, multi-loss multi-resolution network, deep residual network Jiang & Li 2020[27]; Dobrescu et al. 2017[28]; Pound et al. 2017[29]; Yasrab et al. 2019[30]; Soltaninejad et al. 2020[31]
Augmenting genomic selection models in plant breeding with machine learning This section Bayesian regularized neural networks, radial basis function neural networks, reproducing kernel Hilbert space, random forest regression Gonzalez-Camacho et al. 2018[33]; Harfouche et al. 2019[2]
Prediction of soil characteristics from near-infrared and mid-infrared soil spectroscopy data This section Regularized linear models, support vector mechanics, tree-based models Data Study Group Team 2020[34]
Automatic identification of crop pest insects using bioacoustics data This section Support vector machines, random forest, randomized trees classifier, gradient boosting classifier Potamitis et al. 2015[35]
Automatic digitization of herbaria specimens and specimen metadata Next section Convolutional neural networks Carranza-Rojas et al. 2017[36]; Younis et al. 2018[37]
Leaf-counting models for plant phenotyping image analysis This and next section Multi-task learning, adversarial learning, layerwise relevance propagation, guided back propagation Dobrescu et al. 2017, 2019, 2020[28][38][39]; Giuffrida et al. 2019[40]
Computer Vision Problems in Plant Phenotyping (CVPPP) workshops Next section Various; see citations in review paper Tsaftaris & Sharr 2019[41]
Image analysis for automatic disease diagnosis in multiple crops using PlantVillage Nuru Next section Convolutional neural networks Ramcharan et al. 2019[42]

One example of AI opportunities is found with genomic selection. Genomic selection (GS) is an approach for estimating breeding values for individual plants that can guide breeders’ decisions for selection and crossing[43], based on modelling associations between quantitative traits and a genome-wide set of markers. Accuracy of predictive models for GS and rate of genetic gain can be increased by employing ML, although the utility of ML in comparison to existing statistical models vary depending on the characteristics of the trait of interest.[33] A promising opportunity for the improvement of GS lies in using ML for the integration and the analysis of data from different omics layers (e.g., proteomics, metabolomics, and metagenomics) that mediate between genotype and phenotype, facilitating the prediction of quantitative traits based on biological mechanisms rather than genetic marker associations and thereby increasing the reliability and utility of models for a wider range of populations than is currently possible.[2]

A second example concerns long-term experiments. Long-term experiments (LTE)—where the same crop or crop rotation is grown for many years subject to a range of different management or treatment options—have an important place in agricultural research. Data from these experiments enable separation of agronomic and environmental (weather) influences on crop yield, as well as soil health over time, and have done much to influence modern farming practices.[44][45] The "Classical Experiments" at Rothamsted Research[46][47] are important examples. The data from these experiments, some of which were started in 1843, are available and documented in the Electronic Rothamsted Archive (e-RA) data resource.[48] Data from LTEs continue to be the subject of new analytical methods[49], yet remain a relatively untapped resource for knowledge discovery, in part because of the complexity of the experimental designs and the difficulty in accounting properly for the changes that might have occurred during their lifespans. To make LTEs more accessible for knowledge discovery, a recent initiative was launched by the Global Long Term Experiment Network to catalogue LTEs using a standard metadata schema. The use of ML methods combining data from LTEs with local weather data might, for example, reveal hidden patterns in the data linked to long-term or higher order interactions within the data, which could provide useful insights into the impact of future climate change.

Agricultural monitoring is a vital third example. AI offers many opportunities to improve the cost and labor efficiency of longstanding research and monitoring tasks in research and agricultural settings. While such possibilities are most developed in commercial agricultural settings, there are many opportunities too for the public research sector, as well as for small or non-commercial farmers, for example in agricultural settings where there is limited access to relevant scientific expertise. Take for example soil health assessment, a key driver of crop yields. However, wet soil chemistry analyses are both expensive and time-consuming and generally not accessible by growers in low- and middle-income countries. Using near-infrared (NIR) and mid-infrared (MIR) soil spectroscopy data, ML models can be developed to predict soil characteristics and nutrient content that are faster and cheaper to run.[34] Such models could be integrated with plant physiology models in the future to predict optimal crop performance in a given soil, and those models open the possibility of the development of hand-held soil devices for use directly by farmers or local advisors in countries where lab access and resources are limited. In another example of agricultural monitoring, conventional suction and light traps for monitoring the appearance and migration of airborne insects, including crop pests, currently require manual identification. Such methods can also be augmented by ML models trained to recognize and classify insect species based on bioacoustics data[35], connected to in-field sonic sensors. Such developments are directed at increasing the scalability of the insect pest monitoring networks and also potentially removing the need for manual steps for some insect species.

While these three examples offer optimism to using AI and ML applications in plant and agricultural science, the effective implementation of these and similar methods depends in large measure on establishing a favorable data landscape, consisting of the networks and practices of sourcing, managing, and maintaining data. This is particularly important for research undertaken outside of resource-intensive commercial sites, including research in and for low- and middle-income countries. Identifying the primary challenges faced by users and would-be users of AI in the contemporary data landscape of plant science is necessary in order to understand the possibilities and limitations afforded by AI for public as well as private plant and agricultural research. Here we build on the experiences of leading UK-based researchers in these areas to identify and discuss eight key data challenges, summarized in Table 2. These challenges span technical, social, and governmental domains, and will require concerted international and transdisciplinary efforts from a range of stakeholders to address.

Table 2. Synoptic view of the data challenges of effective implementing ML and AI in plant and agricultural science, possible solutions, and what can be lost and gained by investment in those areas.
Data challenges Solutions Risks Payoff Trade-offs
Heterogeneity of data types and sources in biology and agriculture Implement FAIR (findability, accessibility, interoperability, and reusability) principles[50] for all data types. Acknowledge and reward data sources. Inconsistent standardization between domains and communities New possibilities for multi-scale analysis integrating diverse data types There are difficulties in implementing standards while retaining domain-specific insights.
Selection and digitization of data that is viable for AI applications Provide clear and accessible guidance on data requirements for AI. Develop new procedures for priority setting and selecting data. High labor costs of digitization and analysis on resources that may not prove to be significant AI tools and outputs that push forward the cutting edge of plant science research Data management procedures may take up a considerable budget and effort.
Ensuring sufficient linkage between biological materials and data used for AI applications Have clear documentation of material provenance when producing data and throughout analytical workflows. Increased documentation costs and exposure of commercially or otherwise sensitive materials Clear understanding of the biological scope of AI tools Analysis of documentation around materials requires specific expertise and effort.
Standardization and curation of data and related software to a level appropriate for AI applications Develop and use shared semantic standards. Standardize data at the point of collection. Potential to lose system-specific information that does not fit common standard Reusable multi-source datasets and easier validation and sharing between groups Some plant data (e.g., phenotypic observations) remain very difficult to standardize.
Obtaining training and adequate ground truth data for model validation and development Ensure that data quality benchmarking is tailored to analytical purposes. Expand collections of ground truth and training datasets. Data quality assessment requires error estimates and information on data collection, which are often lacking. Reproducible and sound inferences with clear scope of validity Tailoring data to specific research goals runs counter-popular to the narrative of AI relying on "representative" training data and "generalizable" solutions.
Access to and use of computing and modeling platforms, and related expertise Make software and models open and adaptable where appropriate, and/or have clear documentation on their scope. Provide researchers with full workflows, not just software. Software used outside its range of proven usefulness and danger of extrapolation and overfitting A suite of tools with clearly marked utility and relevance for a wide range of analytical tasks in the plant sciences There are difficulties in getting the required know-how to travel together with software and models.
Improving responsible data access Open access to datasets held by government and research institutions. Implement data governance regimes to protect sensitive data and ensure benefit sharing. “Digital feudalism”; unequal distribution of benefits from public or personal data Greater data resources of direct relevance to agricultural and other plant science applications There are ongoing difficulties in identifying and implementing non-exploitative, equitable models for data sharing.
Engagement across plant scientists, data scientists, and other stakeholders Invest in and promote data services for plant scientists. Additionally, promote plant science problems, especially GxE interactions, to ML researchers. Identify and invest in grand challenges and engagement. High cost with potentially limited impact unless closely targeted to needs and interests of researchers and wider stakeholders Greater community participation in the development of ML as a resource for plant science There is long-term investment involved, and its value depends on active and regular engagement of stakeholders.

In the remainder of the paper, we review these challenges in detail, drawing on a range of examples from fundamental and translational plant science. Several of the challenges are shared with the biosciences more broadly, reflecting the conditions and complexity of biological research, while others are specific to plant science and agriculture. In the conclusion, we offer some reflections on how these challenges could be overcome.

Data challenges

Data diversity and continuing obstacles to data sharing

Biological research tends to be very fragmented compared with other sciences, and biological data is highly heterogeneous as a result.[6][51][52][53] A key reason for this is the attention paid by biologists to the unique characteristics of the target systems that they are studying: different species of mushrooms, bacteria, trees, ferns, and mammals can behave and interact with their environment in fundamentally different ways, which in turn affects their different structures, functioning and reproduction. Biodiversity thus encourages the production of research methods and instruments specifically tailored to the "endless forms most beautiful" in question—with different laboratories producing data in a wide variety of ways. Added to this, there is the multiplicity of purposes for which biological research is conducted, which in the plant and crop sciences include the production of genetically engineered crops, understanding growth conditions, improving crop yield, and identifying medically useful compounds, many of which also require the study of key environmental features such as soil and climate conditions. Moreover, the translation of plant research into agronomic spaces is made especially complex by the multiplicity of stakeholders, with breeders focused on the specific conditions in their target markets, farmers producing a large variety of data of potential research interest as part of their everyday work, and many companies working in agritech (including companies producing sensing devices for farms), although many data producers remain secretive around their own data practices and datasets. Furthermore, there is a divergence between the large emphasis on omics data within academic plant science and the equally strong focus on phenotypic data for crop evaluation favored in more applied domains, which is only partly mitigated by ongoing efforts to bridge this gap and exploit the complementary nature of these data resources through integration and interoperability.

Last but not least, there is no consensus on data formats, standards and methods of analysis. Datasets are typically collected with a specific hypothesis or practical use in mind, with much data not generated in machine-readable formats and data standards rarely prioritized when developing new methods or technologies. Data circulation is also limited, due to a lack of targeted incentives and necessary infrastructures as well as a general reluctance from researchers to share their data beyond their immediate communities of collaborators. Many research funders and institutions do not yet provide concrete incentives to make data publicly available, including rewards and resources to match the significant labor involved. This has significant implications for researchers, especially given the competitive culture predominant within the life sciences and the well-founded fear that spending resources on data curation may lower the publication rate of any one group, with negative effects on their reputation and future endeavors.[4][54]

This fragmented data landscape limits the opportunities for the application of AI to plant research and agronomy. For example, when object recognition software is applied to human faces, relatively homogeneous reference sets of photographs are available for training, but equivalent data is not available when the same technologies are aimed at identifying morphological traits in plants. The introduction of the FAIR principles[50]—stating that data should be findable, accessible, interoperable, and reusable—has greatly helped to address some of these issues.[b] Some organizations are promoting the “FAIRification” of data using semantic web technologies (e.g., https://www.go-fair.org), but even more limited forms of annotation, semantification, and standardization would significantly facilitate applications within more restricted domains. Many molecular biology data are already integrated in structured, curated, and interlinked public repositories[55], which are widely used by the research community. This is not surprising given the historical ties between the development of sequencing technologies and the emergence of computation[53][56][57] and related database standards and classification initiatives[58], often starting with data from model organisms grown in standard conditions (like Arabidopsis thaliana) with large associated research communities.[59]

At the same time, many other types of data are not as standardized, and the heterogeneity of data formats and methods across different areas of the life sciences is likely to affect the ways in which FAIR principles are implemented. Such differential adoption of FAIR principles and resources may, again, constrain the potential for ML to integrate data across multiple domains. Indeed, while the FAIR data principles are increasingly being applied across the plant sciences[60][61][62], different projects have developed different elements of FAIR depending on their specific goals and context. Some applications, such as FAIDARE (FAIR Data-finder for Agronomic REsearch)[63] have focused on findability. Others, such as the Crop Ontology and related ontologies in the Planteome project, have focused on interoperability and semantic standards. AI and ML applications depend heavily on the interoperability and reusability dimensions of FAIR, but these have received less attention overall than findability and accessibility. As well as the semantic efforts mentioned above, more recent initiatives such as BrAPI (Breeding API)[64] and MIAPPE (Minimum Information about a Plant Phenotyping Experiment)[65] have addressed these aspects in a more targeted way.

Acknowledging and rewarding those who generate data would go a long way towards encouraging effective data sharing. One approach to this issue is exemplified by the Annotated Crop Image Database[66], which is set up to show only fragments of annotated images of plant phenotypes, without necessarily showing the detailed metadata that would allow others to reuse those images for biological research. This encourages biologists to share their data as early as possible to support the development of methods such as feature detection, while at the same time protecting those data from reuse by other biologists for as long as it is needed for the original data producers to publish their own results. This is only one among many possible solutions to adequate acknowledgement of data sourcing, with other approaches favoring early data publication (for instance in data journals) as a way to reward data producers while also fast-tracking data sharing. The Research Data Alliance is one among many organizations engaged in developing conventions and methods to reassure those providing data that their own research and publications will not be adversely affected, such as for instance the CARE and the TRUST principles.[67][68] It is imperative that such guidance is visibly implemented and that researchers are trained to understand its significance for their own work and data management strategies.

Selecting and digitizing data

Given the wide variety of data types, formats, and sources in the plant sciences, determining which data resources could be selected for AI-informed analysis constitutes a serious challenge. Are there datasets of immediate potential if suitably curated, and what metadata is needed to describe datasets so that their suitability for inclusion in a given analysis can be assessed? The achievement of clear criteria and priorities for data selection is a crucial issue given the considerable amount of work required to digitize, curate, and process datasets and related metadata. Such criteria should consider the ML task at hand, the scientific goals, and the concerns of individuals and groups holding the data.

Consider herbarium specimens as a promising potential substrate for ML. Collectively, the world’s herbaria contain an estimated 392,353,689 plant specimens as of December 2019[69], associated with metadata describing the place and time of their collection. ML can be used to infer useful information from the physical and molecular characteristics of the specimens to support automatic identification of plants[36], or to find material with potentially useful traits.[37] Recent efforts have combined specimen images, their associated metadata (including descriptive labeling), and associated field images.[36] These approaches could be used to monitor ex situ conservation efforts, to track changes in natural and farmed distribution of species in response to environmental changes, to trace the spread of invasive weeds, or many other applications not strictly related to crop research. However, many herbaria are only partially digitized, if at all. Most specimens have not been imaged or subject to molecular analysis, and even basic metadata is often not databased, but only exists in the form of hand-written or typed annotations attached to the physical specimen, meaning that even taking an inventory of stock is not possible, making access to the material only possible via physical visit. Thus, while the new technologies of imaging, molecular analysis, and ML have created new possibilities to exploit these historic collections[70], these will remain unrealized until the information they contain is extracted, digitized, and made publicly available, tasks which are very labor-intensive.

Interestingly, ML itself may be able to help solve this problem: the transcription of physical herbarium labels may be supported by the use of ML to interpret handwriting. A useful step towards this is the recent production of a benchmark dataset of transcribed herbarium labels[71], which could be used to assess the performance of algorithms. This does not however help to address questions of data selection. Researchers still need to decide which specimen and related data/metadata to prioritize given limited resources and the vast scale of existing collections. In turn, the selection of usable and relevant data and digitization of records is tightly associated with the prioritization of research problems and questions on which to work. There is relatively little investment in improving procedures and methods in this area, and yet there is a need for processes through which researchers explicitly consider and debate which data should take precedence and why. Without such processes, the ensemble of data being curated risks being patchy and fragmentary, the random result of individual efforts by separate and uncoordinated projects rather than of a community effort to locate and invest on data of most relevance to all. Indeed, without such processes, pressure to use automatic methods, and to be seen using them, can aggravate the problem with researchers investing resources in the creation of large datasets without considering whether and how those data could be used.

Linking data to material samples

Clear reporting on the relation between digital data and material samples—the seeds, germplasm, and other biological sources to which data are associated—is vital to the interpretation, reuse, and reproducibility of results[5][53], as well as constituting a major source for data in the first place.[72] Moreover, the use of plant science to inform agriculture and related domains such as forestry is predicated on understanding and utilizing the widest possible range of biological variation between and within species. For example, crop breeding is dependent on having access to a large pool of traits that can be incorporated in new varieties that are resistant to changing climates, diseases, and stresses.[73] Applications of AI to plant and related data must be designed in such a way that data, models, and other outputs can be linked back to the material samples on which scientific research and biotechnological applications depend.

This has proved to be problematic. While a vast number of accessions of crops and crop wild relatives are held in genebanks worldwide, the corresponding data records for this global resource are often limited at both a scientific and operational level. Some progress has been made in promoting data deposition and thereby indexing of resources in overarching international plant genetic diversity databases such as EURISCO[74], which provides information about more than two million accessions of crop plants and their wild relatives, preserved ex situ by almost 400 institutes, including both passport data and phenotypic data. However, meeting the disparate needs of users, donors, funders, and other stakeholders in such indexing databases remains difficult. Within the international phenotyping community, information systems are developing which require all objects, including individual plants, to be allocated a persistent URI.[75] This increase in specificity has the potential to increase connectivity between phenomic data and the samples from which it was obtained, but comes with a significant overhead cost, and to date it's only feasible in indoor, highly mechanized environments.

Legacy systems do not always lend themselves to easy integration and can make consistent matching of appropriate terms and datatypes between originating resources difficult. There can be competing arguments for the most appropriate, efficient, or scientifically accurate representation or classification of data and characteristics to meet perceived audiences. A reluctance or inability to reinvent domain-specific resource catalogues is also understandable given the range of operational concerns that inform the management of live resources. Genebank databases have been iteratively customized to user requirements and/or contractual constraints over a period of many decades. There may be significant conflicts between visibility and dissemination drivers for commercial and public collections and even for separately donated materials within those resources. There may also be concerns about third-party use of collated data or perceived availability of materials, particularly where there may be implicit or implied intellectual property, or regulatory compliance benchmarks for benefit sharing obligations. A precautionary principle not to include portions of the biobank collection may also apply when downstream use of data or implied ownership of downstream discovered characteristics are considered by the biobank review panel considering inclusion in such an external index.

Within plant phenotyping facilities, the legacy problem arises from the historic variations in metadata collection. In particular, useful linkage of phenomic data to samples requires details of the growth environment to also be collected. This is now attracting significant interest, and methods and standards for, e.g., illumination conditions are emerging.[76] Inclusion in some databases is now conditional on capture of specified levels of environmental information. Legacy data, however, often lack such information, and the variations in plant structure and performance introduced by environmental conditions—even within well-controlled environments—means that simple linking of genotype to phenotype is insufficient.

Between them, these issues can make reduce data donation to a lowest common denominator of permitted and approximated metadata overlap for a subset of holdings, often a simple index or indicator of materials which merely points to the originating collection and may not permit broader aggregation of recorded characteristics. This can be insufficient for useful exploitation of the resource by specialist researchers and will often render an aggregation site unpopular or secondary to the primary biobanks. The problem is even further exacerbated when considering the very large number of valuable land races, crop wild relatives (CWRs), and heritage varieties preserved in-situ at herbaria, botanical gardens, and conservation sites. Moreover, and despite significant work invested in creating genotyping panels and populations for many different species, a lack of phenotypic data about accessions has limited the utilization of this diversity and constrained understanding of the genetics of complex traits, leading to a phenomics “bottleneck.”[77] An increasing number of high-throughput phenotyping platforms are being constructed, in which large quantities of data about individual plants are collected, integrated, and analyzed with the help of ML techniques (especially on multispectral and RGB imaging data). These phenotyping platforms are at the forefront of materials-data linkage and biodiversity studies in plant science, and yet they are often unavailable beyond the institute or research group that developed them, for reasons ranging from data size to commercial protections.

The challenges of managing the relationship to material samples are not limited to datasets, but also include models. The accuracy of GS (see the previous discussion on GS in the "AI in plant research" section) for a given breeding population is strongly dependent on genotypic and phenotypic data collected from closely related populations, which are used to train models.[78] Robust linkage between models and the material samples for which they have been optimized, combined with pedigree data and made available via public infrastructures, will be important to enhance the accuracy and utility of GS modelling through greater transparency, comparison, and reuse of models for related breeding materials or traits. Thus, the usefulness of AI-informed analysis of digital data is tied to investments in the development and maintenance of material samples—including those kept in seed banks and herbaria—and key germplasm metadata such as those captured by the Multi-Crop Passport Descriptors.[79]

Standardizing data and metadata

Standards ensure that data are collected in formats and with labels that can be understood by users, whether human or machine, as well as ensuring that a necessary minimum of contextual information (or metadata) is recorded about the methods through which data were generated and the environmental and experimental conditions in which they were acquired. Providing metadata labels and labeling/annotating individual data points with semantic standards both present major challenges for the use of ML and AI, although these challenges can differ in nature (e.g., the type and choice of standards required) and scale (e.g., labeling data points requires substantially more labor than assembling appropriate metadata). Nevertheless, many of the key issues of how to develop appropriate standards for labeling that fulfil the needs of different user communities and are widely adopted by those communities are shared between the two areas, and are increasingly approached through coordinated effort.

Consider this example of a dataset from orchard management. A two-year study of 19 orchards in New York state collected data on the effects of conventional pesticide use on the wild bee community visiting apple (Malus domestica) within a gradient of percentage natural area in the landscape.[80] ML techniques, such as hidden Markov models, can effectively be used to model the behavior of pollinators based on movement data, especially between orchards and natural habitats. This in turn could inform decision support tools for scheduling the use of pesticides to limit their effect on pollinators, using data collected from individual trees by remote sensing technology. However, the dataset presents several issues for reuse. Each orchard was going to be visited twice for data collection, once before and once after blossoming, but the first year some data were not collected due to cancelled visits. While this was not a problem for this study, where the focus was on the bee count in the second year, for a study with a different objective the incomplete data could be problematic. Moreover, dates are annotated relative to bloom rather than to calendar dates; and a key variable is the Bee Impact Quotient (BIQ) for each individual pesticide, and other scores derived from these. These measures are appropriate for a study on pollinators, but may be less suitable for a study measuring different impacts on the ecosystem, such as plants or biodiversity. Without a preliminary discussion of standards for future data reuse at the start of the study, and incentives to ensure that the scientists involved are given credit for developing data resources of wider interest than for their own project, such considerations are not taken into account, and data collection cannot yield standardized, machine-readable labels for individual data points that can be aggregated and reused within other projects.

To counter such issues and help researchers to signpost more clearly the characteristics and expected utility of datasets, there has been considerable progress in developing semantic standards for data and metadata by a variety of transnational organizations and initiatives. For phenomic research, such efforts include the Ontologies community of practice of the CGIAR[81], which manages the Crop Ontology and the Food and Agriculture Organization’s AGROVOC thesaurus. Distinctive to both initiatives is not only the standardization of terms for field studies, but also the attempt to develop terminologies that bridge the expertise of the multiple stakeholders in agricultural field trials, including farmers, breeders, and scientists, and link different languages.[c] Initiatives such as ELIXIR, the Research Data Alliance Agricultural Data Interest Group, GODAN, and the project PHENOME-EMPHASIS provide precious collaborative venues to improve plant data standards beyond molecular omics and experiments. Notable concrete examples include projects such as the Breeding API (BrAPI)[64] and MIAPPE (Minimum Information about a Plant Phenotyping Experiment)[65], the latter fostered by ELIXIR as a way to improve consensus on ways to annotate data generated by phenotypic experiments; the Working Group on Integrating Genomic and Phenotypic Data for Crop and Forest Plants coordinated by ELIXIR-EXCELERATE[83]; and the efforts to standardize the collection and interoperability of field data in the CGIAR’s AgroFIMS, the open-source FieldBook application[84] and the Grassroots information infrastructure of the BBSRC Designing Future Wheat program. Efforts such as the COPO platform also implement semantic standards, including MIAPPE, in user interfaces to aid data brokering, which underpins the availability of well-described datasets that can in turn power AI/ML studies.[85]

These initiatives, and a shift in research culture more generally, are playing a central role in establishing wider attention to and use of best practice in standards to deliver impact through AI/ML.[4] Ensuring that these standards are not implemented retrospectively, but rather they are adopted before data are actually produced, remains a key challenge. In this respect, companies that develop scientific instruments and research software have a crucial role to play. This is particularly evident in the case of data generated by remote sensing technologies, where the most prominent standards concern the technical levels of imaging and data processing rather than data curation. For example, a recent study of the impact of oil palm plantation in Indonesia based on a range of sources of satellite imagery attempted to assess the impact that historical changes in land use had on greenhouse gas emissions.[86] An outcome of the study was that comparing the outputs from different remote sensing sources was severely compromised not because of any challenges of changes to satellite technology but rather because there was no consistency in the classification of land use between the different remote sensing campaigns.

Evaluating the quality of reference data

Developing reliable ML tools is dependent on having adequate reference data (also referred to as ground-truth or training data) for model validation. Obtaining or accessing reference data for complex field environments poses a distinct challenge, due to the scale of data collection required and the associated problem that the high value of such data means that it is frequently held behind restrictions on access or licensing agreements (see later the subsection "Managing data access responsibly"). Purpose-built platforms such as Rothamsted Research’s North Wyke Farm Platform allow the monitoring and control of multiple agricultural and environmental variables, from plant growth through soil health and water flows, generating detailed, multi-scalar data.[87] Such facilities are expensive and few, however.

Given that the generation of new data specifically for the purpose of training and benchmarking can be expensive and time-consuming, reusing already published data for these purposes is desirable. Implementing good data and metadata standards can reduce the cost and time of reuse, but standardization alone does not allow the creation of benchmark datasets on demand. The utility and accuracy of algorithms is dependent on the quality of the datasets used to train them. Without sufficiently broad and unbiased training sets, algorithms will not have wide general applicability. It is therefore necessary to address statistical aspects of datasets in addition to the data management and stewardship principles described in the previous section. Data quality benchmarking has played a central role for example in genomics, with projects such as the MAQC/SEQC[88], the MicroArray/Sequencing Quality Control initiative by the FDA, but quality standards also depend on the potential implications of decisions taken based on the information contained in the data. For example, the evaluation of ecological risks associated with GM crops or pesticide use need to happen based on more robust data than those procured via fundamental research. While published datasets such as those in genomics repositories, citizen science platforms or ecological data banks typically have undergone some quality checks, these are tailored to the requirements of the original context of the data collection. Reusing datasets to develop an algorithm serving a changed purpose requires a fresh assessment of the suitability and quality of the dataset.

The following questions can provide guidance for finding out whether representativeness and resolution requirements are fit for the specific context and purpose of the algorithm that is being trained with the data:

  1. Are the variables used by the algorithm (or sufficiently close surrogates) included in the dataset? Are the measurement methods sufficiently accurate and precise for this purpose? Have they been taken on a sufficiently elementary unit rather than on an aggregated level only? If the data collection covers a time period, have the measurements been taken sufficiently frequently?
  2. Are the records complete? If not, are records simply missing at random or are there any patterns in the absence that might skew the results obtained by the algorithm?
  3. Is the sampling method used to collect the data subject to any selection biases that were negligible for the conclusions of the original study, but could impact the results or interpretation of the algorithm?
  4. Where data has been gathered from human experts (e.g., in image annotation for phenotyping), has subjective bias been identified? Is the set of experts used sufficient to capture possibly conflicting views? Have the annotators understood and been provided with appropriate tools for the task?

One example to illustrate quality issues in data reuse concerns the British Farm Scale Evaluations (FSE), which analyzed the effect of genetically modified herbicide-tolerant varieties of beet, oilseed rape, and maize, and that of comparable conventional varieties on the abundance and diversity of arable plants and invertebrates.[89][d] The dataset consists of complex time courses reliant on farmers’ assessments. While measurement of weed cover, crop cover, crop height and pollinators followed protocols, the schedule for taking the measurements throughout the year was chosen by the individual land managers, which made comparisons difficult. It was pointed out that: extra data assessing "whether there is evidence of biodiversity harm from the use of the GM crop and herbicide regime" should have been collected[90]; no definitive yield component had been included, which makes it difficult to use this dataset for trade-offs between environmental and economic targets; and pesticide data is given as product application rates, which makes interpretation of these numbers difficult for future studies. Another major issue is missing data due to vandalism, a foot and mouth disease outbreak and unknown reasons, in some cases showing systematic patterns of incompleteness.

Indeed, a related issue is the breadth of data used to train models: whether they sufficiently represent the variation and diversity of target species or populations. Use of computer-generated images of plants in order to enlarge the image datasets used to train deep learning computer vision algorithms for phenotyping is increasingly common.[91][92][93][94] Whether or not such methods could feasibly be used to generate training data that sufficiently reflected the complexity of field environments is another question. The expansion of field phenotyping—including attempts to capture, integrate, and analyze imaging captured by drones and other sensing technologies—is likely to be necessary for this task. Progress in the latter area is rapid, although it is still constrained by the expensive and technically challenging nature of both experiments and associated data annotation practices.[95][96][97]

Using software and models across scales, species, and environments

When developing effective AI solutions in plant-related research, access to adequate software and modeling platforms is as necessary as access to high-quality data. Software and models need to be implemented on digital environments that its users have access to or are willing to pay for. Accessibility, especially for large-scale AI, is key: researchers need access to the computing and data platforms needed to power AI at a reasonable price, in order for such research to be scalable. Where possible, software should also be portable for use across digital environments, so as to accommodate researchers working in different systems; and it should be approachable by users with a range of experience in handling and analyzing data.

A key obstacle here is the fact that researchers who can formulate the biological problems are often not those developing ML algorithms. An example is the use of targeted software to explore existing data in search for new targets for experimental investigation. The KnetMiner resource, for instance, assembles a suite of software and data integration methods aimed at sifting through the biological literature and public data resources to explore relationships between datasets and species, especially in cases where multiple traits are connected to multiple genes. The application of these exploratory methods to key crops such as wheat, sorghum, and sugarcane has already resulted in the identification and further study of important agronomic traits.[98] At the same time, it requires expert tailoring whenever targeting a new species, including biologically informed assessment of which datasets used for key crops can be applied to other crops. Indeed, not all models will work directly off the shelf on a new dataset/problem. Giuffrida et al.[40] devised an algorithm that adapts a leaf counting model on new data without requiring annotations and without requiring the availability of the original training dataset. As model complexity increases, however, so does the opacity of the models. Dobrescu et al.[38] sought to develop mechanisms that help peek into what ML models learn in tasks of object counting (and in particular leaves), part of a growing research field which promises to make CNN methods better understood, increasing trust in the insights they provide.

It is then not enough to provide researchers with software alone. Rather, this must be supported with workflows that incorporate the whole lifecycle of data preparation, validation, and analysis, and which can be operationalized with minimal friction.[99] Algorithms and data models must also be articulated at the right level of abstraction, to resolve what could be perceived as a new form of "translation gap" between the cutting edge of data science and the frontiers of plant research. Software and models created for very particular uses will have higher requirements for data quality and annotation, creating a barrier to reuse.[100][101] Some models need to be flexible enough to work across the multiple scales that characterize both biological work in general and agricultural research in particular, including between species and between different environments. An example is the John Innes Centre’s work to create data resources with the appropriate software that enable the transfer of learning from model organisms (e.g., Arabidopsis) to non-model organisms (e.g., Brassica crops).[102][103][104][105][106] Many crops have large complex polyploid genomes, one of many factors that can make the direct transfer of knowledge problematic. ML approaches are being developed that allow for large transcriptomic and phenotypic datasets being collected from many individuals, populations and species. This in turn can be exploited to identify similarities and differences in the regulation of developmental transitions in response to environmental stimuli.[104][105] Bringing foundational plant science into the crop space is crucial, yet key challenges remain at every level from gene activity and function through networks, tissue behavior, and plant physiology to field-level behavior.

The need to operate across multiple scales has long been acknowledged to require trade-offs between accuracy and generality.[107] Indeed, most models are designed to address specific questions and will not be applicable across scales. This raises questions around whether and how the results of such modeling efforts can be linked and integrated. The goal of ML is not to ensure how the model will do on the training data but instead how it will perform on a testing set. The testing set is used to ascertain how well the model will generalize to an unseen data source and thus “in the wild.” After all, we do care about creating AI/ML that will generalize well either in unseen data or unseen tasks. For example, how will a model that is trained to count plant leaves perform when tasked to count leaves in different images of the same plant family (different illumination), or a different plant family or even a different task (e.g., seed counting)? The ability for models to generalize is largely governed by the quality of the internal data representations the model has learned to fulfil the task. If one relies on supervised ML, then these representations will be tuned to the specific task and will have difficulty in generalizing. Here multitask and meta-learning can help as they tend to learn representations that can more easily generalize.[39]

However, if we rely on annotations to drive this supervised learning of data representations, one must readily ask whether data quality and annotation play a key role. In ML, data cleaning and preparation take a considerable amount of time. Even more time consuming is annotation/labeling of individual data points. Approaches to relieve the data annotation effort include semi-supervised, self-supervised, and multitask learning. These methods aim to learn representations by leveraging unlabeled data, or correlations and self-similarity of the data themselves, or correlations between tasks. Considerable and notable improvements have been made outside and within plant sciences, and particularly in image-based phenotyping. Yet even these methods rely on some annotations. Thus, one must consider whether noise (errors) in annotation have an effect. Learning with label noise, as it is colloquially known in ML, is a mathematical framework that aims to learn a good model even when labels may be noisy, i.e., have errors.[108] Recently, Giuffrida et al.[109] went to the extreme of assessing such levels of noise amongst expert and even novice annotators (citizen scientists). The findings are promising: despite the presence of noise, as long as multiple annotations of the same datum by diverse individuals exist, models can be learned.

One must consider errors in annotation not only in providing data and metadata labels for the datasets to which ML is applied, but also in how ML outputs will be used to support statistical hypotheses. In this aspect, an error in labeling the metadata of a mutant as control will create considerable propagation of error in the pipeline. Thus consistent records of experimental conditions will help ensure that such errors are minimized. Here ML can also help identify errors.[22] An ML algorithm can actually act as a calibration method: outputs of an ML model which are suddenly inconsistent point to data inputs that are out of distribution. Whether such out of distribution data are due to errors in the data or metadata or because the ML is encountering data not trained with (but could be updated), necessitates human intervention, and this in turn creates a viable checkpoint in the development of robust data processing pipelines.

Managing data access responsibility

Access to appropriate datasets is necessary for the application of AI tools to complex environmental and biological research topics, yet it clearly depends on factors well beyond scientific need, including intellectual property regimes, data governance by specific institutions, and consideration of the rights and risks involved in data sharing for those who produce the materials from which data are extracted and/or may suffer the social and economic consequences of specific applications of data analysis.[110] Legal constraints such as intellectual property controls and licensing regimes can and often do put the data beyond the financial means of lower-resourced researchers and institutions, or place restrictions on the use of the data that makes the kind of wide-ranging data mining required for AI application difficult if not impossible to implement.[111] Given the distinctive landscape of intellectual property rights, contracts, and the need to find incentives for data sharing that respond to imperatives of commercial competition, finding ways to make data usable to a range of actors without necessarily sharing it is likely to become increasingly important. In biomedicine, initiatives such as DataSHIELD have been developed in which users are able to run analyses on a dataset via an intermediary platform without having direct access to the source data.[112] Such efforts allow the anonymization of data and removal of patient/volunteer personal information, which are recognized as important issues in biomedical research. Similar initiatives such as the Open Algorithms (OPAL) project, developed in relation to commercially sensitive data[113], have recently been promoted in agricultural research forums such as the CGIAR Big Data in Agriculture Convention, but their uptake remains to be determined.

Research institutions including universities have often kept data from widespread access, with even data produced by publicly funded studies remaining either unknown or inaccessible to other researchers. This is partly explained by lack of investment in the platforms, curation expertise, and training required to ensure data sharing and facilitate analysis, and partly due to enduring confusion around legal accountability of research institutions vis-a-vis the requirements of governments, data protection laws, private sponsors (including public-private-partnerships) and public funders, not to speak of the fact that researchers often operate within international networks where different national legislation and expectations may apply.

Data access must also be balanced against ethical concerns that have recently arisen around the reuse of data and materials collected in low-income countries and/or low-resourced research environments. With reference to the longer history of colonial exploitation of indigenous agricultural knowledge to support market-driven growth in high-income nations[114], international institutions, including the World Data Systems, CODATA, and the CGIAR, have pointed to the potential for indiscriminate data access to accelerate so-called “digital feudalism”; the exploitation of more vulnerable members of the agricultural research network by better-resourced and more powerful actors (such as Alphabet/Google) who can effectively appropriate such data. The opportunities afforded by AI, while holding the potential to benefit many stakeholders, also create new commercial incentives for such exploitation.

Key areas for negotiation include access and benefit sharing agreements and the protection of sensitive data, for example, where they include location or certain kinds of farm production data. In the biomedical field, strong regimes of governance and ethics have been developed for data protection and legislating the acceptable uses of data[115], and these may provide a model for the plant sciences. However, plant data poses several different challenges to human biomedical data, notably the fact that much of the data utilized in basic and translational plant research does not come under the more protected category of personal data, but is frequently covered instead by contract law.[116]

Engaging experts beyond one's domain

Despite the increased use of ML expertise and tools and the example set by some highly visible projects, collaboration between cutting-edge data science research groups and plant science communities is not yet commonplace.[117][118] On the one hand, this is due to the poor visibility of plant science datasets and problems to the data science community, in comparison to more prominent biomedical or environmental data and challenges. On the other hand, plant researchers need a better understanding of how algorithms work and what can legitimately be expected from the outputs of AI and ML. It is necessary to upskill researchers with expertise about the available types and minimum necessary semantic annotations that datasets must be labelled with in order to make them machine-readable, in the first instance, and usable with specific algorithms. Providing researchers in the plant sciences with a minimum fundamental knowledge about such matters, preferably from an early stage in their careers, will facilitate the deployment of AI in the field and assisting decision-making around the issues of data selection and management described above, while also acting as an incentive towards the implementations of standards in the production and use of plant data.

One example of combining community-wide incentives with collaboration and upskilling are the “data challenges” organized in conjunction with the Computer Vision Problems in Plant Phenotyping (CVPPP) workshops, held at various international computer vision conferences since 2014. The first challenges were built around a curated dataset of images of rosette plants, including Arabidopsis and tobacco, taken in a controlled experimental setting, that could be used to test algorithms for leaf detection, segmentation, and counting. This dataset, provided with expert annotation and full metadata, was presented alongside clear problem statements for computer vision researchers to work with and scripts for preprocessing and to code performance metrics, thereby minimizing the costs of engagement. Phenotyping problems were mapped onto appropriate computer vision terminology, for example leaf segmentation to multi-instance segmentation, and the workshops were organized to facilitate research likely to lead to publications for participants. These efforts resulted in wider visibility of the Arabidopsis dataset among the CV community as an important benchmark in the development of multi-instance segmentation and object counting tools[41]; educated ML researchers in the potential of plant data; and highlighted the potential of computer vision (and AI tools more generally) in addressing long-standing plant research questions.[e] At the same time, this example highlights the significant effort involved in developing closer collaboration between these two research communities, since presenting the dataset required extensive preparation by the organizers (who needed an understanding of both areas of work to effectively set up the challenge). In addition to supporting access and use of specific software, hardware, and workflows, there are benefits to be gained from supporting engagement with tools around which a community has developed, particularly when the users may lack technical background in ML/software engineering. Access to other users’ experiences and opinions is likely to be very valuable here, whether it is informal or through training material and events.

It is crucial to extend this engagement beyond the sphere of professional scientists to include other stakeholders in food systems, including farmers, agronomy advisors, plant breeders, food manufacturers and suppliers, nutritionists, and others. Without dialogue with and among stakeholders, it is hard to identify the priority areas—the social-scientific needs and challenges—where there is greatest opportunity for AI applications to achieve impact. Mapping the stakeholder networks for specific forms of data-intensive plant research is a labor-intensive but important endeavor[120], as demonstrated in large projects such as ELIXIR that devoted significant efforts towards developing transparent and robust mapping services. Government representatives, funding agencies, and industrial partners need to be engaged in the development of any data infrastructures and services. The involvement of industrial partners in particular is crucial given their ownership of key data resources, and also for their use of the tools and applications of their outputs. There is strong need for increased governance and related norms ensuring the delivery of public goods from those organizations that see data as a key part of their commercial activities, similarly to what the Food and Agriculture Organisation has been spearheading in the case of plant genetic resources. If the field is to provide advantages to a wider range of socio-economic actors, SMEs also need to be represented in future discussions and governance strategies around data access and protections. In developing its agri-tech strategy[121], the UK government identified the key role of data and placed the development of an agri-tech center dedicated to data integration and access (Agrimetrics[122]) as central to its wider development of centers of agricultural innovation.[123] Such collaboration has also been envisioned, for instance, in the work of the Agrisemantics working group within the Agricultural Data Interest Group of the RDA[124] and the CGIAR Communities of Practice bringing together stakeholders to discuss data standards and semantics.[81] This engagement is crucial to ensure that academic expertise is informing and contributing to food security on the ground. Equally important is for public academic research, typically targeted to a wider range of topics, crops, and applications, to be directed towards stakeholder needs. For instance, PlantVillage Nuru, a free smartphone app that uses automated image analysis and recognition with a phone camera for immediate disease diagnosis in several other crop species[42], is targeted at farmers in the developing world and was specifically designed, in consultation with farmers’ representatives, to be usable offline and with minimal external input. This resulted in wide uptake and positive feedback due to the accessibility of the app to farmers and the usefulness of its contents and design.

Conclusion: What data landscape do we need for plant-related AI?

We have reviewed eight data challenges that need to be urgently confronted in order to support the application of AI and ML tools to plant-related research (see Table 2). With specific reference to the UK and Europe, where our work is based, we discussed examples of good practice, including efforts to articulate data standards, algorithms, and models at the right level of abstraction, in order to fit existing research questions and also address the gaps separating cutting edge data science from the frontiers of plant research. Building on such examples, we pointed to the need for a more systemic change in how research in this domain is conducted, incentivized, supported, and regulated. We highlighted the importance of developing data services aiming to make data available and usable to people. This is particularly important in relation to environmental data of relevance to plant research, on which there has been much less focus compared to the tools already present to cope with genomic data. We pointed to the need for substantive investment in the development and maintenance of data infrastructures, standards, and software, as well as venues and training programs aimed at fostering collaboration among the diverse expertise required (and especially exposure to data science for plant scientists and breeders); the identification of relevant stakeholders, including industry, governmental agencies, local breeders, and indigenous communities as relevant; and substantive engagement with those stakeholders. We stressed the difficulties in implementing these approaches within the highly fragmented biological data landscape, and the even more complex ensemble of public and private sponsors involved in research on crops. Despite marked advances in data availability, infrastructures, and analytics, many plant researchers remain unaware of the extent to which AI tools could support their work, and they do not actively participate in the effort to produce reliable data for the community.

One way to shift incentives and support a substantive culture change among researchers could be to foster international and transdisciplinary collaboration around big projects with clear use cases, e.g., a “moonshot” equivalent to the Human Genome Project or the search for the Higgs particle in physics. Big science of this kind has a strong track record in driving the development of standards and epistemic cultures, as well as bringing together international partners to maximize the strengths of different regions and approaches.[125] The agronomic domain may need one such big project to create traction and new forms of collaboration, especially given the importance of driving adoption of common standards across as diverse research communities as those of data scientists working on algorithms, molecular biologists focused on genetic engineering, and crop scientists engaged in field experiments. Targets for such a moonshot project could be anything from addressing the phosphate crisis and its impact on agricultural yield and developing a fully digital farm modelling an existing experimental station, to the development of ecosystem services using multiple metrics.

An alternative approach would be to focus on a key feature of ML that has been lacking in previously dominant technologies: its ability to both generalize and transfer between domains once specific, targeted solutions are found to well-defined problems. Once an ML strategy has been identified for a given task, exposure to further examples of that task typically improves performance, sometimes even when the details and environment are significantly different. Rather than identify a moonshot biological challenge, which runs the risk of creating more tools tuned to specific research questions, an explicit search for capabilities needed across a range of plant and agricultural science scenarios could inform the identification of Technological Grand Challenges facing this community. These could be used both to spread innovation across the community and to engage colleagues from other disciplines. This approach could learn from other areas of research who have fared better in the development and application of AI, such as biomedicine. Repurposing some of the insights and infrastructures created in that domain would also be very useful for plant-focused science, including in tackling ethical and governance issues associated with the protection, sharing, and reuse of plant data.

Any future strategy for the development and application of AI in plant-focused research will need to have data curation at its center, rather than as an afterthought. Making plant data FAIR is crucial. This in turn requires both technical work on standards, reference data, software, and modeling, and organizational work towards establishing norms and venues for appropriate data governance (including on the terms of ownership, and access to and reuse of data), as well as engagement with the widest possible spectrum of relevant stakeholders. Most importantly, it requires collaboration towards tailoring the technologies to the challenges posed by the green domain and the role of plants in relation to food systems and environmental sustainability. The opportunities immediately available in terms of AI applications may not necessarily be what plant research and agronomy need. There is a need to foster collaboration between fundamental researchers, data scientists, algorithm developers, and end users in order to identify and maximize opportunities in this domain. Notably, while overcoming challenges to the effective use of AI will require changing practices and networks, it is important that such changes should not detrimentally affect what has already been successful. Existing communities of practice (such as the ELIXIR plant science community in Europe and the RDA agriculture-related groups at the global level) provide valuable sources of expertise and collaboration, and disrupting these risks creating more obstacles to good practice than benefits. We should note that making data FAIR along these lines will not resolve all issues of comparability and interoperability across experiments, given the enormous variability in settings and the number of variables involved, all of which are regularly updated to reflect local conditions. Carrying out a meta-analysis of data across experiments using AI will thus always require calibration and adjustments to allow for the specific sites, purposes, and conditions of the study.

Last but not least, improvements in data management may help identify and account for ethical and societal issues of relevance to agronomy and food production. There has been widespread concern that the adoption of ML tools implies a decrease in the oversight and control retained by humans on the interpretation of results, including the assessment of the potential implications of any resulting actions for stakeholder communities such as farmers, breeders, and consumers. This has been flanked by worry around documenting the provenance of data and rewarding the efforts involved in generating the materials and conditions for data collection, especially where results are extracted from farming communities in deprived areas. Practical solutions to these concerns require concerted effort from data producers and curators, research institutions, data infrastructures, and international governance.[110] For instance, the impact of specific crop varieties on diverse landscapes is considered by AgroFIMS and other tools developed by the CGIAR, while the allocation of ownership claims and rewards attached to discovery is incorporated into the Global Information System (GLIS) of the International Treaty on Plant Genetic Resources for Food and Agriculture. Thus, data management strategies can help to ensure that the environmental, social, and economic impact of AI tools is built into all applications.

Abbreviations, acronyms, and initialisms

  • AI: artificial intelligence
  • BrAPI: Breeding API
  • CNN: convolutional neural network
  • CVPPP: Computer Vision Problems in Plant Phenotyping
  • CWR: crop wild relative
  • e-RA: Electronic Rothamsted Archive
  • FAIDARE: FAIR Data-finder for Agronomic REsearch
  • FAIR: findable, accessible, interoperable, and reusable
  • GS: genomic selection
  • GxE: genotype-environment interaction
  • GxExM: genotype-environment-management interaction
  • LASSO: least absolute shrinkage and selection operator
  • LTE: long-term experiments
  • MIAPPE: Minimum Information about a Plant Phenotyping Experiment
  • MIR: mid-infrared
  • ML: machine learning
  • NIR: near-infrared

Acknowledgements

With many thanks to the constructive and helpful reports by the two referees, we have revised the paper by: (1) adding nuance to the sometimes too optimistic conclusions concerning proposed solution to data challenges for AI in plant science; (2) clarifying the scope of some of the claims; and (3) adding a table outlining some key examples and applications of AI in plant science.

Funding

HFW and SL were funded via the From Field Data to Global Indicators project from the Alan Turing Institute, under EPSRC grant EP/N510129/1. STM was funded via BBSRC grant BB/P024068/1, The Nottingham Arabidopsis Stock Centre (arabidopsis.info). SAT was funded via MRC grant MR/R025746/1, "PhenomUK - Crop Phenotyping: from Sensors to Knowledge". TP was funded under H2020-EU projects 739514 EMPHASIS-PREP and 731013 EPPN2020. JB was funded by NERC Small Grant - Landscape Decisions “JDec – Joint decision models for citizens, crops, and environment” (Grant Reference NE/T004134/1).

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data availability

No data are associated with this article.

Competing interest

No competing interests were disclosed.

Footnotes

  1. We recognise that there is disagreement over whether ML can always be classified as AI, given that the application of ML techniques often requires extensive manual feature extraction in order to process data more effectively for analysis. In this regard, ML may be considered closer to statistical methods than to AI. For the purposes of this paper, where many of the data challenges are shared between existing methods of ML and AI sensu stricto, we will treat the two as a continuum of techniques where "AI" is the more encompassing and general term.
  2. In short, the existence of the data should be published, procedures for accessing the data should be available, sufficient metadata should be provided to allow the data to be understood and appropriately repurposed, and common formats and APIs should be used to facilitate the integration of different datasets.
  3. Semantic standards that recognise and incorporate this diversity of knowledge will be a necessary bedrock for any applications of AI and ML that are envisioned to work for diverse user bases, and to preventing implicit bias towards the terminology, scope or aims of dominant research groups.[82]
  4. The datasets are published by CEH as a collection here. Each crop dataset has its own DOI, and the metadata gives a summary of measurements/data available, plus an extra dataset for management data.
  5. Another successful initiative by the CVPPP is the Global Wheat Detection Kaggle Competition launched to broaden engagement in summer 2020, which received over 2000 entries.[119]

References

  1. 1.0 1.1 Wang, Hai; Cimen, Emre; Singh, Nisha; Buckler, Edward (1 April 2020). "Deep learning for plant genomics and crop improvement" (in en). Current Opinion in Plant Biology 54: 34–41. doi:10.1016/j.pbi.2019.12.010. https://linkinghub.elsevier.com/retrieve/pii/S1369526619301256. 
  2. 2.0 2.1 2.2 Harfouche, Antoine L.; Jacobson, Daniel A.; Kainer, David; Romero, Jonathon C.; Harfouche, Antoine H.; Scarascia Mugnozza, Giuseppe; Moshelion, Menachem; Tuskan, Gerald A. et al. (1 November 2019). "Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence" (in en). Trends in Biotechnology 37 (11): 1217–1235. doi:10.1016/j.tibtech.2019.05.007. https://linkinghub.elsevier.com/retrieve/pii/S0167779919301143. 
  3. King, B. (2020). "Big Data in Agriculture". CGIAR Platform for Big Data in Agriculture. CGIAR. https://www.cgiar.org/annual-report/performance-report-2020/big-data-in-agriculture/. 
  4. 4.0 4.1 4.2 Leonelli, Sabina; Davey, Robert P.; Arnaud, Elizabeth; Parry, Geraint; Bastow, Ruth (6 June 2017). "Data management and best practice for plant science" (in en). Nature Plants 3 (6): 17086. doi:10.1038/nplants.2017.86. ISSN 2055-0278. https://www.nature.com/articles/nplants201786. 
  5. 5.0 5.1 Leonelli, Sabina (2016). Data-centric biology: a philosophical study. Chicago ; London: The University of Chicago Press. ISBN 978-0-226-41633-5. 
  6. 6.0 6.1 Leonelli, Sabina (5 April 2019). "The challenges of big data biology" (in en). eLife 8: e47381. doi:10.7554/eLife.47381. ISSN 2050-084X. PMC PMC6450665. PMID 30950793. https://elifesciences.org/articles/47381. 
  7. Tardieu, François; Cabrera-Bosquet, Llorenç; Pridmore, Tony; Bennett, Malcolm (1 August 2017). "Plant Phenomics, From Sensors to Knowledge" (in en). Current Biology 27 (15): R770–R783. doi:10.1016/j.cub.2017.05.055. https://linkinghub.elsevier.com/retrieve/pii/S0960982217306218. 
  8. Market Reports World (26 March 2019). "Global Artificial Intelligence (AI) in Agriculture Market Size, Status and Forecast 2019–2025". https://www.marketreportsworld.com/global-artificial-intelligence-ai-in-agriculture-market-13268433. 
  9. Carbonell, Isabelle M. (31 March 2016). "The ethics of big data in big agriculture" (in en). Internet Policy Review 5 (1). doi:10.14763/2016.1.405. ISSN 2197-6775. https://policyreview.info/node/405. 
  10. Ainali, Katerina; Tsiligiridis, Theodore (29 October 2018). Su, Ruidan. ed. "Remote sensing Big AgriData for food availability". 2018 International Conference on Image and Video Processing, and Artificial Intelligence (Shanghai, China: SPIE): 17. doi:10.1117/12.2327014. ISBN 978-1-5106-2310-1. https://www.spiedigitallibrary.org/conference-proceedings-of-spie/10836/2327014/Remote-sensing-Big-AgriData-for-food-availability/10.1117/12.2327014.full. 
  11. Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill series in computer science. New York: McGraw-Hill. ISBN 978-0-07-042807-2. 
  12. Napoletani, D.; Panza, M.; Struppa, D. C. (1 February 2011). "Agnostic Science. Towards a Philosophy of Data Analysis" (in en). Foundations of Science 16 (1): 1–20. doi:10.1007/s10699-010-9186-7. ISSN 1233-1821. http://link.springer.com/10.1007/s10699-010-9186-7. 
  13. El-Gebali, Sara; Mistry, Jaina; Bateman, Alex; Eddy, Sean R; Luciani, Aurélien; Potter, Simon C; Qureshi, Matloob; Richardson, Lorna J et al. (8 January 2019). "The Pfam protein families database in 2019" (in en). Nucleic Acids Research 47 (D1): D427–D432. doi:10.1093/nar/gky995. ISSN 0305-1048. PMC PMC6324024. PMID 30357350. https://academic.oup.com/nar/article/47/D1/D427/5144153. 
  14. Larrañaga, Pedro; Calvo, Borja; Santana, Roberto; Bielza, Concha; Galdiano, Josu; Inza, Iñaki; Lozano, José A.; Armañanzas, Rubén et al. (1 March 2006). "Machine learning in bioinformatics" (in en). Briefings in Bioinformatics 7 (1): 86–112. doi:10.1093/bib/bbk007. ISSN 1467-5463. https://academic.oup.com/bib/article/7/1/86/264025. 
  15. Hayes, William S.; Borodovsky, Mark (1 November 1998). "How to Interpret an Anonymous Bacterial Genome: Machine Learning Approach to Gene Identification" (in en). Genome Research 8 (11): 1154–1171. doi:10.1101/gr.8.11.1154. ISSN 1088-9051. http://genome.cshlp.org/lookup/doi/10.1101/gr.8.11.1154. 
  16. Birney, Ewan; Clamp, Michele; Durbin, Richard (1 May 2004). "GeneWise and Genomewise" (in en). Genome Research 14 (5): 988–995. doi:10.1101/gr.1865504. ISSN 1088-9051. PMC PMC479130. PMID 15123596. http://genome.cshlp.org/lookup/doi/10.1101/gr.1865504. 
  17. 17.0 17.1 17.2 Zou, Quan; Sangaiah, Arun Kumar; Mrozek, Dariusz (4 October 2019). "Editorial: Machine Learning Techniques on Gene Function Prediction". Frontiers in Genetics 10: 938. doi:10.3389/fgene.2019.00938. ISSN 1664-8021. PMC PMC6788354. PMID 31636657. https://www.frontiersin.org/article/10.3389/fgene.2019.00938/full. 
  18. 18.0 18.1 Mochida, Keiichi; Koda, Satoru; Inoue, Komaki; Nishii, Ryuei (29 November 2018). "Statistical and Machine Learning Approaches to Predict Gene Regulatory Networks From Transcriptome Datasets". Frontiers in Plant Science 9: 1770. doi:10.3389/fpls.2018.01770. ISSN 1664-462X. PMC PMC6281826. PMID 30555503. https://www.frontiersin.org/article/10.3389/fpls.2018.01770/full. 
  19. 19.0 19.1 Sperschneider, Jana (1 October 2020). "Machine learning in plant–pathogen interactions: empowering biological predictions from field scale to genome scale" (in en). New Phytologist 228 (1): 35–41. doi:10.1111/nph.15771. ISSN 0028-646X. https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.15771. 
  20. Leonelli, S (1 April 2014). "What difference does quantity make? On the epistemology of Big Data in biology" (in en). Big Data & Society 1 (1): 205395171453439. doi:10.1177/2053951714534395. ISSN 2053-9517. PMC PMC4340542. PMID 25729586. http://journals.sagepub.com/doi/10.1177/2053951714534395. 
  21. Smith, G.N.; Cordes, J. (2019). The 9 pitfalls of data science (1st edition ed.). New York, NY: Oxford University Press. ISBN 978-0-19-884439-6. 
  22. 22.0 22.1 Schramowski, Patrick; Stammer, Wolfgang; Teso, Stefano; Brugger, Anna; Herbert, Franziska; Shao, Xiaoting; Luigs, Hans-Georg; Mahlein, Anne-Katrin et al. (12 August 2020). "Making deep neural networks right for the right scientific reasons by interacting with their explanations" (in en). Nature Machine Intelligence 2 (8): 476–486. doi:10.1038/s42256-020-0212-3. ISSN 2522-5839. https://www.nature.com/articles/s42256-020-0212-3. 
  23. 23.0 23.1 Singh, Arti; Ganapathysubramanian, Baskar; Singh, Asheesh Kumar; Sarkar, Soumik (1 February 2016). "Machine Learning for High-Throughput Stress Phenotyping in Plants" (in en). Trends in Plant Science 21 (2): 110–124. doi:10.1016/j.tplants.2015.10.015. https://linkinghub.elsevier.com/retrieve/pii/S1360138515002630. 
  24. 24.0 24.1 Mohanty, Sharada P.; Hughes, David P.; Salathé, Marcel (22 September 2016). "Using Deep Learning for Image-Based Plant Disease Detection". Frontiers in Plant Science 7: 1419. doi:10.3389/fpls.2016.01419. ISSN 1664-462X. PMC PMC5032846. PMID 27713752. http://journal.frontiersin.org/article/10.3389/fpls.2016.01419/full. 
  25. 25.0 25.1 Ramcharan, Amanda; Baranowski, Kelsee; McCloskey, Peter; Ahmed, Babuali; Legg, James; Hughes, David P. (27 October 2017). "Deep Learning for Image-Based Cassava Disease Detection". Frontiers in Plant Science 8: 1852. doi:10.3389/fpls.2017.01852. ISSN 1664-462X. PMC PMC5663696. PMID 29163582. http://journal.frontiersin.org/article/10.3389/fpls.2017.01852/full. 
  26. 26.0 26.1 Fu, Peng; Meacham-Hensold, Katherine; Guan, Kaiyu; Bernacchi, Carl J. (3 June 2019). "Hyperspectral Leaf Reflectance as Proxy for Photosynthetic Capacities: An Ensemble Approach Based on Multiple Machine Learning Algorithms". Frontiers in Plant Science 10: 730. doi:10.3389/fpls.2019.00730. ISSN 1664-462X. PMC PMC6556518. PMID 31214235. https://www.frontiersin.org/article/10.3389/fpls.2019.00730/full. 
  27. 27.0 27.1 Jiang, Yu; Li, Changying (1 January 2020). "Convolutional Neural Networks for Image-Based High-Throughput Plant Phenotyping: A Review" (in en). Plant Phenomics 2020: 2020/4152816. doi:10.34133/2020/4152816. ISSN 2643-6515. PMC PMC7706326. PMID 33313554. https://spj.science.org/doi/10.34133/2020/4152816. 
  28. 28.0 28.1 28.2 Dobrescu, Andrei; Giuffrida, Mario Valerio; Tsaftaris, Sotirios A. (1 October 2017). "Leveraging Multiple Datasets for Deep Leaf Counting". 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (Venice, Italy: IEEE): 2072–2079. doi:10.1109/ICCVW.2017.243. ISBN 978-1-5386-1034-3. http://ieeexplore.ieee.org/document/8265453/. 
  29. 29.0 29.1 Pound, Michael P.; Atkinson, Jonathan A.; Townsend, Alexandra J.; Wilson, Michael H.; Griffiths, Marcus; Jackson, Aaron S.; Bulat, Adrian; Tzimiropoulos, Georgios et al. (1 October 2017). "Deep machine learning provides state-of-the-art performance in image-based plant phenotyping" (in en). GigaScience 6 (10). doi:10.1093/gigascience/gix083. ISSN 2047-217X. PMC PMC5632296. PMID 29020747. https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/gix083/4091592. 
  30. 30.0 30.1 Yasrab, Robail; Atkinson, Jonathan A; Wells, Darren M; French, Andrew P; Pridmore, Tony P; Pound, Michael P (1 November 2019). "RootNav 2.0: Deep learning for automatic navigation of complex plant root architectures" (in en). GigaScience 8 (11): giz123. doi:10.1093/gigascience/giz123. ISSN 2047-217X. PMC PMC6839032. PMID 31702012. https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giz123/5614712. 
  31. 31.0 31.1 Soltaninejad, Mohammadreza; Sturrock, Craig J.; Griffiths, Marcus; Pridmore, Tony P.; Pound, Michael P. (2020). "Three Dimensional Root CT Segmentation Using Multi-Resolution Encoder-Decoder Networks". IEEE Transactions on Image Processing 29: 6667–6679. doi:10.1109/TIP.2020.2992893. ISSN 1057-7149. https://ieeexplore.ieee.org/document/9091908/. 
  32. Gao, Junfeng; French, Andrew P.; Pound, Michael P.; He, Yong; Pridmore, Tony P.; Pieters, Jan G. (1 December 2020). "Deep convolutional neural networks for image-based Convolvulus sepium detection in sugar beet fields" (in en). Plant Methods 16 (1): 29. doi:10.1186/s13007-020-00570-z. ISSN 1746-4811. PMC PMC7059384. PMID 32165909. https://plantmethods.biomedcentral.com/articles/10.1186/s13007-020-00570-z. 
  33. 33.0 33.1 González‐Camacho, Juan Manuel; Ornella, Leonardo; Pérez‐Rodríguez, Paulino; Gianola, Daniel; Dreisigacker, Susanne; Crossa, José (1 July 2018). "Applications of Machine Learning Methods to Genomic Selection in Breeding Wheat for Rust Resistance" (in en). The Plant Genome 11 (2): 170104. doi:10.3835/plantgenome2017.11.0104. ISSN 1940-3372. https://acsess.onlinelibrary.wiley.com/doi/10.3835/plantgenome2017.11.0104. 
  34. 34.0 34.1 Data Study Group Team (29 April 2020) (in en). Data Study Group Network Final Report: Rothamsted Research. doi:10.5281/zenodo.3775489. https://zenodo.org/record/3775489. 
  35. 35.0 35.1 Potamitis, Ilyas; Rigakis, Iraklis; Fysarakis, Konstantinos (6 November 2015). Dickens, Joseph Clifton. ed. "Insect Biometrics: Optoacoustic Signal Processing and Its Applications to Remote Monitoring of McPhail Type Traps" (in en). PLOS ONE 10 (11): e0140474. doi:10.1371/journal.pone.0140474. ISSN 1932-6203. PMC PMC4636391. PMID 26544845. https://dx.plos.org/10.1371/journal.pone.0140474. 
  36. 36.0 36.1 36.2 Carranza-Rojas, Jose; Goeau, Herve; Bonnet, Pierre; Mata-Montero, Erick; Joly, Alexis (1 December 2017). "Going deeper in the automated identification of Herbarium specimens" (in en). BMC Evolutionary Biology 17 (1): 181. doi:10.1186/s12862-017-1014-z. ISSN 1471-2148. PMC PMC5553807. PMID 28797242. http://bmcevolbiol.biomedcentral.com/articles/10.1186/s12862-017-1014-z. 
  37. 37.0 37.1 Younis, Sohaib; Weiland, Claus; Hoehndorf, Robert; Dressler, Stefan; Hickler, Thomas; Seeger, Bernhard; Schmidt, Marco (2 October 2018). "Taxon and trait recognition from digitized herbarium specimens using deep convolutional neural networks" (in en). Botany Letters 165 (3-4): 377–383. doi:10.1080/23818107.2018.1446357. ISSN 2381-8107. https://www.tandfonline.com/doi/full/10.1080/23818107.2018.1446357. 
  38. 38.0 38.1 Dobrescu, A.; Guiffrida, M.V.; Tsaftaris, S.A. (2019). "Understanding Deep Neural Networks for Regression in Leaf Counting". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops: 4321–29. https://openaccess.thecvf.com/content_CVPRW_2019/html/CVPPP/Dobrescu_Understanding_Deep_Neural_Networks_for_Regression_in_Leaf_Counting_CVPRW_2019_paper.html. 
  39. 39.0 39.1 Dobrescu, Andrei; Giuffrida, Mario Valerio; Tsaftaris, Sotirios A. (28 February 2020). "Doing More With Less: A Multitask Deep Learning Approach in Plant Phenotyping". Frontiers in Plant Science 11: 141. doi:10.3389/fpls.2020.00141. ISSN 1664-462X. PMC PMC7093010. PMID 32256503. https://www.frontiersin.org/article/10.3389/fpls.2020.00141/full. 
  40. 40.0 40.1 Giuffrida, M.V.; Dobrescu, A.; Doerner, P. et al. (2019). "Leaf Counting Without Annotations Using Adversarial Unsupervised Domain Adaptation". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops: 1–10. https://openaccess.thecvf.com/content_CVPRW_2019/html/CVPPP/Giuffrida_Leaf_Counting_Without_Annotations_Using_Adversarial_Unsupervised_Domain_Adaptation_CVPRW_2019_paper.html. 
  41. 41.0 41.1 Tsaftaris, Sotirios A.; Scharr, Hanno (1 February 2019). "Sharing the Right Data Right: A Symbiosis with Machine Learning" (in en). Trends in Plant Science 24 (2): 99–102. doi:10.1016/j.tplants.2018.10.016. https://linkinghub.elsevier.com/retrieve/pii/S1360138518302498. 
  42. 42.0 42.1 Ramcharan, Amanda; McCloskey, Peter; Baranowski, Kelsee; Mbilinyi, Neema; Mrisho, Latifa; Ndalahwa, Mathias; Legg, James; Hughes, David P. (20 March 2019). "A Mobile-Based Deep Learning Model for Cassava Disease Diagnosis". Frontiers in Plant Science 10: 272. doi:10.3389/fpls.2019.00272. ISSN 1664-462X. PMC PMC6436463. PMID 30949185. https://www.frontiersin.org/article/10.3389/fpls.2019.00272/full. 
  43. Crossa, José; Pérez-Rodríguez, Paulino; Cuevas, Jaime; Montesinos-López, Osval; Jarquín, Diego; de los Campos, Gustavo; Burgueño, Juan; González-Camacho, Juan M. et al. (1 November 2017). "Genomic Selection in Plant Breeding: Methods, Models, and Perspectives" (in en). Trends in Plant Science 22 (11): 961–975. doi:10.1016/j.tplants.2017.08.011. https://linkinghub.elsevier.com/retrieve/pii/S136013851730184X. 
  44. Poulton, Paul; Johnston, Johnny; Macdonald, Andy; White, Rodger; Powlson, David (1 June 2018). "Major limitations to achieving “4 per 1000” increases in soil organic carbon stock in temperate regions: Evidence from long‐term experiments at Rothamsted Research, United Kingdom" (in en). Global Change Biology 24 (6): 2563–2584. doi:10.1111/gcb.14066. ISSN 1354-1013. PMC PMC6001646. PMID 29356243. https://onlinelibrary.wiley.com/doi/10.1111/gcb.14066. 
  45. Jensen, Johannes L.; Schjønning, Per; Watts, Christopher W.; Christensen, Bent T.; Obour, Peter B.; Munkholm, Lars J. (1 April 2020). "Soil degradation and recovery – Changes in organic matter fractions and structural stability" (in en). Geoderma 364: 114181. doi:10.1016/j.geoderma.2020.114181. PMC PMC7043339. PMID 32255839. https://linkinghub.elsevier.com/retrieve/pii/S0016706119310572. 
  46. Parolini, Giuditta (1 May 2015). "The Emergence of Modern Statistics in Agricultural Science: Analysis of Variance, Experimental Design and the Reshaping of Research at Rothamsted Experimental Station, 1919–1933" (in en). Journal of the History of Biology 48 (2): 301–335. doi:10.1007/s10739-014-9394-z. ISSN 0022-5010. http://link.springer.com/10.1007/s10739-014-9394-z. 
  47. Rothamsted Research (2018) (in en). Guide to the Classical and other Long-term Experiments, Datasets and Sample Archive. Rothamsted Research Ltd, Rothamsted Research Ltd, Lawes Agricultural Trust, E-RA Curator Team, E-RA Curator Team. Rothamsted Research. doi:10.23637/rothamsted-long-term-experiments-guide-2018. http://www.era.rothamsted.ac.uk/eradoc/book/248. 
  48. Perryman, Sarah A. M.; Castells-Brooke, Nathalie I. D.; Glendining, Margaret J.; Goulding, Keith W. T.; Hawkesford, Malcolm J.; Macdonald, Andy J.; Ostler, Richard J.; Poulton, Paul R. et al. (15 May 2018). "The electronic Rothamsted Archive (e-RA), an online resource for data from the Rothamsted long-term experiments" (in en). Scientific Data 5 (1): 180072. doi:10.1038/sdata.2018.72. ISSN 2052-4463. PMC PMC5952867. PMID 29762552. https://www.nature.com/articles/sdata201872. 
  49. Addy, John W.G.; Ellis, Richard H.; Macdonald, Andy J.; Semenov, Mikhail A.; Mead, Andrew (1 April 2020). "Investigating the effects of inter-annual weather variation (1968–2016) on the functional response of cereal grain yield to applied nitrogen, using data from the Rothamsted Long-Term Experiments" (in en). Agricultural and Forest Meteorology 284: 107898. doi:10.1016/j.agrformet.2019.107898. PMC PMC7079297. PMID 32308247. https://linkinghub.elsevier.com/retrieve/pii/S0168192319305118. 
  50. 50.0 50.1 Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem et al. (15 March 2016). "The FAIR Guiding Principles for scientific data management and stewardship" (in en). Scientific Data 3 (1): 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC PMC4792175. PMID 26978244. https://www.nature.com/articles/sdata201618. 
  51. Hey, Anthony J. G., ed. (2009). The fourth paradigm: data-intensive scientific discovery. Redmond, Washington: Microsoft Research. ISBN 978-0-9825442-0-4. 
  52. Marx, Vivien (13 June 2013). "The big challenges of big data" (in en). Nature 498 (7453): 255–260. doi:10.1038/498255a. ISSN 0028-0836. https://www.nature.com/articles/498255a. 
  53. 53.0 53.1 53.2 Strasser, Bruno J. (2019). Collecting experiments: making big data biology. Chicago: The University of Chicago Press. ISBN 978-0-226-63499-9. 
  54. European Commission. Directorate General for Research and Innovation. (2018). Mutual learning exercise: open science : altmetrics and rewards : Horizon 2020 policy support facility.. LU: Publications Office. doi:10.2777/468970. ISBN 978-92-79-82005-2. https://data.europa.eu/doi/10.2777/468970. 
  55. Rigden, Daniel J; Fernández, Xosé M (8 January 2020). "The 27th annual Nucleic Acids Research database issue and molecular biology database collection" (in en). Nucleic Acids Research 48 (D1): D1–D8. doi:10.1093/nar/gkz1161. ISSN 0305-1048. PMC PMC6943072. PMID 31906604. https://academic.oup.com/nar/article/48/D1/D1/5695332. 
  56. November, Joseph Adam (2012). Biomedical computing: digitizing life in the United States. Johns Hopkins University studies in historical and political science. Baltimore: Johns Hopkins University Press. ISBN 978-1-4214-0468-4. 
  57. Stevens, Hallam (2013). Life out of sequence: a data-driven history of bioinformatics. Chicago: The University of Chicago Press. ISBN 978-0-226-08017-8. 
  58. Mackenzie, Adrian; Waterton, Claire; Ellis, Rebecca; Frow, Emma K.; McNally, Ruth; Busch, Lawrence; Wynne, Brian (1 September 2013). "Classifying, Constructing, and Identifying Life: Standards as Transformations of “The Biological”" (in en). Science, Technology, & Human Values 38 (5): 701–722. doi:10.1177/0162243912474324. ISSN 0162-2439. http://journals.sagepub.com/doi/10.1177/0162243912474324. 
  59. Leonelli, Sabina; Ankeny, Rachel A. (1 March 2012). "Re-thinking organisms: The impact of databases on model organism biology" (in en). Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 43 (1): 29–36. doi:10.1016/j.shpsc.2011.10.003. https://linkinghub.elsevier.com/retrieve/pii/S1369848611000793. 
  60. Rodríguez-Iglesias, Alejandro; Rodríguez-González, Alejandro; Irvine, Alistair G.; Sesma, Ane; Urban, Martin; Hammond-Kosack, Kim E.; Wilkinson, Mark D. (12 May 2016). "Publishing FAIR Data: An Exemplar Methodology Utilizing PHI-Base". Frontiers in Plant Science 7. doi:10.3389/fpls.2016.00641. ISSN 1664-462X. PMC PMC4922217. PMID 27433158. http://journal.frontiersin.org/Article/10.3389/fpls.2016.00641/abstract. 
  61. Pommier, C.; Michotey, C.; Cornut, G.; Roumet, P.; Duchêne, E.; Flores, R.; Lebreton, A.; Alaux, M. et al. (1 January 2019). "Applying FAIR Principles to Plant Phenotypic Data Management in GnpIS" (in en). Plant Phenomics 2019: 1671403. doi:10.34133/2019/1671403. ISSN 2643-6515. PMC PMC7718628. PMID 33313522. https://spj.science.org/doi/10.34133/2019/1671403. 
  62. Reiser, Leonore; Harper, Lisa; Freeling, Michael; Han, Bin; Luan, Sheng (1 September 2018). "FAIR: A Call to Make Published Data More Findable, Accessible, Interoperable, and Reusable" (in en). Molecular Plant 11 (9): 1105–1108. doi:10.1016/j.molp.2018.07.005. https://linkinghub.elsevier.com/retrieve/pii/S1674205218302399. 
  63. "FAIR Data-finder for Agronomic REsearch". Unité de Recherche en Génomique-Info. 2023. https://urgi.versailles.inrae.fr/faidare/. 
  64. 64.0 64.1 Selby, Peter; Abbeloos, Rafael; Backlund, Jan Erik; Basterrechea Salido, Martin; Bauchet, Guillaume; Benites-Alfaro, Omar E; Birkett, Clay; Calaminos, Viana C et al. (15 October 2019). Wren, Jonathan. ed. "BrAPI—an application programming interface for plant breeding applications" (in en). Bioinformatics 35 (20): 4147–4155. doi:10.1093/bioinformatics/btz190. ISSN 1367-4803. PMC PMC6792114. PMID 30903186. https://academic.oup.com/bioinformatics/article/35/20/4147/5418796. 
  65. 65.0 65.1 Papoutsoglou, Evangelia A.; Faria, Daniel; Arend, Daniel; Arnaud, Elizabeth; Athanasiadis, Ioannis N.; Chaves, Inês; Coppens, Frederik; Cornut, Guillaume et al. (1 July 2020). "Enabling reusability of plant phenomic datasets with MIAPPE 1.1" (in en). New Phytologist 227 (1): 260–273. doi:10.1111/nph.16544. ISSN 0028-646X. PMC PMC7317793. PMID 32171029. https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.16544. 
  66. "Annotated Crop Image Database". University of Nottingham. 2023. https://plantimages.nottingham.ac.uk/. 
  67. "CARE Principles for Indigenous Data Governance". Global Indigenous Data Alliance. 2023. https://www.gida-global.org/care#. 
  68. Lin, Dawei; Crabtree, Jonathan; Dillo, Ingrid; Downs, Robert R.; Edmunds, Rorie; Giaretta, David; De Giusti, Marisa; L’Hours, Hervé et al. (14 May 2020). "The TRUST Principles for digital repositories" (in en). Scientific Data 7 (1): 144. doi:10.1038/s41597-020-0486-7. ISSN 2052-4463. https://www.nature.com/articles/s41597-020-0486-7. 
  69. Thiers, B.M. (10 January 2020). "The World’s Herbaria 2019: A Summary Report Based on Data from Index Herbariorum" (PDF). Index Herbariorum. https://sweetgum.nybg.org/science/docs/The_Worlds_Herbaria_2019.pdf. 
  70. Soltis, Pamela S. (1 September 2017). "Digitization of herbaria enables novel research" (in en). American Journal of Botany 104 (9): 1281–1284. doi:10.3732/ajb.1700281. ISSN 0002-9122. https://bsapubs.onlinelibrary.wiley.com/doi/10.3732/ajb.1700281. 
  71. Dillen, Mathias; Groom, Quentin; Chagnoux, Simon; Güntsch, Anton; Hardisty, Alex; Haston, Elspeth; Livermore, Laurence; Runnel, Veljo et al. (8 February 2019). "A benchmark dataset of herbarium specimen images with label data". Biodiversity Data Journal 7: e31817. doi:10.3897/BDJ.7.e31817. ISSN 1314-2828. PMC PMC6396854. PMID 30833825. https://bdj.pensoft.net/article/31817/. 
  72. Bebber, Daniel P.; Carine, Mark A.; Davidse, Gerrit; Harris, David J.; Haston, Elspeth M.; Penn, Malcolm G.; Cafferty, Steve; Wood, John R. I. et al. (7 June 2012). "Big hitting collectors make massive and disproportionate contribution to the discovery of plant species" (in en). Proceedings of the Royal Society B: Biological Sciences 279 (1736): 2269–2274. doi:10.1098/rspb.2011.2439. ISSN 0962-8452. PMC PMC3321708. PMID 22298844. https://royalsocietypublishing.org/doi/10.1098/rspb.2011.2439. 
  73. Hufford, Matthew B.; Berny Mier y Teran, Jorge C.; Gepts, Paul (29 April 2019). "Crop Biodiversity: An Unfinished Magnum Opus of Nature" (in en). Annual Review of Plant Biology 70 (1): 727–751. doi:10.1146/annurev-arplant-042817-040240. ISSN 1543-5008. https://www.annualreviews.org/doi/10.1146/annurev-arplant-042817-040240. 
  74. "EURISCO - Finding Seeds for the future". Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung. 2023. https://eurisco.ipk-gatersleben.de/apex/eurisco_ws/r/eurisco/home. 
  75. Neveu, Pascal; Tireau, Anne; Hilgert, Nadine; Nègre, Vincent; Mineau‐Cesari, Jonathan; Brichet, Nicolas; Chapuis, Romain; Sanchez, Isabelle et al. (1 January 2019). "Dealing with multi‐source and multi‐scale information in plant phenomics: the ontology‐driven Phenotyping Hybrid Information System" (in en). New Phytologist 221 (1): 588–601. doi:10.1111/nph.15385. ISSN 0028-646X. PMC PMC6585972. PMID 30152011. https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.15385. 
  76. Cabrera‐Bosquet, Llorenç; Fournier, Christian; Brichet, Nicolas; Welcker, Claude; Suard, Benoît; Tardieu, François (1 October 2016). "High‐throughput estimation of incident light, light interception and radiation‐use efficiency of thousands of plants in a phenotyping platform" (in en). New Phytologist 212 (1): 269–281. doi:10.1111/nph.14027. ISSN 0028-646X. https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.14027. 
  77. Araus, José Luis; Cairns, Jill E. (1 January 2014). "Field high-throughput phenotyping: the new crop breeding frontier" (in en). Trends in Plant Science 19 (1): 52–61. doi:10.1016/j.tplants.2013.09.008. https://linkinghub.elsevier.com/retrieve/pii/S1360138513001994. 
  78. Spindel, Jennifer E.; McCouch, Susan R. (1 December 2016). "When more is better: how data sharing would accelerate genomic selection of crop plants" (in en). New Phytologist 212 (4): 814–826. doi:10.1111/nph.14174. ISSN 0028-646X. https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.14174. 
  79. Alercia, A.; Diulgheroff, S.; Mackay, M. (2015). "FAO/Bioversity multi-crop passport descriptors V.2.1.". Food and Agriculture Organization of the United Nations. https://www.genesys-pgr.org/descriptorlists/0cd31350-234b-4ebf-80bc-fc65f14f7541. 
  80. Park, Mia G.; Blitzer, E. J.; Gibbs, Jason; Losey, John E.; Danforth, Bryan N. (22 June 2015). "Negative effects of pesticides on wild bee communities can be buffered by landscape context" (in en). Proceedings of the Royal Society B: Biological Sciences 282 (1809): 20150299. doi:10.1098/rspb.2015.0299. ISSN 0962-8452. PMC PMC4590442. PMID 26041355. https://royalsocietypublishing.org/doi/10.1098/rspb.2015.0299. 
  81. 81.0 81.1 Arnaud, Elizabeth; Laporte, Marie-Angélique; Kim, Soonho; Aubert, Céline; Leonelli, Sabina; Miro, Berta; Cooper, Laurel; Jaiswal, Pankaj et al. (1 October 2020). "The Ontologies Community of Practice: A CGIAR Initiative for Big Data in Agrifood Systems" (in en). Patterns 1 (7): 100105. doi:10.1016/j.patter.2020.100105. PMC PMC7660444. PMID 33205138. https://linkinghub.elsevier.com/retrieve/pii/S2666389920301392. 
  82. Arnaud, Elizabeth; Laporte, Marie-Angélique; Kim, Soonho; Aubert, Céline; Leonelli, Sabina; Miro, Berta; Cooper, Laurel; Jaiswal, Pankaj et al. (1 October 2020). "The Ontologies Community of Practice: A CGIAR Initiative for Big Data in Agrifood Systems" (in en). Patterns 1 (7): 100105. doi:10.1016/j.patter.2020.100105. PMC PMC7660444. PMID 33205138. https://linkinghub.elsevier.com/retrieve/pii/S2666389920301392. 
  83. "EXCELERATE WP7: Integrating Genomic and Phenotypic Data for Crop and Forest Plants". ELIXIR, Wellcome Genome Campus. 2023. https://elixir-europe.org/about-us/how-funded/eu-projects/excelerate/wp7. 
  84. Rife, Trevor W.; Poland, Jesse A. (1 July 2014). "Field Book: An Open‐Source Application for Field Data Collection on Android" (in en). Crop Science 54 (4): 1624–1627. doi:10.2135/cropsci2013.08.0579. ISSN 0011-183X. https://acsess.onlinelibrary.wiley.com/doi/10.2135/cropsci2013.08.0579. 
  85. Shaw, Felix; Etuk, Anthony; Minotto, Alice; Gonzalez-Beltran, Alejandra; Johnson, David; Rocca-Serra, Phillipe; Laporte, Marie-Angélique; Arnaud, Elizabeth et al. (2 June 2020). "COPO: a metadata platform for brokering FAIR data in the life sciences" (in en). F1000Research 9: 495. doi:10.12688/f1000research.23889.1. ISSN 2046-1402. https://f1000research.com/articles/9-495/v1. 
  86. van Beijma, Sybrand; Chatterton, Julia; Page, Susan; Rawlings, Chris; Tiffin, Richard; King, Henry (4 July 2018). "The challenges of using satellite data sets to assess historical land use change and associated greenhouse gas emissions: a case study of three Indonesian provinces" (in en). Carbon Management 9 (4): 399–413. doi:10.1080/17583004.2018.1511383. ISSN 1758-3004. https://www.tandfonline.com/doi/full/10.1080/17583004.2018.1511383. 
  87. Orr, R. J.; Murray, P. J.; Eyles, C. J.; Blackwell, M. S. A.; Cardenas, L. M.; Collins, A. L.; Dungait, J. A. J.; Goulding, K. W. T. et al. (1 July 2016). "The N orth W yke F arm P latform: effect of temperate grassland farming systems on soil moisture contents, runoff and associated water quality dynamics" (in en). European Journal of Soil Science 67 (4): 374–385. doi:10.1111/ejss.12350. ISSN 1351-0754. PMC PMC5103177. PMID 27867310. https://bsssjournals.onlinelibrary.wiley.com/doi/10.1111/ejss.12350. 
  88. SEQC/MAQC-III Consortium (1 September 2014). "A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium" (in en). Nature Biotechnology 32 (9): 903–914. doi:10.1038/nbt.2957. ISSN 1087-0156. PMC PMC4321899. PMID 25150838. https://www.nature.com/articles/nbt.2957. 
  89. Firbank, L. G.; Heard, M. S.; Woiwod, I. P.; Hawes, C.; Haughton, A. J.; Champion, G. T.; Scott, R. J.; Hill, M. O. et al. (1 February 2003). "An introduction to the Farm‐Scale Evaluations of genetically modified herbicide‐tolerant crops" (in en). Journal of Applied Ecology 40 (1): 2–16. doi:10.1046/j.1365-2664.2003.00787.x. ISSN 0021-8901. https://besjournals.onlinelibrary.wiley.com/doi/10.1046/j.1365-2664.2003.00787.x. 
  90. House of Commons Environmental Audit Committee (2 March 2004). "GM Foods—Evaluating the Farm Scale Trials, Second Report of Session 2003–04" (PDF). The Stationery Office. https://publications.parliament.uk/pa/cm200304/cmselect/cmenvaud/90/90.pdf. 
  91. Ubbens, Jordan; Cieslak, Mikolaj; Prusinkiewicz, Przemyslaw; Stavness, Ian (1 December 2018). "The use of plant models in deep learning: an application to leaf counting in rosette plants" (in en). Plant Methods 14 (1): 6. doi:10.1186/s13007-018-0273-z. ISSN 1746-4811. PMC PMC5773030. PMID 29375647. https://plantmethods.biomedcentral.com/articles/10.1186/s13007-018-0273-z. 
  92. Humphreys, Mike W.; Doonan, John H.; Boyle, Roger; Rodriguez, Anyela C.; Marley, Christina L.; Williams, Kevin; Farrell, Markku S.; Brook, Jason et al. (1 November 2018). "Root imaging showing comparisons in root distribution and ontogeny in novel Festulolium populations and closely related perennial ryegrass varieties" (in en). Food and Energy Security 7 (4): e00145. doi:10.1002/fes3.145. ISSN 2048-3694. PMC PMC6360931. PMID 30774947. https://onlinelibrary.wiley.com/doi/10.1002/fes3.145. 
  93. Toda, Yosuke; Okura, Fumio; Ito, Jun; Okada, Satoshi; Kinoshita, Toshinori; Tsuji, Hiroyuki; Saisho, Daisuke (15 April 2020). "Training instance segmentation neural network with synthetic datasets for crop seed phenotyping" (in en). Communications Biology 3 (1): 173. doi:10.1038/s42003-020-0905-5. ISSN 2399-3642. PMC PMC7160130. PMID 32296118. https://www.nature.com/articles/s42003-020-0905-5. 
  94. Atanbori, John; French, Andrew P.; Pridmore, Tony P. (1 February 2020). "Towards infield, live plant phenotyping using a reduced-parameter CNN" (in en). Machine Vision and Applications 31 (1-2): 2. doi:10.1007/s00138-019-01051-7. ISSN 0932-8092. PMC PMC6917635. PMID 31894176. http://link.springer.com/10.1007/s00138-019-01051-7. 
  95. Fahlgren, Noah; Gehan, Malia A; Baxter, Ivan (1 April 2015). "Lights, camera, action: high-throughput plant phenotyping is ready for a close-up" (in en). Current Opinion in Plant Biology 24: 93–99. doi:10.1016/j.pbi.2015.02.006. https://linkinghub.elsevier.com/retrieve/pii/S1369526615000266. 
  96. Coppens, Frederik; Wuyts, Nathalie; Inzé, Dirk; Dhondt, Stijn (1 August 2017). "Unlocking the potential of plant phenotyping data through integration and data-driven approaches" (in en). Current Opinion in Systems Biology 4: 58–63. doi:10.1016/j.coisb.2017.07.002. PMC PMC7477990. PMID 32923745. https://linkinghub.elsevier.com/retrieve/pii/S2452310017300069. 
  97. Rosenqvist, Eva; Großkinsky, Dominik K.; Ottosen, Carl-Otto; van de Zedde, Rick (28 February 2019). "The Phenotyping Dilemma—The Challenges of a Diversified Phenotyping Community". Frontiers in Plant Science 10: 163. doi:10.3389/fpls.2019.00163. ISSN 1664-462X. PMC PMC6403123. PMID 30873188. https://www.frontiersin.org/article/10.3389/fpls.2019.00163/full. 
  98. Hassani-Pak, Keywan; Singh, Ajit; Brandizi, Marco; Hearnshaw, Joseph; Amberkar, Sandeep; Phillips, Andrew L.; Doonan, John H.; Rawlings, Chris (3 April 2020) (in en). KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species. doi:10.1101/2020.04.02.017004. http://biorxiv.org/lookup/doi/10.1101/2020.04.02.017004. 
  99. Bechhofer, Sean; De Roure, David; Gamble, Matthew; Goble, Carole; Buchan, Iain (6 July 2010). "Research Objects: Towards Exchange and Reuse of Digital Knowledge" (in en). Nature Precedings. doi:10.1038/npre.2010.4626.1. ISSN 1756-0357. https://www.nature.com/articles/npre.2010.4626.1. 
  100. Tiwari, Krishna; Kananathan, Sarubini; Roberts, Matthew G; Meyer, Johannes P; Sharif Shohan, Mohammad Umer; Xavier, Ashley; Maire, Matthieu; Zyoud, Ahmad et al. (10 August 2020) (in en). Reproducibility in systems biology modelling. doi:10.1101/2020.08.07.239855. http://biorxiv.org/lookup/doi/10.1101/2020.08.07.239855. 
  101. Stanford, Natalie J; Wolstencroft, Katherine; Golebiewski, Martin; Kania, Renate; Juty, Nick; Tomlinson, Christopher; Owen, Stuart; Butcher, Sarah et al. (1 December 2015). "The evolution of standards and data management practices in systems biology" (in en). Molecular Systems Biology 11 (12): 851. doi:10.15252/msb.20156053. ISSN 1744-4292. PMC PMC4704484. PMID 26700851. https://www.embopress.org/doi/10.15252/msb.20156053. 
  102. Jones, D. Marc; Wells, Rachel; Pullen, Nick; Trick, Martin; Irwin, Judith A.; Morris, Richard J. (1 October 2018). "Spatio‐temporal expression dynamics differ between homologues of flowering time genes in the allopolyploid Brassica napus" (in en). The Plant Journal 96 (1): 103–118. doi:10.1111/tpj.14020. ISSN 0960-7412. PMC PMC6175450. PMID 29989238. https://onlinelibrary.wiley.com/doi/10.1111/tpj.14020. 
  103. Jones, D. Marc; Olson, Tjelvar S. G.; Pullen, Nick; Wells, Rachel; Irwin, Judith A.; Morris, Richard J. (1 December 2020). "The oilseed rape developmental expression resource: a resource for the investigation of gene expression dynamics during the floral transition in oilseed rape" (in en). BMC Plant Biology 20 (1): 344. doi:10.1186/s12870-020-02509-x. ISSN 1471-2229. PMC PMC7374918. PMID 32693783. https://bmcplantbiol.biomedcentral.com/articles/10.1186/s12870-020-02509-x. 
  104. 104.0 104.1 Calderwood, Alexander; Hepworth, Jo; Woodhouse, Shannon; Bilham, Lorelei; Jones, D. Marc; Tudor, Eleri; Ali, Mubarak; Dean, Caroline et al. (27 August 2020) (in en). Comparative transcriptomics identifies differences in the regulation of the floral transition between Arabidopsis and Brassica rapa cultivars. doi:10.1101/2020.08.26.266494. http://biorxiv.org/lookup/doi/10.1101/2020.08.26.266494. 
  105. 105.0 105.1 Calderwood, Alexander; Lloyd, Andrew; Hepworth, Jo; Tudor, Eleri H.; Jones, D. Marc; Woodhouse, Shannon; Bilham, Lorelei; Chinoy, Catherine et al. (1 March 2021). "Total FLC transcript dynamics from divergent paralogue expression explains flowering diversity in Brassica napus" (in en). New Phytologist 229 (6): 3534–3548. doi:10.1111/nph.17131. ISSN 0028-646X. PMC PMC7986421. PMID 33289112. https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.17131. 
  106. "ORDER: Oilseed Rape Developmental Expression Resource". John Innes Centre. 2023. Archived from the original on 29 September 2023. https://web.archive.org/web/20230929122811/https://order.jic.ac.uk/. 
  107. Levins, R. (1966). "The Strategy of Model Building in Population Biology". American Scientist 54 (4): 421–31. https://www.jstor.org/stable/27836590. 
  108. Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K. et al. (2013). "Learning with Noisy Labels". NeurIPS Proceedings: 1–9. https://papers.nips.cc/paper_files/paper/2013/hash/3871bd64012152bfb53fdf04b401193f-Abstract.html. 
  109. Giuffrida, M. Valerio; Chen, Feng; Scharr, Hanno; Tsaftaris, Sotirios A. (1 December 2018). "Citizen crowds and experts: observer variability in image-based plant phenotyping" (in en). Plant Methods 14 (1): 12. doi:10.1186/s13007-018-0278-7. ISSN 1746-4811. PMC PMC5806457. PMID 29449872. https://plantmethods.biomedcentral.com/articles/10.1186/s13007-018-0278-7. 
  110. 110.0 110.1 Williamson, Hugh F.; Leonelli, Sabina, eds. (2023) (in en). Towards Responsible Plant Data Linkage: Data Challenges for Agricultural Research and Development. Cham: Springer International Publishing. doi:10.1007/978-3-031-13276-6. ISBN 978-3-031-13275-9. https://link.springer.com/10.1007/978-3-031-13276-6. 
  111. Jefferson, Osmat A; Köllhofer, Deniz; Ehrich, Thomas H; Jefferson, Richard A (1 November 2015). "The ownership question of plant gene and genome intellectual properties" (in en). Nature Biotechnology 33 (11): 1138–1143. doi:10.1038/nbt.3393. ISSN 1087-0156. https://www.nature.com/articles/nbt.3393. 
  112. Murtagh, M.J.; Demir, I.; Jenkings, K.N.; Wallace, S.E.; Murtagh, B.; Boniol, M.; Bota, M.; Laflamme, P. et al. (2012). "Securing the Data Economy: Translating Privacy and Enacting Security in the Development of DataSHIELD" (in en). Public Health Genomics 15 (5): 243–253. doi:10.1159/000336673. ISSN 1662-4246. https://www.karger.com/Article/FullText/336673. 
  113. Roca, T.; Letouzé, E. (18 July 2016). "Open algorithms: A new paradigm for using private data for social good". Devex. https://www.devex.com/news/open-algorithms-a-new-paradigm-for-using-private-data-for-social-good-88434. 
  114. Carney, Judith Ann (2001). Black rice: the African origins of rice cultivation in the Americas. Cambridge, Mass. London: Harvard University Press. ISBN 978-0-674-00834-2. 
  115. Hilgartner, Stephen (2017). Reordering life: knowledge and control in the genomics revolution. Inside technology. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03586-6. 
  116. Wiseman, L.; Sanderson, J.; Robb, L. (2018). "Rethinking Ag Data Ownership". Farm Policy Journal 15 (1): 71–77. https://research-repository.griffith.edu.au/bitstream/handle/10072/382094/WisemanPUB6097.pdf?sequence=1. 
  117. Henkhaus, Natalie; Bartlett, Madelaine; Gang, David; Grumet, Rebecca; Jordon‐Thaden, Ingrid; Lorence, Argelia; Lyons, Eric; Miller, Samantha et al. (1 August 2020). "Plant science decadal vision 2020–2030: Reimagining the potential of plants for a healthy and sustainable future" (in en). Plant Direct 4 (8): e00252. doi:10.1002/pld3.252. ISSN 2475-4455. PMC PMC7459197. PMID 32904806. https://onlinelibrary.wiley.com/doi/10.1002/pld3.252. 
  118. Stevens, R.; Taylor, V.; Nichols, J. et al. (2019). "AI for Science: Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science". Argonne National Laboratory. https://anl.app.box.com/s/bpp2xokglo8z8qiw7qzmgtsnmhree4p0. 
  119. University of Saskatchewan (18 August 2020). "Global Wheat Detection". Kaggle. https://www.kaggle.com/c/global-wheat-detection. 
  120. The Open Research Data Task Force (July 2018). "Realising the potential: Final report of the Open Research Data Task Force" (PDF). https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/775006/Realising-the-potential-ORDTF-July-2018.pdf. 
  121. Government of the United Kingdom (30 October 2015). "Agricutlural technologies (agri-tech) strategy". https://www.gov.uk/government/collections/agricultural-technologies-agri-tech-strategy. 
  122. "Agrimetrics". Agrimetrics Ltd. 2023. https://www.agrimetrics.co.uk/. 
  123. "Agri-Tech Centres". Innovate UK. 2023. https://agritechcentres.com/. 
  124. "Sematics - The Way to Reconcile Points of View and Data" (JPG). RD Alliance. https://www.rd-alliance.org/system/files/documents/SEMANTICS-RICE_poster_LD.jpg. 
  125. Leonelli, Sabina; Ankeny, Rachel A. (1 July 2015). "Repertoires: How to Transform a Project into a Research Community" (in en). BioScience 65 (7): 701–708. doi:10.1093/biosci/biv061. ISSN 1525-3244. PMC PMC4580990. PMID 26412866. http://academic.oup.com/bioscience/article/65/7/701/258233/Repertoires-How-to-Transform-a-Project-into-a. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation and updates to spelling and grammar (including to the title). In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; this version lists them in order of appearance, by design. Many of the footnotes in the original are URLs essentially acting as references; for this version, a majority of the original footnotes were turned into full citations, making the citation list longer and footnotes shorter. The original put examples of AI opportunities into Boxes 1–3; this version converted that content into inline paragraphs to keep the text flow smooth. In the original, the terms "Global South" and "Global North" are used to describe "research undertaken outside of resource-intensive commercial sites" and research from areas of the world that are resource-intensive, respectively; these terms are largely unwarranted as synonyms of "Third World" and "First World." For this version "low- and middle-income nations" is used in the place of "Global South," guided by the authors' use of that phrasing in the prior paragraph on agricultural monitoring (originally Box 3), and "high-income nations" is used for "Global North." Table 1 and 2 in the original are swapped in this version, based on order of mention. Multiple citation URLS {e.g., link to the ORDER database, link to the CGIAR Big Data in Agriculture database, link to the European Commission report) were dead at the time of loading, and archived versions or alternate versions of the URLs were used for this version.