An open access benchmark dataset of herbarium specimen images with label data


ICEDIG project results are more than just project deliverables, they also include other research output related to digitisation.

Lead author Mathias Dillen from the Meise Botanic Garden, in collaboration with colleagues from the ICEDIG consortium and beyond, recently published a data paper in Biodiversity Data Journal for which they compiled a benchmark dataset of 1,800 herbarium specimen images with their corresponding transcribed data. The specimens were chosen to reflect multiple potential obstacles for transcription (e.g.: differences in language, text format, specimen age and nomenclatural type status). Different methods of data transcription are currently under development, including crowdsourcing and artificial intelligence, but many difficulties are still encountered in data extraction and interpretation.

This benchmark dataset of images may be used as a defined and documented set of herbarium specimens for the further development and optimisation of transcription methods. Specimen data and images are made publicly available with a Creative Commons Zero license waiver and with permanent online storage of the data. A link is provided in the open source publication, which can be found here:

Share this post

Xavier's picture


Related Articles