Navigation auf uzh.ch

Swiss Art Research Infrastructure (SARI)

A survey of the computer vision landscape: models, pipelines, data formats and provenance

Computer vision (CV) plays a pivotal role in the digital transformation of the galleries, libraries, archives and museums (GLAM) sector, as the acquisition of any document, whether visual or textual, necessarily involves an intermediate image format (i.e. digitisation images). The application of CV technologies to GLAM materials reveals considerable potential for both research and the transformation of archival practices, especially by aiding curators in the creation and enrichment of descriptive metadata for cultural heritage documents. At the same time, CV models have evolved rapidly over the last decade, with the emergence of multimodal and foundation models, thus creating new research opportunities in image-heavy domains such as art or architectural history. Moreover, while semantic standards such as CIDOC-CRM are becoming widely adopted by GLAM institutions for describing their collections, their usage for modelling data and annotations produced by CV models remains limited.

In light of this situation, the objective of the present survey within the framework of the ORDEA project is two-fold. Firstly, it will constitute the basis for a user-oriented workshop aimed at exploring use cases for the surveyed CV models in the context of art history and architecture history research. Secondly, it will inform the creation of a set of "semantic recipes" for the modelling of provenance in data resulting from CV pipelines.

This survey is organised into three main sections. Section 1 (Tasks, pipelines and applications) provides an overview of existing CV tasks (e.g., object detection, image similarity, etc.) that are most relevant for GLAM data, and showcases selected examples of existing CV pipelines that integrate multiple tasks to produce innovative user-facing applications (examples include GallicaPixArtVisionBilder der Schweiz Online, etc.). Section 2 (CV Models) surveys existing computer vision models, focusing specifically on their application with GLAM data in a research setting. Particular emphasis is placed on the latest advancements in the domain, including multimodal models and the emergence of pre-trained foundational models for computer vision. Lastly, Section 3 (Semantic modelling of provenance and annotations) reviews the current approaches to the semantic modelling of digital provenance and discusses the primary affordances of the most widely used CV formats in terms of data representation.