Speakers
Description
When is a machine learning model performing well? From a computer science perspective, this question can be answered quite simply with evaluation metrics. A statistically well-performing model forms the basis for any further research, in real-world, as well as in humanities domains. However, domain-specific tasks and in-depth case studies, such as those in digital art history, often reveal specific biases and shortcomings, especially with pre-trained models. Therefore, additional domain-specific benchmarking is often necessary to evaluate a model's performance. And suddenly, well performing becomes not a statistical, but a humanities problem as well. How representative is the dataset for art-historical research questions? What role does the canonicity of certain painters play in the training data as well as in the benchmarking dataset? How generalizable are the features of a domain as heterogeneous as art, especially in more nuanced subsets?
In this submission, we present a systematic benchmarking of various pre-trained state-of-the-art transformer models for the visual arts domain. Multimodal vision-language models such as CLIP have previously shown good performance across natural-image domain tasks. This motivates the need for a systematic evaluation of their usability in the visual arts domain. Addressing the gap between statistics and art history, this is accompanied by a discussion of dataset- and architecture-specific challenges. With large, deeply annotated datasets lacking, the evaluation task often boils down to available meta data such as style, genre and artist classification, oftentimes rejecting nuanced differences between sub-genres or artists and opting for the well documented information already obvious to art historians. Additionally, the quality of image data may also affect the quality of a benchmark. Museum databases, for example, often offer higher quality images than what is presented in large datasets such as WikiArt. Furthermore, transformer model architecture adds additional challenges, such as subject-specific terms in image description. We are going to address this problem with domain specific tasks, proposing a framework for additional evaluation on a more qualitative, art historical level.