In the dynamic field of Digital Humanities (DH), research methodologies have traditionally focused on textual data. The introduction of deep learning, however, has expanded this focus by facilitating the automated analysis and labelling of visual materials. Although powerful, these initial methods required large training datasets. The field experienced another shift with the development of multimodal deep learning architectures like Contrastive Language Image Pre-training (CLIP). More recent advancements have integrated GPT-inspired interfaces for visual analysis, significantly enhancing the scope of multimodal research. As a result, humanists are now on the brink of fully embracing computational visual analysis.
This keynote aims to spotlight these advancements and probe deeper into their alignment with multimodal theory. By doing so, it strives to understand their ramifications on the humanistic engagement with visual mediums. As we journey through this alignment, we find ourselves at a crossroads, grappling with pressing dilemmas of practicality, adaptability, and choice. Can the humanistic community keep pace with these swift technological evolutions? And, more fundamentally, is there an imperative to stay abreast, or should we gravitate towards more established techniques, offering greater control and explainability?