MVRMLM 2024

MVRMLM2024

Multimodal Video Retrieval and Multimodal Language Modelling

co-located with International Conference onMultimedia Retrieval (ICMR) 2024
Phuket, Thailand, June 10-14, 2024

Welcome

Videos are being generated in large numbers and volumes – by the public, video conferencing tools (e.g. Teams, Zoom, Webex), and TV broadcasters such as the BBC. The videos may be stored in public archives such as Youtube or proprietary archives such as the BBC Rewind Archive. How to search video archives without using pre-defined metadata such as titles, tags, and viewer notes is a challenge.

The Workshop proposers are undertaking an EPSRC funded research project, Multimodal Video Search by Examples (MVSE), to tackle this challenge. This is motivated by a use case from the BBC, “locate clips with person X in setting Y talking about subject Z”. This use case is difficult to answer by keyword-based search. The MVSE project conducts research on the image and video analysis techniques that enable search with a face image and an audio of X, a scene image of Y and a text phrase of subject Z, where the modalities are person (face or voice), context and topic.

Insights have been gained, a demonstrator has been built, and a video archive has been well understood. It is the opinion of the proposers it is time to share with the research community our vision, our findings and techniques in order to accelerate the research and impact of an important research area, multimodel video search by examples.

ICMR2024

  • Image and video retrieval
  • Image and video segmentation
  • Image and video embedding
  • Multimodal information retrieval
  • Multimodal language modelling
  • Human centred AI for multimedia search

Keynote Speaker

Assoc. Prof. Klaus Schoeffmann (University of Klagenfurt, Austria) & Prof. Cathal Gurrin (Dublin City University, Ireland)

Bio:Dr. Klaus Schoeffmann is an Associate Professor at the Institute of Information Technology (ITEC) at Klagenfurt University, Austria, where he received his habilitation in Computer Science in 2015. He holds a PhD and a MSc in Computer Science. His research focuses on video content understanding, deep learning, computer vision, multimedia retrieval, and interactive multimedia. He has secured research funding of over €2M (from FWF, KWF, and industrial partners) and has graduated 6 PhDs and numerous MSc dissertations. He has a Google H-index of 33 from over 3,500 citations to his research works, and he is founder and co-organizer of the annual Video Browser Showdown (VBS) and the annual Lifelog Search Challenge (LSC). He is a member of the IEEE and the ACM, a regular reviewer for international conferences and journals in the field of multimedia and medical imaging. Klaus Schoeffmann has been the program co-chair of MMM 2021, CBMI 2021, ACM ICMR 2020, MMM2018, CMBI 2013, the demo & video co-chair of ACMMM2020, open-source software competition chair of ACMMM2019, the general co-chair of MMM2012, and will be the general co-chair of ACM ICMR 2024 and ACMMM2025.

Bio:Dr Cathal Gurrin is a Professor of Computer Science and lifelogger. He is the Head of the Adapt Centre at Dublin City University, a Funded Investigator of the Insight Centre, and the director of the Human Media Archives research group. He was previously the deputy head of the School of Computing. His interests include personal analytics and lifelogging. He publishes in information retrieval (IR) with a particular focus on how people access information from pervasive computing devices. He has captured a continuous personal digital memory since 2006 using a wearable camera and logged hundreds of millions of other sensor readings.

Title: From Concepts To Embeddings. Charting the use of AI in Digital Video and Lifelog Search over the last Decade

Abstract: In the past decade, the field of interactive multimedia retrieval has undergone a transformative evolution driven by the advances in artificial intelligence (AI). This keynote talk will explore the journey from early concept-based retrieval systems to the sophisticated embedding-based techniques that dominate the landscape today. By examining the progression of such AI-driven approaches at both the VBS (Video Browser Showdown) and the LSC (Lifelog Search Challenge), we will highlight the pivotal role of comparative benchmarking in accelerating innovation and establishing performance standards. We will also forward at the potential future developments in interactive multimedia retrieval benchmarking, including emerging trends, the integration of multimodal data, and the future comparative benchmarking challenges within our community.


Prof. Huiyu Zhou, University of Leicester

Bio: Dr. Huiyu Zhou received a Bachelor of Engineering degree in Radio Technology from Huazhong University of Science and Technology of China, and a Master of Science degree in Biomedical Engineering from University of Dundee of United Kingdom, respectively. He was awarded a Doctor of Philosophy degree in Computer Vision from Heriot-Watt University, Edinburgh, United Kingdom.Dr. Zhou currently is a full Professor at School of Computing and Mathematical Sciences, University of Leicester, United Kingdom. He has published over 500 peer-reviewed papers in the field. His research work has been or is being supported by UK EPSRC, ESRC, AHRC, MRC, EU, Innovate UK, Royal Society, British Heart Foundation, Leverhulme Trust, Puffin Trust, Alzheimer’s Research UK, Invest NI and industry. Homepage: https://le.ac.uk/people/huiyu-zhou

Title: Video Understanding for Behavioural Analysis

Abstract: Video understanding has emerged as a powerful tool in behavioural analysis, offering innovative methodologies to capture and interpret complex behaviours from visual data. This talk explores the various techniques used in video understanding, including machine learning, deep learning, and computer vision, which can be used to address the challenges in this field, such as accurately detecting and tracking multiple subjects, recognising subtle and nuanced behaviours, and managing large volumes of video data. The application of video understanding extends across numerous sectors from multimedia, healthcare to security, with potential to revolutionise behavioural analysis and beyond.

Prof Zhou will share his experience and insights in video understanding for behavioural analysis. He will present a case study on Parkinson’s disease (PD) diagnosis to demonstrate the capability of video understanding in healthcare. He will describe the methodologies developed to analyse behaviours in both animals (e.g, mice) and humans. These include pioneering techniques for detecting and tracking single and multiple mice, recognising individual and social behaviours, conducting comprehensive social behaviour analysis, and also for distinguishing between normal and PD-afflicted mice by examining their interactions and movements. He will conclude with a vision for the future of video understanding in behavioural analysis, along with an outline for the anticipated advancements in technology and methodology, the potential for broader applications, and the ongoing research efforts aimed at overcoming current limitations.


Prof Mark Plumbley, University of Surrey

Bio: Prof. Mark Plumbley is Professor of Signal Processing at the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey, in Guildford, UK. He is an expert on analysis and processing of audio, using a wide range of signal processing and machine learning methods. He led the first international data challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), and is a co-editor of the book “Computational Analysis of Sound Scenes and Events” (Springer, 2018). He currently holds a 5-year EPSRC Fellowship “AI for Sound” on automatic recognition of everyday sounds. He is a Member of the IEEE Signal Processing Society Technical Committee on Audio and Acoustic Signal Processing, and a Fellow of the IET and IEEE.

Title: Machine Learning for Everyday Sounds: Recognition, Captioning, Visualization, Separation and Generation of Audio

Abstract: The last few years has seen a rapid increase of interest in machine learning for everyday sounds. Starting a decade ago with acoustic scene classification and sound event detection, the challenges and workshops on Detection and Classification of Acoustic Scenes and Events (DCASE) have brought together researchers from academia and industry to establish a new research community. In this talk, I will highlight some of the recent work taking place in this area at the University of Surrey, including pretrained audio neural networks (PANNs), audio captioning, audio visualization, audio source separation and audio generation (AudioLDM). I will also mention some cross-cutting issues such as dataset collection and algorithm efficiency, and discuss how we might design future audio machine learning applications for the benefit of people and society.

Programme

TimeTypeSpeakerTitle
08:30-09:00Registration
09:30-09:50Welcome and Opening SpeechHui Wang, Queen’s University BelfastMultimodal Video Search by Examples
09:50-10:30Keynote 1Klaus Schoeffmann (University of Klagenfurt, Austria) & Cathal Gurrin (Dublin City University, Ireland)From Concepts To Embeddings. Charting the use of AI in Digital Video and Lifelog Search over the last Decade
10:30-11:00Conference Tea Break
11:00-11:30Keynote 2Huiyu Zhou, University of LeicesterVideo Understanding for Behavioural Analysis
11:30-13:00Session 1
11:30-11:50SCUTWing Ng, South China University of TechnologySemi-supervised Concept Preserving Hashing for Image Retrieval in Non-stationary Data Environment
11:50-12:10SurreyMohamed Faheem Thanveer, Surrey UniversityDeep Fisher-Vector Descriptors for Image Retrieval and Scene Recognition
12:10-12:40CambridgeMengjie Qian, Cambridge UniversitySpeaker Retrieval in the Wild
12:40-13:00QUBAbbas Haider, Queen’s University BelfastMulti Modal Fusion for Video Retrieval based on CLIP Guide Feature Alignment
13:00-14:00Conference Lunch
14:00-15:00Keynote 3Mark Plumbley, Surrey UniversityMachine Learning for Everyday Sounds: Recognition, Visualization, Captioning, Separation and Generation of Audio
15:00-17:00Session 2
15:00-15:20Dan Li, Southwest Jiaotong UniversityOptimal Reversible Color Transform for Multimedia Image and Video Processing
15:20-15:40Jiawen Zhang, Fujian Normal UniversityHashing Orthogonal Constraint Loss for Multi-Label Image Retrieval
15:40-16:00Jiawen Zhang, Fujian Normal UniversityMulti-Proxy Deep Hashing for Image Retrieval
16:00-16:20Guangmin Li, Hubei Normal UniversityMulti-attention Fusion for Multimodal Sentiment Classification
16:20-16:40Ji Huang, Southwest Jiaotong UniversitySmall Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion

Important Dates

  • Paper Submission: May 1, 2024 (Final – no further extension)
  • Paper Notification: May 6, 2024
  • Camera-Ready Submission: May 8, 2024
  • Workshop: June 14, 2024

Submission Instructions

Paper format

All papers must be formatted according to the ACM proceedings style. Click here to access LaTeX and Microsoft Word templates for this format. If you use LaTeX, please use sigconf.tex as the template.

Complying with double-blind review

In a “Double-blind Review” process, authors should not know the names of the reviewers of their papers, and reviewers should not know the names of the authors. Please prepare your paper in a way that preserves the anonymity of the authors, namely:

  • Do not put your names under the title,
  • Avoid using phrases such as “our previous work” when referring to earlier publications by the authors,
  • Remove information that may identify the authors in the acknowledgements (e.g., co-workers and grant IDs),
  • Check supplemental material for information that may identify the authors’ identity,
  • Avoid providing links to Websites that identify the authors.

Length of the paper

Please ensure the appropriate length of your submission. The length of both short and long papers should not exceed 6 pages including references.

All submissions can be made via EasyChair:

https://easychair.org/conferences/?conf=mvrmlm2024

Organizers

  • Prof. Hui Wang, Queen’s University Belfast
  • Prof. Josef Kittler, University of Surrey
  • Prof. Mark Gales, University of Cambridge
  • Dr. Rob Cooper, BBC
  • Prof. Maurice Mulvenna, Ulster University
  • Prof. Wing Ng, South China University of Technology
  • Dr Yang Hua, Queen’s University Belfast
  • Dr Richard Gault, Queen’s University Belfast
  • Dr Abbas Haider, Queen’s University Belfast
  • Dr Guanfeng Wu, Southwest Jiaotong University

Contact

For any questions regarding MVRMLM2024 workshop, please contact Workshop Chair Prof Hui Wang(h.wang◎qub.ac.uk).