MVRMLM 2024 – Multimodal Video Search by Examples (MVSE)

MVRMLM2024

Multimodal Video Retrieval and Multimodal Language Modelling

co-located with International Conference onMultimedia Retrieval (ICMR) 2024
Phuket, Thailand, June 10-14, 2024

Welcome

Videos are being generated in large numbers and volumes – by the public, video conferencing tools (e.g. Teams, Zoom, Webex), and TV broadcasters such as the BBC. The videos may be stored in public archives such as Youtube or proprietary archives such as the BBC Rewind Archive. How to search video archives without using pre-defined metadata such as titles, tags, and viewer notes is a challenge.

The Workshop proposers are undertaking an EPSRC funded research project, Multimodal Video Search by Examples (MVSE), to tackle this challenge. This is motivated by a use case from the BBC, “locate clips with person X in setting Y talking about subject Z”. This use case is difficult to answer by keyword-based search. The MVSE project conducts research on the image and video analysis techniques that enable search with a face image and an audio of X, a scene image of Y and a text phrase of subject Z, where the modalities are person (face or voice), context and topic.

Insights have been gained, a demonstrator has been built, and a video archive has been well understood. It is the opinion of the proposers it is time to share with the research community our vision, our findings and techniques in order to accelerate the research and impact of an important research area, multimodel video search by examples.

ICMR2024

Topics:

Image and video retrieval
Image and video segmentation
Image and video embedding
Multimodal information retrieval
Multimodal language modelling
Human centred AI for multimedia search

Keynote Speaker

Assoc. Prof. Klaus Schoeffmann (University of Klagenfurt, Austria) & Prof. Cathal Gurrin (Dublin City University, Ireland)

Bio:Dr. Klaus Schoeffmann is an Associate Professor at the Institute of Information Technology (ITEC) at Klagenfurt University, Austria, where he received his habilitation in Computer Science in 2015. He holds a PhD and a MSc in Computer Science. His research focuses on video content understanding, deep learning, computer vision, multimedia retrieval, and interactive multimedia. He has secured research funding of over €2M (from FWF, KWF, and industrial partners) and has graduated 6 PhDs and numerous MSc dissertations. He has a Google H-index of 33 from over 3,500 citations to his research works, and he is founder and co-organizer of the annual Video Browser Showdown (VBS) and the annual Lifelog Search Challenge (LSC). He is a member of the IEEE and the ACM, a regular reviewer for international conferences and journals in the field of multimedia and medical imaging. Klaus Schoeffmann has been the program co-chair of MMM 2021, CBMI 2021, ACM ICMR 2020, MMM2018, CMBI 2013, the demo & video co-chair of ACMMM2020, open-source software competition chair of ACMMM2019, the general co-chair of MMM2012, and will be the general co-chair of ACM ICMR 2024 and ACMMM2025.

Bio:Dr Cathal Gurrin is a Professor of Computer Science and lifelogger. He is the Head of the Adapt Centre at Dublin City University, a Funded Investigator of the Insight Centre, and the director of the Human Media Archives research group. He was previously the deputy head of the School of Computing. His interests include personal analytics and lifelogging. He publishes in information retrieval (IR) with a particular focus on how people access information from pervasive computing devices. He has captured a continuous personal digital memory since 2006 using a wearable camera and logged hundreds of millions of other sensor readings.

Title: From Concepts To Embeddings. Charting the use of AI in Digital Video and Lifelog Search over the last Decade

Abstract: In the past decade, the field of interactive multimedia retrieval has undergone a transformative evolution driven by the advances in artificial intelligence (AI). This keynote talk will explore the journey from early concept-based retrieval systems to the sophisticated embedding-based techniques that dominate the landscape today. By examining the progression of such AI-driven approaches at both the VBS (Video Browser Showdown) and the LSC (Lifelog Search Challenge), we will highlight the pivotal role of comparative benchmarking in accelerating innovation and establishing performance standards. We will also forward at the potential future developments in interactive multimedia retrieval benchmarking, including emerging trends, the integration of multimodal data, and the future comparative benchmarking challenges within our community.

Prof. Huiyu Zhou, University of Leicester

Bio: Dr. Huiyu Zhou received a Bachelor of Engineering degree in Radio Technology from Huazhong University of Science and Technology of China, and a Master of Science degree in Biomedical Engineering from University of Dundee of United Kingdom, respectively. He was awarded a Doctor of Philosophy degree in Computer Vision from Heriot-Watt University, Edinburgh, United Kingdom.Dr. Zhou currently is a full Professor at School of Computing and Mathematical Sciences, University of Leicester, United Kingdom. He has published over 500 peer-reviewed papers in the field. His research work has been or is being supported by UK EPSRC, ESRC, AHRC, MRC, EU, Innovate UK, Royal Society, British Heart Foundation, Leverhulme Trust, Puffin Trust, Alzheimer’s Research UK, Invest NI and industry. Homepage: https://le.ac.uk/people/huiyu-zhou

Title: Video Understanding for Behavioural Analysis

Abstract: Video understanding has emerged as a powerful tool in behavioural analysis, offering innovative methodologies to capture and interpret complex behaviours from visual data. This talk explores the various techniques used in video understanding, including machine learning, deep learning, and computer vision, which can be used to address the challenges in this field, such as accurately detecting and tracking multiple subjects, recognising subtle and nuanced behaviours, and managing large volumes of video data. The application of video understanding extends across numerous sectors from multimedia, healthcare to security, with potential to revolutionise behavioural analysis and beyond.

Prof Zhou will share his experience and insights in video understanding for behavioural analysis. He will present a case study on Parkinson’s disease (PD) diagnosis to demonstrate the capability of video understanding in healthcare. He will describe the methodologies developed to analyse behaviours in both animals (e.g, mice) and humans. These include pioneering techniques for detecting and tracking single and multiple mice, recognising individual and social behaviours, conducting comprehensive social behaviour analysis, and also for distinguishing between normal and PD-afflicted mice by examining their interactions and movements. He will conclude with a vision for the future of video understanding in behavioural analysis, along with an outline for the anticipated advancements in technology and methodology, the potential for broader applications, and the ongoing research efforts aimed at overcoming current limitations.

Prof Mark Plumbley, University of Surrey

Bio: Prof. Mark Plumbley is Professor of Signal Processing at the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey, in Guildford, UK. He is an expert on analysis and processing of audio, using a wide range of signal processing and machine learning methods. He led the first international data challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), and is a co-editor of the book “Computational Analysis of Sound Scenes and Events” (Springer, 2018). He currently holds a 5-year EPSRC Fellowship “AI for Sound” on automatic recognition of everyday sounds. He is a Member of the IEEE Signal Processing Society Technical Committee on Audio and Acoustic Signal Processing, and a Fellow of the IET and IEEE.

Title: Machine Learning for Everyday Sounds: Recognition, Captioning, Visualization, Separation and Generation of Audio

Abstract: The last few years has seen a rapid increase of interest in machine learning for everyday sounds. Starting a decade ago with acoustic scene classification and sound event detection, the challenges and workshops on Detection and Classification of Acoustic Scenes and Events (DCASE) have brought together researchers from academia and industry to establish a new research community. In this talk, I will highlight some of the recent work taking place in this area at the University of Surrey, including pretrained audio neural networks (PANNs), audio captioning, audio visualization, audio source separation and audio generation (AudioLDM). I will also mention some cross-cutting issues such as dataset collection and algorithm efficiency, and discuss how we might design future audio machine learning applications for the benefit of people and society.

Programme

Time	Type	Speaker	Title
08:30-09:00	Registration
09:30-09:50	Welcome and Opening Speech	Hui Wang, Queen’s University Belfast	Multimodal Video Search by Examples
09:50-10:30	Keynote 1	Klaus Schoeffmann (University of Klagenfurt, Austria) & Cathal Gurrin (Dublin City University, Ireland)	From Concepts To Embeddings. Charting the use of AI in Digital Video and Lifelog Search over the last Decade
10:30-11:00	Conference Tea Break
11:00-11:30	Keynote 2	Huiyu Zhou, University of Leicester	Video Understanding for Behavioural Analysis
11:30-13:00	Session 1
11:30-11:50	SCUT	Wing Ng, South China University of Technology	Semi-supervised Concept Preserving Hashing for Image Retrieval in Non-stationary Data Environment
11:50-12:10	Surrey	Mohamed Faheem Thanveer, Surrey University	Deep Fisher-Vector Descriptors for Image Retrieval and Scene Recognition
12:10-12:40	Cambridge	Mengjie Qian, Cambridge University	Speaker Retrieval in the Wild
12:40-13:00	QUB	Abbas Haider, Queen’s University Belfast	Multi Modal Fusion for Video Retrieval based on CLIP Guide Feature Alignment
13:00-14:00	Conference Lunch
14:00-15:00	Keynote 3	Mark Plumbley, Surrey University	Machine Learning for Everyday Sounds: Recognition, Visualization, Captioning, Separation and Generation of Audio
15:00-17:00	Session 2
15:00-15:20		Dan Li, Southwest Jiaotong University	Optimal Reversible Color Transform for Multimedia Image and Video Processing
15:20-15:40		Jiawen Zhang, Fujian Normal University	Hashing Orthogonal Constraint Loss for Multi-Label Image Retrieval
15:40-16:00		Jiawen Zhang, Fujian Normal University	Multi-Proxy Deep Hashing for Image Retrieval
16:00-16:20		Guangmin Li, Hubei Normal University	Multi-attention Fusion for Multimodal Sentiment Classification
16:20-16:40		Ji Huang, Southwest Jiaotong University	Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion

Important Dates

Paper Submission: May 1, 2024 (Final – no further extension)
Paper Notification: May 6, 2024
Camera-Ready Submission: May 8, 2024
Workshop: June 14, 2024

Submission Instructions

Paper format

All papers must be formatted according to the ACM proceedings style. Click here to access LaTeX and Microsoft Word templates for this format. If you use LaTeX, please use sigconf.tex as the template.

Complying with double-blind review

In a “Double-blind Review” process, authors should not know the names of the reviewers of their papers, and reviewers should not know the names of the authors. Please prepare your paper in a way that preserves the anonymity of the authors, namely:

Do not put your names under the title,
Avoid using phrases such as “our previous work” when referring to earlier publications by the authors,
Remove information that may identify the authors in the acknowledgements (e.g., co-workers and grant IDs),
Check supplemental material for information that may identify the authors’ identity,
Avoid providing links to Websites that identify the authors.

Length of the paper

Please ensure the appropriate length of your submission. The length of both short and long papers should not exceed 6 pages including references.

All submissions can be made via EasyChair:

https://easychair.org/conferences/?conf=mvrmlm2024

Organizers

Prof. Hui Wang, Queen’s University Belfast
Prof. Josef Kittler, University of Surrey
Prof. Mark Gales, University of Cambridge
Dr. Rob Cooper, BBC
Prof. Maurice Mulvenna, Ulster University
Prof. Wing Ng, South China University of Technology
Dr Yang Hua, Queen’s University Belfast
Dr Richard Gault, Queen’s University Belfast
Dr Abbas Haider, Queen’s University Belfast
Dr Guanfeng Wu, Southwest Jiaotong University

Contact

For any questions regarding MVRMLM2024 workshop, please contact Workshop Chair Prof Hui Wang(h.wang◎qub.ac.uk).