MVRMLM 2024
MVRMLM2024
Multimodal Video Retrieval and Multimodal Language Modelling
co-located with International Conference onMultimedia Retrieval (ICMR) 2024
Phuket, Thailand, June 10-14, 2024
Welcome
Videos are being generated in large numbers and volumes – by the public, video conferencing tools (e.g. Teams, Zoom, Webex), and TV broadcasters such as the BBC. The videos may be stored in public archives such as Youtube or proprietary archives such as the BBC Rewind Archive. How to search video archives without using pre-defined metadata such as titles, tags, and viewer notes is a challenge.
The Workshop proposers are undertaking an EPSRC funded research project, Multimodal Video Search by Examples (MVSE), to tackle this challenge. This is motivated by a use case from the BBC, “locate clips with person X in setting Y talking about subject Z”. This use case is difficult to answer by keyword-based search. The MVSE project conducts research on the image and video analysis techniques that enable search with a face image and an audio of X, a scene image of Y and a text phrase of subject Z, where the modalities are person (face or voice), context and topic.
Insights have been gained, a demonstrator has been built, and a video archive has been well understood. It is the opinion of the proposers it is time to share with the research community our vision, our findings and techniques in order to accelerate the research and impact of an important research area, multimodel video search by examples.
Topics:
- Image and video retrieval
- Image and video segmentation
- Image and video embedding
- Multimodal information retrieval
- Multimodal language modelling
- Human centred AI for multimedia search
Keynote Speaker
Assoc. Prof. Klaus Schoeffmann (University of Klagenfurt, Austria) & Prof. Cathal Gurrin (Dublin City University, Ireland)
Bio:Dr. Klaus Schoeffmann is an Associate Professor at the Institute of Information Technology (ITEC) at Klagenfurt University, Austria, where he received his habilitation in Computer Science in 2015. He holds a PhD and a MSc in Computer Science. His research focuses on video content understanding, deep learning, computer vision, multimedia retrieval, and interactive multimedia. He has secured research funding of over €2M (from FWF, KWF, and industrial partners) and has graduated 6 PhDs and numerous MSc dissertations. He has a Google H-index of 33 from over 3,500 citations to his research works, and he is founder and co-organizer of the annual Video Browser Showdown (VBS) and the annual Lifelog Search Challenge (LSC). He is a member of the IEEE and the ACM, a regular reviewer for international conferences and journals in the field of multimedia and medical imaging. Klaus Schoeffmann has been the program co-chair of MMM 2021, CBMI 2021, ACM ICMR 2020, MMM2018, CMBI 2013, the demo & video co-chair of ACMMM2020, open-source software competition chair of ACMMM2019, the general co-chair of MMM2012, and will be the general co-chair of ACM ICMR 2024 and ACMMM2025.
Bio:Dr Cathal Gurrin is a Professor of Computer Science and lifelogger. He is the Head of the Adapt Centre at Dublin City University, a Funded Investigator of the Insight Centre, and the director of the Human Media Archives research group. He was previously the deputy head of the School of Computing. His interests include personal analytics and lifelogging. He publishes in information retrieval (IR) with a particular focus on how people access information from pervasive computing devices. He has captured a continuous personal digital memory since 2006 using a wearable camera and logged hundreds of millions of other sensor readings.
Title: From Concepts To Embeddings. Charting the use of AI in Digital Video and Lifelog Search over the last Decade
Abstract: In the past decade, the field of interactive multimedia retrieval has undergone a transformative evolution driven by the advances in artificial intelligence (AI). This keynote talk will explore the journey from early concept-based retrieval systems to the sophisticated embedding-based techniques that dominate the landscape today. By examining the progression of such AI-driven approaches at both the VBS (Video Browser Showdown) and the LSC (Lifelog Search Challenge), we will highlight the pivotal role of comparative benchmarking in accelerating innovation and establishing performance standards. We will also forward at the potential future developments in interactive multimedia retrieval benchmarking, including emerging trends, the integration of multimodal data, and the future comparative benchmarking challenges within our community.
Prof. Huiyu Zhou, University of Leicester
Bio: Dr. Huiyu Zhou received a Bachelor of Engineering degree in Radio Technology from Huazhong University of Science and Technology of China, and a Master of Science degree in Biomedical Engineering from University of Dundee of United Kingdom, respectively. He was awarded a Doctor of Philosophy degree in Computer Vision from Heriot-Watt University, Edinburgh, United Kingdom.Dr. Zhou currently is a full Professor at School of Computing and Mathematical Sciences, University of Leicester, United Kingdom. He has published over 500 peer-reviewed papers in the field. His research work has been or is being supported by UK EPSRC, ESRC, AHRC, MRC, EU, Innovate UK, Royal Society, British Heart Foundation, Leverhulme Trust, Puffin Trust, Alzheimer’s Research UK, Invest NI and industry. Homepage: https://le.ac.uk/people/huiyu-zhou
Title: Video Understanding for Behavioural Analysis
Abstract: Video understanding has emerged as a powerful tool in behavioural analysis, offering innovative methodologies to capture and interpret complex behaviours from visual data. This talk explores the various techniques used in video understanding, including machine learning, deep learning, and computer vision, which can be used to address the challenges in this field, such as accurately detecting and tracking multiple subjects, recognising subtle and nuanced behaviours, and managing large volumes of video data. The application of video understanding extends across numerous sectors from multimedia, healthcare to security, with potential to revolutionise behavioural analysis and beyond.
Prof Zhou will share his experience and insights in video understanding for behavioural analysis. He will present a case study on Parkinson’s disease (PD) diagnosis to demonstrate the capability of video understanding in healthcare. He will describe the methodologies developed to analyse behaviours in both animals (e.g, mice) and humans. These include pioneering techniques for detecting and tracking single and multiple mice, recognising individual and social behaviours, conducting comprehensive social behaviour analysis, and also for distinguishing between normal and PD-afflicted mice by examining their interactions and movements. He will conclude with a vision for the future of video understanding in behavioural analysis, along with an outline for the anticipated advancements in technology and methodology, the potential for broader applications, and the ongoing research efforts aimed at overcoming current limitations.
Prof Mark Plumbley, University of Surrey
Bio: Prof. Mark Plumbley is Professor of Signal Processing at the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey, in Guildford, UK. He is an expert on analysis and processing of audio, using a wide range of signal processing and machine learning methods. He led the first international data challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), and is a co-editor of the book “Computational Analysis of Sound Scenes and Events” (Springer, 2018). He currently holds a 5-year EPSRC Fellowship “AI for Sound” on automatic recognition of everyday sounds. He is a Member of the IEEE Signal Processing Society Technical Committee on Audio and Acoustic Signal Processing, and a Fellow of the IET and IEEE.
Title: Machine Learning for Everyday Sounds: Recognition, Captioning, Visualization, Separation and Generation of Audio
Abstract: The last few years has seen a rapid increase of interest in machine learning for everyday sounds. Starting a decade ago with acoustic scene classification and sound event detection, the challenges and workshops on Detection and Classification of Acoustic Scenes and Events (DCASE) have brought together researchers from academia and industry to establish a new research community. In this talk, I will highlight some of the recent work taking place in this area at the University of Surrey, including pretrained audio neural networks (PANNs), audio captioning, audio visualization, audio source separation and audio generation (AudioLDM). I will also mention some cross-cutting issues such as dataset collection and algorithm efficiency, and discuss how we might design future audio machine learning applications for the benefit of people and society.
Programme
Time | Type | Speaker | Title |
---|---|---|---|
08:30-09:00 | Registration | ||
09:30-09:50 | Welcome and Opening Speech | Hui Wang, Queen’s University Belfast | Multimodal Video Search by Examples |
09:50-10:30 | Keynote 1 | Klaus Schoeffmann (University of Klagenfurt, Austria) & Cathal Gurrin (Dublin City University, Ireland) | From Concepts To Embeddings. Charting the use of AI in Digital Video and Lifelog Search over the last Decade |
10:30-11:00 | Conference Tea Break | ||
11:00-11:30 | Keynote 2 | Huiyu Zhou, University of Leicester | Video Understanding for Behavioural Analysis |
11:30-13:00 | Session 1 | ||
11:30-11:50 | SCUT | Wing Ng, South China University of Technology | Semi-supervised Concept Preserving Hashing for Image Retrieval in Non-stationary Data Environment |
11:50-12:10 | Surrey | Mohamed Faheem Thanveer, Surrey University | Deep Fisher-Vector Descriptors for Image Retrieval and Scene Recognition |
12:10-12:40 | Cambridge | Mengjie Qian, Cambridge University | Speaker Retrieval in the Wild |
12:40-13:00 | QUB | Abbas Haider, Queen’s University Belfast | Multi Modal Fusion for Video Retrieval based on CLIP Guide Feature Alignment |
13:00-14:00 | Conference Lunch | ||
14:00-15:00 | Keynote 3 | Mark Plumbley, Surrey University | Machine Learning for Everyday Sounds: Recognition, Visualization, Captioning, Separation and Generation of Audio |
15:00-17:00 | Session 2 | ||
15:00-15:20 | Dan Li, Southwest Jiaotong University | Optimal Reversible Color Transform for Multimedia Image and Video Processing | |
15:20-15:40 | Jiawen Zhang, Fujian Normal University | Hashing Orthogonal Constraint Loss for Multi-Label Image Retrieval | |
15:40-16:00 | Jiawen Zhang, Fujian Normal University | Multi-Proxy Deep Hashing for Image Retrieval | |
16:00-16:20 | Guangmin Li, Hubei Normal University | Multi-attention Fusion for Multimodal Sentiment Classification | |
16:20-16:40 | Ji Huang, Southwest Jiaotong University | Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion |
Important Dates
- Paper Submission: May 1, 2024 (Final – no further extension)
- Paper Notification: May 6, 2024
- Camera-Ready Submission: May 8, 2024
- Workshop: June 14, 2024
Submission Instructions
Paper format
All papers must be formatted according to the ACM proceedings style. Click here to access LaTeX and Microsoft Word templates for this format. If you use LaTeX, please use sigconf.tex as the template.
Complying with double-blind review
In a “Double-blind Review” process, authors should not know the names of the reviewers of their papers, and reviewers should not know the names of the authors. Please prepare your paper in a way that preserves the anonymity of the authors, namely:
- Do not put your names under the title,
- Avoid using phrases such as “our previous work” when referring to earlier publications by the authors,
- Remove information that may identify the authors in the acknowledgements (e.g., co-workers and grant IDs),
- Check supplemental material for information that may identify the authors’ identity,
- Avoid providing links to Websites that identify the authors.
Length of the paper
Please ensure the appropriate length of your submission. The length of both short and long papers should not exceed 6 pages including references.
All submissions can be made via EasyChair:
https://easychair.org/conferences/?conf=mvrmlm2024
Organizers
- Prof. Hui Wang, Queen’s University Belfast
- Prof. Josef Kittler, University of Surrey
- Prof. Mark Gales, University of Cambridge
- Dr. Rob Cooper, BBC
- Prof. Maurice Mulvenna, Ulster University
- Prof. Wing Ng, South China University of Technology
- Dr Yang Hua, Queen’s University Belfast
- Dr Richard Gault, Queen’s University Belfast
- Dr Abbas Haider, Queen’s University Belfast
- Dr Guanfeng Wu, Southwest Jiaotong University
Contact
For any questions regarding MVRMLM2024 workshop, please contact Workshop Chair Prof Hui Wang(h.wang◎qub.ac.uk).