Continual Learning meets Multimodal Foundation Models:
Fundamentals and Advances

In conjunction with ACM MM 2024

1 November, 2024 (9:00 AM - 17:30 PM)

Location: Melbourne, Australia

Call For Papers

In recent years, with the advancement of multimodal foundation models (MMFMs), there has been a growing interest in enhancing their generalization abilities through continual learning (CL) to process diverse data types, from text to visuals, and continuously update their capabilities based on real-time inputs. Despite significant advancements in theoretical research and applications of continual learning, the community remains confronted with serious challenges. Our workshop aims to provide a venue where academic researchers and industry practitioners can come together to discuss the principles, limitations, and applications of multimodal foundation models in continual learning for multimedia applications and promote the understanding of multimodal foundation models in continual learning, innovative algorithms, and research on new multimodal technologies and applications.

Scope and Topics
Interested topics will include, but not be limited to:

  • Lifelong / Continual / Incremental / Online Learning
  • Few-shot & Transfer Learning related to Continual Learning
  • Applications and use-cases of Continual Learning
  • Meta-learning & Curriculum Learning & Active Learning
  • Reinforcement Learning and Robotics in Continual Learning
  • Ethical and Safety considerations for machines that can learn continuously
  • Continuous domain adaptation / Test-time adaptation
  • Vision / Sound / Speech / Language Foundation Models in any possible combination
  • Self / Semi / Weakly supervised training of MMFMs
  • Multi-task and Continual Learning for MMFMs
  • Efficient training and inference of MMFMs
  • Parameter-efficient fine-tuning, prompting, and adapters for MMFMs
  • Generative MMFMs (e.g. text-to-image / video /3D generation)
  • Ethics, risks, and fairness of MMFMs
  • Benchmarks, scenarios, evaluation protocols, and metrics for the above topics

Keynote Speakers

Piotr Koniusz is a Principal Research Scientist in the Machine Learning Research Group at Data61/CSIRO and an Honorary Associate Professor at the Australian National University. He obtained a BSc in Telecommunications and Software Engineering in 2004 from Warsaw University of Technology, Poland, and a PhD in Computer Vision in 2013 at CVSSP, University of Surrey, UK. Between 2013-2015, he was a postdoctoral researcher with the LEAR team at INRIA, France. Dr. Koniusz’s research interests are centered around representation learning and learning-to- learn paradigms, with a focus on contrastive, incremental and few-shot learning across various data modalities. His contributions to the field have been recognized through the Sang Uk Lee Best Student Paper Award at ACCV’22 and the Runner-up APRS/IAPR Best Student Paper Award at DICTA’22. He has served as a Workshop Program Co-Chair for NeurIPS’23.

Qianru Sun is an associate professor of computer science at the School of Computing and Information Systems (SCIS), at Singapore Management University (SMU). Here is her faculty profile. From 2018 to 2019, she was a research fellow working with Prof. Tat-Seng Chua at the National University of Singapore and Prof.Dr. Bernt Schiele at the MPI for Informatics. From 2016 to 2018, she held the Lise Meitner Award Fellowship and worked with Prof.Dr. Bernt Schiele and Prof. Dr. Mario Fritz at the MPI for Informatics. She got my Ph.D. degree from Peking University in 2016 and her thesis was advised by Prof. Hong Liu. In 2014, she visited the research group of Prof. Tatsuya Harada at the University of Tokyo. Her research interests are computer vision and machine learning.

Prof. Gholamreza (Reza) Haffari is a distinguished academic at Monash University's Department of Data Science & Artificial Intelligence, where he also serves as the Director of the Vision & Language Group. With a Ph.D. in Computer Science from Simon Fraser University, Haffari has extensively researched and developed generative artificial intelligence systems with a focus on low-level perception and high-level reasoning from multimodal data sources. His work spans several crucial areas including the continual knowledge alignment of large language models, safety and alignment of AI systems to human values, and development of LLM-based conversational agents. His innovative approaches in AI have earned him recognition such as the ARC Future Fellowship and multiple awards from prestigious associations. Prof. Haffari's leadership extends into his role as a senior committee member for numerous international AI conferences. His research, backed by substantial funding from organizations like DARPA, Google, and Amazon, aims to develop trustworthy AI technologies in critical areas such as digital health and law. His ongoing projects and collaborations continue to set benchmarks in the AI research community, fostering developments that align technological advancements with societal needs and values.

Dr. Lingqiao Liu is an Associate Professor at the School of Computer Science at The University of Adelaide, Australia. He is also an Academic Member of the Australian Institute for Machine Learning. His research spans machine learning, computer vision, and natural language processing. His primary objective is to develop practical machine learning systems that are both data-efficient and generalizable for real-world applications. His current research focuses on low-supervision machine learning, including semi-supervised learning, unsupervised learning, and few-shot/zero-shot learning. Additionally, he is interested in creating generalizable machine learning systems, exploring areas such as domain generalization and compositional generalization. His work has significant applications in computer vision, including dense prediction, fine-grained recognition, and content generation, as well as in natural language processing, particularly in low-resource NLP and the generalization of NLP systems. In recognition of his contributions, A/Prof Liu received the ARC DECRA (Discovery Early Career Researcher Award) and the University of Adelaide Research Fellowship in 2016.

Dr. Tongtong Wu is a Postdoctoral Research Fellow at Monash University, working with Prof. Reza Haffari, and holding a jointly supervised Ph.D. from Monash University and Southeast University. His research, which focuses on the co-evolution of LLMs, Data, and Knowledge, has attracted widespread interest from industry and received support, including the Monash Seed Grant. He has published over ten papers at conferences such as ICLR, ACL, EMNLP, AAAI, IJCAI, etc. As a principal researcher, he has long served as a member of the program committee for major conferences, including ICML, ICLR, NeurIPS, ACL ARR, ACM MM, AAAI, etc.

Program

The proposed workshop will include 3 invited talks and 4 or more oral paper presentations. The workshop will be considered for a whole-day meeting.

Time

Programme

09:00-09:05

Opening Remarks

09:05-09:50

Keynote Speaker

Keynote: Adaptation Without Forgetting: Repurposing Foundation Models for Zero-Shot and Few-Shot Semantic Segmentation

  • [Abstract]

  • [Slides]

  • Foundation vision models, trained either in a supervised or unsupervised manner, possess extensive knowledge about diverse object appearances. These models are often adapted for new computer vision tasks, such as transitioning from classification to segmentation, by adding additional parameters. However, in practice this adaptation often relies on a limited set of object categories, causing the system to overfit to the "seen category" and leading to the forgetting of the foundation model's knowledge about other categories. In this talk, we present our recent efforts to address this challenge, with a focus on zero-shot and few-shot semantic segmentation applications. Our findings demonstrate that parameter-efficient tuning, carefully designed loss functions, and specific input for the newly added module can significantly enhance performance compared to the straightforward extension of foundation models.
Lingqiao Liu
The University of Adelaide

09:50-10:35

Keynote Speaker

Keynote: Adapting Foundation Models: A Case Study on Remote Sensing Imagery

  • [Abstract]

  • [Slides]

  • Large visual models, such as CLIP and Stable Diffusion (SD), demonstrate remarkable performance in general image recognition and generation tasks. Their continual learning involves two folds: enhancing their performance with more natural images as input, and adapting them to specialized image domains. Our research targets the latter, using remote sensing (RS) imagery as a use case. RS, which relies on specialized satellites, presents challenges in image annotation and suffers from data scarcity and class imbalance, especially in special spectrums. Adapting models in this domain often leads to strong biases, where features of major classes overshadow those of minor classes. To address this, we recently introduced debLoRA---a generic training approach compatible with various low-rank model adaptation methods (like LoRA) to produce debiased features. In this talk, we will delve into this method and present the results achieved.
Qianru Sun
Singapore Management University

10:35-11:00

Morning Tea

11:00-11:45

Keynote Speaker

Keynote: Evolving AI: Advancing Continual Learning in Large Language Models

  • [Abstract]

  • [Slides]

  • Continual learning with large language models (LLMs) is crucial for enabling AI systems to adapt and evolve in real-time, maintaining and enhancing knowledge without succumbing to catastrophic forgetting, thereby ensuring sustained operational efficiency and relevance. This report explores the integration of continual learning with large language models across multi-modal information sources. We begin by reviewing traditional continual learning, illustrating its application in text, image, and speech extraction, and multi-modal knowledge graph construction. We then redefine continual learning for LLMs, focusing on overcoming catastrophic forgetting and enhancing knowledge retention through continual pre-training, instruction tuning, and alignment. Looking ahead, we discuss challenges such as data evolution and contamination and propose innovations in architectures and learning paradigms, including language agents evolution and proactive continual learning.
Gholamreza (Reza) Haffari
Monash University
Tongtong Wu
Monash University

11:45-12:30

Keynote Speaker

Keynote: Zero- and Few-shot Keypoint Detection: from modulation to multimodal prompting

  • [Abstract]

  • [Slides]

  • Keypoint detection has been an important topic in computer vision over 20 years. Early methods relied on unsupervised techniques such as Hessian or Harris corner detectors, or SIFT interest points. Modern keypoint detection can now be performed within a few-shot learning paradigm, where annotated support keypoints (e.g., paw, nose, ears, eyes) are detected in an unannotated query. Applications of such keypoints include pose estimation, fine-grained recognition, and pose warping. In this talk, I will discuss our earlier work on few-shot keypoint detection (FSKD) that can generalize to unseen animal species (e.g., training on dogs, testing on cats) and keypoint types (e.g., training on paws, testing on ears). I will also cover how saliency maps and DINO can enhance attention in keypoint detection, how one may streamline the traditional modulation and detection into a single step, and use contrastive learning to improve performance. Finally, I will explain our recent work on multimodal (image, text) keypoint prompting using CLIP for generalized zero- and few-shot keypoint detection.
Piotr Koniusz
Australian National University

12:30-14:00

Lunch

14:00-14:30

Fast and Accurate Continual Test Time Domain Adaptation

14:30-15:00

Incremental Image Generation with Diffusion Models by Label Embedding Initialization and Fusion

15:00-15:30

EAGLE Network: A Novel Incremental Learning Framework for Detecting Unknown Logos in Open-World Environments

15:30-16:00

Afternoon Tea

16:00-16:30

FAM-Logo: Forward Compatible Multimodal Framework for Few-Shot Logo Incremental Classification

16:30-17:15

Panel & Closing Remarks

Submission

  • The CL-24 will be held together with ACM MM 2024.
  • Accepted papers will be presented at the workshop and authors retain the right to submit them to other journals.
  • We invite submissions of original research papers addressing but not limited to the topics as listed above. Submissions should adhere to the ACM Multimedia 2024 formatting guidelines and will undergo a rigorous peer-review process. The template can be found via:
  • Submissions may vary in length from 4 to 8 pages, with additional pages permitted for the reference section (up to 2 pages). There is no distinction between long and short papers, but authors are free to determine the appropriate length for their paper.
  • Papers have to be submitted via:

Organizers

Program Committee

Wenbin Li

Nanjing University

Qi Fan

Nanjing University

Rui Yan

Nanjing University

Hongguang Zhang

Systems Engineering Institute, AMS

Lei Wang

University of Wollongong

Jinhui Tang

Nanjing University of Science and Technology

Jiebo Luo

University of Rochester

Student Organizer

Peng Huang

Nanjing University of Science and Technology

Zhiping Wu

Nanjing University

Shangge Liu

Nanjing University

Important Dates

  • Paper Submission Deadline: 19th, July, 2024

  • Extended Paper Submission Deadline: 27th, July, 2024

  • Paper Acceptance Notification: 5th, Aug, 2024

  • Camera-Ready Deadline: 19th, Aug, 2024

Contacts

Contact the Organizing Committee: woods.cl.acm.mm@gmail.com