In recent years, with the advancement of multimodal foundation models (MMFMs), there has been a growing interest in enhancing their generalization abilities through continual learning (CL) to process diverse data types, from text to visuals, and continuously update their capabilities based on real-time inputs. Despite significant advancements in theoretical research and applications of continual learning, the community remains confronted with serious challenges. Our workshop aims to provide a venue where academic researchers and industry practitioners can come together to discuss the principles, limitations, and applications of multimodal foundation models in continual learning for multimedia applications and promote the understanding of multimodal foundation models in continual learning, innovative algorithms, and research on new multimodal technologies and applications.
Scope and Topics
Interested topics will include, but not be limited to:
The proposed workshop will include 3 invited talks and 4 or more oral paper presentations. The workshop will be considered for a whole-day meeting.
Time |
Programme |
10 min |
Welcome + Opening |
30 minKeynote Speaker |
Keynote: Zero- and Few-shot Keypoint Detection: from modulation to multimodal prompting
Keypoint detection has been an important topic in computer vision over 20 years.
Early methods relied on unsupervised techniques such as Hessian or Harris corner detectors, or SIFT interest points.
Modern keypoint detection can now be performed within a few-shot learning paradigm, where annotated support keypoints (e.g., paw, nose, ears, eyes) are detected in an unannotated query.
Applications of such keypoints include pose estimation, fine-grained recognition, and pose warping. In this talk, I will discuss our earlier work on few-shot keypoint detection (FSKD) that can generalize to unseen animal species (e.g., training on dogs, testing on cats) and keypoint types (e.g., training on paws, testing on ears).
I will also cover how saliency maps and DINO can enhance attention in keypoint detection, how one may streamline the traditional modulation and detection into a single step, and use contrastive learning to improve performance.
Finally, I will explain our recent work on multimodal (image, text) keypoint prompting using CLIP for generalized zero- and few-shot keypoint detection.
|
30 minKeynote Speaker |
Keynote: Class Incremental Learning for Image Classification
Our most recent work reviews the concept of "phase" which is the primary cause of unrealistic data distribution shifts in Class-Incremental Learning (CIL) settings.
We thus eliminate the phase concept and propose a non-stationary data stream with class sampling distributions that shift at every time step. The entry time point of each class is random. This design respects the "rise-and-fall" nature described by Gunderson (2002) and introduces two new challenges for CIL.
First, the non-stationarity of the data requires models to identify recent dynamics and adopt appropriate learning strategies for both memorization and adaptation. Second, at any given time point, the proposed stream may exhibit an extremely imbalanced data distribution, introducing a strong bias towards the dominant class.
We address these challenges by introducing a novel Rate-dependent Coreset Selection (RdCS) method. For the first challenge, we tie the RdCS to a real-time rate indicating the intensity of the distribution shift, providing an adaptive selection strategy. For the second challenge, we design the RdCS to operate on biased validation steps.
We will showcase some results of the CIL of image classification in this talk.
|
30 minKeynote Speaker |
Keynote: Evolving AI: Advancing Continual Learning in Large Language Models
Continual learning with large language models (LLMs) is crucial for enabling AI systems to adapt and evolve in real-time, maintaining and enhancing knowledge without succumbing to catastrophic forgetting, thereby ensuring sustained operational efficiency and relevance.
This report explores the integration of continual learning with large language models across multi-modal information sources. We begin by reviewing traditional continual learning, illustrating its application in text, image, and speech extraction, and multi-modal knowledge graph construction.
We then redefine continual learning for LLMs, focusing on overcoming catastrophic forgetting and enhancing knowledge retention through continual pre-training, instruction tuning, and alignment. Looking ahead, we discuss challenges such as data evolution and contamination and propose innovations in architectures and learning paradigms, including language agents evolution and proactive continual learning.
|
30 minKeynote Speaker |
Keynote: Adaptation Without Forgetting: Repurposing Foundation Models for Zero-Shot and Few-Shot Semantic Segmentation
Foundation vision models, trained either in a supervised or unsupervised manner, possess extensive knowledge about diverse object appearances.
These models are often adapted for new computer vision tasks, such as transitioning from classification to segmentation, by adding additional parameters.
However, in practice this adaptation often relies on a limited set of object categories, causing the system to overfit to the "seen category" and leading to the forgetting of the foundation model's knowledge about other categories.
In this talk, we present our recent efforts to address this challenge, with a focus on zero-shot and few-shot semantic segmentation applications.
Our findings demonstrate that parameter-efficient tuning, carefully designed loss functions, and specific input for the newly added module can significantly enhance performance compared to the straightforward extension of foundation models.
|
... |
... |
10 min |
Coffee Break |
... |
... |
30 min |
Round Table Discussion |
Contact the Organizing Committee: woods.cl.acm.mm@gmail.com