In recent years, with the advancement of multimodal foundation models (MMFMs), there has been a growing interest in enhancing their generalization abilities through continual learning (CL) to process diverse data types, from text to visuals, and continuously update their capabilities based on real-time inputs. Despite significant advancements in theoretical research and applications of continual learning, the community remains confronted with serious challenges. Our workshop aims to provide a venue where academic researchers and industry practitioners can come together to discuss the principles, limitations, and applications of multimodal foundation models in continual learning for multimedia applications and promote the understanding of multimodal foundation models in continual learning, innovative algorithms, and research on new multimodal technologies and applications.
Scope and Topics
Interested topics will include, but not be limited to:
The workshop will include four invited talks and four paper presentations from 8:45 a.m. to 12:30 a.m. .
Time |
Programme |
08:45-08:50 |
Opening Remarks |
10:10-10:50Keynote Speaker |
Keynote: Continual Learning for Multi-modal Human-centric Applications
Continual learning is a fundamental mechanism for enabling long-term adaptation and knowledge accumulation in intelligent systems. However, in the era of large-scale pretraining, it remains a critical challenge to extend continual learning to dynamic, heterogeneous real-world environments. This talk presents our recent efforts on continual learning in pretrained models, with a focus on two key directions. First, we propose a modality-heterogeneous continual pretraining framework for multi-modal physiological signal generation, enabling real-time and robust monitoring of human health conditions. Second, inspired by the spatial cognition mechanisms of the biological brain, we develop embodied agents that construct and refine cognitive maps through continuous collection of spatial knowledge, thus equipping multi-modal language models with strong long-horizon generalization in complex environments. Together, these advances point toward the development of brain-inspired embodied intelligence with lifelong adaptability.
![]() |
09:30-10:10Keynote Speaker |
Keynote(online): Continual Learning of Visual Representations
Continually learning and acquiring new concepts from a dynamically changing
environment is an important requirement for an artificial intelligence system.
Existing deep learning methods fail to achieve this goal and suffer from significant performance
degeneration when being trained again on a new dataset. We discuss
the main approaches in continual learning, trough regularization, expansion
architectures and using replay mechanisms. A series of recent approaches to the continual learning
of image tasks will be introduced during the plenary lecture and experimental results will be provided.
Limitations of the existing continual learning systems will also be discussed together with directions
of future research.
![]() |
08:50-09:30Keynote Speaker |
Keynote: From Context to Parameters: Generalization and Transfer in Modalities and Domains
Multimodal foundation models (MFMs) are increasingly deployed in dynamic, open-world environments where they must generalize to new tasks and modalities and transfer knowledge across diverse domains. Achieving this requires tackling two complementary challenges: adapting in context for immediate, task-specific generalization, and evolving parameters for long-term retention and scalable transfer. In this keynote, I will present effective strategies through the lens of context and parameter adaptations. On the context side, I will discuss how multimodal models can leverage demonstrations and structured reasoning to generalize on the fly, adapting to new tasks without additional training. On the parameter side, I will examine how models can evolve to retain prior knowledge and expand to new modalities and domains, enabling continual learning over time.
![]() |
10:50-11:30Keynote Speaker |
Keynote: Generalizing vision-language models to novel domains
Vision-language pretraining has enabled powerful vision-language models (VLMs) with strong zero-shot capabilities. Yet, their performance drops in domain-specific tasks, motivating research on transferring and generalizing VLM knowledge to downstream applications. This talk briefly reviews generalization settings, methodologies, and benchmarks, categorizing approaches into prompt-based, parameter-based, and feature-based methods. We also discuss our recent research on generalizing VLMs to novel domains.
![]() |
11:30-11:43 |
Morning Tea |
11:43-11:51 |
SR-ML: A Sequence-level Routing with Mixed Low-rank Experts Framework for Continual Learning |
11:51-11:59 |
Low Altitude-R1: Exploring the Upper Limits of Target Detection in Low-altitude Scenarios with Reinforcement Learning |
11:59-12:07 |
LaST-LoRA: Adaptive Knowledge Reuse and Latent Subspace Tracking for Continual Learning |
12:07-12:15 |
NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation |
12:15-12:30 |
Panel & Closing Remarks |
Contact the Organizing Committee: woods.cl.acm.mm@gmail.com