Unified model can interpret various input modalities -- text, image, video, audio, and motion sensor data -- and generate textual responses
The recent surge in the development of Large Language Models (LLMs) has paved the way for advanced machine understanding and representation of human language. Building upon this advancement, a new model called Any-Modality Augmented Language Model (AnyMAL) has been introduced. This unified model can interpret various input modalities—text, image, video, audio, and IMU motion sensor data—and generate textual responses. By inheriting the robust text-based reasoning abilities of renowned LLMs like LLaMA-2, AnyMAL translates modality-specific signals into a unified textual space using a pre-trained aligner module.
AnyMAL stands out from its predecessors that often combined text with just one other modality. Instead, it boasts a collection of multi-modal encoders, trained to convert data from a myriad of sources into the text embedding space of an LLM. Key achievements of AnyMAL include the provision of pre-trained projection layers on vast datasets across various modalities, and a fine-tuning process that utilizes a manually gathered multimodal instruction set. This dataset covers a wide array of topics beyond standard Q&A formats, acting as both a training tool and a benchmark for intricate multimodal reasoning.
The model's remarkable capability is evident in its zero-shot performance metrics. Tests have shown marked improvements in accuracy and human evaluation on a plethora of tasks across different modalities. AnyMAL's success is further underscored by its strong performance metrics against other models in the literature, such as a +7.0% accuracy boost on VQAv2 and a +14.5% CIDEr score on AudioCaps.
Under the hood, AnyMAL's pre-training modality alignment plays a pivotal role. The model is pre-trained with paired multimodal data, using a lightweight adapter for each modality to project input signals into the text token embedding space of a designated LLM. This creates a unified token embedding space, representing both text and other modalities. Furthermore, to efficiently scale training for large datasets, advanced quantization strategies have been implemented, drastically reducing the memory requirements.
Through rigorous experimentation, AnyMAL has demonstrated its capabilities in both captioning tasks for different modalities and intricate multimodal reasoning tasks. Whether providing in-depth image captions or making logical conclusions based on multi-input instructions, AnyMAL showcases its aptitude as a next-generation language model, setting new benchmarks in multimodal language processing.
Webdesk AI News : AnyMAL Multi-Modal Language Model, September 27, 2023