[NExT-GPT: Open Source Multimodal AI Technology Challenges Big Tech Giants]
Singapore’s National University of Singapore (NUS) and Tsinghua University have joined forces to develop NExT-GPT, an open source multimodal AI model that aims to rival industry giants like OpenAI and Google. With its revolutionary conversational capabilities, NExT-GPT combines text, images, audio, and video to provide more natural interactions compared to text-only models.
The team behind NExT-GPT describes it as an any-to-any system, meaning it can accept inputs in any modality and deliver responses accordingly. This open-source model empowers users to customize and modify it to suit their specific needs, resulting in potential advancements far beyond its original capabilities.
So, how does NExT-GPT work? The model utilizes separate modules to encode various inputs, such as images and audio, into text-like representations that the core language model can process. Researchers have implemented a technique called modality-switching instruction tuning to enhance cross-modal reasoning abilities, enabling seamless transitions between different types of inputs during conversations.
NExT-GPT employs unique tokens for each input and output modality, allowing for flexible any-to-any conversion. These tokens facilitate the generation of text responses, as well as trigger the production of non-text outputs like images and videos. Different decoders, including Stable Diffusion for images, AudioLDM for audio, and Zeroscope for videos, handle the output generation for each modality. In addition, NExT-GPT integrates Vicuna as the base LLM (large language model) and ImageBind for input encoding.
Despite training only 1% of the total parameters, NExT-GPT achieves remarkable flexibility in any-to-any conversion. The majority of parameters are frozen, pretrained modules, making it incredibly efficient.
Although a demo site for NExT-GPT has been established, availability remains intermittent. Nonetheless, NExT-GPT presents itself as a compelling open-source alternative for creators seeking to harness the power of multimodal AI. Multimodality is crucial for enabling more natural interactions, and by open-sourcing NExT-GPT, researchers are providing a platform for the community to propel AI to new heights.
As tech giants like Google and OpenAI launch their own multimodal AI products, NExT-GPT introduces healthy competition in the field. Its ability to process multiple modalities and generate coherent responses holds great potential for advancing conversational AI. By embracing openness and collaboration, NExT-GPT brings researchers, developers, and enthusiasts together to shape the future of AI.