X

MM1, a Family of Multimodal AI Models with up to 30 billion Parameters, is being Developed by Apple Researchers

In a pre-print paper, Apple researchers presented their work on developing a multimodal large language model (LLM) for artificial intelligence (AI). The paper describes how it was possible to achieve the advanced capabilities of multimodality and train the foundation model on both text-only data and images, and it was published on an online portal on March 14. The Cupertino-based tech giant has made new advances in AI in response to CEO Tim Cook’s statement during the company’s earnings calls, which stated that AI features might be released later this year.

ArXiv, an open-access online repository for scholarly papers, has published the research paper’s pre-print version. Peer review is not, however, applied to the papers that are posted here. The project is thought to be connected to Apple as well, even though the paper makes no mention of the company; this is because the majority of the researchers mentioned are connected to the machine learning (ML) division of Apple.

A family of multimodal models with up to 30 billion parameters, known as MM1, is the project that the researchers are currently working on. The paper’s authors referred to it as a “performant multimodal LLM (MLLM)” and noted that in order to build an AI model that can comprehend both text and image-based inputs, image encoders, the vision language connector, and other architecture elements and data decisions were made.

The paper provided an example in stating that “We demonstrate that achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results, requires a careful mix of image-caption, interleaved image-text, and text-only data for large-scale multimodal pre-training.”

To put it simply, the AI model has not received enough training to produce the intended results and is presently in the pre-training phase. This phase involves designing the model’s workflow and data processing eventually using the algorithm and AI architecture. The researchers at Apple were able to incorporate computer vision into the model by means of a vision language connector and image encoders. Upon conducting tests using a combination of image-only, image-text, and text-only data sets, the team discovered that the outcomes were comparable to those of other models at the same stage.

Although this is a significant breakthrough, there is insufficient evidence in this research paper to conclude that Apple will integrate a multimodal AI chatbot into its operating system. It’s difficult to even say at this point whether the AI model is multimodal in terms of receiving inputs or producing output (i.e., whether it can produce AI images or not). However, it can be said that the tech giant has made significant progress toward developing a native generative AI foundation model if the results are verified to be consistent following peer review.

Categories: Technology
Kajal Chavan:
X

Headline

You can control the ways in which we improve and personalize your experience. Please choose whether you wish to allow the following:

Privacy Settings

All rights received