Meta as soon as it has detected a clear view of the pre - Chameleonfamily of her new models, one designed to be the original multimodale. This is driven strategic response to the Metës to the competition in the growing field of artificial intelligence generation, and, in particular, to models, to be issued from the rivals of her own, OpenAI, in the first place.
A an innovative approach to the multimodalitetit
Unlike the approach to the usual training of a model specific to each mode, and more, after the union of the results of using the agregatorë, to be known as “the late”, Chameleon adopts an architecture “modal-mixed is based in the land of the merger of the earliest”. This means that the model has been designed from the foundation to learn from a unique mix of images, text, code, and to the ways of the other.
Chameleon is transforming the images in the sign of the specific, as well as make and models of the English language, with words, using a dictionary to be a unified text, code, and signs picture. This is feature allows you to the same architecture can be applied to sequences that contain the symbols in the image and the text by allowing the model to reason, and to generate the sequence, the image and the text without the need for the components specific to each mode.
According to the researchers, the model is more similar to the Chameleon, is Google's Gemini that also uses the approach to the child with each other.
Jump to the challenges of training and ratings
While the modal blended, on the basis of the land to the other child has the advantage of a visible, it also presents significant challenges during their training, and the scaling of the model. To address this issue, researchers in the Meta, applied to a wide range of posters and architectural and techniques, and innovative coaching.
The training of a Chameleon it happens in two stages, using a data set that contains the 4.4 trillion of the signs, text, couples, image -, text-and sequence-related text and images. The final of the 7 billion to 34 billion parameters trained on more than 5 million hours of GPU Nvidia A100 80 GB.
In the experiments carried out show that the Chameleon achieves the best performance on the recent in various capacities, including to respond to visual questions (VQA), and the signature of the image and, thus overcoming the patterns, such as the Flamingo, IDEFICS, and Llava-1.5. Furthermore, the Chameleon remain competitive in the standards, the only text being matched with a Mixtral 8x7B and Gemini-Pro.
Towards the future of the open-multimodale
As OpenAI and Google make and model for a new multimodale, Spot you can stand out by offering an alternative that is open to the models in private. Moreover, the approach of a mixed-modal grounded in the soil of a merger of the earlier of the Chameleon, can inspire new directions for research on the model of the most advanced, in particular, with the integration of a ways extra.
The researchers of the Faulty point out that “theChameleon it represents an important step towards the realization of the vision of a model of unified constitutive capable of reason, and the generating the content multimodale in the way of a flexible”.
Discussion about this post