Azure AI milestone: New foundation model Florence v1.0 pushing vision and vision-language state of the art

This post has been republished via RSS; it originally appeared at: Microsoft Research.

Florence v1.0—along with recent milestones in Neural Text-to-Speech and question answering—is part of a larger Azure AI mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work—with improved vision, knowledge understanding, and speech capabilities. At the center of these efforts is XYZ-code, a joint representation of three cognitive attributes: monolingual text (X), audio or visual sensory signals (Y), and multilingual (Z). For more information about these efforts, read the XYZ-code blog post.

Developing AI that operates more like people do has been a challenging but exciting journey. We take a holistic and people-centered approach to learning and understanding by using multimodality. Our approach examines the relationship between three attributes of human cognition—monolingual text (X), audio or visual sensory cues (Y), and multilingual (Z)—and brings them together under XYZ-code, a common representation to enable AI that can speak, hear, see, and understand better. The goal is to create pretrained basic AI models that learn common representations of different modalities and support a wide range of downstream AI tasks with the ability to leverage additional external domain knowledge to underpin AI systems that interpret and interact in the world more like people do.

To achieve the ambitious goal of XYZ-code, Microsoft Azure Cognitive Services launched Project Florence in May 2020 to advance its large-scale multi-task, multi-modal computer vision services. Last year, the Project Florence team achieved its first milestone, reaching state-of-the-art performance on the nocaps benchmark. Compared to image descriptions provided by people, captions for the same images generated by the AI system were more detailed and precise. Such capability is a key component to the Microsoft mission of inclusive and accessible technology. Today, we’re thrilled to announce another important milestone: Florence v1.0, a computer vision foundation model that successfully scales a large variety of vision and vision-language tasks.

Florence v1.0 demonstrates superior performance on challenging tasks such as zero-shot image classification, image/text retrieval, open-set object detection, and visual question answering. We’ve achieved new state of the art with large margins on a wide range of benchmarks. Supported by Florence v1.0, we’ve also achieved the new state of the art on multiple popular vision and vision-language leaderboards, including COCO object detection and Kinetics-400/Kinetics-600 action classification.

From left to right, figure shows the workflow of Florence v1.0, beginning with the curation of an image-text dataset from the internet, represented by two cylinders. An arrow points from the cylinders to two image-text pairs. The text—“rowers carrying boat over heads on a dock” and “dog”—and their corresponding images are input into a language encoder, represented by a blue square labeled “language encoder,” and an image encoder, represented by a blue square labeled “image encoder (CoSwin),” respectively, and Florence is pretrained via unified contrastive learning, represented by a gray square labeled as such. The pretrained model is then adapted to four tasks, each one represented by a green square labeled with the task: classification/retrieval adaptation; object-level representation using a Dynamic Head Adaptor; fine-grained vision-language representation using a METER Adaptor; and video representation using a Video CoSwin Adaptor. From each green box an arrow points to an image representing the respective task. An arrow from this group of images points to a yellow square labeled “Unified Vision Stack” and another arrow points from “Unified Vision Stack” to a set of new images labeled collectively “deployment.” Learn more about Florence v1.0 in the research paper. — Florence v1.0 leverages data curation, unified learning, a Transformer architecture comprising an image encoder and a language encoder, and adaptation. It can be integrated into modern computer vision systems to power real-world vision and multimedia applications. Compared with existing image-text pretraining models, mainly limited to cross-modal shared representations for classification and retrieval (illustrated by the light-green adaptation module above), Florence expands the representation to support object detection, modalities beyond just RGB like image depth, and videos, respectively.

Florence v1.0: From research to application

Project Florence’s mission is to take the advancements being made in areas such as feature representation learning, transfer learning, and model architecture search and turn them into applications that can empower our partners and customers to achieve more with Azure Cognitive Services. Florence v1.0 and other AI breakthroughs achieved so far are being transferred to the cloud platform, helping to improve model quality for image captioning, tagging, and customized object detection.

The Florence image captioning model is available to customers via the computer vision offering of Azure Cognitive Services, which is part of Azure AI, and can enable developers to incorporate alt text more easily, helping them improve accessibility of their own products and services. The Florence image captioning model is also being incorporated into Seeing AI, an app that identifies text, objects, and people in a user’s surroundings, and Microsoft Word, Outlook, and PowerPoint on various platforms.

The Florence image tagging model is also available through the Azure Cognitive Services computer vision offering. It’s being incorporated into OneDrive to empower the photo search and recommendation experience for millions of users.

The Florence models can be further adapted with additional customer data through model fine-tuning. This moves us closer to our ambition of “custom vision for all”—that is, providing developers and customers with tools to build and improve models customized to meet their unique needs—where new vision objects can be recognized by the Florence model with few-shot fine-tuning.

Petrobras is using Spatial Analysis on its drillship for worker health and safety scenarios, such as trespassing detection, personal protective equipment (PPE) compliance, and person under suspended object detection. The system is running in the Atlantic Ocean under various weather and lighting conditions and with poor internet connection. The Project Florence team’s AI is analyzing six high-resolution (3,072 x 2,048 pixels) cameras and detecting people as small as 0.01 percent of the frame. Spatial Analysis can support deployment that makes sense for a company’s individual system to be secure. In the case of Petrobras, the system leverages the edge to enable real-time computer vision solutions while keeping data in the customer’s preferred trusted space—that is, on its premises. More broadly, during the continued development of Spatial Analysis, the Project Florence team employs a variety of privacy-preserving techniques, such as using face blurring to minimize the risk of people in the videos being identified, when customers have agreed to share data for the purposes of improving the service.

The achievements here helped pave our way toward having AI models themselves being supplied as a service in production and contribute to many ongoing projects—from Intelligent photo for Microsoft 365 to planogram compliance for Industry Cloud to spatial analysis for Microsoft Dynamics 365.

We’ll have more updates in the coming months. Please check out our project page to learn more about our technology and latest advancements.

Note on Responsible AI

Like other publicly available models, Florence models are trained with billions of pages of publicly available text and images and hence may have picked up biases around gender, race, and more from these public documents. Mitigating negative effects from these biases is a difficult, industry-wide issue, and Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these Microsoft AI principles into practice throughout the company and have taken extensive precautionary measures to prevent these implicit biases from getting exhibited when using the models in our products. We strongly encourage developers to do the same by putting appropriate guardrails and mitigations in place before taking these models to production.

Acknowledgment

This research was conducted by the Project Florence team under Azure Cognitive Services, in close collaboration with the Microsoft Research Deep Learning Group. Thanks to the Office of the Chief Technology Officer, Integrated Training Platform, AI Framework, and DeepSpeed teams for making this great accomplishment possible. Thanks to Luis Vargas for coordination and Microsoft Research Asia for its help and collaboration. Thanks also to Jianfeng Gao, Baining Guo, Michael Zeng, Yumao Lu, Zicheng Liu, Ce Liu, and Xuedong Huang for their leadership and support.

Azure AI milestone: New foundation model Florence v1.0 pushing vision and vision-language state of the art

Florence v1.0: From research to application

Note on Responsible AI

Acknowledgment

Related Blogs

Leave a Reply Cancel reply