This post has been republished via RSS; it originally appeared at: Microsoft Research.
Kosmos-2.5: A Multimodal Literate Model
Current large language models (LLMs) primarily focus on textual information and cannot understand visual information. However, advancements in the field of multimodal large language models (MLLMs) aim to address this limitation. MLLMs combine visual and textual information within a single Transformer-based model, enabling the model to learn and generate content based on both modalities.
While existing MLLMs have mainly focused on natural images with lower resolutions, the exploration of text images requires further investigation. Incorporating text images into the training process and developing models based on textual and visual information can unlock new possibilities for multimodal applications involving high-resolution text-intensive images.
In a new paper: Kosmos-2.5: A Multimodal Literate Model, researchers from Microsoft present Kosmos-2.5, a MLLM for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. The model can be adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning. This work paves the way for the future scaling of MLLMs.
Evaluation of Dependency Structure for Multivariate Weather Predictors using Copulas
In the Global South (opens in new tab), climate change is driving more frequent and severe weather events such as droughts, floods, and storms. This leads to crop failures, food insecurity, and job loss. These effects are expected to increase in intensity, further disadvantaging marginalized communities and exacerbating existing inequalities. The need for prevention and adaptation is urgent. But despite advances in machine learning and numerical modeling, accurate weather forecasting remains challenging, due to complex interactions among atmospheric and oceanic variables.
In a new paper: Evaluation of Dependency Structure for Multivariate Weather Predictors using Copulas, researchers from Microsoft explore the potential of vine copulas to explain complex relationships of different weather variables in three African locations. Copulas separate marginal distributions from the dependency structure, offering a flexible way to model dependence between random variables for improved risk assessments and simulations. Vine copulas are based on a variety of bivariate copulas, including Gaussian, Student’s t, Clayton, Gumbel, and Frank copulas. They are effective in high-dimensional problems and offer a hierarchy of trees to express conditional dependence. The researchers propose applying this framework within subseasonal forecasting models to enhance the prediction of different weather events or variables.
Adaptive Training System
Adaptive training has been defined as training in which the problem, stimulus, or task is varied as a function of how well the trainee performs. Researchers have shown that this type of training outperforms comparative training that is non-adaptive or fixed across a range of populations and learning contexts. Virtual reality offers new opportunities for applying this type of training and has already demonstrated its effectiveness (opens in new tab) across a variety of simulated tasks. By using a computational model of the training process, we can derive recommendations for optimal scenario difficulty, resulting in faster and enhanced training.
In a new paper: Adaptive Training System, researchers from Microsoft propose an adaptive training algorithm that accelerates the training process based on a parametric model of trainees and training scenarios. The proposed approach makes trial-by-trial recommendations on optimal scenario difficulty selections to maximize improvements in the trainee’s absolute skill level. The Adaptive Training System is applied to the task of training pilots on a virtual reality flight simulator. The system was designed for scenarios varying in difficulty from easy, with full visibility, to flight in fog with side wind, which is difficult even for experienced pilots.
CodePlan: Repository-level Coding using LLMs and Planning
Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. These activities are formulated as repository-level coding tasks.
Large language model-powered coding assistants, like GitHub Copilot, have succeeded in offering high-quality solutions to localized coding problems. But repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is interdependent and the entire repository may be too large to fit into the prompt.
In a new paper: CodePlan: Repository-level Coding using LLMs and Planning, researchers from Microsoft frame LLM-driven repository-level coding as a planning problem, where the goal is to take the repository from its initial state to a target state whose specifications are provided in natural language. They present CodePlan, a task-agnostic framework, to solve it by synthesizing a multi-step chain of edits, where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. This research evaluates the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python) and shows that CodePlan exhibits a stronger alignment with the ground truth in comparison to baselines.
The intimacy triple bind: Structural inequalities and relational labor in the influencer industry
Social media content creators, or influencers, depend heavily on their ability to cultivate and maintain an invested audience-community. They are encouraged to practice “relational labor,” commodifying their personalities, lives and tastes in order to build authentic self-brands and intimacy with audiences.
In a new article (opens in new tab), a researcher from Microsoft draws on an ethnographic study of the London influencer industry to examine relational labor through an intersectional feminist lens, exploring the ways in which structural inequalities shape relationships between creators and their audiences. Managing audience relationships is harder for marginalized creators – especially those making stigmatized and less brandable content genres – who are at higher risk of trolling and harassment.
This article explores four key tactics for managing such conditions: (1) leaning into making rather than being content; (2) (dis)engaging with anti-fans through silence; (3) retreating into private community spaces, away from the exposure of public platforms; and, in parallel, (4) turning off public comments.