This post has been republished via RSS; it originally appeared at: Microsoft Research.
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Large Language Model (LLM) inference consists of two distinct phases – prefill phase, which processes the input prompt, and decode phase, which generates output tokens autoregressively. While the prefill phase effectively saturates graphics processing unit (GPU) compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles.
In a new paper: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, researchers from Microsoft present a solution to these challenges that yields significant improvements in inference performance across models and hardware. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. Chunked-prefills allow constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M (opens in new tab). This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories.
In a new paper: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory, researchers from Microsoft propose an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, DragNUWA simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, the researchers propose trajectory modeling with three aspects: a trajectory sampler (TS) to enable open-domain control of arbitrary trajectories, a multiscale fusion (MF) to control trajectories in different granularities, and an adaptive training (AT) strategy to generate consistent videos following trajectories. Their experiments demonstrate DragNUWA’s superior performance in fine-grained control in video generation.
DragNUWA is purely a research project and there are no current plans to incorporate DragNUWA into a product. Any further research will continue to follow Microsoft AI principles.
Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals
Understanding cortical responses to human visual perception has emerged a research hotspot. Yet, the underlying mechanism of how human visual perceptions are intertwined with our cognitions is still a mystery. Thanks to recent advances in both neuroscience and artificial intelligence, researchers have been able to record the visually evoked brain activities and mimic the visual perception ability through computational approaches.
In a new paper: Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals, researchers from Microsoft reconstruct observed images based on portably accessible brain signals, i.e., electroencephalography (EEG) data. Since EEG signals are dynamic in the time-series format and are notoriously noisy, processing and extracting useful information requires more dedicated efforts. The researchers propose a comprehensive pipeline, named NeuroImagen, to incorporate a novel multi-level perceptual information decoding to draw multi-grained and heterogeneous outputs from the given EEG data. A pretrained latent diffusion model then leverages the extracted semantic information to reconstruct the high-resolution visual stimuli images. The experimental results illustrate the effectiveness of image reconstruction and superior quantitative performance of the proposed method.