RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation
Summary: Whole-body pose estimation, a complex task requiring simultaneous prediction of key-points for the body, hands, face, and feet, aims to accurately predict fine-grained pose information for the entire human body. This information is crucial for advancing human-centric perception, generation, and various applications. This work introduces RTMW, a series of high-performance models for both 2D and 3D whole-body pose estimation. Building upon the RTMPose architecture, RTMW incorporates FPN and HEM to effectively capture pose information across different body parts and scales. Trained on a rich dataset of manually aligned human key-point annotations and enhanced through a two-stage distillation process, RTMW demonstrates exceptional performance on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency. The RTMW-l model achieves a groundbreaking 70.2 mAP on the Whole-body benchmark, becoming the first open-source model to surpass the 70 mAP threshold. Additionally, the study explores RTMW's capabilities in 3D whole-body pose estimation using an image-based monocular coordinate classification approach.
Inference Performance Optimization for Large Language Models on CPUs
Summary: Large language models (LLMs) have demonstrated remarkable performance and promise across a wide range of tasks. Nevertheless, deploying high-performance LLMs in resource-constrained environments remains a significant challenge. Given limitations in GPU hardware, exploring alternative CPU-based solutions is essential. To address the financial and hardware constraints associated with LLMs, optimizing inference performance is crucial. This paper presents a readily deployable solution for accelerating LLM inference on CPUs. The proposed approach effectively reduces KV cache size while preserving precision. Additionally, a distributed inference optimization strategy is introduced and implemented using the oneAPI Collective Communications Library. The paper further outlines CPU-specific optimization techniques and applies tailored optimizations to commonly used LLM models.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Summary: Visual instruction tuning has significantly advanced the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs primarily focus on single-image tasks, leaving their application to multi-image scenarios relatively unexplored. Additionally, previous LMM research has addressed different scenarios separately, hindering the development of generalized models with emerging capabilities. To address these limitations, LLaVA-NeXT-Interleave is introduced, capable of handling Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios within a unified LMM framework. By adopting an interleaved data format as a general template, the M4-Instruct dataset is compiled, encompassing 1,177.6k samples across 4 primary domains, 14 tasks, and 41 datasets. Furthermore, the LLaVA-Interleave Bench is curated to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave demonstrates superior performance on multi-image, video, and 3D benchmarks while maintaining performance on single-image tasks. Notably, the model exhibits emerging capabilities such as task transferability across different settings and modalities.
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
Summary: Open-source large multimodal models (LMMs) have faced several limitations, including the need for adapters to align visual representations with pre-trained large language models (LLMs), restrictions on single-modal generation, and reliance on separate diffusion models for visual modeling and generation in multimodal generation tasks. To address these challenges, Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation, is introduced. Built upon Meta AI's Chameleon and employing a data-efficient and parameter-efficient fine-tuning strategy, Anole demonstrates high-quality and coherent multimodal generation capabilities.
iLLM-TSC: Integration reinforcement learning and large language model for traffic signal control policy improvement
Summary:Urban congestion remains a significant challenge, with traffic signal control (TSC) emerging as a promising solution. While reinforcement learning (RL) has demonstrated effectiveness in modeling TSC as a Markov Decision Process, existing RL-based systems often overlook the impact of imperfect observations due to communication issues and the omission of rare, real-world events from the reward function. To address these limitations, a novel integration framework combining a large language model (LLM) with RL is proposed. This framework aims to handle overlooked elements in the reward function and compensate for gaps in state information, thereby enhancing RL agent policies. The RL component initially makes decisions based on observed data, which are subsequently evaluated by the LLM for reasonableness. Unreasonable decisions are adjusted accordingly. Notably, this integration can be seamlessly incorporated into existing RL-based TSC systems without requiring modifications. Extensive testing demonstrates a 17.5% reduction in average waiting time under degraded communication conditions compared to traditional RL methods, highlighting the potential of this approach to advance practical RL applications in intelligent transportation systems.