Trending Research

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

ali-vilab/unianimate-dit ? 15 Apr 2025

Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment.

Image Animation

157

1.50 stars / hour

Paper
Code

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

lizonghang/prima.cpp ? 7 Apr 2025

Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices.

Quantization

403

1.46 stars / hour

Paper
Code

REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

End2End-Diffusion/REPA-E ? ? 14 Apr 2025

We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process.

103

1.00 stars / hour

Paper
Code

Liquid: Language Models are Scalable Multi-modal Generators

foundationvision/liquid ? ? 5 Dec 2024

We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language.

Language Modeling Language Modelling +2

515

0.99 stars / hour

Paper
Code

Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion

nevsnev/fgdvi ? ? 1 Dec 2024

Specifically, FloED employs a dual-branch architecture, where a flow branch first restores corrupted flow and a multi-scale flow adapter provides motion guidance to the main inpainting branch.

Denoising Optical Flow Estimation +1

189

0.98 stars / hour

Paper
Code

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

microsoft/bitnet ? ? 17 Feb 2025

The advent of 1-bit large language models (LLMs), led by BitNet b1. 58, has spurred interest in ternary LLMs.

13,883

0.77 stars / hour

Paper
Code

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

sakanaai/ai-scientist-v2 ? ? 10 Apr 2025

AI is increasingly playing a pivotal role in transforming how scientific discoveries are made.

scientific discovery

640

0.74 stars / hour

Paper
Code

UniK3D: Universal Camera Monocular 3D Estimation

lpiccinelli-eth/UniK3D ? ? 20 Mar 2025

Monocular 3D estimation is crucial for visual perception.

Ranked #2 on Monocular Depth Estimation on KITTI Eigen split

3D Reconstruction Disentanglement +1

373

0.72 stars / hour

Paper
Code

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

bytedance/ui-tars ? ? 21 Jan 2025

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e. g., keyboard and mouse operations).

4,129

0.71 stars / hour

Paper
Code

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

HorizonRobotics/BIP3D ? ? 22 Nov 2024

In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments.

3D visual grounding

111

0.68 stars / hour

Paper
Code