Publications

You can also find my articles on my Google Scholar profile.

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Yang, C.*, Su, S.*, Liu, S.*, Dong, X.*, Yu, Y.*, Su, W.*, ... & Dai, J.
Preprint / Paper / Model / Code
ZeroGUI is a scalable, online learning framework that trains vision-based GUI Agents without human supervision by using Vision-Language Models (VLMs) for automatic task generation, reward estimation, and reinforcement learning.

HoVLE

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Tao, C.*, Su, S.*, Zhu, X.*, Zhang, C., Chen, Z., Liu, J., ... & Dai, J.
CVPR 2025 / Paper / Model
HoVLE is a high-performance monolithic Vision-Language Model that uses a insightful holistic embedding module to effectively integrate vision and language, outperforming previous models.

Data Scaling Laws

Learning 1D Causal Visual Representation with De-focus Attention Networks

Tao, C.*, Zhu, X.*, Su, S.*, Lu, L., Tian, C., Luo, X., ... & Dai, J.
NeurIPS 2024 / Paper / Code
De-focus Attention Networks are introduced to address the “over-focus” issue in 1D causal vision models by using several inspiring ideas, enabling 1D causal vision models to match 2D models in performance.