Hang Hua

IMG_6119.heic

Email:

hhua2 [A-T] cs.rochester [D-O-T] edu

<aside> <img src="notion://custom_emoji/16b80f2e-3c9f-4c04-963e-223f53bce4d5/155c5b68-f629-802a-a0fa-007a2920d110" alt="notion://custom_emoji/16b80f2e-3c9f-4c04-963e-223f53bce4d5/155c5b68-f629-802a-a0fa-007a2920d110" width="40px" /> **Google Scholar**

</aside>

<aside> <img src="https://prod-files-secure.s3.us-west-2.amazonaws.com/16b80f2e-3c9f-4c04-963e-223f53bce4d5/cd361d2a-9856-49d3-bfa3-abc0d900c2fa/25231.png" alt="https://prod-files-secure.s3.us-west-2.amazonaws.com/16b80f2e-3c9f-4c04-963e-223f53bce4d5/cd361d2a-9856-49d3-bfa3-abc0d900c2fa/25231.png" width="40px" /> GitHub

</aside>

<aside> <img src="notion://custom_emoji/16b80f2e-3c9f-4c04-963e-223f53bce4d5/155c5b68-f629-80e1-adbc-007af1442f89" alt="notion://custom_emoji/16b80f2e-3c9f-4c04-963e-223f53bce4d5/155c5b68-f629-80e1-adbc-007af1442f89" width="40px" /> Linkedin

</aside>

👋 Hi!

I am Hang Hua, a research scientist at MIT-IBM Watson AI Lab :mit-ibm:. I obtained my PhD degree from the University of Rochester :rochester: advised by Prof. Jiebo Luo (Fellow of ACM/AAAI/IEEE/NAI/AIMBE/IAPR/SPIE). Prior to UR, I obtained my master's degree from Peking University :peking: and my bachelor’s degree from South China University of Technology :south_china:.

🌋 Research Interests

My research focuses on GenAI, with a particular emphasis on Multimodal LLMs (MLLMs) and Pre-trained Language Models (PLMs). I investigate the core limitations of MLLMs and PLMs —such as Compositionality, Fine-grained Visual Perception, Robustness, and Reasoning—that cannot be overcome by scaling alone. To address these challenges, I develop diagnostic benchmarks to assess MLLMs' capabilities and design new MLLMs that incorporate enhanced competencies. More specifically,

Building New Diagnostic Benchmarks — I have systematically researched the failure of MLLMs in fine-grained visual compositional perception and reasoning. By developing new diagnostic benchmarks, I highlight the limitations of large-scale MLLMs, analyze the underlying causes, and offer insights for improving future model design and training. Related publications include (1) MMComposition, (2) VidComposition (CVPR 2025), (3)MMPerspective (NeurIPS2025), (4) MMIG-Bench(NeurIPS2025), and (5) FineMatch (ECCV 2024).
Designing new MLLMs with enhanced capabilities — I advance the frontiers of MLLMs by addressing critical real-world challenges, particularly in contexts where high-resolution and detailed visual information is crucial to models’ capabilities. My work focuses on: (1) Enhancing models’ compositional and regional visual understanding e.g., FineCaption (CVPR 2025). (2) Enabling MLLMs to process long video sequences efficiently for fine-grained temporal understanding e.g., V2Xum-LLM (AAAI 2025) , VideoXum (TMM 2023).
Improving the Generalization and Robustness of Pre-trained Language Models — For the encoder-only language models, such as T5, BERT, etc., the fine-tuning process is unstable when there are only a small number of training samples available. The brittleness of this process is often reflected by the sensitivity to random seeds. I investigate the relation of the noise stability property to the generalizability of BERT and propose new LNSR(NAACL 2021)and In-manifold LNSR (TNNLS 2023)regularizers for smoothing the learned function of neural networks.

<aside> 🌟 What’s NEW

☑️ Feb**. 02, 2026 🚀 🚀🚀** Two papers accepted by ICLR 2026!

☑️ Sep**. 18, 2025 🚀 🚀🚀** Three papers accepted by NeurIPS 2025!

☑️ Apr**. 07, 2025 📣📣** Please check our new model and paper — CAT-V, ****a new fine-grained object-centric video captioning model!

☑️ Feb. 26, 2025 📣 🚀🚀 Two papers ( FineCaption, VidComposition) accepted by CVPR 2025—see you in Nashville!

☑️ Jan. 02, 2025 📣 Excited to introduce our new survey paper — Generative AI for Cel-Animation: A Survey.

☑️ Dec. 09, 2024 🚀🚀 Two papers (**V2Xum-LLM, AVicuna**) accepted by AAAI 2025—see you in Philadelphia!

☑️ Nov. 21, 2024 📣 We propose FineCaption, a novel Vision-Language model with the improved capabilities of Attribute-Aware Regional Captioning, Regional Dense Captioning, and Comprehensive Global Image Captioning.

☑️ Oct. 2, 2024 📣 We release MMComposition, a new benchmark for evaluating the compositionality of MLLMs.

☑️ Jul. 1, 2024 🚀🚀 FineMatch is accepted by ECCV 2024!

</aside>

📜 Selected Publications

Please see my Google Scholar profiles for the full list.

(*: equal contribution, 🔥: highlight)

🔥PROMPTCAP: Prompt-Guided Image Captioning for VQA with GPT-3

Hang Hua*,Yushi Hu*, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo

ICCV 2023. [paper][code]

🔥MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo

ArXiv 2024. [paper][code]

🔥FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang, Jianming Zhang, Jiebo Luo

CVPR 2025. [paper][code]

🔥V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo

AAAI 2025. [paper][code]

🔥FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo

ECCV 2024. [paper][code]

🔥Generative AI for Cel-Animation: A Survey

Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong,Chenliang Xu

ArXiv 2025. [paper][code]

🔥VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Yunlong Tang*, Junjia Guo*, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu

CVPR 2025. [paper][code]

PromptFix: You Prompt and We Fix the Photo

Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, Jiebo Luo

NeurIPS 2024. [paper] [code]

BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical Analysis

Shuhang Lin, Wenyue Hua, Lingyao Li, Jianchao Ji, Lizhou Fan, Hang Hua, Jiebo Luo, Yongfeng Zhang

EMNLP 2024 Demo Track. [paper][code]

Emo-Avatar: Efficient Monocular Video Style Avatar through Texture Rendering

Pinxin Liu, Luchuan Song, Daoan Zhang, Hang Hua, Yunlong Tang, Huaijin Tu, Jiebo Luo, Chenliang Xu

3D Vision 2025. [paper]

Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization

Hang Hua, Xingjian Li, Dejing Dou, Chengzhong Xu, Jiebo Luo

TNNLS 2023 (IF: 10.4). [paper]

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Hang Hua*, Jingyang Lin*, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo

TMM 2023 (IF: 8.4). [paper][code]

Noise Stability Regularization for Improving BERT Fine-tuning

Hang Hua, Xingjian Li, Dejing Dou, Chengzhong Xu, Jiebo Luo

NAACL-HLT 2021. [paper]

Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation

Ke Wang, Hang Hua, Xiaojun Wan

NeurIPS 2019. [paper][code]

📚Professional Service:

Work Shop/Contest Organizer:
- IEEE Workshop on Visual-Language Alignment in Text-Guided Multi-Modal Generation
- VQualA Competition @ ICCV 2025
Program Committee / Reviewer: ACL Rolling Review, ****ICCV 2025, ****ICLR 2025, ECCV 2024, CVPR 2024-2026, NeurIPS 2023-2025, ACL 2023-2025, EMNLP 2021-2024, AAAI 2023-2026, ACM Multimedia 2024-2025, WACV 2026, AISTATS 2024-2025, AACL 2020-2023, ICASSP 2023-2024, ACM Multi-Media Asia 2021-2025, ACM Transactions on Intelligent Systems and Technology, ACM Transactions on the Web, IEEE TCSVT.

👋 Hi!

🌋 Research Interests

📜 Selected Publications

📚Professional Service:

🏆Awards: