1. Baichuan-Omni-1.5 Technical Report
- Author
-
Li, Yadong, Liu, Jun, Zhang, Tao, Chen, Song, Li, Tianpeng, Li, Zehuan, Liu, Lijun, Ming, Lingfeng, Dong, Guosheng, Pan, Da, Li, Chong, Fang, Yuanbo, Kuang, Dongdong, Wang, Mingrui, Zhu, Chenglin, Zhang, Youwei, Guo, Hongyu, Zhang, Fengyu, Wang, Yuran, Ding, Bowen, Song, Wei, Li, Xu, Huo, Yuqi, Liang, Zheng, Zhang, Shusen, Wu, Xin, Zhao, Shuai, Xiong, Linchu, Wu, Yozhen, Ye, Jiahui, Lu, Wenhao, Li, Bowen, Zhang, Yan, Zhou, Yaqi, Chen, Xin, Su, Lei, Zhang, Hongda, Chen, Fuzhong, Dong, Xuezhen, Nie, Na, Wu, Zhiying, Xiao, Bin, Li, Ting, Dang, Shunya, Zhang, Ping, Sun, Yijia, Wu, Jincheng, Yang, Jinjie, Lin, Xionghai, Ma, Zhi, Wu, Kegeng, li, Jia, Yang, Aiyuan, Liu, Hui, Zhang, Jianqiang, Chen, Xiaoxi, Ai, Guangwei, Zhang, Wentao, Chen, Yicong, Huang, Xiaoqin, Li, Kun, Luo, Wenjing, Duan, Yifei, Zhu, Lingling, Xiao, Ran, Su, Zhe, Pu, Jiani, Wang, Dian, Jia, Xu, Zhang, Tianyu, Ai, Mengyu, Wang, Mang, Qiao, Yujing, Zhang, Lei, Shen, Yanjun, Yang, Fan, Zhen, Miao, Zhou, Yijie, Chen, Mingyang, Li, Fei, Zhu, Chenzheng, Lu, Keer, Zhao, Yaqi, Liang, Hao, Li, Youquan, Qin, Yanzhao, Sun, Linzhuang, Xu, Jianhua, Sun, Haoze, Lin, Mingan, Zhou, Zenan, and Chen, Weipeng
- Subjects
Computer Science - Computation and Language ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
- Published
- 2025