人工智能驱动的数据标注：从辅助者到主导者的范式革命

ETI

Education and Teaching Innovation

2995-48942995-4908

Art and Design

10.61369/ETI.2025100037

Article

人工智能驱动的数据标注：从辅助者到主导者的范式革命https://artdesignp.com/journal/ETI/3/10/10.61369/ETI.2025100037何源

2025

310

2025-10-20

数据标注作为人工智能模型训练的起点，其效率和质量的高低直接决定了算法能力上限，纯人工标注方式存在代价高、难规模化、质量不稳定等痛点，成为制约AI发展的“标注瓶颈”。论文将系统介绍人工智能赋能数据标注的发展历程，并提出一个核心命题：AI在数据标注领域的应用从“辅助”正在演变为“主角”。首先分析传统标注方式中成本、质量和规模三难权衡的困境；其次分析以主动学习（Active Learning）为代表的AI助标注范式，分析其效果与“主动学习悖论”的固有困境；然后重点分析以LLM和视觉基础模型为代表的新范式，分析其如何通过零样本或少样本等能力重塑标注工作流，以及存在“泛化-特化鸿沟”等新问题；最后分析新范式下的算法偏见、模型可靠性等问题，并展望人机协作的新范式。我们认为，未来数据标注的核心是构建一个人机协同的强大生态系统，在一个大的系统中由基础模型做规模化工作，人类专家进行更高水平的监管、治理和对齐。数据标注,人工智能,主动学习,基础模型,大型语模型,人机协同

[1] Zhu X, et al. 数据标注研究综述[J]. 软件学报, 2020, 31(1): 1-20.[2] Neves M, &Scaron;eva J, Teixeira L. On the challenges of using text annotation tools[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 5353-5360.[3] 中国信息通信研究院. 人工智能数据标注白皮书[R]. 2023.[4] 国务院办公厅. 关于促进数据标注产业高质量发展的实施意见[Z]. 2024.[5] 王珊, 李建中. 数据库管理系统中的不确定性数据管理研究进展[J]. 软件学报, 2018, 29(1): 1-20.[6] Midtvedt B. Annotation-free deep learning for quantitative microscopy[D]. University of Gothenburg, 2024.[7] Sun S, Liu L, Liu Y, et al. Uncovering bias in foundation models: Impact, testing, harm, and mitigation[J]. arXiv preprint arXiv:2501.10453, 2025.[8] 工业和信息化部. 人工智能产业创新发展行动计划（2023-2025 年）[Z]. 2023.[9] Settles B. Active learning literature survey[R]. University of Wisconsin-Madison, 2009.[10] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.[11] 李飞飞, 张钹, 等. 人工智能的本质与未来[J]. 中国科学: 信息科学, 2024, 54(1): 1-25.[12] Dasgupta S. Analysis of a simple active learning algorithm[C]//Proceedings of the twenty-first conference on Uncertainty in artificial intelligence. 2005: 133-140.[13] Beygelzimer A, Dasgupta S, Langford J. Importance weighted active learning[C]//Proceedings of the 26th annual international conference on machine learning. 2009:49-56.[14] Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.[15] Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work?[J]. arXiv preprint arXiv:2202.12837, 2022.[16] Zhao W X, Zhou K, Li J, et al. A survey of large language models[J]. arXiv preprint arXiv:2303.18223, 2023.[17] Tan Z, Beigi A, Wang S, et al. Large language models for data annotation: A survey[J]. arXiv preprint arXiv:2402.13446, 2024.[18] Kirillov A, Mintun E, Ravi N, et al. Segment anything[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3879-3890.[19] Mazurowski M A, Dong H, Gu H, et al. Segment anything model for medical image analysis: An experimental study[J]. Medical Image Analysis, 2023, 89: 102918.[20] Ma J, He Y, Li F, et al. Segment anything in medical images[J]. Nature Communications, 2024, 15(1): 654.[21] Zhang Y, Liu Q, Song L. Medical image segmentation based on U-Net and its variants: A review[J]. Journal of Shanghai Jiaotong University (Science), 2023, 28(3): 345-358.[22] Veselý A, Straka M. Synthetic data generation using large language models: A survey[J]. arXiv preprint arXiv:2503.14023, 2025.[23] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.[24] Shumailov I, Zhao Y, Bates D, et al. The curse of recursion: Training on generated data makes models forget[J]. arXiv preprint arXiv:2305.17493, 2023.[25] 段伟文. 人工智能伦理与治理[M]. 北京: 科学出版社, 2023.[26] Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35:27730-27744.[27] 张俊林. 以数据为中心的人工智能：现状与展望[J]. 中国科学: 信息科学, 2023, 53(8): 1459-1476.