<?xml version="1.1" encoding="utf-8"?>
<article xsi:noNamespaceSchemaLocation="http://jats.nlm.nih.gov/publishing/1.1/xsd/JATS-journalpublishing1-mathml3.xsd" dtd-version="1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><front><journal-meta><journal-id journal-id-type="publisher-id">ETI</journal-id><journal-title-group><journal-title>Education and Teaching Innovation</journal-title></journal-title-group><issn>2995-4894</issn><eissn>2995-4908</eissn><publisher><publisher-name>Art and Design</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.61369/ETI.2025100037</article-id><article-categories><subj-group subj-group-type="heading"><subject>Article</subject></subj-group></article-categories><title>人工智能驱动的数据标注：从辅助者到主导者的范式革命</title><url>https://artdesignp.com/journal/ETI/3/10/10.61369/ETI.2025100037</url><author>何源</author><pub-date pub-type="publication-year"><year>2025</year></pub-date><volume>3</volume><issue>10</issue><history><date date-type="pub"><published-time>2025-10-20</published-time></date></history><abstract>数据标注作为人工智能模型训练的起点，其效率和质量的高低直接决定了算法能力上限，纯人工标注方式存在代价高、难规模化、质量不稳定等痛点，成为制约AI发展的&amp;ldquo;标注瓶颈&amp;rdquo;。论文将系统介绍人工智能赋能数据标注的发展历程，并提出一个核心命题：AI在数据标注领域的应用从&amp;ldquo;辅助&amp;rdquo;正在演变为&amp;ldquo;主角&amp;rdquo;。首先分析传统标注方式中成本、质量和规模三难权衡的困境；其次分析以主动学习（Active Learning）为代表的AI助标注范式，分析其效果与&amp;ldquo;主动学习悖论&amp;rdquo;的固有困境；然后重点分析以LLM和视觉基础模型为代表的新范式，分析其如何通过零样本或少样本等能力重塑标注工作流，以及存在&amp;ldquo;泛化-特化鸿沟&amp;rdquo;等新问题；最后分析新范式下的算法偏见、模型可靠性等问题，并展望人机协作的新范式。我们认为，未来数据标注的核心是构建一个人机协同的强大生态系统，在一个大的系统中由基础模型做规模化工作，人类专家进行更高水平的监管、治理和对齐。</abstract><keywords>数据标注,人工智能,主动学习,基础模型,大型语模型,人机协同</keywords></article-meta></front><body/><back><ref-list><ref id="B1" content-type="article"><label>1</label><element-citation publication-type="journal"><p>[1] Zhu X, et al. 数据标注研究综述[J]. 软件学报, 2020, 31(1): 1-20.[2] Neves M, &amp;Scaron;eva J, Teixeira L. On the challenges of using text annotation tools[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 5353-5360.[3] 中国信息通信研究院. 人工智能数据标注白皮书[R]. 2023.[4] 国务院办公厅. 关于促进数据标注产业高质量发展的实施意见[Z]. 2024.[5] 王珊, 李建中. 数据库管理系统中的不确定性数据管理研究进展[J]. 软件学报, 2018, 29(1): 1-20.[6] Midtvedt B. Annotation-free deep learning for quantitative microscopy[D]. University of Gothenburg, 2024.[7] Sun S, Liu L, Liu Y, et al. Uncovering bias in foundation models: Impact, testing, harm, and mitigation[J]. arXiv preprint arXiv:2501.10453, 2025.[8] 工业和信息化部. 人工智能产业创新发展行动计划（2023-2025 年）[Z]. 2023.[9] Settles B. Active learning literature survey[R]. University of Wisconsin-Madison, 2009.[10] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.[11] 李飞飞, 张钹, 等. 人工智能的本质与未来[J]. 中国科学: 信息科学, 2024, 54(1): 1-25.[12] Dasgupta S. Analysis of a simple active learning algorithm[C]//Proceedings of the twenty-first conference on Uncertainty in artificial intelligence. 2005: 133-140.[13] Beygelzimer A, Dasgupta S, Langford J. Importance weighted active learning[C]//Proceedings of the 26th annual international conference on machine learning. 2009:49-56.[14] Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.[15] Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work?[J]. arXiv preprint arXiv:2202.12837, 2022.[16] Zhao W X, Zhou K, Li J, et al. A survey of large language models[J]. arXiv preprint arXiv:2303.18223, 2023.[17] Tan Z, Beigi A, Wang S, et al. Large language models for data annotation: A survey[J]. arXiv preprint arXiv:2402.13446, 2024.[18] Kirillov A, Mintun E, Ravi N, et al. Segment anything[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3879-3890.[19] Mazurowski M A, Dong H, Gu H, et al. Segment anything model for medical image analysis: An experimental study[J]. Medical Image Analysis, 2023, 89: 102918.[20] Ma J, He Y, Li F, et al. Segment anything in medical images[J]. Nature Communications, 2024, 15(1): 654.[21] Zhang Y, Liu Q, Song L. Medical image segmentation based on U-Net and its variants: A review[J]. Journal of Shanghai Jiaotong University (Science), 2023, 28(3): 345-358.[22] Vesel&amp;yacute; A, Straka M. Synthetic data generation using large language models: A survey[J]. arXiv preprint arXiv:2503.14023, 2025.[23] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.[24] Shumailov I, Zhao Y, Bates D, et al. The curse of recursion: Training on generated data makes models forget[J]. arXiv preprint arXiv:2305.17493, 2023.[25] 段伟文. 人工智能伦理与治理[M]. 北京: 科学出版社, 2023.[26] Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35:27730-27744.[27] 张俊林. 以数据为中心的人工智能：现状与展望[J]. 中国科学: 信息科学, 2023, 53(8): 1459-1476.</p><pub-id pub-id-type="doi"/></element-citation></ref></ref-list></back></article>
