news 2026/3/31 9:20:48

面试-Tokenizer训练

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
面试-Tokenizer训练

1 代码

# 注:不建议再重复训练tokenizer(“词典”),MiniMind已自带,此脚本仅供学习和参考。基于不同词典训练的模型将导致输出完全不统一,降低社区的模型复用性# Note: It is not recommended to re-train the tokenizer. MiniMind already includes one. This script is for learning and reference only. Training models with different tokenizers will lead to inconsistent outputs and reduce model reusability in the community.importosimportjsonfromtokenizersimportdecoders,models,pre_tokenizers,trainers,Tokenizer DATA_PATH='../dataset/pretrain_hq.jsonl'TOKENIZER_DIR='../model_learn_tokenizer/'VOCAB_SIZE=6400defget_texts(data_path):withopen(data_path,'r',encoding='utf-8')asf:fori,lineinenumerate(f):ifi>=10000:break# 实验性,可只用前10000行测试data=json.loads(line)yielddata['text']deftrain_tokenizer(data_path,tokenizer_dir,vocab_size):tokenizer=Tokenizer(models.BPE())tokenizer.pre_tokenizer=pre_tokenizers.ByteLevel(add_prefix_space=False)trainer=trainers.BpeTrainer(vocab_size=vocab_size,special_tokens=["<|endoftext|>","<|im_start|>","<|im_end|>"],show_progress=True,initial_alphabet=pre_tokenizers.ByteLevel.alphabet())texts=get_texts(data_path)tokenizer.train_from_iterator(texts,trainer=trainer)tokenizer.decoder=decoders.ByteLevel()asserttokenizer.token_to_id("<|endoftext|>")==0asserttokenizer.token_to_id("<|im_start|>")==1asserttokenizer.token_to_id("<|im_end|>")==2os.makedirs(tokenizer_dir,exist_ok=True)tokenizer.save(os.path.join(tokenizer_dir,"tokenizer.json"))tokenizer.model.save(tokenizer_dir)config={"add_bos_token":False,"add_eos_token":False,"add_prefix_space":False,"added_tokens_decoder":{"0":{"content":"<|endoftext|>","lstrip":False,"normalized":False,"rstrip":False,"single_word":False,"special":True},"1":{"content":"<|im_start|>","lstrip":False,"normalized":False,"rstrip":False,"single_word":False,"special":True},"2":{"content":"<|im_end|>","lstrip":False,"normalized":False,"rstrip":False,"single_word":False,"special":True}},"additional_special_tokens":[],"bos_token":"<|im_start|>","clean_up_tokenization_spaces":False,"eos_token":"<|im_end|>","legacy":True,"model_max_length":32768,"pad_token":"<|endoftext|>","sp_model_kwargs":{},"spaces_between_special_tokens":False,"tokenizer_class":"PreTrainedTokenizerFast","unk_token":"<|endoftext|>","chat_template":"{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' -%}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else -%}\n {{- '<|im_start|>system\\nYou are a helpful assistant<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}"}withopen(os.path.join(tokenizer_dir,"tokenizer_config.json"),"w",encoding="utf-8")asf:json.dump(config,f,ensure_ascii=False,indent=4)print("Tokenizer training completed.")defeval_tokenizer(tokenizer_dir):fromtransformersimportAutoTokenizer tokenizer=AutoTokenizer.from_pretrained(tokenizer_dir)messages=[{"role":"system","content":"你是一个优秀的聊天机器人,总是给我正确的回应!"},{"role":"user","content":'你来自哪里?'},{"role":"assistant","content":'我来自地球'}]new_prompt=tokenizer.apply_chat_template(messages,tokenize=False)print('-'*100)print(new_prompt)print('-'*100)print('tokenizer词表长度:',len(tokenizer))model_inputs=tokenizer(new_prompt)print('encoder长度:',len(model_inputs['input_ids']))response=tokenizer.decode(model_inputs['input_ids'],skip_special_tokens=False)print('decoder一致性:',response==new_prompt,"\n")print('-'*100)print('流式解码(字节缓冲)测试:')input_ids=model_inputs['input_ids']token_cache=[]fortidininput_ids:token_cache.append(tid)current_decode=tokenizer.decode(token_cache)ifcurrent_decodeand'\ufffd'notincurrent_decode:display_ids=token_cache[0]iflen(token_cache)==1elsetoken_cache raw_tokens=[tokenizer.convert_ids_to_tokens(int(t))fortin(token_cacheifisinstance(token_cache,list)else[token_cache])]print(f'Token ID:{str(display_ids):15}-> Raw:{str(raw_tokens):20}-> Decode Str:{current_decode}')token_cache=[]if__name__=='__main__':train_tokenizer(DATA_PATH,TOKENIZER_DIR,VOCAB_SIZE)eval_tokenizer(TOKENIZER_DIR)
版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/3/25 10:57:13

难绷!和阿里 P11/P12 约会相亲?女网友竟称“也没那么难钓嘛”

今日份趣图&#xff0c;属于小某书上推某软件的软文帖子了。28 岁的 P11&#xff0c;29 岁的 P12……忒离谱了&#xff01;大模型出幻觉后都不如她。不懂大厂职级体系&#xff0c;你随便抓个大模型问就知道的嘛我抓了一个问了&#xff0c;知名的 P11 和 P12 年龄大概如下&#…

作者头像 李华
网站建设 2026/3/26 21:22:52

Waymo融资160亿美元:估值1260亿美元 红杉与DST领投

雷递网 乐天 2月3日自动驾驶出租车先驱Waymo宣布筹集160亿美元&#xff0c;投后估值达到1260亿美元。当前&#xff0c;Waymo正在打造覆盖全球的自动驾驶车队&#xff0c;而其他财力雄厚的竞争对手&#xff0c;例如特斯拉和亚马逊&#xff0c;则正努力追赶。除Alphabet作为主要投…

作者头像 李华
网站建设 2026/3/20 9:26:16

LeakCanary 使用经验分享

文章目录 1. 集成配置 基本依赖配置 自定义配置 2. 使用经验总结 2.1 检测时机 2.2 常见泄漏场景识别 3. 实际项目经验 3.1 误报处理 3.2 自定义排除规则 4. 最佳实践 4.1 版本管理 4.2 性能考虑 4.3 团队协作 5. 高级配置技巧 5.1 自定义 Heap Dumper 5.2 监听检测结果 6. 常见…

作者头像 李华
网站建设 2026/3/27 17:15:15

【软考每日一练030】软件维护:逆向工程与再工程的区别与联系

【软考每日一练030】软件维护&#xff1a;逆向工程与再工程的区别与联系 一、 题目回顾 6. ( ) 是在逆向工程所获取信息的基础上修改或重构已有的系统&#xff0c;产生系统的一个新版本。 A. 逆向分析 (Reverse Analysis) B. 重组 (Restructuring) C. 设计恢复 (Design Reco…

作者头像 李华
网站建设 2026/3/24 16:11:22

解读大数据领域HDFS的元数据管理

深入解读大数据领域HDFS的元数据管理 摘要/引言 问题陈述 在大数据存储与处理的场景中&#xff0c;Hadoop分布式文件系统&#xff08;HDFS&#xff09;作为重要的数据存储基石&#xff0c;面临着如何高效管理海量元数据的挑战。元数据记录着文件系统的关键信息&#xff0c;如文…

作者头像 李华
网站建设 2026/3/25 15:27:32

Spark代码规范指南:写出高性能Spark应用的最佳实践

Spark代码规范指南&#xff1a;写出高性能Spark应用的最佳实践 一、引言&#xff1a;为什么你的Spark应用跑得慢&#xff1f; 你是否遇到过这样的场景&#xff1f; 写了一个Spark应用&#xff0c;本地测试没问题&#xff0c;上线后却跑了几个小时还没结束&#xff1b;明明给…

作者头像 李华