【Python基础20讲】第17章：正则表达式-开发者社区

博主智算菩萨，专注于人工智能、Python编程、音视频处理及UI窗体程序设计等方向。致力于以通俗易懂的方式拆解前沿技术，从零基础入门到高阶实战，陪伴开发者共同成长。目前已开设五大技术专栏，累计发布多篇原创技术文章，深受读者好评。
📌 专栏导航
人工智能前沿知识：深度剖析Transformer架构、生成式AI、强化学习、具身智能、神经符号系统、大模型及智能体（Agent）技术，系统性解析AI核心技术体系与前沿趋势。
Python基础小白编程：从零开始，以保姆式教程讲解变量、数据类型、流程控制、函数等核心语法，配有大量实战代码与避坑指南，真正做到学以致用。
机器学习与深度学习：系统化拆解线性模型、决策树、随机森林、梯度提升树、神经网络等算法原理与工程实践，覆盖从公式推导到代码实现的全链路内容。
音频、图像与视频处理理论与实战：涵盖FFmpeg多媒体处理、audio_shop开源工具、ComfyUI-WanVideoWrapper视频生成等实用技术，从基础操作到高级应用一应俱全。
UI窗体程序设计实战：深入讲解UI设计、动态窗体生成、游戏UI框架设计等实战技巧，提供从配置到编码的完整解决方案。
智算菩萨，以代码为经，以算法为纬，在人工智能的星辰大海中，做你前行路上最可靠的导航者。

17.1 正则表达式基础

正则表达式（Regular Expression）是一种描述字符串模式的形式化语言。Python 的 re 模块提供了正则表达式的支持。常用函数：re.search() 搜索第一个匹配，re.match() 从开头匹配，re.fullmatch() 完全匹配，re.findall() 查找所有匹配。

search 和 match 的区别：match 只在字符串开头匹配，search 在字符串任意位置搜索。findall 返回所有匹配的字符串列表，如果模式中有分组则返回分组列表。

17.2 正则表达式语法

元字符：. 匹配任意字符，^ 匹配开头，$ 匹配结尾，* 匹配 0 次或多次，+ 匹配 1 次或多次，? 匹配 0 次或 1 次，{n,m} 匹配 n 到 m 次，[] 字符集，| 或，() 分组，\ 转义。

预定义字符类：\d 数字、\w 单词字符、\s 空白字符、\b 单词边界。量词的贪婪/非贪婪模式：*、+、? 默认贪婪（尽可能多匹配），加 ? 变为非贪婪（尽可能少匹配）。

17.3 分组与替换

使用 () 创建捕获分组，分组的内容可以通过 group(n) 访问。(?Ppattern) 创建命名分组。(?:pattern) 创建非捕获分组。re.sub(pattern, repl, string) 进行替换，repl 可以是字符串或函数。

re.split(pattern, string) 按模式分割字符串。re.compile(pattern) 预编译正则表达式，提高重复使用的效率。匹配标志：re.IGNORECASE 忽略大小写，re.MULTILINE 多行模式，re.DOTALL 使 . 匹配换行符。

17.4 正则表达式实战技巧

验证邮箱：[\w.±]+@[\w-]+.[\w.-]+。验证手机号：1[3-9]\d{9}。提取 HTML 标签内容：(.*?)。匹配 IP 地址：\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}。

编写正则表达式的建议：先明确需求，再逐步构建模式；使用 re.VERBOSE 模式添加注释提高可读性；先用简单模式测试，再逐步添加约束。正则表达式不是万能的——对于复杂的文本解析任务，有时使用字符串方法或专门的解析库更合适。

完整代码

""" 第17章：正则表达式 演示 re 模块、匹配模式、分组、替换、常用正则 """importre# ============================================================# 1. 基本匹配# ============================================================print("="*50)print(" 基本匹配")print("="*50)text="Python 3.12 发布于 2023 年 10 月，Python 2.7 已停止维护。"# search: 搜索第一个匹配m=re.search(r"Python \d+\.\d+",text)print(f"search:{m.group()ifmelse'None'}")# match: 从开头匹配m=re.match(r"Python",text)print(f"match:{m.group()ifmelse'None'}")m=re.match(r"\d+",text)print(f"match 数字:{m.group()ifmelse'None'}")# fullmatch: 完全匹配m=re.fullmatch(r"\d+","12345")print(f"fullmatch:{m.group()ifmelse'None'}")# findall: 查找所有匹配results=re.findall(r"Python",text)print(f"findall 'Python':{results}")results=re.findall(r"\d+",text)print(f"findall 数字:{results}")# finditer: 返回迭代器forminre.finditer(r"\d+\.\d+",text):print(f" finditer:{m.group()}at{m.span()}")# ============================================================# 2. 匹配模式（标志）# ============================================================print("\n"+"="*50)print(" 匹配模式")print("="*50)text="Hello\nWorld\nPython"# re.MULTILINE: ^ $ 匹配每行print(f"MULTILINE ^\\w+:{re.findall(r'^\w+',text,re.MULTILINE)}")print(f"MULTILINE \\w+$:{re.findall(r'\w+$',text,re.MULTILINE)}")# re.DOTALL: . 匹配换行符html="<div>\n Hello\n</div>"print(f"DOTALL:{re.findall(r'<div>(.*?)</div>',html,re.DOTALL)}")# re.IGNORECASE: 忽略大小写text="Python python PYTHON"print(f"IGNORECASE:{re.findall(r'python',text,re.IGNORECASE)}")# re.VERBOSE: 允许注释pattern=r""" \b # 单词边界 \d{4} # 四位年份 [-/] # 分隔符 \d{1,2} # 月 [-/] # 分隔符 \d{1,2} # 日 \b # 单词边界 """date_text="日期: 2024-01-15 和 2023/12/25"print(f"VERBOSE:{re.findall(pattern,date_text,re.VERBOSE)}")# 组合标志flags=re.IGNORECASE|re.MULTILINE text="Hello\nhello\nWORLD"print(f"组合标志:{re.findall(r'^hello',text,flags)}")# ============================================================# 3. 分组# ============================================================print("\n"+"="*50)print(" 分组")print("="*50)# 基本分组text="张三: 25岁, 北京; 李四: 30岁, 上海"pattern=r"(\w+):\s*(\d+)岁"matches=re.findall(pattern,text)print(f"findall 分组:{matches}")# 命名分组pattern=r"(?P<name>\w+):\s*(?P<age>\d+)岁"forminre.finditer(pattern,text):print(f"{m.group('name')},{m.group('age')}岁")# 非捕获分组 (?:...)text="abc123def456"print(f"捕获:{re.findall(r'(\d+)',text)}")print(f"非捕获:{re.findall(r'(?:\d+)',text)}")# 分组嵌套text="2024-01-15"m=re.match(r"(\d{4})-(\d{2})-(\d{2})",text)ifm:print(f"完整匹配:{m.group(0)}")print(f"年:{m.group(1)}, 月:{m.group(2)}, 日:{m.group(3)}")print(f"groups:{m.groups()}")# ============================================================# 4. 替换# ============================================================print("\n"+"="*50)print(" 替换")print("="*50)text="Hello World Python Programming"# sub: 替换所有匹配result=re.sub(r"\s+"," ",text)print(f"sub 多空格→单空格: '{result}'")# subn: 返回替换次数result,count=re.subn(r"\s+"," ",text)print(f"subn: '{result}', 替换{count}次")# 使用函数替换defreplacer(match):word=match.group()returnword.upper()iflen(word)>3elseword text="the quick brown fox jumps over the lazy dog"result=re.sub(r"\b\w+\b",replacer,text)print(f"函数替换: '{result}'")# 反向引用text="hello hello world world python python"result=re.sub(r"(\w+) \1",r"\1",text)print(f"去重: '{result}'")# ============================================================# 5. 分割# ============================================================print("\n"+"="*50)print(" 分割")print("="*50)text="one, two; three|four\tfive"result=re.split(r"[,;|\t]+",text)print(f"split:{result}")# 保留分隔符result=re.split(r"([,;|\t]+)",text)print(f"split 保留分隔符:{result}")# 限制分割次数text="a:b:c:d:e"result=re.split(r":",text,maxsplit=2)print(f"maxsplit=2:{result}")# ============================================================# 6. 常用正则模式# ============================================================print("\n"+"="*50)print(" 常用正则模式")print("="*50)patterns={"邮箱":r"[\w.+-]+@[\w-]+\.[\w.-]+","手机号":r"1[3-9]\d{9}","IP 地址":r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}","URL":r"https?://[\w\-._~:/?#\[\]@!$&'()*+,;=%]+","日期":r"\d{4}[-/]\d{1,2}[-/]\d{1,2}","中文名":r"[\u4e00-\u9fa5]{2,4}","身份证":r"\d{17}[\dXx]",}test_texts={"邮箱":"联系我: test@example.com 或 admin@mail.cn","手机号":"电话: 13800138000 和 19912345678","IP 地址":"服务器: 192.168.1.100 和 10.0.0.1","URL":"访问 https://www.example.com/path?q=test","日期":"日期: 2024-01-15 和 2023/12/25","中文名":"姓名: 张三和李四","身份证":"ID: 110101199001011234",}forname,patterninpatterns.items():text=test_texts[name]matches=re.findall(pattern,text)print(f"{name}:{matches}")# ============================================================# 7. 实战：文本清洗器# ============================================================print("\n"+"="*50)print(" 实战：文本清洗器")print("="*50)dirty_text=""" <html> <body> <h1>Python 教程</h1> <p> 这是第一段。 有很多 多余的 空格。 </p> <p> 联系邮箱: test@example.com </p> <p> 电话: 13800138000 </p> <!-- 这是注释 --> <script>alert('xss')</script> </body> </html> """classTextCleaner:def__init__(self,text):self.text=textdefremove_html_tags(self):self.text=re.sub(r"<[^>]+>","",self.text)returnselfdefremove_html_comments(self):self.text=re.sub(r"<!--.*?-->","",self.text,flags=re.DOTALL)returnselfdefnormalize_whitespace(self):self.text=re.sub(r"[ \t]+"," ",self.text)self.text=re.sub(r"\n\s*\n","\n",self.text)returnselfdefremove_script_content(self):self.text=re.sub(r"<script.*?>.*?</script>","",self.text,flags=re.DOTALL|re.IGNORECASE)returnselfdefextract_emails(self):returnre.findall(r"[\w.+-]+@[\w-]+\.[\w.-]+",self.text)defextract_phones(self):returnre.findall(r"1[3-9]\d{9}",self.text)defclean(self):return(self.remove_html_comments().remove_script_content().remove_html_tags().normalize_whitespace().text.strip())cleaner=TextCleaner(dirty_text)clean_text=cleaner.clean()print(f"清洗结果:\n{clean_text}")print(f"\n提取邮箱:{cleaner.extract_emails()}")print(f"提取手机:{cleaner.extract_phones()}")print("\n[第17章] 全部示例运行完毕")

实验日志

以下是运行上述代码后的实际输出：

================================================== 基本匹配 ================================================== search: Python 3.12 match: Python match 数字: None fullmatch: 12345 findall 'Python': ['Python', 'Python'] findall 数字: ['3', '12', '2023', '10', '2', '7'] finditer: 3.12 at (7, 11) finditer: 2.7 at (35, 38) ================================================== 匹配模式 ================================================== MULTILINE ^\w+: ['Hello', 'World', 'Python'] MULTILINE \w+$: ['Hello', 'World', 'Python'] DOTALL: ['\n Hello\n'] IGNORECASE: ['Python', 'python', 'PYTHON'] VERBOSE: ['2024-01-15', '2023/12/25'] 组合标志: ['Hello', 'hello'] ================================================== 分组 ================================================== findall 分组: [('张三', '25'), ('李四', '30')] 张三, 25岁 李四, 30岁 捕获: ['123', '456'] 非捕获: ['123', '456'] 完整匹配: 2024-01-15 年: 2024, 月: 01, 日: 15 groups: ('2024', '01', '15') ================================================== 替换 ================================================== sub 多空格→单空格: 'Hello World Python Programming' subn: 'Hello World Python Programming', 替换 3 次 函数替换: 'the QUICK BROWN fox JUMPS OVER the LAZY dog' 去重: 'hello world python' ================================================== 分割 ================================================== split: ['one', ' two', ' three', 'four', 'five'] split 保留分隔符: ['one', ',', ' two', ';', ' three', '|', 'four', '\t', 'five'] maxsplit=2: ['a', 'b', 'c:d:e'] ================================================== 常用正则模式 ================================================== 邮箱: ['test@example.com', 'admin@mail.cn'] 手机号: ['13800138000', '19912345678'] IP 地址: ['192.168.1.100', '10.0.0.1'] URL: ['https://www.example.com/path?q=test'] 日期: ['2024-01-15', '2023/12/25'] 中文名: ['姓名', '张三和李'] 身份证: ['110101199001011234'] ================================================== 实战：文本清洗器 ================================================== 清洗结果: Python 教程 这是第一段。 有很多 多余的 空格。 联系邮箱: test@example.com 电话: 13800138000 提取邮箱: ['test@example.com'] 提取手机: ['13800138000'] [第17章] 全部示例运行完毕

【Python基础20讲】第17章：正则表达式

17.1 正则表达式基础

17.2 正则表达式语法

17.3 分组与替换

17.4 正则表达式实战技巧

完整代码

实验日志

CSDN技术教程｜OpenClaw 小龙虾AI v2.6.4 部署+全程报错排查（图文并茂）

AI编程革命：Codex如何高效生成自动化脚本

golang如何使用embed嵌入文件_golang embed嵌入文件使用解析

性能提升的真相｜WebGPU 到底能让 Highcharts 快多少？

面试官总问的‘凸优化’：在逻辑回归、SVM与神经网络中到底怎么用？（避坑指南）

光学检测新手指南：用C++和OpenCV手把手实现PSD功率谱密度分析（附完整代码）