學習Python第四周總結

正則運算式

Python使用正則運算式的兩種方式：

不創建正則運算式物件，直接呼叫函式進行匹配操作

match
fullmatch

~創建正則運算式物件（Pattern），通過給物件發訊息實作匹配操作

compile

例子：網站注冊，用戶名要求必須是字母、數字、下劃線，長度在6-20個字符之間，檢查用戶名是否合法，應該怎么做？

import re


username = input('請輸入用戶名: ')
username_pattern = re.compile(r'^\w{6,20}$')
print(type(username_pattern))
matcher = username_pattern.match(username)
print(type(matcher))
if matcher is None:
    print('無效的用戶名！！！')
else:
    print(matcher.group())
# matcher = re.match(r'\w{6,20}$', username)
# if matcher is None:
#     print('用戶名不合法！！！')
# else:
#     print(matcher)
#     print(matcher.group())


# qq = input('請輸入QQ號: ')
# matcher = re.fullmatch(r'[1-9]\d{4,10}', qq)
# if matcher is None:
#     print('QQ號錯誤！！！')
# else:
#     print(matcher)
#     print(matcher.group())

import re
content = """報警電話: 110, 我們班是Python-2105班,
我的QQ是123456, 我的手機號是15581572054,謝謝!"""
# matcher = re.search(r'1[3-9]\d{9}', content)
# if not matcher:
#     print('沒有找到手機號')
# else:
#     print(matcher.group())

pattern = re.compile(r'\d+')
matcher = pattern.search(content)
while matcher:
    print(matcher.group())
    print(matcher.start(), matcher.end())
    matcher = pattern.search(content, matcher.end())

results = pattern.findall(content)
for result in results:
    print(result)

results = re.findall(r'\d+', content)
for result in results:
    print(result)

這是我們對正則運算式中的一些基本符號進行的扼要總結，

符號	解釋	示例	說明
`.`	匹配任意字符	`b.t`	可以匹配bat / but / b#t / b1t等
`\w`	匹配字母/數字/下劃線	`b\wt`	可以匹配bat / b1t / b_t等但不能匹配b#t
`\s`	匹配空白字符（包括\r、\n、\t等）	`love\syou`	可以匹配love you
`\d`	匹配數字	`\d\d`	可以匹配01 / 23 / 99等
`\b`	匹配單詞的邊界	`\bThe\b`
`^`	匹配字串的開始	`^The`	可以匹配The開頭的字串
`$`	匹配字串的結束	`.exe$`	可以匹配.exe結尾的字串
`\W`	匹配非字母/數字/下劃線	`b\Wt`	可以匹配b#t / b@t等但不能匹配but / b1t / b_t等
`\S`	匹配非空白字符	`love\Syou`	可以匹配love#you等但不能匹配love you
`\D`	匹配非數字	`\d\D`	可以匹配9a / 3# / 0F等
`\B`	匹配非單詞邊界	`\Bio\B`
`[]`	匹配來自字符集的任意單一字符	`[aeiou]`	可以匹配任一元音字母字符
`[^]`	匹配不在字符集中的任意單一字符	`[^aeiou]`	可以匹配任一非元音字母字符
`*`	匹配0次或多次	`\w*`
`+`	匹配1次或多次	`\w+`
`?`	匹配0次或1次	`\w?`
`{N}`	匹配N次	`\w{3}`
`{M,}`	匹配至少M次	`\w{3,}`
`{M,N}`	匹配至少M次至多N次	`\w{3,6}`
`\|`	分支	`foo\|bar`	可以匹配foo或者bar
`(?#)`	注釋
`(exp)`	匹配exp并捕獲到自動命名的組中
`(?<name>exp)`	匹配exp并捕獲到名為name的組中
`(?:exp)`	匹配exp但是不捕獲匹配的文本
`(?=exp)`	匹配exp前面的位置	`\b\w+(?=ing)`	可以匹配I’m dancing中的danc
`(?<=exp)`	匹配exp后面的位置	`(?<=\bdanc)\w+\b`	可以匹配I love dancing and reading中的第一個ing
`(?!exp)`	匹配后面不是exp的位置
`(?<!exp)`	匹配前面不是exp的位置
`*?`	重復任意次，但盡可能少重復	`a.b` `a.?b`	將正則運算式應用于aabab，前者會匹配整個字串aabab，后者會匹配aab和ab兩個字串
`+?`	重復1次或多次，但盡可能少重復
`??`	重復0次或1次，但盡可能少重復
`{M,N}?`	重復M到N次，但盡可能少重復
`{M,}?`	重復M次以上，但盡可能少重復

import re
import requests

# 匹配整個a標簽，但是只捕獲（）中的內容--->正則運算式的捕獲組
pattern = re.compile(r'<a\s.*?href="(.+?)".*?title="(.+?)".*?>')
resp = requests.get('https://www.sohu.com/')
results = pattern.findall(resp.text)
for href, title in results:
    print(title)
    if not href.startswith('https://www.sohu.com'):
        href = 'https://www.sohu.com' + href
    print(href)

正則運算式捕獲組

從網頁上獲取新聞的標題和鏈接

import re
import requests

# 匹配整個a標簽，但是只捕獲（）中的內容--->正則運算式的捕獲組
pattern = re.compile(r'<a\s.*?href="(.+?)".*?title="(.+?)".*?>')
resp = requests.get('https://www.sohu.com/')
results = pattern.findall(resp.text)
for href, title in results:
    print(title)
    if not href.startswith('https://www.sohu.com'):
        href = 'https://www.sohu.com' + href
    print(href)

Python對正則運算式的支持

Python提供了re模塊來支持正則運算式相關操作，下面是re模塊中的核心函式，

函式	說明
`compile(pattern, flags=0)`	編譯正則運算式回傳正則運算式物件
`match(pattern, string, flags=0)`	用正則運算式匹配字串成功回傳匹配物件否則回傳`None`
`search(pattern, string, flags=0)`	搜索字串中第一次出現正則運算式的模式成功回傳匹配物件否則回傳`None`
`split(pattern, string, maxsplit=0, flags=0)`	用正則運算式指定的模式分隔符拆分字串回傳串列
`sub(pattern, repl, string, count=0, flags=0)`	用指定的字串替換原字串中與正則運算式匹配的模式可以用`count`指定替換的次數
`fullmatch(pattern, string, flags=0)`	`match`函式的完全匹配（從字串開頭到結尾）版本
`findall(pattern, string, flags=0)`	查找字串所有與正則運算式匹配的模式回傳字串的串列
`finditer(pattern, string, flags=0)`	查找字串所有與正則運算式匹配的模式回傳一個迭代器
`purge()`	清除隱式編譯的正則運算式的快取
`re.I` / `re.IGNORECASE`	忽略大小寫匹配標記
`re.M` / `re.MULTILINE`	多行匹配標記

不良內容過濾

import re

content = '馬化騰是一個沙雕煞筆，FUck you！'
pattern = re.compile(r'[傻沙煞][逼筆雕鄙]|馬化騰|fuck|shit', flags=re.IGNORECASE)
# modified_content = re.sub(r'[傻沙煞][逼筆雕鄙]|馬化騰|fuck|shit', '*', content, flags=re.I)
modified_content = pattern.sub('*', content)
print(modified_content)

IGNORECASE)

modified_content = re.sub(r’[傻沙煞][逼筆雕鄙]|馬化騰|fuck|shit’, ‘*’, content, flags=re.I)