re.1-常用運算式規則-有解無憂

一、正則運算式的含義

定義：即文本的高級匹配模式,提供搜索替換等功能.其本質是一系列由字符和特殊符號組成的字串,這個字串即正則運算式
匹配原理：由普通的字符和特殊符號構成,通過描述字符的重復,位置,種類等行為達到匹配某一類字串的目的
正則特點：方便處理文本，支持語言眾多，使用靈活多樣

二、正則語法

re.findall(pattern,string)

- 功能: 使用正則運算式匹配字串
- 引數
  - pattern 正則運算式
  - string 目標字串
- 回傳值 : 回傳匹配內容串列

三、正則運算式模式

1. 普通字符

元字符 : a B c
匹配規則: 每個字符匹配對應的自身字符

In [15]: re.findall('ab','abcdaefabcdef')
Out[15]: ['ab', 'ab']

In [16]: re.findall('你好','你好,北京')
Out[16]: ['你好']

2. 或

元字符 : |
匹配規則: 匹配 | 兩邊任意一個正則運算式

In [24]: re.findall('ab|cd',"abcdef")
Out[24]: ['ab', 'cd']

3. 匹配單個字符

元字符 : .
匹配規則: 匹配除換行外任意一個字符

f.o --> foo fao

In [25]: re.findall('f.o',"foo fao")
Out[25]: ['foo', 'fao']

4. 匹配字串開始位置

元字符 : ^
匹配規則: 匹配目標字串的開始位置

In [29]: re.findall('^Jame',"Jame is a boy")
Out[29]: ['Jame']

5. 匹配字串結束位置

元字符 : $
匹配規則: 匹配目標字串的結束位置

In [32]: re.findall('Jame$',"Hi,Jame")
Out[32]: ['Jame']

6. 匹配重復

元字符 : *
匹配規則: 匹配前面的字符出現0次或多次

fo* --> fooooooooooo f

In [34]: re.findall('fo*',"fooooooabceffo")
Out[34]: ['foooooo', 'f', 'fo']

7. 匹配重復

元字符: +
匹配規則 : 匹配前面的字符出現1次或多次

fo+ --> fooooooooooo fo

In [37]: re.findall('fo+',"fooooooabceffo")
Out[37]: ['foooooo', 'fo']

8. 匹配重復

元字符 : ?
匹配規則 : 匹配前面的字符出現0次或1次

fo? --> f fo

In [43]: re.findall('fo?',"fooooooabceffo")
Out[43]: ['fo', 'f', 'fo']

9. 匹配重復

元字符 : {n}
匹配規則 : 匹配前面的字符重復指定的次數

fo{3} --> fooo

In [46]: re.findall('fo{3}',"fooooooabceffo")
Out[46]: ['fooo']

10. 匹配重復

元字符 : {m,n}
匹配規則 : 匹配前面的字符出現 m -- n次

fo{2,4} --> foo fooo foooo

In [49]: re.findall('fo{2,4}',"fooooooabceffoo")
Out[49]: ['foooo', 'foo']

11. 匹配字符集

元字符: [字符集]
匹配規則: 匹配字符集中任意一個字符

[abc123] --> a b c 1 2 3
[a-z] [A-Z] [0-9]
[$#_a-zA-Z]

In [50]: re.findall('[A-Z][a-z]*',"Hi,This is Lua")
Out[50]: ['Hi', 'This', 'Lua']

12. 匹配字符集

元字符 : [^...]
匹配規則 : 匹配除指定字符外的任意一個字符

[^abc] --> 除了a b c外任意一個字符
[^a-z]

In [61]: re.findall('[^ ]+',"This is a test")
Out[61]: ['This', 'is', 'a', 'test']

13. 匹配任意(非)數字字符

元字符 : \d \D
匹配規則:

\d 匹配任意一個數字字符 [0-9]
\D 匹配任意一個非數字字符 [^0-9]

In [63]: re.findall('\d+',"2018年就快過去,2019馬上到來")
Out[63]: ['2018', '2019']

14. 匹配任意(非)普通字符

元字符 : \w \W
匹配規則:

\w 匹配普通字符 (數字字母下劃線,utf8字符)
\W 匹配特殊字符

In [71]: re.findall('\w+',"PORT#1234,Error 44% 下降")
Out[71]: ['PORT', '1234', 'Error', '44', '下降']

15. 匹配任意(非)空字符

元字符 : \s \S
匹配規則:

\s 匹配任意空字符 [ \r\n\t\v\f]
\S 匹配任意非空字符

In [72]: re.findall('\w+\s+\w+',"hello world")
Out[72]: ['hello world']

In [74]: re.findall('^\S+',"Terna-123#H xxxxxxx")
Out[74]: ['Terna-123#H']

16. 匹配字串開頭結尾位置

元字符 : \A \Z
匹配規則:

\A 匹配字串開頭位置
\Z 匹配字串結尾位置

In [80]: re.findall('\A\d+-\d+\Z',"1000-15000")
Out[80]: ['1000-15000']

絕對匹配(完全匹配) : 保證正則運算式匹配目標字串的全部內容

17. 匹配(非)單詞邊界

元字符 : \b \B
匹配規則:

\b 匹配單詞邊界 (普通字符和其他字符的交接)
\B 匹配非單詞邊界

In [85]: re.findall(r'\bis\b',"This is a boy")
Out[85]: ['is']

In [86]: re.findall(r'\Bis',"This is a boy")
Out[86]: ['is']

18.元字符總結

匹配單個字符: . [...] [^...] \d \D \w \W \s \S

匹配重復: * + ? {n} {m,n}

匹配位置: ^ $ \A \Z \b \B

其他: | () \

19.正則運算式的轉義

正則特殊符號: . * + ? ^ $ () [] | \

正則運算式如果匹配特殊字本身符則需要加\
e.g. 匹配字符 . 用 \.

目標字串正則運算式字串
$10 \$\d+ "\\$\\d+"

raw字串: 對字串不進行轉義決議

r'\$\d+' ==> '\\$\\d+'

20.貪婪和非貪婪

貪婪模式: 正則運算式的重復匹配默認總是盡可能多的向后匹配內容

* + ? {m,n}

非貪婪(懶惰)模式 : 盡可能少的匹配內容

貪婪-->非貪婪 *? +? ?? {m,n}?

In [105]: re.findall(r'ab+?',"abbbbbbbbb")
Out[105]: ['ab']

........待補充

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/166778.html

標籤：Python

上一篇：spider.2-爬蟲的基礎

下一篇：Github 太狠了，居然把 "master" 干掉了！