使用BeautifulSoup無法按預期進行HTML決議-有解無憂

我正在使用 Python 3 和 BeautifulSoup 模塊，版本 4.9.3。我正在嘗試使用這個包來練習決議一些簡單的 HTML。

我擁有的字串如下：

text = '''<li><p>Some text</p>is put here</li><li><p>And other text is put here</p></li>'''

我使用 BeautifulSoup 如下：

x = BeautifulSoup(text, "html.parser")

然后，我使用以下腳本來試驗 Beautiful Soup 的功能：

for li in x.find_all('li'):
    print(li)
    print(li.string)
    print(li.next_element)
    print(li.next_element)
    print(li.next_element.string)
    print("\n")

結果（至少對于第一次迭代）是出乎意料的：

<li><p>Some text</p>is put here</li>
None
<p>Some text</p>
Some text


<li><p>And other text is here</p></li>
And other text is here
<p>And other text is here</p>
And other text is here

為什么string第一個li標簽的屬性是None，而string內p標簽的屬性不是None？

同樣，如果我這樣做：

x.find_all('li', string=re.compile('text'))

我只得到一個結果（第二個標簽）。

但如果我這樣做：

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))

我得到 2 個結果（兩個標簽）。

uj5u.com熱心網友回復：

解釋檔案：

如果一個標簽只有一個孩子，并且那個孩子是 a NavigableString，那么這個孩子就可以作為使用.string。

如果標簽的唯一子標簽是另一個標簽，并且該標簽具有.string，則認為父標簽.string與其子標簽相同。

如果一個標簽包含多個東西，那么不清楚.string應該參考什么，所以.string定義為 None。

讓我們將這些規則應用于您的問題：

為什么第一個li標簽None的字串屬性是字串，而內p標簽的字串屬性不是None？

內部p標簽滿足規則#1；它只有一個孩子，那個孩子是 a NavigableString，所以.string回傳那個孩子。

第一個li滿足規則#3；它有不止一個孩子，所以.string會模棱兩可。

考慮到您的第二個問題，讓我們咨詢檔案以string=獲取.find_all()

string您可以搜索字串而不是標簽。...雖然string用于查找字串，但您可以將其與查找標簽的引數結合使用：Beautiful Soup 將查找.string與您的字串值匹配的所有標簽。

你的第一個例子：

x.find_all('li', string=re.compile('text'))
# [<li><p>And other text is put here</p></li>]

這將搜索與正則運算式匹配的所有li標簽。.string但是我們已經看到第一個li是.string，None所以它不匹配。

你的第二個例子：

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))
# ['Some text']
# ['And other text is put here']

這將搜索每li棵樹中任何位置包含的所有字串。對于第一棵樹，li.p.string存在且匹配，即使li.string不存在。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/436486.html

標籤：Python html 解析美丽的汤

上一篇：Parse-Swift檢查多列（Compound/OR陳述句）

下一篇：YAML：編碼與語意差異