我的 HTML 代碼包含這樣的嵌套串列:
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
我需要決議它們,使它們看起來像這樣:
Apple
Pear
Cherry
Orange
Pineapple
Banana
我嘗試使用 BeautifulSoup,但我一直在思考如何在我的代碼中考慮嵌套。
示例,其中x包含上面列出的 HTML 代碼:
import bs4
soup = bs4.BeautifulSoup(x, "html.parser")
for ul in soup.find_all("ul"):
for li in ul.find_all("li"):
li.replace_with(" {}\n".format(li.text))
uj5u.com熱心網友回復:
您可以使用遞回:
import bs4, re
from bs4 import BeautifulSoup as soup
s = """
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
"""
def indent(d, c = 0):
if (s:=''.join(i for i in d.contents if isinstance(i, bs4.NavigableString) and i.strip())):
yield f'{" "*c} {s}'
for i in d.contents:
if not isinstance(i, bs4.NavigableString):
yield from indent(i, c 1)
print('\n'.join(indent(soup(s, 'html.parser').ul)))
輸出:
Apple
Pear
Cherry
Orange
Pineapple
Banana
uj5u.com熱心網友回復:
這有點像黑客,但您可以使用 lxml 來代替:
import lxml.html as lh
uls = """[your html above]"""
doc = lh.fromstring(uls)
tree = etree.ElementTree(doc)
for e in doc.iter('li'):
path = tree.getpath(e)
print(' ' * path.count('ul'), e.text)
輸出:
Apple
Pear
Cherry
Orange
Pineapple
Banana
uj5u.com熱心網友回復:
我認為將html字串轉換為markdown自定義專案符號會更容易。這可以通過markdownify來完成:
import markdownify
formatted_html = markdownify.markdownify(x, bullets=[' ', ' ', ' '], strip="ul")
結果:
Apple
Pear
Cherry
Orange
Pineapple
Banana
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/364391.html
