如何使用 Python 將 HTML 超鏈接轉換為純文本,如下所示:
<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>
我當前的代碼看起來像這樣,但是這個包似乎并沒有自己完成這項作業,因為它們只是將主要的 HTML 文本元素轉換為沒有鏈接的純文本:
from html2text import html2text
text = html2text("<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>")
print(text)
# Result I wanted: "Hello world, it's foo bar time - https://google.com/"
# Result I got: "Hello world, it's foo bar time"
如果找到解決方案,真的會有所幫助。
uj5u.com熱心網友回復:
你可以看看html.parser,這個庫絕對滿足你的需求。
檔案中的示例:
from html.parser import HTMLParser
from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
def handle_comment(self, data):
print("Comment :", data)
def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)
def handle_charref(self, name):
if name.startswith('x'):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)
def handle_decl(self, data):
print("Decl :", data)
parser = MyHTMLParser()
uj5u.com熱心網友回復:
您可以使用 Beautiful Soup ( bs4 包)
from bs4 import BeautifulSoup
spam = """<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>
<p>Hello world, it's <a href="https://stackoverflow.com">spam eggs</a></p>"""
soup = BeautifulSoup(spam, 'html.parser')
for a_tag in soup.find_all('a'):
a_tag.replace_with(f"{a_tag.text} - {a_tag.get('href')}")
print(soup.text)
輸出
Hello world, it's foo bar time - https://google.com
Hello world, it's spam eggs - https://stackoverflow.com
請注意,您可以從這里作業。查看tag.replace_with()并tag.unwrap()
鏈接到檔案
uj5u.com熱心網友回復:
您可以使用 BeautifulSoup 模塊。
from bs4 import BeautifulSoup
html = "<p>Hello world, it's <a href='https://google.com'>foo bar time</a></p>"
soup = BeautifulSoup(html, features="html.parser")
text = soup.get_text()
url_part = soup.find('a')
url_str = url_part['href']
print(text , ' - ' , url_str)
要匯入模塊,您需要安裝它
pip install beautifulsoup4
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/382305.html
