我正在決議一個文本,每個單詞都被制成一個鏈接。問題是標點符號不是該標簽的內容<a>,它們只是位于標簽之外,所以我也不知道該怎么做才能獲得標點符號。
<table>
<tbody>
<tr>
<td>
<a href="#">Lorem</a>
", "
<a href="#">Ipsum</a>
": "
<a href="#">dolor</a>
"."
</td>
<td>...</td>
</tr>
<tr>
<td>
<a href="#">sit</a>
"? '"
<a href="#">amet</a>
"' "
<a href="#">consectetur</a>
"..."
</td>
<td>...</td>
</tr>
<tr>
<td>
<a href="#">adipisicing</a>
"-"
<a href="#">elit</a>
"; "
<a href="#">Molestias</a>
"!"
</td>
<td>...</td>
</tr>
</tbody>
</table>
這是決議器
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRows in soup.select("table > tbody > tr"):
for word in tableRows.find("td").select("a"):
words.append(word.text)
print(words)
uj5u.com熱心網友回復:
標簽元素之間的文本內容a屬于父td元素本身。
您可以直接從td元素中獲取文本,如下所示:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRow in soup.select("table > tbody > tr"):
words.append(tableRow.text)
print(words)
UPD
如果您想將標點符號作為分隔物件,您可以用空格分割表格行文本。以下代碼應該這樣做 洗掉前導和尾隨空格。
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRow in soup.select("table > tbody > tr"):
tableRowtext = tableRow.text
rowTexts = [x.strip() for x in tableRowtext.split(' ')]
words.append(rowTexts)
print(words)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/529588.html
