我對 HTML 和決議很陌生,所以如果我為此使用了錯誤的術語,我深表歉意。我之前問過一個類似的問題,并找到了一些有用的答案。我有以下 HTML 片段,由兩個表格和兩個表格標題組成(還有更多行,但與這篇文章無關)
<body>
<table>
<tr class="header">
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
<table>
<tr class="header">
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
</body>
我正在嘗試使用 python3.6 和 BeautifulSoup4 來決議它并將文本提取到 aa 串列中。我的問題是,我希望每個塊都有單獨的串列。我當前的代碼似乎搜索并找到所有<th>標簽,而不是第一個表中的標簽。
這是我所擁有的:
def parse_html(self):
""" Parse the html file """
with open(self.html_path) as f:
soup = BeautifulSoup(f, 'html.parser')
tables = soup.find_all('table')
for table in tables:
# Find each row in the table
rows = table.find_all_next('tr')
for row in rows:
# Find each column in the row
cols = row.find_all_next('th')
for col in cols:
# Print each cell
print(col) # This is where it seems to be finding every <th>
break # Break just to do the first row (seems not to work?)
問題:如何修改此代碼,使其只<th>在當前行而不是每一行中找到標簽?
感謝您的任何幫助!
uj5u.com熱心網友回復:
使用.find_all代替.find_all_next。
如果html_doc是問題中的 HTML 片段:
soup = BeautifulSoup(html_doc, "html.parser")
tables = soup.find_all("table")
for table in tables:
# Find each row in the table
rows = table.find_all("tr")
for row in rows:
cols = row.find_all("th")
for col in cols:
print(col)
print("-" * 80)
印刷:
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/314438.html
