廢棄標簽的文本內容，但不是First內其他標簽的文本-有解無憂

我需要從<td>每個<tr>. 但不是所有的文本，只有一個內部標簽<a>和任何其他標簽之外。我將必要文本的示例寫為“yyy”/“y”，將不必要的示例寫為“zzz”

<table>
  <tbody>
    <tr>
      <td>
        <b>zzz</b>
        <a href="#">yyy</a>
        "y"
        <a href="#">yyy</a>
        <sup>zzz</sup>
        <a href="#">yyy</a>
        <a href="#">yyy</a>
        "y"
      </td>
      <td>
        zzzzz
      </td>
    </tr>
  </tbody>
</table>

這就是我現在所擁有的

words = []
for tableRows in soup.select("table > tbody > tr"):
  tableData = tableRows.find("td").text
  text = [word.strip() for word in tableData.split(' ')]
  words.append(text)
print(words)

但是這段代碼正在決議來自<td>:的所有文本["zzz", "yyyy", "yyyy", "zzz", "yyyy"]。

uj5u.com熱心網友回復：

嘗試：

from bs4 import BeautifulSoup, Tag, NavigableString

html_doc = """\
<table>
  <tbody>
    <tr>
      <td>
        <b>zzz</b>
        <a href="#">yyy</a>
        "y"
        <a href="#">yyy</a>
        <sup>zzz</sup>
        <a href="#">yyy</a>
        <a href="#">yyy</a>
        "y"
      </td>
      <td>
        zzzzz
      </td>
    </tr>
  </tbody>
</table>"""

soup = BeautifulSoup(html_doc, "html.parser")

for td in soup.select("td:nth-of-type(1)"):
    for c in td.contents:
        if isinstance(c, Tag) and c.name == "a":
            print(c.text.strip())
        elif isinstance(c, NavigableString):
            c = c.strip()
            if c:
                print(c)

印刷：

yyy
"y"
yyy
yyy
yyy
"y"

soup.select("td:nth-of-type(1)")只選擇 first <td>。
然后我們迭代.contents這個<td>
if isinstance(c, Tag) and c.name == "a"檢查內容是否是Tag和名稱Tag是<a>
if isinstance(c, NavigableString)檢查內容是否為純字串。

uj5u.com熱心網友回復：

根據您的示例，使用childrenoftd標簽。a然后檢查名稱為 None 的孩子。然后檢查孩子是否有文本然后追加。

words = []

for item in soup.select("table > tbody > tr"):
    for child in item.td.children:        
        if child.name=='a' or child.name==None:
           if child.text.strip():
              words.append(child.text.strip())
print(words)

輸出：

['yyy', '"y"', 'yyy', 'yyy', 'yyy', '"y"']

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/529587.html

標籤：Python硒解析网页抓取美丽的汤

上一篇：格拉默左因子分解

下一篇：如何決議標簽外的文本