使用BS4排除跨度-Python-有解無憂

所以我試圖排除（而不是提取）跨度中包含的資訊。這是HTML：

<li><span>Type:</span> Cardiac Ultrasound</li>

這是我的代碼：

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
        description_elements = description_el.find('span')
        for el in description_elements: 
            curr_el = {}
            key = el.replace(':', '')
            print(el)
            print(description_el.text.replace(' ', ''))

列出湯基本上是整個頁面（在我的示例中是 HTML）當我這樣做時，我得到：

Type:
Type: CardiacUltrasound

如你看到的。由于某些特殊的原因：P，即使span產生了replace().textstr

編輯：對不起。我的目標是創建一堆在dictionnaries哪里key以及之后span的內容。value

uj5u.com熱心網友回復：

注意：小心“創建一堆字典”，因為字典不能有重復的鍵。但是你可以有一個字典串列，在這種情況下，這并不重要（在每個單獨的字典中仍然很重要）。

選項1：

采用.next_sibling()

from bs4 import BeautifulSoup

html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':', '')
    v = description_el.find('span').next_sibling.strip()
    
    print(k)
    print(v)

選項 2：

只需從description_el, 中獲取文本.split(':')。然后你得到了你想要的 2 個元素（如果我正確地閱讀了你的問題。

from bs4 import BeautifulSoup

html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    descText = description_el.text.split(':', 1)
    k = descText[0].strip()
    v = descText[-1].strip()
    
    print(k)
    print(v)

選項 3：

獲取<span>文本。去掉它。然后在<li>. 盡管由于您不想提取，因此可能對您沒有用。

from bs4 import BeautifulSoup

html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':','')
    description_el.find('span').extract()
    v = description_el.text.strip()
    
    print(k)
    print(v)

輸出：

Type
Cardiac Ultrasound

uj5u.com熱心網友回復：

要提取不包括子標簽內容的標簽文本，您可以使用此答案中的方法。通常，您只需要遍歷<li>標簽并從包含 child 的標簽中獲取文本<span>。

代碼：

from bs4 import BeautifulSoup, NavigableString

html = """<html><body>
<li><span>Key1:</span> Value1</li>
<li><span>Key2:</span> Value2</li>
<li><NoKeyValue</li>
<li><span>Key3:</span> Value3</li>
<li><span>Key4:</span> Value4</li>
</body></html>"""

result = {}
for li in BeautifulSoup(html, "html.parser").find_all("li"):
    span = li.find("span")
    if span:
        result[span.text.strip(" :")] = \
            "".join(e for e in li if isinstance(e, NavigableString)).strip()

你可以幫助我的國家，查看我的個人資料資訊。

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/441997.html

標籤：Python 网页抓取美丽的汤

上一篇：如何在多個div類python中查找一行文本

下一篇：如何從字串中提取url并保存到串列中