我正在使用 beautifulsoup 和 python 抓取一個網站,它有 100 多個跨度標簽。我想洗掉 2 個連續的 span 標簽,其中第一個 span 標簽有文本元素“READ MORE:”,第二個 span 標簽是一些字串。
<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
<span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
<span>READ MORE: </span>,
<span>Long queues form at airports as one million Aussies set to fly this Easter</span>,
<span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
<span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
<span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
<span>READ MORE: </span>,
<span>Four female backpackers killed in horror highway crash</span>,
<span>The court also heard he had earned the title of a serial traffic offender.</span>,
<span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
<span>Watfa will serve at least two years and three months for manslaughter.</span>,
<span>He will be eligible for parole in early 2024.</span>
例如:我想洗掉以下 4 個標簽
<span>READ MORE: </span>,
<span>Long queues form at airports as one million Aussies set to fly this Easter</span>
<span>READ MORE: </span>,
<span>Four female backpackers killed in horror highway crash</span>
輸出應該是:
<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
<span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
<span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
<span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
<span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
<span>The court also heard he had earned the title of a serial traffic offender.</span>,
<span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
<span>Watfa will serve at least two years and three months for manslaughter.</span>,
<span>He will be eligible for parole in early 2024.</span>
如果有人可以幫助我了解 python.cheers 中的邏輯,我將不勝感激
uj5u.com熱心網友回復:
假設你抓取了新聞網站每篇文章的文本,你應該改變你的策略。
.decompose()在您不想刮掉的元素時清潔樹:
for e in soup.select('span:-soup-contains("READ MORE")'):
e.find_next('span').decompose()
e.decompose()
而不是選擇文章的正文并提取文本:
soup.select_one('.article__body-croppable').get_text(' ', strip=True)
這導致:
一名司機因在悉尼西南部的一次車禍中坐在他腿上的一名男嬰死亡而被判入獄。2019 年 2 月 25 日,兩輛汽車在 Lurnea 低速相撞。事故造成一名 11 個月大的男孩在由 Peter Watfa 駕駛的寶馬轎車中喪生。Peter Watfa 已被判入獄至少兩年零三個月。(9News) Watfa 多次拒絕承認這名 11 個月大的嬰兒坐在他的腿上,并堅稱在車禍發生時嬰兒被限制在后座上。當駕駛員的安全氣囊展開時,男嬰受了致命傷。今天,一名法官對 Watfa 的行為進行了猛烈抨擊,法庭審理稱,這名易受傷害的孩子“完全依賴于 Watfa,他有責任照顧他”。一名 11 個月大的男孩在車禍中喪生。(9News) 法庭還聽說他獲得??了連環交通肇事者的頭銜。在事故發生后的幾個月里,Watfa 參與了警方的追捕,并被抓獲在毒品影響下駕駛。Watfa 將因過失殺人罪服刑至少兩年零三個月。他將有資格在 2024 年初獲得假釋。
事實上,你也可以迭代你的ResultSet并創建一個新list的,<span>但我認為這不是最好的選擇:
[x for i, x in enumerate(results) if 'READ MORE' not in x.text and 'READ MORE' not in results[i-1].text]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/460065.html
標籤:html python-3.x 网页抓取 美丽的汤
