當前代碼會抓取單個欄位,但我想將時間和標題映射在一起。
既然網頁沒有時間和標題在同一個類中,那么這種映射是如何發生的呢?
附帶這個問題 -鏈接(我的問題使用了一個例子,其中時間和標題的長度不等)
參考網址:https : //ash.confex.com/ash/2021/webprogram/WALKS.html
示例預期輸出:
5:00 PM-6:00 PM,ASH 老年血液學海報步行:為患者選擇正確的治療,而不僅僅是疾病
5:00 PM-6:00 PM,ASH 醫療質量改善海報步行
等等
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
uj5u.com熱心網友回復:
這可能是另一種選擇:
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
#times = soup.select('.time')
for a in productlist:
title = a.text
time = a.find_previous('h3').text
date = a.find_previous('h4').text
print(title, date, time, end = "\n")
輸出
ASH Poster Walk on What's Hot in Sickle Cell Disease
Wednesday, December 15, 2021
10:00 AM-11:00 AM
ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Healthcare Quality Improvement
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Natural Killer Cell-Based Immunotherapy
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Pediatric Non-malignant Hematology Highlights
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Clinical Trials In Progress
Thursday, December 16, 2021
10:00 AM-11:00 AM
ASH Poster Walk on Financial Toxicity in Hematologic Malignancies
Thursday, December 16, 2021
10:00 AM-11:00 AM
ASH Poster Walk on Diversity, Equity, and Inclusion in Hematologic Malignancies and Cell Therapy
Thursday, December 16, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Emerging Research in Immunotherapies
Thursday, December 16, 2021
5:00 PM-6:00 PM
ASH Poster Walk on the Spectrum of Hemostasis and Thrombosis Research
Thursday, December 16, 2021
5:00 PM-6:00 PM
uj5u.com熱心網友回復:
嘗試這個:
content = soup.find('div', {"class": "content"})
times = content.find_all("h3")
output = []
for i,h3 in enumerate(times):
for j in h3.next_siblings:
if j.name:
if j.name == "h3":
break
j = j.text.replace('\n', '')
output.append(f"{times[i].text}, {j}")
print(output)
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/351280.html
