Webscrape-不同長度的欄位-有解無憂

當前代碼會抓取單個欄位，但我想將時間和標題映射在一起。

既然網頁沒有時間和標題在同一個類中，那么這種映射是如何發生的呢？

附帶這個問題 -鏈接（我的問題使用了一個例子，其中時間和標題的長度不等）

參考網址：https : //ash.confex.com/ash/2021/webprogram/WALKS.html

示例預期輸出：

5:00 PM-6:00 PM，ASH 老年血液學海報步行：為患者選擇正確的治療，而不僅僅是疾病

5:00 PM-6:00 PM，ASH 醫療質量改善海報步行

等等

import requests
from bs4 import BeautifulSoup

url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'

res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')

uj5u.com熱心網友回復：

這可能是另一種選擇：

import requests
from bs4 import BeautifulSoup

url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'

res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

productlist = soup.select('div.itemtitle > a')
#times = soup.select('.time')

for a in productlist:
    title = a.text
    time = a.find_previous('h3').text
    date = a.find_previous('h4').text
    print(title, date, time, end = "\n")

輸出

ASH Poster Walk on What's Hot in Sickle Cell Disease 
Wednesday, December 15, 2021
 10:00 AM-11:00 AM

ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Healthcare Quality Improvement 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Natural Killer Cell-Based Immunotherapy 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Pediatric Non-malignant Hematology Highlights 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Clinical Trials In Progress 
Thursday, December 16, 2021
 10:00 AM-11:00 AM

ASH Poster Walk on Financial Toxicity in Hematologic Malignancies 
Thursday, December 16, 2021
 10:00 AM-11:00 AM

ASH Poster Walk on Diversity, Equity, and Inclusion in Hematologic Malignancies and Cell Therapy 
Thursday, December 16, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Emerging Research in Immunotherapies 
Thursday, December 16, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on the Spectrum of Hemostasis and Thrombosis Research 
Thursday, December 16, 2021
 5:00 PM-6:00 PM

uj5u.com熱心網友回復：

嘗試這個：

content = soup.find('div', {"class": "content"})
times = content.find_all("h3")
output = []
for i,h3 in enumerate(times):
    for j in h3.next_siblings:
        if j.name:
            if j.name == "h3":
                break
            j = j.text.replace('\n', '')
            output.append(f"{times[i].text}, {j}")
print(output)

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/351280.html

標籤：Python 网页抓取美汤 css-选择器

上一篇：我從.NET5升級到.NET6，現在從LINQ查詢中獲取SqlNullValueException

下一篇：Selenium-為什么NoSuchElementException在第二次for回圈迭代中發生？