我是 XML 和 BeautifulSoup 的新手,我正在嘗試使用 Clinicaltrials.gov 的新 API 獲取臨床試驗資料集,該 API 將試驗串列轉換為 XML 資料集。我嘗試find_all()像我通常使用 HTML 那樣使用,但我沒有同樣的運氣。我嘗試了其他一些方法,例如轉換為字串和拆分(非常混亂),但我不想因嘗試失敗而弄亂我的代碼。
底線:我想提取所有 NCTId(我知道我可以將整個內容轉換為字串并使用正則運算式,但我想學習如何正確決議 XML)和 XML 檔案中列出的每個臨床試驗的官方標題. 任何幫助表示贊賞!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
uj5u.com熱心網友回復:
您可以過濾如下屬性:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
然后迭代每個結果以獲取文本,例如:
official_titles = [result.text for result in m1_officialtitle]
有關更多資訊,您可以在此處查看檔案
uj5u.com熱心網友回復:
您可以搜索field小寫的標簽,并將其name作為屬性傳遞給attrs。這適用于BeautifulSoup不需要使用etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/359585.html
