如何使用BeautifulSoup決議帶有標簽中命名空間的XML？-有解無憂

我有一個包含以下資料的 xml 鏈接 ( http://api.worldbank.org/v2/countries )：

<wb:countries xmlns:wb="http://www.worldbank.org" page="1" pages="6" per_page="50" total="299">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code=""/>
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code=""/>
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity/>
<wb:longitude/>
<wb:latitude/>
</wb:country>
</wb:countries>

我試圖決議收入水平，但它回傳 (None) 如何使用 BeautifulSoup 到達 xml 文本中的文本（例如：高收入）？我試過這段代碼，但它不能正常作業！

import requests
#import re
from bs4 import BeautifulSoup
url = 'http://api.worldbank.org/v2/countries'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
countries= soup.findAll('wb:country')
for country in countries:
    name = country.find("wb:name").text
    code = country.find('wb:iso2code').text
    incomeLevel = country.find('wb:incomeLevel', {"iso2code":"XD"})
    print(f"{name}, {code}, {incomeLevel}")

uj5u.com熱心網友回復：

感謝您發布問題。我認為您在代碼中有一些錯誤。這將幫助您糾正錯誤。

而不是findAll您應該使用新方法，即find_all來自 bs4 API。請參考此鏈接https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
在第二個引數中，BeautifulSoup請指定“lxml-xml”或簡單的“xml”，因為它指示 beautifulsoup 生成 XML 檔案，否則它只會生成一個純 HTML 檔案，在您的問題中，您希望從 XML 決議和提取內容檔案。請參考以下鏈接https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

請參考以下代碼片段:)

import requests
from bs4 import BeautifulSoup

url = 'http://api.worldbank.org/v2/countries'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
countries = soup.find_all('wb:country')
for country in countries:
    name = country.find('wb:name').text
    code = country.find('wb:iso2Code').text
    income_level = country.find('wb:incomeLevel').text
    print(f'Name: {name} Code: {code} Income Level: {income_level}')

uj5u.com熱心網友回復：

下面是如何使用xml 模塊中的ElementTree類，特別是考慮到命名空間：

from xml.etree import ElementTree as ET

ns = {'wb': 'http://www.worldbank.org'}

countries = ET.parse('input.xml').getroot()

for country in countries.findall('wb:country', namespaces=ns):
    name = country.find("wb:name", namespaces=ns).text
    code = country.find('wb:iso2Code', namespaces=ns).text
    
    incomeLevel = None
    for x in country.findall('wb:incomeLevel', namespaces=ns):
        if x.get('iso2code') == 'XD':
            incomeLevel = x.text
            break
    
    print(f"{name}, {code}, {incomeLevel}")

當我在您提供的示例input.xml上運行它時，我得到：

Aruba, AW, High income
Africa Eastern and Southern, ZH, None

uj5u.com熱心網友回復：

怎么了？

決議器在決議BeautifulSoup具有 xml 命名空間的物件時遇到問題，因為它是作為 HTML 創建的，這就是為什么你得到 aNone而你在不添加.text方法的情況下不會得到文本。所以它是兩者的結合。

如何實作？

xml作為決議器傳遞到BeautifulSoup以正確處理命名空間并作為有效的 xml 而不是 html：

soup = BeautifulSoup(response.content, 'xml')

添加.text到您的結果：

incomeLevel = country.find('incomeLevel').text

如果您只想獲得具有incomeLevel 和iso2code="XD" 的國家/地區，請更改您的選擇器并使用css selectors代替find_all()：

countries = soup.select('country:has(incomeLevel[iso2code="XD"])')
for country in countries:
    name = country.find("name").text
    code = country.find('iso2Code').text
    incomeLevel = country.find('incomeLevel').text
    print(f"{name}, {code}, {incomeLevel}")

注意： xml 決議器區分大小寫find('iso2code')不起作用，您必須更改為find('iso2Code')

例子

注意： 在新代碼中最好使用實際語法find_all()而不是過時的findAll()

import requests
#import re
from bs4 import BeautifulSoup
url = 'http://api.worldbank.org/v2/countries'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
countries= soup.find_all('country')
for country in countries:
    name = country.find("name").text
    code = country.find('iso2Code').text
    incomeLevel = country.find('incomeLevel').text
    print(f"{name}, {code}, {incomeLevel}")

輸出

Aruba, AW, High income
Africa Eastern and Southern, ZH, Aggregates
Afghanistan, AF, Low income
Africa, A9, Aggregates
Africa Western and Central, ZI, Aggregates
Angola, AO, Lower middle income
Albania, AL, Upper middle income
Andorra, AD, High income
Arab World, 1A, Aggregates
United Arab Emirates, AE, High income
...

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/401036.html

標籤：Python xml 网页抓取美汤

上一篇：XPath語法的官方規范在哪里？

下一篇：`XMLStreamReader.getEncoding()`到底做了什么？