XML就像:
<Section>
<ContainerBlockElement>
<UnorderedList>
<ListItem>
<Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"</URLLink></Paragraph>
</ListItem>
</UnorderedList>
<UnorderedList>
<ListItem>
<Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"</URLLink></Paragraph>
</ListItem>
</UnorderedList>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"></URLLink></Paragraph>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Follow these rules:</Paragraph>
<UnorderedList>
<ListItem>Don't do this</ListItem>
<ListItem>Don't do that</ListItem>
<ListItem>Don't do blablabla</ListItem>
</UnorderedList>
</ContainerBlockElement>
</Section>
我想提取ContainerBlockElement文本中的所有資料,但子標簽和結構每次都不同。
預期輸出:
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
更新: 現在我在上述 xml 的末尾添加了一個新元素。
<ContainerBlockElement>
<Paragraph>Apply the newer update in: <URLLink LinkURL="www.newerupdate.com"></URLLink></Paragraph>
</ContainerBlockElement>
@ACHRAF 答案現在將以混亂的順序輸出。它是順序敏感的,不能用于處理不同的 xml 檔案。
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Apply the newer update in: www.newerupdate.com
Don't do this
Don't do that
Don't do blablabla
預期輸出應遵循 xml 中的順序。此外,程式應該能夠區分那些存在于相同的ContainerBlockElement. (例如我需要把遵循這些規則:,不要這樣做,不要這樣做,不要在同一個陣列中做blablabla。)
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
Apply the newer update in: www.newerupdate.com
uj5u.com熱心網友回復:
首先,您的示例在 URLLINK 中包含錯誤
<URLLink LinkURL="www.software1.com"</URLLink>
將會
<URLLink LinkURL="www.software1.com"/>
完整示例:
<Section>
<ContainerBlockElement>
<UnorderedList>
<ListItem>
<Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"/></Paragraph>
</ListItem>
</UnorderedList>
<UnorderedList>
<ListItem>
<Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"/></Paragraph>
</ListItem>
</UnorderedList>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"/></Paragraph>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Follow these rules:</Paragraph>
<UnorderedList>
<ListItem>Don't do this</ListItem>
<ListItem>Don't do that</ListItem>
<ListItem>Don't do blablabla</ListItem>
</UnorderedList>
</ContainerBlockElement>
</Section>
關于提取資料,您可以這樣做:
from xml.etree import ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
results = root.findall('ContainerBlockElement/UnorderedList/ListItem') root.findall('ContainerBlockElement') root.findall('ContainerBlockElement/UnorderedList')
for elem in results:
for e in elem:
if (len(e.text.strip()) == 0):
continue
URLLINK_Data = e.find('./URLLink')
if URLLINK_Data is None:
print(e.text.strip())
else:
print(e.text.strip() " " e.find('./URLLink').attrib['LinkURL'])
輸出 :
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
uj5u.com熱心網友回復:
除了@ACHRAF 的回答中提到的更正之外,我還建議使用 lxml 而不是 ElementTree,因為 lxml 對 xpath 的支持更好:
from lxml import etree
doc = etree.parse('file.xml')
for entry in doc.xpath('//Paragraph'):
link_target = entry.xpath('./URLLink/@LinkURL')
ul_target = entry.xpath('./following-sibling::UnorderedList//text()')
link = link_target[0] if link_target else ''
ul = " ".join(ul_target) if ul_target else ''
print(entry.text,link,ul)
輸出:
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
uj5u.com熱心網友回復:
要獲取具有實際文本或 URLLink 的元素,請使用此 XPath
/Section/ContainerBlockElement//*[URLLink or text()[normalize-space()]]
*表示元素節點。
這[URLLink or text()[normalize-space()]]是一個謂詞,用于過濾具有直接 URLLink 元素或 text() 作為子元素的元素,而不僅僅是空格
然后使用 python 提取 text() 和 URLLink
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/474976.html
下一篇:OSM獲取方式的來源和目的地
