我正在使用 python 的 xml.etree.ElementTree 抓取 pubmed xml 檔案。嵌入在文本中的 html 格式化元素的存在會導致為給定的 xml 元素回傳碎片化的文本。以下 xml 元素僅回傳斜體標記之前的文本。
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
這是可以作業但無法回傳包含 html 的完整記錄的示例代碼。
import xml.etree.ElementTree as ET
xmldata = 'directory/to/data.xml'
tree = ET.parse(xmldata)
root = tree.getroot()
abstracts = {}
for i in range(len(root)):
for child in root[i].iter():
if child.tag == 'ArticleTitle':
title = child.text
titles[i] = title
我也嘗試過使用 lxml.etree 與 child.xpath('//AbstractText/text()') 類似的東西。這將檔案中的所有文本作為串列元素回傳,但沒有明確的方法將元素組合到原始摘要中(即,3 個摘要可能回傳 3x 串列元素。
uj5u.com熱心網友回復:
答案是itertext()--> 收集元素的內部文本。
所以代碼會是這樣的:
import xml.etree.ElementTree as ET
from io import StringIO
raw_data="""
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
"""
tree = ET.parse(StringIO(raw_data))
root = tree.getroot()
# in the element there is child element, that is reason text was comming till <i>
for e in root.findall("."):
print(e.text,type(e))
Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <class 'xml.etree.ElementTree.Element'>
通過使用 itertext()
"".join(root.find(".").itertext()) # "".join(element.itertext())
'Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which Microdochium species are the most harmful.'
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/342353.html
上一篇:如何使用pd.read_xml正確決議SECcal.xml檔案?
下一篇:轉換xslt/xml單獨的影像
