我正在嘗試使用 pandas 讀取 xml 檔案read_xml,但努力在輸出中包含父屬性。這可能使用read_xml還是我需要使用不同的決議器?示例 xml,我在下面嘗試過和想要的輸出。
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
# Getting all VT values, but no parent attributes
pd.read_xml(xml, xpath='.//VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
3 2020-08-18T20:30:00 21.7
4 2020-08-18T21:30:00 21.8
5 2020-08-18T22:30:00 22.0
"""
# Alternative to read one channel at the time
# But want to avoid opening file several times since they can be large
# For example like this, and then build the dataframe in a loop
pd.read_xml(xml, xpath='.//ChannelData[@channelId="1"]/Values/VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
"""
所需輸出的示例(可以包含所有或選擇的父屬性),理想情況下只讀取 xml 檔案一次
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
uj5u.com熱心網友回復:
使用 ElementTree - 見下文
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
data = []
root = ET.fromstring(xml)
for ld in root.findall('.//LocationData'):
location = ld.attrib['LocationId']
for cd in ld.findall('ChannelData'):
channel = cd.attrib['channelId']
for vt in cd.findall('.//VT'):
data.append({'t': vt.attrib['t'],'VT': vt.text,'channelId':channel,'LocationId':location})
df = pd.DataFrame(data)
print(df)
輸出
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
uj5u.com熱心網友回復:
一種選擇是使路徑通用,從 LocationData 節點開始盡可能多地捕獲:
(pd
.read_xml(xml, xpath='.//*')
.dropna(how = 'all', axis = 1)
.assign(channelId = lambda df: df.channelId.ffill(),
LocationId = lambda df: df.LocationId.ffill(),
name = lambda df: df.name.ffill())
.dropna(subset='t')
)
LocationId name channelId VT t
3 123.0 Temperature 1.0 3.2 2020-08-18T20:30:00
4 123.0 Temperature 1.0 3.3 2020-08-18T21:30:00
5 123.0 Temperature 1.0 3.2 2020-08-18T22:30:00
8 123.0 Speed 2.0 21.7 2020-08-18T20:30:00
9 123.0 Speed 2.0 21.8 2020-08-18T21:30:00
10 123.0 Speed 2.0 22.0 2020-08-18T22:30:00
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/459322.html
上一篇:錯誤的屬性值轉換?
