我正在嘗試從多個嵌套的 xml 檔案創建資料框并將資料附加到單個資料框。我知道資料框的結構并定義了它。
tree_list = []
details = ['FirstName','LastName','City','Country']
for file in bucket_list:
obj = s3.Object(s3_bucket_name,file)
data = (obj.get()['Body'].read())
tree_list.append(ET.ElementTree(ET.fromstring(data)))
def parse_XML(list_of_trees, df_cols):
for tree in tree_list:
xroot = tree.getroot()
rows = []
for node in xroot:
res = []
for el in df_cols[0:]:
if node is not None and node.find(el) is not None:
res.append(node.find(el).text)
else:
res.append(None)
rows.append({df_cols[i-1]: res[i-1]
for i, _ in enumerate(df_cols)})
out_df = pd.DataFrame(rows, columns=df_cols)
return out_df
parse_XML(tree_list,details)
在我的輸出資料框中,我得到了最后一個檔案讀取的資訊和幾個空白行,如下所示:
FirstName LastName City Country
Ted Mosbey Washington USA
None None None None
None None None None
應該在代碼中進行哪些更改以讀取所有檔案、附加到資料框并洗掉不必要的行?感謝任何有效處理檔案的建議。
XML 示例:
<PD>
<Clt>
<PType>xxxx</PType>
<PNumber>xxxxx</PNumber>
<UID>xxxx</UID>
<TEfd>xxxxx</TEfd>
<TExd>xxxxxx</TExd>
<DID>xxxxx</DID>
<CType>xxxxx</CType>
<FirstName>Ted</FirstName>
<MiddleName></MiddleName>
<LastName>Mosbey</LastName>
<MailingAddrLocation>Home</MailingAddrLocation>
<AddressLine1>3435</AddressLine1>
<AddressLine2>Columbia RD</AddressLine2>
<AddressLine3></AddressLine3>
<City>Washington</City>
<State>DC</State>
<ZipCode>20009</ZipCode>
<Country>USA</Country>
<Pr>
<PrType>xxxxx</PrType>
<PrName>xxxxxx</PrName>
<PrID>xxxxxx</PrID>
</Pr>
</Clt>
</PD>
uj5u.com熱心網友回復:
所以現在當我有你的資料樣本時,我測驗了它,它對我有用,就像我認為你想要的那樣:
def parse_XML(list_of_trees, df_cols):
def get_el(el_list):
if len(el_list) > 1:
return [el_text.text for el_text in el_list]
else:
return el_list[0].text
rows = []
for tree in list_of_trees:
xroot = tree.getroot()
for node in xroot:
res = []
for el in df_cols[0:]:
if node is not None and node.findall(f".//{el}") is not None:
res.append(get_el(node.findall(f".//{el}")))
rows.append(res)
out_df = pd.DataFrame(rows, columns=df_cols)
return out_df
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/471876.html
