我是 Python 新手,并且有一個具有以下結構的 file.xml:
<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
<PRODUCT_DETAILS>
<DESCRIPTION_SHORT>blue dog w short hair</DESCRIPTION_SHORT>
<DESCRIPTION_LONG>blue dog w short hair and unlimitied zoomies</DESCRIPTION_LONG>
</PRODUCT_DETAILS>
<PRODUCT_FEATURES>
<FEATURE>
<FNAME>Hair</FNAME>
<FVALUE>short</FVALUE>
</FEATURE>
<FEATURE>
<FNAME>Colour</FNAME>
<FVALUE>blue</FVALUE>
</FEATURE>
<FEATURE>
<FNAME>Legs</FNAME>
<FVALUE>4</FVALUE>
</FEATURE>
</PRODUCT_FEATURES>
</HEADER>
我正在使用一個非常簡單的片段(如下)將其轉換為 file_export.csv:
import pandas as pd
df = pd.read_xml("file.xml")
# df
df.to_csv("file_export.csv", index=False)
問題是我最終得到了這樣的表格:
DESCRIPTION_SHORT DESCRIPTION_LONG FEATURE
blue dog w short hair blue dog w short hair and unlimitied zoomies NaN
我嘗試洗掉 FEATURE 屬性,但最終用最后一個覆寫(?)以前的 FNAME 和 FVALUE,假設因為它們被稱為相同:
DESCRIPTION_SHORT DESCRIPTION_LONG FNAME FVALUE
blue dog w short hair blue dog w short hair and unlimitied zoomies None NaN
None None Legs 4.0
我需要在代碼中添加什么來顯示嵌套屬性,包括它們的文本?像這樣:
DESCRIPTION_SHORT DESCRIPTION_LONG FEATURE FNAME FVALUE
blue dog w short hair blue dog w short hair and unlimitied zoomies NaN Hair short
blue dog w short hair blue dog w short hair and unlimitied zoomies NaN Colour blue
blue dog w short hair blue dog w short hair and unlimitied zoomies NaN Legs 4
先感謝您!!
?C
uj5u.com熱心網友回復:
首先,您問題中的示例 xml(可能還有您的實際 xml)并不適合read_xml(). 在這種情況下,您最好使用實際的 xml 決議器并將輸出交給 pandas。
此外,我認為您想要的輸出效率不是很高 - 在您的示例中,您將每個長短描述重復 3 次,沒有明顯的原因。
說了這么多,我會建議這樣的事情:
假設您的實際 xml 有多個寵物,例如:
inventory="""<?xml version="1.0" encoding="UTF-8"?>
<doc>
<HEADER>
<PRODUCT_DETAILS>
<DESCRIPTION_SHORT>green cat w short hair</DESCRIPTION_SHORT>
<DESCRIPTION_LONG>green cat w short hair and unlimitied zoomies</DESCRIPTION_LONG>
</PRODUCT_DETAILS>
<PRODUCT_FEATURES>
<FEATURE>
<FNAME>Hair</FNAME>
<FVALUE>medium</FVALUE>
</FEATURE>
<FEATURE>
<FNAME>Colour</FNAME>
<FVALUE>green</FVALUE>
</FEATURE>
<FEATURE>
<FNAME>Legs</FNAME>
<FVALUE>14</FVALUE>
</FEATURE>
</PRODUCT_FEATURES>
</HEADER>
****the HEADER in your question goes here***
</doc>"""
from lxml import etree
import pandas as pd
doc = etree.XML(inventory.encode())
pets = doc.xpath('//HEADER')
headers=[elem.tag for elem in doc.xpath('//HEADER[1]//PRODUCT_DETAILS//*')]
headers.extend(doc.xpath('//HEADER[1]//FNAME/text()'))
rows = []
for pet in pets:
row = [pet.xpath(f'.//{headers[0]}/text()')[0],pet.xpath(f'.//{headers[1]}/text()')[0]]
f_values = pet.xpath('.//FVALUE/text()')
row.extend(f_values)
rows.append(row)
如果您想更加冒險并使用 xpath 2.0(lxml 不支持)以及更多串列推導,您可以試試這個:
from elementpath import select
expression1 = '//HEADER[1]/string-join((./PRODUCT_DETAILS//*/name(),./PRODUCT_FEATURES//FNAME),",")'
expression2 = '//HEADER/string-join((./PRODUCT_DETAILS//*,./PRODUCT_FEATURES//FVALUE),",")'
headers = [h.split(',') for h in select(doc, expression1 )]
rows= [r.split(',') for r in select(doc, expression2)]
在任一情況下:
pd.DataFrame(rows,columns=headers)
should output:
DESCRIPTION_SHORT DESCRIPTION_LONG Hair Colour Legs
0 green cat w short hair green cat w short hair and unlimitied zoomies medium green 14
1 blue dog w long hair blue dog w long hair and limitied zoomies short blue 4
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/452386.html
上一篇:在特定(最初為空)路徑中插入元素
