有沒有辦法在沒有名稱空間的URI/URL的情況下使用pandas.read

在我的 XML 檔案 [studentinfo.xml] 中，一些標簽具有命名空間前綴，有沒有辦法在不定義命名空間的 URI/URL 的情況下遍歷 xml 檔案并決議標簽內容 [所有同級和子標簽]？

如果您有另一種不使用 pandas 決議 xml 檔案的方法，我愿意接受任何和所有解決方案。

<?xml version="1.0" encoding="UTF-8"?>
<stu:StudentBreakdown>
<stu:Studentdata>
    <stu:StudentScreening>
        <st:name>Sam Davies</st:name>
        <st:age>15</st:age>
        <st:hair>Black</st:hair>
        <st:eyes>Blue</st:eyes>
        <st:grade>10</st:grade>
        <st:teacher>Draco Malfoy</st:teacher>
        <st:dorm>Innovation Hall</st:dorm>
    </stu:StudentScreening>
    <stu:StudentScreening>
        <st:name>Cassie Stone</st:name>
        <st:age>14</st:age>
        <st:hair>Science</st:hair>
        <st:grade>9</st:grade>
        <st:teacher>Luna Lovegood</st:teacher>
    </stu:StudentScreening>
    <stu:StudentScreening>
        <st:name>Derek Brandon</st:name>
        <st:age>17</st:age>
        <st:eyes>green</st:eyes>
        <st:teacher>Ron Weasley</st:teacher>
        <st:dorm>Hogtie Manor</st:dorm>
    </stu:StudentScreening>
</stu:Studentdata>
</stu:StudentBreakdown>

下面是我的代碼：

import pandas as pd
from bs4 import BeautifulSoup
with open('studentinfo.xml', 'r') as f:
    file = f.read()  

def parse_xml(file):
    soup = BeautifulSoup(file, 'xml')
    df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
    all_items = soup.find_all('info')
    items_length = len(all_items)
    for index, info in enumerate(all_items):
        StudentName = info.find('<st:name>').text
        Age = info.find('<st:age>').text
        Hair = info.find('<st:hair>').text
        Eyes = info.find('<st:eyes>').text
        Grade = info.find('<st:grade>').text
        Teacher = info.find('<st:teacher>').text
        Dorm = info.find('<st:dorm>').text
      row = {
            'StudentName': StudentName,
            'Age': Age,
            'Hair': Hair,
            'Eyes': Eyes,
            'Grade': Grade,
            'Teacher': Teacher,
            'Dorm': Dorm
        }
        
        df1 = df1.append(row, ingore_index=True)
        print(f'Appending row %s of %s' %(index 1, items_length))
    
    return df1

期望的輸出：

	姓名	年齡	頭發	眼睛	年級	老師	宿舍
0	山姆戴維斯	15	黑色的	藍色的	10	德拉科馬爾福	創新館
1	卡西斯通	14	科學	不適用	9	盧娜洛夫古德	不適用
2	德里克·布蘭登	17	不適用	綠色	不適用	羅恩韋斯萊	霍蒂莊園

uj5u.com熱心網友回復：

你大約有 90% 在那里。我剛剛修復了幾件事：

all_items: 尋找StudentScreening而不是info
info.find()陳述句：處理缺失值
pd.concat()：代替df1.append()
最后呼叫函式 parse_xml

這是代碼：

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# Read in the XML file
with open('studentinfo.xml', 'r') as f:
    file = f.read()  

def parse_xml(file):
    soup = BeautifulSoup(file, 'xml')
    df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
    all_items = soup.find_all('StudentScreening')
    for index, info in enumerate(all_items):

        row = {
            'StudentName': info.find('name').text if info.find('name') else np.nan,
            'Age': info.find('age').text if info.find('age') else np.nan,
            'Hair': info.find('hair').text if info.find('hair') else np.nan, 
            'Eyes': info.find('eyes').text if info.find('eyes') else np.nan,
            'Grade': info.find('grade').text if info.find('grade') else np.nan,
            'Teacher': info.find('teacher').text if info.find('teacher') else np.nan,
            'Dorm': info.find('dorm').text if info.find('dorm') else np.nan
        }
        
        df1 = pd.concat([df1, pd.Series(row).to_frame().T], ignore_index=True)
        
    
    return df1  


print(parse_xml(file))

輸出：

     StudentName Age     Hair   Eyes Grade        Teacher             Dorm
0     Sam Davies  15    Black   Blue    10   Draco Malfoy  Innovation Hall
1   Cassie Stone  14  Science    NaN     9  Luna Lovegood              NaN
2  Derek Brandon  17      NaN  green   NaN    Ron Weasley     Hogtie Manor

uj5u.com熱心網友回復：

這是使用的另一個答案ElementTree：

匯入 XML 檔案
regex sub()使用方法洗掉命名空間前綴
將 XML 檔案轉換為ElementTree
遍歷相關節點以提取所需資訊
轉換為具有預期輸出的資料幀

這是代碼：

import xml.etree.ElementTree as ET
import re
import pandas as pd

# import the file
with open('studentinfo.xml', 'r') as f:
    file = f.read()  
    
# remove namespace prefixes    
file = re.sub(r'stu?:', '', file)

# Extract the XML into an ElementTree
root = ET.ElementTree(ET.fromstring(file)).getroot()

# translation between XML tags and column names
column_names = {'name': 'StudentName', 
                'age': 'Age', 
                'hair': 'Hair', 
                'eyes': 'Eyes', 
                'grade': 'Grade', 
                'teacher': 'Teacher', 
                'dorm': 'Dorm'}

# Extract the relevant information from the ElementTree
results = []
temp_dict = {}
for node in root.iter():
    if node.tag in column_names.keys():
        temp_dict[column_names[node.tag]] = node.text 
        
    elif len(temp_dict) > 0:
        results.append(temp_dict.copy())
        temp_dict.clear()

results.append(temp_dict.copy())    

# Create a dataframe from the extracted information
df = pd.DataFrame(results)
print(df)

輸出：

     StudentName Age     Hair   Eyes Grade        Teacher             Dorm
0     Sam Davies  15    Black   Blue    10   Draco Malfoy  Innovation Hall
1   Cassie Stone  14  Science    NaN     9  Luna Lovegood              NaN
2  Derek Brandon  17      NaN  green   NaN    Ron Weasley     Hogtie Manor

雖然你規定你不想使用 pandas，但我實際上認為這將是最簡單、最干凈的方法。

這是我使用pandas的答案：

# import the file
with open('studentinfo.xml', 'r') as f:
    file = f.read() 

# remove namespace prefixes    
file = re.sub(r'stu?:', '', file)

# read the xml file stipulating the node of interest
df = pd.read_xml(file, xpath="//StudentScreening")
print(df)

輸出：

     StudentName Age     Hair   Eyes Grade        Teacher             Dorm
0     Sam Davies  15    Black   Blue    10   Draco Malfoy  Innovation Hall
1   Cassie Stone  14  Science    NaN     9  Luna Lovegood              NaN
2  Derek Brandon  17      NaN  green   NaN    Ron Weasley     Hogtie Manor

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/531570.html

標籤：Python熊猫xml解析美丽的汤

上一篇：無法決議REST答案-如何使其正確處理

下一篇：在Python中切換日期格式

有沒有辦法在沒有名稱空間的URI/URL的情況下使用pandas.read_xml()？