在我的 XML 檔案 [studentinfo.xml] 中,一些標簽具有命名空間前綴,有沒有辦法在不定義命名空間的 URI/URL 的情況下遍歷 xml 檔案并決議標簽內容 [所有同級和子標簽]?
如果您有另一種不使用 pandas 決議 xml 檔案的方法,我愿意接受任何和所有解決方案。
<?xml version="1.0" encoding="UTF-8"?>
<stu:StudentBreakdown>
<stu:Studentdata>
<stu:StudentScreening>
<st:name>Sam Davies</st:name>
<st:age>15</st:age>
<st:hair>Black</st:hair>
<st:eyes>Blue</st:eyes>
<st:grade>10</st:grade>
<st:teacher>Draco Malfoy</st:teacher>
<st:dorm>Innovation Hall</st:dorm>
</stu:StudentScreening>
<stu:StudentScreening>
<st:name>Cassie Stone</st:name>
<st:age>14</st:age>
<st:hair>Science</st:hair>
<st:grade>9</st:grade>
<st:teacher>Luna Lovegood</st:teacher>
</stu:StudentScreening>
<stu:StudentScreening>
<st:name>Derek Brandon</st:name>
<st:age>17</st:age>
<st:eyes>green</st:eyes>
<st:teacher>Ron Weasley</st:teacher>
<st:dorm>Hogtie Manor</st:dorm>
</stu:StudentScreening>
</stu:Studentdata>
</stu:StudentBreakdown>
下面是我的代碼:
import pandas as pd
from bs4 import BeautifulSoup
with open('studentinfo.xml', 'r') as f:
file = f.read()
def parse_xml(file):
soup = BeautifulSoup(file, 'xml')
df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
all_items = soup.find_all('info')
items_length = len(all_items)
for index, info in enumerate(all_items):
StudentName = info.find('<st:name>').text
Age = info.find('<st:age>').text
Hair = info.find('<st:hair>').text
Eyes = info.find('<st:eyes>').text
Grade = info.find('<st:grade>').text
Teacher = info.find('<st:teacher>').text
Dorm = info.find('<st:dorm>').text
row = {
'StudentName': StudentName,
'Age': Age,
'Hair': Hair,
'Eyes': Eyes,
'Grade': Grade,
'Teacher': Teacher,
'Dorm': Dorm
}
df1 = df1.append(row, ingore_index=True)
print(f'Appending row %s of %s' %(index 1, items_length))
return df1
期望的輸出:
| 姓名 | 年齡 | 頭發 | 眼睛 | 年級 | 老師 | 宿舍 | |
|---|---|---|---|---|---|---|---|
| 0 | 山姆戴維斯 | 15 | 黑色的 | 藍色的 | 10 | 德拉科馬爾福 | 創新館 |
| 1 | 卡西斯通 | 14 | 科學 | 不適用 | 9 | 盧娜洛夫古德 | 不適用 |
| 2 | 德里克·布蘭登 | 17 | 不適用 | 綠色 | 不適用 | 羅恩韋斯萊 | 霍蒂莊園 |
uj5u.com熱心網友回復:
你大約有 90% 在那里。我剛剛修復了幾件事:
all_items: 尋找StudentScreening而不是infoinfo.find()陳述句:處理缺失值pd.concat(): 代替df1.append()- 最后呼叫函式 parse_xml
這是代碼:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
# Read in the XML file
with open('studentinfo.xml', 'r') as f:
file = f.read()
def parse_xml(file):
soup = BeautifulSoup(file, 'xml')
df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
all_items = soup.find_all('StudentScreening')
for index, info in enumerate(all_items):
row = {
'StudentName': info.find('name').text if info.find('name') else np.nan,
'Age': info.find('age').text if info.find('age') else np.nan,
'Hair': info.find('hair').text if info.find('hair') else np.nan,
'Eyes': info.find('eyes').text if info.find('eyes') else np.nan,
'Grade': info.find('grade').text if info.find('grade') else np.nan,
'Teacher': info.find('teacher').text if info.find('teacher') else np.nan,
'Dorm': info.find('dorm').text if info.find('dorm') else np.nan
}
df1 = pd.concat([df1, pd.Series(row).to_frame().T], ignore_index=True)
return df1
print(parse_xml(file))
輸出:
StudentName Age Hair Eyes Grade Teacher Dorm
0 Sam Davies 15 Black Blue 10 Draco Malfoy Innovation Hall
1 Cassie Stone 14 Science NaN 9 Luna Lovegood NaN
2 Derek Brandon 17 NaN green NaN Ron Weasley Hogtie Manor
uj5u.com熱心網友回復:
這是使用 的另一個答案ElementTree:
- 匯入 XML 檔案
regex sub()使用方法洗掉命名空間前綴- 將 XML 檔案轉換為
ElementTree - 遍歷相關節點以提取所需資訊
- 轉換為具有預期輸出的資料幀
這是代碼:
import xml.etree.ElementTree as ET
import re
import pandas as pd
# import the file
with open('studentinfo.xml', 'r') as f:
file = f.read()
# remove namespace prefixes
file = re.sub(r'stu?:', '', file)
# Extract the XML into an ElementTree
root = ET.ElementTree(ET.fromstring(file)).getroot()
# translation between XML tags and column names
column_names = {'name': 'StudentName',
'age': 'Age',
'hair': 'Hair',
'eyes': 'Eyes',
'grade': 'Grade',
'teacher': 'Teacher',
'dorm': 'Dorm'}
# Extract the relevant information from the ElementTree
results = []
temp_dict = {}
for node in root.iter():
if node.tag in column_names.keys():
temp_dict[column_names[node.tag]] = node.text
elif len(temp_dict) > 0:
results.append(temp_dict.copy())
temp_dict.clear()
results.append(temp_dict.copy())
# Create a dataframe from the extracted information
df = pd.DataFrame(results)
print(df)
輸出:
StudentName Age Hair Eyes Grade Teacher Dorm
0 Sam Davies 15 Black Blue 10 Draco Malfoy Innovation Hall
1 Cassie Stone 14 Science NaN 9 Luna Lovegood NaN
2 Derek Brandon 17 NaN green NaN Ron Weasley Hogtie Manor
雖然你規定你不想使用 pandas,但我實際上認為這將是最簡單、最干凈的方法。
這是我使用pandas的答案:
# import the file
with open('studentinfo.xml', 'r') as f:
file = f.read()
# remove namespace prefixes
file = re.sub(r'stu?:', '', file)
# read the xml file stipulating the node of interest
df = pd.read_xml(file, xpath="//StudentScreening")
print(df)
輸出:
StudentName Age Hair Eyes Grade Teacher Dorm
0 Sam Davies 15 Black Blue 10 Draco Malfoy Innovation Hall
1 Cassie Stone 14 Science NaN 9 Luna Lovegood NaN
2 Derek Brandon 17 NaN green NaN Ron Weasley Hogtie Manor
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/531570.html
下一篇:在Python中切換日期格式
