我有多個xml來自 PubMed 的檔案。這里有幾個檔案。
如何決議它并在單個資料框中獲取這些列。如果一篇文章有??多個作者,我希望將它們作為單獨的行
預期輸出(應包括所有作者):
Title Year ArticleTitle LastName ForeName
Nature 2021 Inter-mosaic ... Roy Suva
Nature 2021 Inter-mosaic ... Pearson John
Nature 2021 Neural dynamics Pearson John
Nature 2021 Neural dynamics Mooney Richard
uj5u.com熱心網友回復:
首先,你想要的是可行的。這樣的東西應該適用于您的第二個檔案,您可以通過使用for回圈包裝代碼來添加其他檔案:
from lxml import etree
import pandas as pd
doc = etree.parse('file.xml')
columns = ['Title','ArticleDate','ArticleTitle','LastName','ForeName']
title = doc.xpath(f'//{columns[0]}/text()')[0]
year = doc.xpath(f'//{columns[1]}//Year/text()')[0]
article_title = doc.xpath(f'//{columns[2]}/text()')[0]
rows = []
for auth in doc.xpath('//Author'):
last_name = auth.xpath(f'{columns[3]}/text()')[0]
fore_name = auth.xpath(f'{columns[4]}/text()')[0]
rows.append([title,year,article_title,last_name,fore_name])
pd.DataFrame(rows,columns=columns)
輸出(對于 34671166.xml):
Title ArticleDate ArticleTitle LastName ForeName
0 Nature 2021 Neural dynamics underlying birdsong practice a... Singh Alvarado Jonnathan
1 Nature 2021 Neural dynamics underlying birdsong practice a... Goffinet Jack
2 Nature 2021 Neural dynamics underlying birdsong practice a... Michael Valerie
3 Nature 2021 Neural dynamics underlying birdsong practice a... Liberti William
4 Nature 2021 Neural dynamics underlying birdsong practice a... Hatfield Jordan
5 Nature 2021 Neural dynamics underlying birdsong practice a... Gardner Timothy
6 Nature 2021 Neural dynamics underlying birdsong practice a... Pearson John
7 Nature 2021 Neural dynamics underlying birdsong practice a... Mooney Richard
說了這么多,我不確定每個作者在單獨的行中的資料框是否適合您擁有的資料型別。在此示例中,由于您有 8 位共同作者,因此文章標題等資訊會不必要地重復 8 次。您可以為每個作者提供一組單獨的列,但是如果文章有 3 或 10 個共同作者,您就會遇到問題......
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/476401.html
上一篇:洗掉有條件的重復物件
