我希望將這份俄勒岡州立大學工資 PDF 檔案中的資料轉換為Pandas 資料框。正如您所看到的,在檔案中,結構是這樣的:將成為變數名的內容在整個檔案中重復出現,后跟冒號和每個觀察值的列值。我已經使用 PDFBox 提取了本檔案的每一行,并且能夠清理內容,以便它們最終出現在如下示例中的元組串列中。我想要做的基本事情是獲取這個元組串列,其中變數名稱重復出現在元組的元素之一和第二個元素中的變數值中,并將其轉換為資料幀,其中這些值都在正確的位置對齊對應的列。
data = ## Heading ##[('Name', ' Abbas, Houssam'),
('First Hired', ' 31-DEC-2018'),
('Home Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
('Adj Service Date', ' 31-DEC-2018'),
('Job Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
('Job Type', ' P'),
('Job Title', ' Assistant Professor'),
('Posn-Suff', ' C18336-00'),
('Rank', ' Assistant Professor'),
('Rank Effective Date', ' 31-DEC-2018'),
('Appt Begin Date', ' 31-DEC-2018'),
('Appt Percent', ' 100'),
('Appt End Date', ' N/A'),
('Annual Salary Rate', ' 92961.00 9 mo'),
('Name', ' Abbasi, Bahman'),
('First Hired', ' 01-AUG-2017'),
('Home Orgn', ' LCB - Acad Prog / Student Aff'),
('Adj Service Date', ' 01-AUG-2017'),
('Job Orgn', ' EMM - Sch of Mech/Ind/Mfg Engr'),
('Job Type', ' O'),
('Job Title', ' Assistant Professor'),
('Posn-Suff', ' C18194-00'),
('Rank', ' Assistant Professor'),
('Rank Effective Date', ' 16-SEP-2017'),
('Appt Begin Date', ' 16-SEP-2017'),
('Appt Percent', ' 100'),
('Appt End Date', ' N/A'),
('Annual Salary Rate', ' 97659.00 9 mo'),
('Job Orgn', ' LCB - Acad Prog / Student Aff'),
('Job Type', ' P'),
('Job Title', ' Assistant Professor'),
('Posn-Suff', ' C11566-00'),
('Rank', ' Assistant Professor'),
('Rank Effective Date', ' 16-SEP-2017'),
('Appt Begin Date', ' 16-SEP-2020'),
('Appt Percent', ' 100'),
('Appt End Date', ' 15-JUN-2021'),
('Annual Salary Rate', ' 98811.00 9 mo')]
所需的最終資料幀(的縮寫版本)在哪里:
| Name | First Hired | Home Orgn | Adj Service Date |
| Abbas, Houssam | 31-DEC-2018 | ESE - Sch Elect Engr/Comp Sci | 31-DEC-2018 |
| Abbasi, Bahman | 01-AUG-2017 | LCB - Acad Prog / Student Aff | 01-AUG-2017 |
我嘗試了此處、此處和此處介紹的解決方案的變體,但沒有任何運氣。任何建議(可能包括處理原始原始資料的不同方式)將不勝感激!
uj5u.com熱心網友回復:
您只需將元組串列轉換為字典串列。Pandas 將完成剩下的作業:
import pandas as pd
data = [('Name', ' Abbas, Houssam'),
('First Hired', ' 31-DEC-2018'),
('Home Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
('Adj Service Date', ' 31-DEC-2018'),
('Job Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
('Job Type', ' P'),
('Job Title', ' Assistant Professor'),
('Posn-Suff', ' C18336-00'),
('Rank', ' Assistant Professor'),
('Rank Effective Date', ' 31-DEC-2018'),
('Appt Begin Date', ' 31-DEC-2018'),
('Appt Percent', ' 100'),
('Appt End Date', ' N/A'),
('Annual Salary Rate', ' 92961.00 9 mo'),
('Name', ' Abbasi, Bahman'),
('First Hired', ' 01-AUG-2017'),
('Home Orgn', ' LCB - Acad Prog / Student Aff'),
('Adj Service Date', ' 01-AUG-2017'),
('Job Orgn', ' EMM - Sch of Mech/Ind/Mfg Engr'),
('Job Type', ' O'),
('Job Title', ' Assistant Professor'),
('Posn-Suff', ' C18194-00'),
('Rank', ' Assistant Professor'),
('Rank Effective Date', ' 16-SEP-2017'),
('Appt Begin Date', ' 16-SEP-2017'),
('Appt Percent', ' 100'),
('Appt End Date', ' N/A'),
('Annual Salary Rate', ' 97659.00 9 mo'),
('Job Orgn', ' LCB - Acad Prog / Student Aff'),
('Job Type', ' P'),
('Job Title', ' Assistant Professor'),
('Posn-Suff', ' C11566-00'),
('Rank', ' Assistant Professor'),
('Rank Effective Date', ' 16-SEP-2017'),
('Appt Begin Date', ' 16-SEP-2020'),
('Appt Percent', ' 100'),
('Appt End Date', ' 15-JUN-2021'),
('Annual Salary Rate', ' 98811.00 9 mo')]
rows = []
for k,v in data:
if k == 'Name':
rows.append({})
rows[-1][k]=v
#print(rows)
df = pd.DataFrame(rows)
print(df)
輸出:
Name First Hired ... Appt End Date Annual Salary Rate
0 Abbas, Houssam 31-DEC-2018 ... N/A 92961.00 9 mo
1 Abbasi, Bahman 01-AUG-2017 ... 15-JUN-2021 98811.00 9 mo
[2 rows x 14 columns]
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/400746.html
上一篇:從熊貓資料框中的索引獲取前后的行
