我必須閱讀多個大型 excel 檔案才能嘗試清理資料。
我歸結為某些單元格中有多行的最后一個問題,或者我猜有些單元格跨越多行。
它是這樣的:
Index Col1 Col2 Col3
1 row1 row1 row1
2 row1.1
3 row1.2
4 row2 row2 row3
當我使用 Pandas.read_excel(filename) 或 Pandas.ExcelFile 然后 sheet.parse(sheetname) 它當然在索引 2 和 3 中讀取,大部分是空行。
我將如何根據 Col1 的跨度將索引 2 和 3 合并為 1?
要清楚我的問題是:我怎么能讀入 excel 檔案并根據第一列跨越的行合并行?這甚至可能嗎?
謝謝
uj5u.com熱心網友回復:
我不知道 Pandas 內置了這個功能,因為坦率地說 Excel 不打算這樣使用,但人們仍然傾向于濫用它。伙計,我討厭 Excel.....但這是另一個主題的主題。
我認為最好的辦法是根據您知道適用于這些檔案的邏輯定義自定義函式。由于我目前正在處理各種格式錯誤的 Excel 檔案,因此我對這種垃圾非常熟悉。
這是我的建議,基于我對資料的理解和您的要求。它可能需要根據檔案的具體情況進行更改。
last_valid = None
check_cols = [] # if only need to check a subset of cols for validity, do it here
for i, s in df.iterrows(): # This is slow, but probably necessary in this case
""" If all the rows are valid, we want to keep it as a reference in case
the following rows are not """
if all(s[check_cols].notna()):
lvi, last_valid = i, s
# need to store index and series so we can go back and replace it
continue
else: # here is the critical part
extra_vals = s[s.notna()] # find cells in row that have actual values
for col in extra_vals.index:
""" I'm creating a list and appending here since I don't know
your values or how they need to be handled exactly"""
last_valid[col] = list(last_valid[col]).append(extra_vals[col])
# replace that row in the dataframe
df.iloc[lvi, :] = last_valid
# drop extra rows:
df = df.dropna(axis=0, subset=check_cols)
希望這對你有用!
uj5u.com熱心網友回復:
@LiamFiddler 答案是正確的,但需要一些調整才能在我的情況下作業,因為我在同一行上組合數字并將作為字串輸出到 csv。我張貼我的以防它幫助到這里的人
last_valid = None
check_cols = ['Col1'] # if only need to check a subset of cols for validity, do it here
df = df.astype(str) #convert all columns to strings as I have to combine numbers in the same cell
df = df.replace('nan','') #get rid of the nan created back to a blank string
for i, s in df.iterrows(): # This is slow, but probably necessary in this case
""" If all the rows are valid, we want to keep it as a reference in case
the following rows are not """
if all(s[check_cols] != ''):
lvi, last_valid = i, s
# need to store index and series so we can go back and replace it
continue
else: # here is the critical part
extra_vals = s[s != ''] # find cells in row that have actual values
for col in extra_vals.index:
""" I'm creating a list and appending here since I don't know
your values or how they need to be handled exactly"""
last_valid[col] = last_valid[col] "," extra_vals[col] #separate by whatever you wish, list was causing issues
# replace that row in the dataframe
df.iloc[lvi, :] = last_valid
# drop extra rows:
df = df[df['Col1'] != ''].reset_index(drop=True)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/349795.html
上一篇:當特定列上沒有特定文本時如何突出顯示excel中的行
下一篇:單擊按鈕插入行
