我正在嘗試將一個非常大(約 2500 萬行)的 csv 檔案加載到 Pandas 中。我正在對這個檔案進行分塊,一次 100,000 行,并附加一個新創建的資料框,該資料框基本上計算分塊資料框列中某些單詞的出現次數。當我保存第一個塊時,一切正常,塊并排連接到新創建的資料幀。然而,由于某種原因,第二個塊是對角連接的。我的意思是分塊資料幀現在有 200,000 行,前 100,000 行是空的,新創建的資料幀與前 100,000 行并排連接。如何解決此問題并將每個塊與新創建的資料幀并排連接并將每個塊保存到單獨的 csv 檔案中?
我的代碼:
import pandas as pd
from pandas.core.frame import DataFrame
chunk = 1
for df in pd.read_csv('all_comments_data.csv', chunksize=100000):
dict_to_append = {}
with open('conflict_words.txt') as f:
for word in f.readlines():
dict_to_append[word.strip()] = []
index = 0
for comment in df['comment'].to_numpy():
word_list = str(comment).split(" ")
for conflict_word in dict_to_append.keys():
dict_to_append[conflict_word].append(word_list.count(conflict_word))
print(index)
index =1
df_to_append = pd.DataFrame(dict_to_append)
final_df = pd.concat([pd.DataFrame(df), df_to_append], axis=1)
final_df.to_csv(f"all_comments_data_with_conflict_scores_{chunk}.csv")
chunk = 1
What I need dataframes to look like:
---------------------------
| | |
| chunk | new dframe |
| | |
---------------------------
What the dataframes look like after the first chunk:
---------------------------
| | |
| | new dframe |
| | |
---------------------------
| | |
| chunk | |
| | |
---------------------------
uj5u.com熱心網友回復:
當按pd.concat列運行時,pandas 將嘗試按索引匹配行:
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=[1, 3, 5])
df2 = pd.DataFrame({"A": [4, 5, 6], "B": [7, 8, 9]}, index=[2, 4, 6])
df3 = pd.concat([df1, df2], axis=1)
print(df3)
A B A B
1 1.0 4.0 NaN NaN
2 NaN NaN 4.0 7.0
3 2.0 5.0 NaN NaN
4 NaN NaN 5.0 8.0
5 3.0 6.0 NaN NaN
6 NaN NaN 6.0 9.0
如果你想chunk和new dframe一個CONCAT后坐并排側,則需要確保它們都具有相同的行索引。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/403757.html
標籤:
上一篇:將資料框轉換為json包括索引列
