我一直在看到this,this和this執行緒,但仍然無法理解如何使以下問題更有效:
我有一個帶有課程名稱的 DataFrame 和提供它的大學:
df_courses:
| 課程名 | 大學 | |
|---|---|---|
| 0 | 當然是一 | 大學一 |
| 1 | 課程名稱二 | 大學一 |
| 2 | “課程名稱三,帶逗號” | 大學二 |
我還有另一個包含學生注冊的 DataFrame:
df_enrollments:
| 入學人數 | 學生電子郵件 | |
|---|---|---|
| 0 | 課程一名稱,課程名稱二 | [email protected] |
| 1 | 課程名稱二,課程名稱三 | [email protected] |
| 2 | 課程三的名稱,帶逗號,課程一的名稱,課程二的名稱 | [email protected] |
我想要做的是為每個學生注冊一個新的資料框:
df_all_enrollments:
| 課程名 | 學生電子郵件 | |
|---|---|---|
| 0 | 當然是一 | [email protected] |
| 1 | 課程名稱二 | [email protected] |
| 2 | 課程名稱二 | [email protected] |
| 3 | “課程名稱三,帶逗號” | [email protected] |
| 4 | “課程名稱三,帶逗號” | [email protected] |
| 5 | 當然是一 | [email protected] |
| 6 | 課程名稱二 | [email protected] |
主要問題是帶有逗號的課程名稱。
What I'm doing now to get this result is to make a list of df_courses['course name'] and then iterate over df_enrollments['enrollments'] searching contains and adding a new column with the course name:
courses = df_courses['course name'].to_list()
df_all_enrollments = pd.DataFrame()
for i in courses:
df_all_enroll = df_enrollments.loc[df_enrollments['enrollments'].str.contains(i, na=False, regex=False, case=True)]
df_all_enroll.insert(1, 'Course Name', i)
df_all_enrollments = pd.concat([df_all_enrollments, df_all_enroll ])
Until now this approach has worked, but I'm wondering if there's a more efficient way to perform this task.
Any help will be greatly appreciated.
uj5u.com熱心網友回復:
是否如您所愿:
courses = df2['enrollments'].str.split(', ')
df_all_enrollments = df_enrollments.assign(**{'course name': courses}) \
.explode('course name', ignore_index=True)
print(df_all_enrollments)
輸出
>>> df_all_enrollments
enrollments student email course name
0 name of course one, name of course two [email protected] name of course one
1 name of course one, name of course two [email protected] name of course two
2 name of course two, name of course three [email protected] name of course two
3 name of course two, name of course three [email protected] name of course three
4 name of course three, name of course one, name... [email protected] name of course three
5 name of course three, name of course one, name... [email protected] name of course one
6 name of course three, name of course one, name... [email protected] name of course two
uj5u.com熱心網友回復:
好吧,我有一個可能的答案。
最初的方法需要 6.6 分鐘來處理 41541 行并輸出一個 DataFrame 和 CSV 與 240603 行。
Corralien 提出的答案是絕對正確的,但我的資料有一些帶逗號的名稱(例如“人工智能、機器學習和深度學習的 TensorFlow 簡介”)
第一個問題(檢查匹配的課程名稱df_enrollments)通過串列理解解決:
df_enrollments['Course Name'] = [[y for y in cursos if y in x] for x in df_enrollments['Enrollments']]
結果是一個包含課程串列的列,它被分解并給了我預期的結果:
df_enrollments = df_enrollments.explode('Course Name')
現在只需 98.73 秒即可完成所有操作:D
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/444474.html
下一篇:如何根據R中組內的行順序選擇組
