如何對2個檔案執行內部聯接-有解無憂

我有 2 個檔案，其中一個有問題，另一個有（多個）每個問題的答案。

例如：

問題：

Q1,What is the name of your son?
Q2,How are you today?

答案：

A1,Q1,George
A2,Q1,David
A1,Q2,Good
A2,Q2,Nice
A3,Q2,Amazing

我試圖獲得的輸出是：

Q1,What is the name of your son? A1,George
Q1,What is the name of your son? A2,David
Q2,How are you today? A1,Good
Q2,How are you today? A2,Nice
Q2,How are you today? A3,Amazing

我一直在尋找解決方案，但找不到簡單的方法。

我嘗試映射到 QID 作為鍵，問題/AID，答案作為值（例如 Q1 你今天好嗎？ || Q1 A1，好），然后在減少時找到問題（唯一沒有“Ax”的問題，在開頭）并將其添加到每個答案中，但這個想法似乎不對（并且不起作用，但這是另一回事）

我希望有人能幫我解決這個問題。謝謝！

uj5u.com熱心網友回復：

我會在地圖邊連接中將鍵標識為 Q_Id 嗎？（這是在兩個檔案之一適合記憶體的情況下）這比執行 map-reudce 更有效。

（mapside 連接示例） ?

如果兩個檔案都不適合記憶體映射 > 減少 (Q_id)

下面是它在 spark 中的樣子的一個例子：

val questions = Seq(
        ("Q1","What is the name of your son?"),
        ("Q2","How are you today?"))
    .toDF("Q_id","text")
val answers = Seq(
        ("A1","Q1","George"),
        ("A2","Q1","David"),
        ("A1","Q2","Good"),      
        ("A2","Q2","Nice"),        
        ("A3","Q2","Amazing"))
    .toDF("A_id","Q_id","answer_text")
questions
    .join(answers, questions("Q_id") === answers("Q_id"))
    .select(
        questions("Q_id"),
        questions("text"),
        answers("A_id"),
        answers("answer_text")) 
    .show()

 ---- -------------------- ---- ----------- 
|Q_id|                text|A_id|answer_text|
 ---- -------------------- ---- ----------- 
|  Q1|What is the name ...|  A1|     George|
|  Q1|What is the name ...|  A2|      David|
|  Q2|  How are you today?|  A1|       Good|
|  Q2|  How are you today?|  A2|       Nice|
|  Q2|  How are you today?|  A3|    Amazing|
 ---- -------------------- ---- -----------

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/394423.html

標籤：爪哇 Hadoop 映射还原

上一篇：如果我保留對底層迭代器的參考，為什么islice(permutations)會快100倍？

下一篇：如何從任何環境更改物件？