按官方檔案寫了以下代碼:
//讀取資料并轉化成rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
val count = hBaseRDD.count()
println(count)
hBaseRDD.foreach{case (_,result) =>{
//獲取行鍵
val key = Bytes.toString(result.getRow)
//通過列族和列名獲取列
val name = Bytes.toString(result.getValue("A".getBytes,"姓名".getBytes))
val age = Bytes.toInt(result.getValue("A".getBytes,"年齡".getBytes))
println("Row key:"+key+"FileName :"+姓名+" 年齡:"+age)
}}
執行成功。但是未能列印出每條資料的具體資訊,日志如下:
18/05/14 16:26:50 INFO DAGScheduler: ResultStage 0 (count at test.scala:64) finished in 2.515 s
18/05/14 16:26:50 INFO DAGScheduler: Job 0 finished: count at test.scala:64, took 2.642359 s
38
18/05/14 16:26:50 INFO SparkContext: Starting job: foreach at test.scala:71
18/05/14 16:26:50 INFO DAGScheduler: Got job 1 (foreach at test.scala:71) with 1 output partitions
18/05/14 16:26:50 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at test.scala:71)
18/05/14 16:26:50 INFO DAGScheduler: Parents of final stage: List()
18/05/14 16:26:50 INFO DAGScheduler: Missing parents: List()
18/05/14 16:26:50 INFO DAGScheduler: Submitting ResultStage 1 (NewHadoopRDD[0] at newAPIHadoopRDD at test.scala:60), which has no missing parents
18/05/14 16:26:50 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.1 KB, free 897.2 MB)
18/05/14 16:26:50 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1334.0 B, free 897.2 MB)
18/05/14 16:26:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.251.6.153:56001 (size: 1334.0 B, free: 897.6 MB)
18/05/14 16:26:50 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
18/05/14 16:26:50 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (NewHadoopRDD[0] at newAPIHadoopRDD at test.scala:60) (first 15 tasks are for partitions Vector(0))
18/05/14 16:26:50 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
18/05/14 16:26:50 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 10.124.130.14, executor 2, partition 0, ANY, 4919 bytes)
18/05/14 16:26:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.124.130.14:54410 (size: 1334.0 B, free: 366.3 MB)
18/05/14 16:26:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.124.130.14:54410 (size: 29.8 KB, free: 366.3 MB)
18/05/14 16:26:52 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 2453 ms on 10.124.130.14 (executor 2) (1/1)
18/05/14 16:26:52 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/05/14 16:26:52 INFO DAGScheduler: ResultStage 1 (foreach at test.scala:71) finished in 2.454 s
18/05/14 16:26:52 INFO DAGScheduler: Job 1 finished: foreach at test.scala:71, took 2.469872 s
紅色字體是HBASE表的總行數,但是下面的foreach沒有任何資料列印出來,查了好幾天了沒找到問題所在
請大神賜教
uj5u.com熱心網友回復:
{case (_,result) =>{是什么鬼,你result里什么值都沒吧,你怎么取值
uj5u.com熱心網友回復:
我也懷疑過這個問題,但是應該不是:
1、上面的count可以得到正確的行數
2、網上的例子都是這樣的https://blog.csdn.net/u013468917/article/details/52822074
還有可能是什么原因?
uj5u.com熱心網友回復:
沒人知道嗎,自己頂一下uj5u.com熱心網友回復:
result里面沒有值吧uj5u.com熱心網友回復:
資料進了hbase嗎uj5u.com熱心網友回復:
RDD的操作是惰性操作的吧,foreach能夠觸發操作嗎?uj5u.com熱心網友回復:
RDD里面的map和foreach里面的操作是分布式的,在里面用print本來就是打不出log的。你要打出來得先collect回來再foreach就可以了。uj5u.com熱心網友回復:
你foreach算子是在executor上執行的。你可以在SparkHistoryServer頁面,找到對應的Application,然后看Executor頁面,查看stdout/stderr,就看到列印資訊了uj5u.com熱心網友回復:
hBaseRDD.foreach{case (_,result) 改為hBaseRDD.collect().foreach{case (_,result) 應該就可以了
uj5u.com熱心網友回復:
hBaseRDD.collect().foreach{case (_,result轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/49083.html
標籤:Spark
上一篇:華為云計算交流平臺
