我有以下 RDD:
x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR
我只想得到這個 RDD 每一部分的第一部分,你可以在下一個例子中看到:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
為了做到這一點,我正在以這種方式嘗試。
//Here I get the RDD:
val x = spark.sparkContext.parallelize(List(value)).collect()
//Try to use regex on it, this regex is to get until the first comma
val regex1 = """(^(. ?),)"""
val rdd_1 = x.map(g => g.matches(regex1))
這是我正在嘗試但對我不起作用,因為我只是得到一個布爾陣列。我究竟做錯了什么?
我是 Apache Spark Scala 的新手。如果您需要更多東西,請告訴我。提前致謝!
uj5u.com熱心網友回復:
嘗試這個。
val x: Array[String] =
Array(
"Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rdd = sc.parallelize(x)
val result = rdd.map(lines => {
lines.split(",")(0)
})
result.collect().foreach(println)
輸出:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
uj5u.com熱心網友回復:
試試這個正則運算式:
^\s*([^,] )(_\w )?
演示
要在您的示例中實作此正則運算式,您可以嘗試:
val arr = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rd_var = spark.sparkContext.parallelize((arr).map((Row(_))))
val pattern = "^\s*([^,] )(_\w )?".r
rd_var.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))
uj5u.com熱心網友回復:
使用 RDD:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val pattern = "([a-zA-Z0-9=:_ ] ),(.*)".r
val el = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
def main(args: Array[String]): Unit = {
val rdd = spark.sparkContext.parallelize((el).map((Row(_))))
rdd.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))
}
它給:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/372696.html
