下面我有一個地方資料框colA和colB包含字串。我想,以檢查是否colB包含值的任意字串colA。值可以包含,或空格,但只要colB's 字串的任何部分與's 重疊colA,它就是匹配。例如,下面的第 1 行有重疊(“bc”),第 2 行沒有。
我正在考慮將值拆分為陣列,但分隔符不是常量。有人可以幫助闡明如何做到這一點嗎?非常感謝您的幫助。
--- ------- -----------
| id|colA | colB |
--- ------- -----------
| 1|abc d | bc, z |
| 2|abcde | hj f |
--- ------- -----------
uj5u.com熱心網友回復:
您可以使用正則運算式進行拆分,然后創建一個 UDF 函式來檢查子字串。
例子:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
結果:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
-------- ----------- --- -------
|A |B |id |result |
-------- ----------- --- -------
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
-------- ----------- --- -------
uj5u.com熱心網友回復:
您可以使用自定義 UDF 來實作相交邏輯,如下所示 -
資料準備
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
--- ----- -----
| id| colA| colB|
--- ----- -----
| 1|abc d|bc, z|
| 2|abcde| hj f|
--- ----- -----
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
--- ----- ----- ---------
| id| colA| colB|intersect|
--- ----- ----- ---------
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
--- ----- ----- ---------
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/360758.html
標籤:数据框 斯卡拉 阿帕奇火花 火花 apache-spark-sql
