公司可以選擇一個路段。部分由開始和結束表示。
pyspark 資料框如下:
-------------------- ---------- --------
|Road company |start(km) |end(km) |
-------------------- ---------- --------
|classA |1 |3 |
|classA |4 |7 |
|classA |10 |15 |
|classA |16 |20 |
|classB |1 |3 |
|classB |4 |7 |
|classB |10 |15 |
-------------------- ---------- --------
B 類公司將首先選擇路段。對于 A 類條目,應該與 B 類有重疊。也就是說,A 類公司無法選擇已被 B 類(公司)選擇的路段。結果應如下所示:
-------------------- ---------- --------
|Road company |start(km) |end(km) |
-------------------- ---------- --------
|classA |16 |20 |
|classB |1 |3 |
|classB |4 |7 |
|classB |10 |15 |
-------------------- ---------- --------
distinct() 函式不支持將幀分成幾個部分來應用不同的操作。我應該怎么做才能實作它?
uj5u.com熱心網友回復:
如果您可以部分分配道路部分,這里有一個不同(非常相似)的策略:
start="start(km)"
end="end(km)"
def emptyDFr():
schema = StructType([
StructField(start,IntegerType(),True),
StructField(end,IntegerType(),True),
StructField("Road company",StringType(),True),
StructField("ranged",IntegerType(),True)
])
return spark.createDataFrame(sc.emptyRDD(), schema)
def dummyData():
return sc.parallelize([["classA",1,3],["classA",4,7],["classA",8,15],["classA",16,20],["classB",1,3],["classB",4,7],["classB",8,17]]).toDF(['Road company','start(km)','end(km)'])
df = dummyData()
df.cache()
df_ordered = df.orderBy(when(col("Road company") == "classB", 1)
.when(col("Road company") == "classA", 2)
.when(col("Road company") == "classC", 3)
).select("Road company").distinct()
# create the sequence of kilometers that cover the 'start' to 'end'
ranged = df.withColumn("range", explode(sequence( col(start), col(end) )) )
whatsLeft = ranged.select( col("range") ).distinct()
result = emptyDFr()
#Only use collect() on small countable sets of data.
for company in df_ordered.collect():
taken = ranged.where(col("Road company") == lit(company[0]))\
.join(whatsLeft, ["range"])
whatsLeft = whatsLeft.subtract( taken.select( col("range") ) )
result = result.union( taken.select( col("range") ,col(start), col(end),col("Road company") ) )
#convert our result back to the 'original style' of records with starts and ends.
result.groupBy( start, end, "Road company").agg(count("ranged").alias("count") )\
#figure out math to see if you got everything you asked for.
.withColumn("Partial", ((col(end) lit(1)) - col(start)) != col("count"))\
.withColumn("Maths", ((col(end) lit(1)) - col(start))).show() #helps show why this works not requried.
uj5u.com熱心網友回復:
如果您可以依靠部分永遠不會重疊的事實,則可以使用以下邏輯解決此問題。您可能會優化它以依賴“開始(公里)”。但是如果你說的比這更深入,它可能會更復雜。
from pyspark.sql.functions col, when
from pyspark.sql.types import *
def emptyDF():
schema = StructType([
StructField("start(km)",IntegerType(),True),
StructField("end(km)",IntegerType(),True),
StructField("Road company",StringType(),True)
])
return spark.createDataFrame(sc.emptyRDD(), schema)
def dummyData():
return sc.parallelize([["classA",1,3],["classA",4,7],["classA",8,15],["classA",16,20],["classB",1,3],["classB",4,7],["classB",8,15]]).toDF(['Road company','start(km)','end(km)'])
df = dummyData()
df.cache()
df_ordered = df.orderBy(when(col("Road company") == "classB", 1)
.when(col("Road company") == "classA", 2)
.when(col("Road company") == "classC", 3)
).select("Road company").distinct()
whatsLeft = df.select( col("start(km)") ,col("end(km)") ).distinct()
result = emptyDF()
#Only use collect() on small countable sets of data.
for company in df_ordered.collect():
taken = df.where(col("Road company") == lit(company[0]))\
.join(whatsLeft, ["start(km)" ,"end(km)"])
whatsLeft = whatsLeft.subtract( taken.drop( col("Road company") ) )
result = result.union( taken )
result.show()
--------- ------- ------------
|start(km)|end(km)|Road company|
--------- ------- ------------
| 1| 3| classB|
| 4| 7| classB|
| 8| 15| classB|
| 16| 20| classA|
--------- ------- ------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/481486.html
