Sparksql使用where子句獲取示例行-有解無憂

是否可以使用 where 子句獲取查詢的樣本 n 行？

我嘗試使用下面的 tablesample 函式，但最終只獲得了第一個磁區“2021-09-14”中的記錄。磷

select * from (select * from table where ts in ('2021-09-14', '2021-09-15')) tablesample（100行）

uj5u.com熱心網友回復：

您可以使用單調遞增 ID -此處或Rand生成一個附加列，該列可用于對您的資料集進行排序以生成必要的采樣欄位

這兩個功能可以結合使用或單獨使用

此外，您還可以使用LIMIT子句來對所需的N記錄進行抽樣

注意 - orderBy 將是一項代價高昂的操作

資料準備

input_str = """
1   2/12/2019   114 2
2   3/5/2019    116 1
3   3/3/2019    120 6
4   3/4/2019    321 10
6   6/5/2019    116 1
7   6/3/2019    116 1
8   10/1/2019   120 3
9   10/1/2019   120 3
10  10/1/2020   120 3
11  10/1/2020   120 3
12  10/1/2020   120 3
13  10/1/2022   120 3
14  10/1/2021   120 3
15  10/6/2019   120 3
""".split()

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "shipment_id  ship_date   customer_id quantity".split()))
            
n = len(input_values)

input_list = [tuple(input_values[i:i 4]) for i in range(0,n,4)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF = sparkDF.withColumn('ship_date',F.to_date(F.col('ship_date'),'d/M/yyyy'))

sparkDF.show()

 ----------- ---------- ----------- -------- 
|shipment_id| ship_date|customer_id|quantity|
 ----------- ---------- ----------- -------- 
|          1|2019-12-02|        114|       2|
|          2|2019-05-03|        116|       1|
|          3|2019-03-03|        120|       6|
|          4|2019-04-03|        321|      10|
|          6|2019-05-06|        116|       1|
|          7|2019-03-06|        116|       1|
|          8|2019-01-10|        120|       3|
|          9|2019-01-10|        120|       3|
|         10|2020-01-10|        120|       3|
|         11|2020-01-10|        120|       3|
|         12|2020-01-10|        120|       3|
|         13|2022-01-10|        120|       3|
|         14|2021-01-10|        120|       3|
|         15|2019-06-10|        120|       3|
 ----------- ---------- ----------- --------

Order By - 單調遞增的 ID & Rand

sparkDF.createOrReplaceTempView("shipment_table")

sql.sql("""
SELECT
 *
FROM (
    SELECT 
        *
        ,monotonically_increasing_id() as increasing_id
        ,RAND(10) as random_order
    FROM shipment_table
    WHERE ship_date BETWEEN '2019-01-01' AND '2019-12-31'
    ORDER BY monotonically_increasing_id() DESC ,RAND(10) DESC
    LIMIT 5
)
""").show()

 ----------- ---------- ----------- -------- ------------- ------------------- 
|shipment_id| ship_date|customer_id|quantity|increasing_id|       random_order|
 ----------- ---------- ----------- -------- ------------- ------------------- 
|         15|2019-06-10|        120|       3|   8589934593|0.11682250456449328|
|          9|2019-01-10|        120|       3|   8589934592|0.03422639313807285|
|          8|2019-01-10|        120|       3|            6| 0.8078688178371882|
|          7|2019-03-06|        116|       1|            5|0.36664222617947817|
|          6|2019-05-06|        116|       1|            4|    0.2093704977577|
 ----------- ---------- ----------- -------- ------------- -------------------

uj5u.com熱心網友回復：

如果您使用的Dataset是檔案中概述的內置功能：

sample(withReplacement: Boolean, fraction: Double): Dataset[T]

Returns a new Dataset by sampling a fraction of rows, using a random seed.

withReplacement: Sample with replacement or not.
fraction: Fraction of rows to generate, range [0.0, 1.0].

Since

    1.6.0
Note

    This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.

要使用它，您需要根據您正在尋找的任何標準過濾資料集，然后對結果進行采樣。如果你需要一個確切的行數，而不是一小部分就可以按照呼叫sample與limit(n)地方n是要回傳的行數。

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/311433.html

標籤：阿帕奇火花 apache-spark-sql

上一篇：將矩陣的RDD轉換為向量的RDD

下一篇：Spark訪問Row物件值