加權隨機選擇-有解無憂

請。我有兩個最常見的名字和姓氏的表。每個表基本上有兩個欄位：

表

CREATE TABLE "common_first_name" (
    "first_name" text PRIMARY KEY, --The text representing the name
    "ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.     
    "inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
    "updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);

CREATE TABLE "common_last_name" (
    "last_name" text PRIMARY KEY, --The text representing the name
    "ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.     
    "inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
    "updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);

PS：TOP 1 名稱僅出現約 1.8% 的時間。每個表有 1000 行。

函式（偽，未就緒）

CREATE OR REPLACE FUNCTION create_sample_data(p_number_of_records INT)
    RETURNS VOID
    AS $$
DECLARE
    SUM_OF_WEIGHTS CONSTANT INT := 100;
BEGIN

    FOR i IN 1..coalesce(p_number_of_records, 0) LOOP
      --Get the random first and last name but taking in consideration their probability (RATIO)round(random()*SUM_OF_WEIGHTS); 
      --create_person (random_first_name || ' ' || random_last_name);
    END LOOP;
END
$$
LANGUAGE plpgsql VOLATILE;

PS：每個名稱（每表）的所有比率總和為 100%。

我想運行一個函式 N 次并獲取一個名字和一個姓氏來創建示例資料......兩個表每個都有 1000 行。

樣本大小可以是 1000 個全名到 1000000 個名字，所以如果有一種“快速”的方法來執行這個隨機加權函式，那就更好了。

關于如何在 PL/PGSQL 中執行此操作的任何建議？

我在 SUPABASE.IO 上使用 PG 13.3。

謝謝

uj5u.com熱心網友回復：

鑒于輸入資料集較小，在純 SQL 中執行此操作很簡單。使用 CTE 為每個 common_FOO_name 表中的每一行構建下限和上限列，然后使用 generate_series() 生成亂數集。將所有內容連接在一起，并使用邊界之間的隨機值作為 WHERE 子句。

with first_names_weighted as (
  select first_name,
    sum(ratio) over (order by first_name) - ratio as lower_bound,
    sum(ratio) over (order by first_name) as upper_bound
  from common_first_name
),
last_names_weighted as (
  select last_name,
    sum(ratio) over (order by last_name) - ratio as lower_bound,
    sum(ratio) over (order by last_name) as upper_bound
  from common_last_name
),
randoms as (
  select random() * (select sum(ratio) from common_first_name) as f_random,
         random() * (select sum(ratio) from common_last_name) as l_random
  from generate_series(1, 32)
)
select r, first_name, last_name
from randoms r
cross join first_names_weighted f
cross join last_names_weighted l
where f.lower_bound <= r.f_random and r.f_random <= f.upper_bound
  and l.lower_bound <= r.l_random and r.l_random <= l.upper_bound;

更改傳遞給的值generate_series()以控制要生成的名稱數量。如果它是一個函式很重要，您可以只使用 LANGAUGE SQL 函式定義來引數化該數字：

https://www.db-fiddle.com/f/mmGQRhCP2W1yfhZTm1yXu5/3

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/349168.html

標籤：PostgreSQL 算法 plpgsql 概率分布超基地

上一篇：使用Doobie映射多對多關系

下一篇：sql,javax.persistence.criteria：根據字串串列檢查具有逗號分隔字串值的資料庫表列