PostgresQL查詢在大資料集上運行時間太長-有解無憂

我有一張token_count桌子。

docid | tokenid | tf | log_ave_tf
  1   |    1    | 1  |  null
  1   |    2    | 2  |  null
  2   |    1    | 3  |  null
  2   |    2    | 1  |  null

ALTER TABLE token_count
ADD COLUMN log_ave_tf real;

我正在嘗試計算log_ave_tf列的值，公式如下：

log_ave_tf = (1 log(tf)) / (1 log(average tf for document))

這是我正在運行的代碼：

UPDATE token_count tc 
SET log_ave_tf = (1   log(2, tf)) / (1 log(2, subquery.avg_tf))
FROM (
    
          SELECT docid, avg(tf) as avg_tf
          FROM token_count
          GROUP BY docid
    ) subquery 
WHERE subquery.docid = tc.docid;

在 1000 個檔案資料集上用了 1 分半鐘，嘗試在 100 000 個檔案資料集（3600 萬行token_count）上運行它，并且在 5 小時后不得不取消查詢。我需要它來處理 400 萬個檔案資料集。有沒有辦法優化這個查詢，這樣它就不會花費太多時間？

對包含 1000 個檔案的資料集進行解釋（分析、緩沖區、格式化文本）：

"Update on token_count tc  (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=89287.844..89287.847 rows=0 loops=1)"
"  Buffers: shared hit=2319962 read=13056 dirtied=17040 written=922"
"  ->  Hash Join  (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=768.179..83652.020 rows=1128913 loops=1)"
"        Hash Cond: (tc.docid = subquery.docid)"
"        Buffers: shared hit=32402 read=8796 dirtied=1 written=922"
"        ->  Seq Scan on token_count tc  (cost=0.00..31888.13 rows=1128913 width=30) (actual time=0.089..702.652 rows=1128913 loops=1)"
"              Buffers: shared hit=16206 read=4393 written=922"
"        ->  Hash  (cost=37552.67..37552.67 rows=888 width=96) (actual time=767.982..767.983 rows=1001 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 93kB"
"              Buffers: shared hit=16196 read=4403 dirtied=1"
"              ->  Subquery Scan on subquery  (cost=37532.69..37552.67 rows=888 width=96) (actual time=766.111..767.517 rows=1001 loops=1)"
"                    Buffers: shared hit=16196 read=4403 dirtied=1"
"                    ->  HashAggregate  (cost=37532.69..37543.79 rows=888 width=36) (actual time=766.105..767.119 rows=1001 loops=1)"
"                          Group Key: token_count.docid"
"                          Batches: 1  Memory Usage: 321kB"
"                          Buffers: shared hit=16196 read=4403 dirtied=1"
"                          ->  Seq Scan on token_count  (cost=0.00..31888.13 rows=1128913 width=8) (actual time=0.010..231.895 rows=1128913 loops=1)"
"                                Buffers: shared hit=16196 read=4403 dirtied=1"
"Planning Time: 0.222 ms"
"Execution Time: 89288.014 ms"

uj5u.com熱心網友回復：

這是回填的經典案例，最好通過創建新表而不是就地更新來完成。先決條件是您需要在要執行此操作的表空間上有 2X 空間。

整個操作應分 3 個步驟的多輪進行。創建臨時表。

create temp table tf_avg_tab_temp
( docid int,
  avg_tf real
  )

create temp table tf_avg_log_tab_temp
(
  docid int,
  log_avg_tf real
)

原表：

create table token_count
(
  docid int,
  tokenid int,
  tf int
)

決賽桌：

create table token_count_final
(
  docid int,
  tokenid int,
  tf int,
  log_avg_tf real
)

每一輪都應該針對一定范圍的 docid 進行，比如 100 萬：

步驟 1：加載表 tf_avg_tab_temp。

insert into tf_avg_tab_temp select docid, avg(tf) as avg_tf FROM token_count where docid between 1 and 1000000 GROUP BY docid;

[注意：要加快此程序，您可以在 docid 上創建索引。]

步驟 2：使用步驟 1 中的值，加載表 tf_avg_log_tab_temp。

insert into tf_avg_log_tab_temp select docid,(1   log(2, tf)) / (1 log(2,avg_tf)) log_avg_tf from tf_avg_tab_temp;

步驟 3：通過連接步驟 2 中獲得的原始表和資料集來加載最終表：

insert into token_count_final 
select a.docid, a.tokenid, a.tf , b.log_avg_tf from token_count a inner join tf_avg_log_tab_temp b on a.docid = b.docid ;

TEMP 表需要在每一輪之后被截斷，下一次迭代應該從接下來的百萬個 docid 開始......在這里說 docid 1000001 到 2000000

您可以重命名最終表并洗掉原始表。重命名后分析表。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/380157.html

標籤：sql PostgreSQL的

上一篇：使用洗掉/更新/插入觸發器計算總和

下一篇：如果記錄存在從下一行插入（Sql）