我有一張token_count桌子。
docid | tokenid | tf | log_ave_tf
1 | 1 | 1 | null
1 | 2 | 2 | null
2 | 1 | 3 | null
2 | 2 | 1 | null
ALTER TABLE token_count
ADD COLUMN log_ave_tf real;
我正在嘗試計算log_ave_tf列的值,公式如下:
log_ave_tf = (1 log(tf)) / (1 log(average tf for document))
這是我正在運行的代碼:
UPDATE token_count tc
SET log_ave_tf = (1 log(2, tf)) / (1 log(2, subquery.avg_tf))
FROM (
SELECT docid, avg(tf) as avg_tf
FROM token_count
GROUP BY docid
) subquery
WHERE subquery.docid = tc.docid;
在 1000 個檔案資料集上用了 1 分半鐘,嘗試在 100 000 個檔案資料集(3600 萬行token_count)上運行它,并且在 5 小時后不得不取消查詢。我需要它來處理 400 萬個檔案資料集。有沒有辦法優化這個查詢,這樣它就不會花費太多時間?
對包含 1000 個檔案的資料集進行解釋(分析、緩沖區、格式化文本):
"Update on token_count tc (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=89287.844..89287.847 rows=0 loops=1)"
" Buffers: shared hit=2319962 read=13056 dirtied=17040 written=922"
" -> Hash Join (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=768.179..83652.020 rows=1128913 loops=1)"
" Hash Cond: (tc.docid = subquery.docid)"
" Buffers: shared hit=32402 read=8796 dirtied=1 written=922"
" -> Seq Scan on token_count tc (cost=0.00..31888.13 rows=1128913 width=30) (actual time=0.089..702.652 rows=1128913 loops=1)"
" Buffers: shared hit=16206 read=4393 written=922"
" -> Hash (cost=37552.67..37552.67 rows=888 width=96) (actual time=767.982..767.983 rows=1001 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 93kB"
" Buffers: shared hit=16196 read=4403 dirtied=1"
" -> Subquery Scan on subquery (cost=37532.69..37552.67 rows=888 width=96) (actual time=766.111..767.517 rows=1001 loops=1)"
" Buffers: shared hit=16196 read=4403 dirtied=1"
" -> HashAggregate (cost=37532.69..37543.79 rows=888 width=36) (actual time=766.105..767.119 rows=1001 loops=1)"
" Group Key: token_count.docid"
" Batches: 1 Memory Usage: 321kB"
" Buffers: shared hit=16196 read=4403 dirtied=1"
" -> Seq Scan on token_count (cost=0.00..31888.13 rows=1128913 width=8) (actual time=0.010..231.895 rows=1128913 loops=1)"
" Buffers: shared hit=16196 read=4403 dirtied=1"
"Planning Time: 0.222 ms"
"Execution Time: 89288.014 ms"
uj5u.com熱心網友回復:
這是回填的經典案例,最好通過創建新表而不是就地更新來完成。先決條件是您需要在要執行此操作的表空間上有 2X 空間。
整個操作應分 3 個步驟的多輪進行。創建臨時表。
create temp table tf_avg_tab_temp
( docid int,
avg_tf real
)
create temp table tf_avg_log_tab_temp
(
docid int,
log_avg_tf real
)
原表:
create table token_count
(
docid int,
tokenid int,
tf int
)
決賽桌:
create table token_count_final
(
docid int,
tokenid int,
tf int,
log_avg_tf real
)
每一輪都應該針對一定范圍的 docid 進行,比如 100 萬:
步驟 1:加載表 tf_avg_tab_temp。
insert into tf_avg_tab_temp select docid, avg(tf) as avg_tf FROM token_count where docid between 1 and 1000000 GROUP BY docid;
[注意:要加快此程序,您可以在 docid 上創建索引。]
步驟 2:使用步驟 1 中的值,加載表 tf_avg_log_tab_temp。
insert into tf_avg_log_tab_temp select docid,(1 log(2, tf)) / (1 log(2,avg_tf)) log_avg_tf from tf_avg_tab_temp;
步驟 3:通過連接步驟 2 中獲得的原始表和資料集來加載最終表:
insert into token_count_final
select a.docid, a.tokenid, a.tf , b.log_avg_tf from token_count a inner join tf_avg_log_tab_temp b on a.docid = b.docid ;
TEMP 表需要在每一輪之后被截斷,下一次迭代應該從接下來的百萬個 docid 開始......在這里說 docid 1000001 到 2000000
您可以重命名最終表并洗掉原始表。重命名后分析表。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/380157.html
標籤:sql PostgreSQL的
