譯者注：
與MySQL設定innodb_buffer_pool_size = 80%左右的系統記憶體相比，也就是將作業系統大部分記憶體分配給Innodb的buffer pool的快取管理機制不同，Postgresql采用資料庫采用實體buffer和作業系統buffer雙快取(effective_cache_size)的作業模式，這一點兩者還是有著比較本質上的差異的，
快取作為資料庫的一個核心組件，shared_buffers決定了資料庫實體層面的可用記憶體，而檔案系統快取的大小是effective_cache_size決定的，effective_cache_size不僅是快取經常訪問的資料，它同時幫助優化器確定實際存在多少快取，指導優化器生成最佳執行計劃，

以下幾篇文章都比較好地解釋了Shared Buffers和作業系統層面檔案快取（os cache）之間的關系，可作為參考
https://www.cybertec-postgresql.com/en/effective_cache_size-what-it-means-in-postgresql/
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
https://severalnines.com/database-blog/architecture-and-tuning-memory-postgresql-databases
https://distributedsystemsauthority.com/optimizing-postgresql-shared-buffers/
https://www.enterprisedb.com/edb-docs/d/edb-postgres-advanced-server/user-guides/user-guide/10/EDB_Postgres_Advanced_Server_Guide.1.24.html
https://stackoverflow.com/questions/42478488/resize-shared-buffer-size-in-postgresql-hosted-in-aws-rds
https://devcenter.heroku.com/articles/understanding-postgres-data-caching

查閱到的資料上只是說了一些理論基礎，比如shared buffers或者和effective_cache_size設定的過大或者過小理論上存在的問題，但是最侄訓是沒整明白，Postgresql為什么shared buffers（建議值是25%系統記憶體）和effective_cache_size（建議值是50%系統記憶體）分配機制的背后原理，也希望有高人指點，

以下為譯文，

原文地址：https://postgreshelp.com/postgresql_shared_buffers/

這篇文章詳細回答了以下問題：在PostgreSQL中，你需要給共享緩沖區多少記憶體?為什么?
額外的福利! !為什么我的RDS postgreSQL共享緩沖區使用系統記憶體的25%，而Aurora Postgresql的共享緩沖區是75%?答案就在這里，

理解PostgreSQL中的OS Cache vs Shared Buffers

在我們開始之前，首先回答一個問題：BGWriter在PostgreSQL中的作用是什么?
如果你的答案是“它把臟緩沖區寫到磁盤”，那就錯了，
它實際上將臟緩沖區寫入OS快取，然后進行單獨的系統呼叫，將頁面從OS快取重繪到磁盤，
是不是不太明白嗎?可以這么去理解它：
由于它輕量級的特性，PostgreSQL必須高度依賴作業系統快取，它依賴于作業系統來了解檔案系統、磁盤布局以及如何讀寫資料檔案，
下圖讓您大致了解了資料如何在磁盤和share buffers之間傳遞，

在這里，每當您發出一個“select * from emp”，您的資料實際上是加載到作業系統快取（OS cache ），然后到shared buffers，
同樣，當您嘗試將臟緩沖區重繪到磁盤時，頁面實際上是重繪到OS快取，然后通過一個名為fsync()的單獨系統呼叫重繪到磁盤，
這里，PostgreSQL實際上復制了OS的功能，這意味著OS cache和shared_buffers可以保存相同的頁面，
這可能會導致空間浪費，但請記住，OS快取使用的是簡單的LRU，而不是資料庫優化的時鐘掃描演算法（clock sweep algorithm.），一旦頁面在shared_buffers上命中，讀取就永遠不會到達作業系統快取，如果有任何副本，則很容易洗掉它們，

是否可以設定作業系統的fsync()方法重繪臟頁到磁盤？

是的，引數在postgresql.conf檔案中，bgwriter_flush_after(整數)—默認512 kB
當backend writer寫入的資料超過這個數量時，嘗試強制作業系統向底層存盤發出這些寫入操作，
這樣做將限制內核頁面快取中的臟資料量，減少在檢查點結束時發出fsync或作業系統在后臺大量回寫資料時暫停的可能性，
它被用作塊，即BLCKSZ位元組，通常為8kB，
不僅是bgwriter，在PostgreSQL甚至checkpoint行程和用戶后端行程也可以將臟緩沖區從共享緩沖區到作業系統快取，
即使在這里，我們也可以分別使用checkpoint_flush_after和backend_flush_after命令來影響作業系統的fsync()(盡管關于檢查指標和后端行程的討論超出了本文的范圍)，
另外請參考:checkpoint程序綜合指南

如果給作業系統快取（OS Cache）分配的值太少了呢?

如上所述，一旦頁面被標記為dirty，它就會被重繪到OS快取中，然后寫入磁盤，
在這里，OS可以更自由地根據傳入的流量進行I/O，如果作業系統快取的大小更小，那么它就不能重新排序寫操作和優化I/O，
這對于撰寫繁重的作業負載尤其重要，所以作業系統快取大小也很重要，
譯者注：
這里提到的，如果給作業系統分片的快取值太小，意味著將大部分記憶體分配給shared buffers，除了上面提到的原因，筆者查閱了很多資料，關于Postgresql實體的OS cache和shared buffers分配比例以及原因都沒有一個明確的解釋，
比如建議的給shared buffers分配25%的總記憶體，給effective_cache_size設定50%總記憶體的具體原理，為什么是這個比例，而不是5:5開或者是7:3開的比例？
在這里：https://www.postgresql-archive.org/Increased-shared-buffer-setting-lower-hit-ratio-td5826899.html有類似問題的解釋：
Regarding the unexpected decrease of performance after increasing shared_buffers - that's actually quite common behavior.
First, themanagement of shared buffers is not free, and the more pieces you need to manage the more expensive it is.
Also, by using larger shared buffers you make that memory unusable for page cache etc.
There are also other negative consequences - double buffering, accumulating more changes for a checkpoint etc.

如果給共享緩沖區快取（shared buffers）的值太少怎么辦?

很簡單，雖然使用OS快取進行快取，但實際的資料庫操作是在共享緩沖區中執行的，因此，在共享緩沖區中提供足夠的空間是個好主意，
譯者注：以下翻譯自https://distributedsystemsauthority.com/optimizing-postgresql-shared-buffers/
PostgreSQL shared_buffers versus operating system cache
如果你習慣了這樣的資料庫，大部分的系統記憶體都是給資料庫的，而作業系統快取是通過同步和直接寫繞過的，你就不會想用同樣的方法來設定PostgreSQL，
但是在Postgresql中這樣做適得其反，例如，PostgreSQL的存盤在pg_clog目錄下提交日志資訊，該資料是定期讀取和寫入的，因此作業系統將負責優化clog的任務，
那么為什么不把所有的記憶體都給作業系統呢?
PostgreSQL共享緩沖區快取能比作業系統快取做得更好的主要原因是它保持緩沖區使用計數的方式，
這允許緩沖區獲得從0到5的“流行度”分數，分數越高，這些緩沖區離開快取的可能性就越小，
每當資料庫查找要洗掉的內容以便為其需要的資料騰出更多空間時，就會減少使用計數，使用量的每一次增加都會使該塊更難擺脫，這個實作稱為時鐘掃描演算法（clock-sweep algorithm），
典型的作業系統快取在資料被驅逐之前只會給任何緩沖區一到兩次機會，
通常，作業系統會使用某種形式的LRU演算法，如果資料庫中有經常使用的資料，那么將資料保存在資料庫的共享RAM中可能比保存在作業系統的共享RAM中更好，

那么什么是最優值呢?

PostgreSQL建議你將25%的系統記憶體分配給共享緩沖區，并且你可以根據你的環境隨時修改共享緩沖區的值，

如何查看共享緩沖區的內容?

PG緩沖快取擴展幫助我們實時查看共享緩沖區中的資料，從shared_buffers收集資訊并將其放在pg_buffercache中以供查看，
create extents pg_buffercache;
安裝擴展之后，執行下面的查詢來檢查共享緩沖區的內容，

SELECT c.relname
  , pg_size_pretty(count(*) * 8192) as buffered
  , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
  , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE pg_relation_size(c.oid) > 0
 GROUP BY c.oid, c.relname
 ORDER BY 3 DESC
 LIMIT 10;

輸出的結果類似如下：

postgres=# SELECT c.relname
postgres-#   , pg_size_pretty(count(*) * 8192) as buffered
postgres-#   , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
postgres-#   , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
postgres-#  FROM pg_class c
postgres-#  INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
postgres-#  INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
postgres-#  WHERE pg_relation_size(c.oid) > 0
postgres-#  GROUP BY c.oid, c.relname
postgres-#  ORDER BY 3 DESC
postgres-#  LIMIT 10;
          relname          |  buffered  | buffers_percent | percent_of_relation
---------------------------+------------+-----------------+---------------------
 pg_operator               | 80 kB      |             0.1 |                71.4
 pg_depend_reference_index | 96 kB      |             0.1 |                27.9
 pg_am                     | 8192 bytes |             0.0 |               100.0
 pg_amproc                 | 24 kB      |             0.0 |               100.0
 pg_cast                   | 8192 bytes |             0.0 |                50.0
 pg_depend                 | 64 kB      |             0.0 |                14.0
 pg_index                  | 32 kB      |             0.0 |               100.0
 pg_description            | 40 kB      |             0.0 |                14.3
 pg_language               | 8192 bytes |             0.0 |               100.0
 pg_amop                   | 40 kB      |             0.0 |                83.3
(10 rows)

如何能看到資料實際上是在作業系統級別快取的?

要檢查在作業系統級別快取的資料，我們需要安裝pgfincore包，
這是一個外部模塊，提供關于作業系統如何快取頁面的資訊，它的級別很低，但卻非常強大，
下載pgfincore并執行以下步驟，

As root user: 
 
export PATH=/usr/local/pgsql/bin:$PATH //Set the path to point pg_config.
tar -xvf pgfincore-v1.1.1.tar.gz 
cd pgfincore-1.1.1 
make clean 
make 
make install 
 
Now connect to PG and run below command
postgres=# CREATE EXTENSION pgfincore;

現在執行下面的命令來檢查作業系統級別的緩沖區，

select c.relname,pg_size_pretty(count(*) * 8192) as pg_buffered, 
 round(100.0 * count(*) / 
           (select setting 
            from pg_settings 
            where name='shared_buffers')::integer,1)
       as pgbuffer_percent,
       round(100.0*count(*)*8192 / pg_table_size(c.oid),1) as percent_of_relation,
       ( select round( sum(pages_mem) * 4 /1024,0 )
         from pgfincore(c.relname::text) ) 
         as os_cache_MB , 
         round(100 * ( 
               select sum(pages_mem)*4096 
               from pgfincore(c.relname::text) )/ pg_table_size(c.oid),1) 
         as os_cache_percent_of_relation,
         pg_size_pretty(pg_table_size(c.oid)) as rel_size 
 from pg_class c 
 inner join pg_buffercache b on b.relfilenode=c.relfilenode 
 inner join pg_database d on (b.reldatabase=d.oid and d.datname=current_database()
            and c.relnamespace=(select oid from pg_namespace where nspname='public')) 
 group by c.oid,c.relname 
 order by 3 desc limit 30;

輸出的結果類似如下：

relname |pg_buffered|pgbuffer_per|per_of_relation|os_cache_mb|os_cache_per_of_relation|rel_size 
---------+-----------+------------+---------------+-----------+------------------------+-------- 
 emp     | 4091 MB   | 99.9       | 49.3          | 7643      | 92.1                   | 8301 MB

pg_buffered表示PostgreSQL緩沖快取中緩沖了多少資料
pgbuffer_percent表示pg_buffered/total_buffer_size *100
percent_of_relation表示pg_buffered/total_relation_size * 100
os_cache_mb表示在OS中快取了多少關系
在這里，我們的emp表有8301 MB大小，92%的資料在作業系統快取中，同時49.3%的資料在共享緩沖區中，也就是大約50%的資料是冗余的，
額外的識訓! !

為什么Aurora的PostgreSQL將shared buffers設定為可用記憶體的75%?

對于RDS DB實體，DB引陣列的默認值設定為總記憶體的25%，但是對于Aurora DB實體，DB引陣列的默認值設定為總記憶體的75%，
這是因為Aurora PostgreSQL消除了雙緩沖，并且不使用檔案系統快取，
因此，Aurora PostgreSQL可以通過增加shared_buffers來提高性能，在使用Aurora PostgreSQL時，shared_buffers DB引數的默認值為75%是一個最佳實踐，
眾所周知，work_mem、maintenance_work_mem和其他本地記憶體組件都不是shared buffers的一部分，在aurora中，如果你的應用程式需要大量的work_mem或者你的應用程式需要更多的客戶端連接，你會遇到set shared_buffers值小于75%的情況，

Thank you for giving your valuable time to read the above information. I hope the content served your purpose in reaching out to the blog.
Suggestions for improvement of the blog are highly appreciable. Please contact us for any information/suggestions/feedback.

If you want to be updated with all our articles

please follow us on Facebook | Twitter
Please subscribe to our newsletter.

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/1170.html

標籤：PostgreSQL

上一篇：PG虛擬檔案描述符（VFD）機制——FD LRU池一：postgresql-8.4.1/src/backend/storage/file/fd.c

下一篇：postgresql中的咨詢鎖（advisory lock）

PostgreSQL Shared Buffers 全面指南（譯）