PostgreSQL pg_clog fsync 頻率分析

本文主要分析一下pg_clog是在什麼時候需要調用fsync的？

引用wiki裏的一段pg_clog的介紹。
https://wiki.postgresql.org/wiki/Hint_Bits

Some details here are in src/backend/access/transam/README:
1. "pg_clog records the commit status for each transaction that has been assigned an XID."
2. "Transactions and subtransactions are assigned permanent XIDs only when/if they first do something that requires one --- typically, insert/update/delete a tuple, though there are a few other places that need an XID assigned."

pg_clog is updated only at sub or main transaction end. When the transactionid is assigned the page of the clog that contains that transactionid is checked to see if it already exists and if not, it is initialised.
pg_clog is allocated in pages of 8kB apiece(其實是和BLOCKSZ一致，所以不一定是8K，見後麵的代碼分析). 
Each transaction needs 2 bits, so on an 8 kB page there is space for 4 transactions/byte * 8k bytes = 32k transactions.
On allocation, pages are zeroed, which is the bit pattern for "transaction in progress". 
So when a transaction starts, it only needs to ensure that the pg_clog page that contains its status is allocated, but it need not write anything to it. 
In 8.3 and later, this happens not when the transaction starts, but when the Xid is assigned (i.e. when the transaction first calls a read-write command). 
In previous versions it happens when the first snapshot is taken, normally on the first command of any type with very few exceptions.

This means that one transaction in every 32K writing transactions *does* have to do extra work when it assigns itself an XID, namely create and zero out the next page of pg_clog. 
And that doesn't just slow down the transaction in question, but the next few guys that would like an XID but arrive on the scene while the zeroing-out is still in progress. 
This probably contributes to reported behavior that the transaction execution time is subject to unpredictable spikes.
每32K個事務，要擴展一個CLOG PAGE，每次擴展需要填充0，同時需要調用PG_FSYNC，這個相比FSYNC XLOG應該是比較輕量級的。但是也可能出現不可預知的響應延遲，因為如果堵塞在擴展CLOG PAGE，所有等待clog PAGE的會話都會受到影響。

這裏指當CLOG buffer沒有空的SLOT時，會從所有的CLOG buffer SLOT選擇一個髒頁，將其刷出，這個時候才會產生pg_fsync。
CLOG pages don't make their way out to disk until the internal CLOG buffers are filled, at which point the least recently used buffer there is evicted to permanent storage.

下麵從代碼中分析一下pg_clog是如何調用pg_fsync刷髒頁的。

每次申請新的事務ID時，都需要調用ExtendCLOG，如果通過事務ID計算得到的CLOG PAGE頁不存在，則需要擴展，但是並不是每次擴展都需要調用pg_fsync，因為checkpoint會將clog buffer刷到磁盤，除非在申請新的CLOG PAGE時所有的clog buffer都沒有刷出髒頁，才需要主動選擇一個page並調用pg_fsync刷出對應的pg_clog/file。
src/backend/access/transam/varsup.c

/*
 * Allocate the next XID for a new transaction or subtransaction.
 *
 * The new XID is also stored into MyPgXact before returning.
 *
 * Note: when this is called, we are actually already inside a valid
 * transaction, since XIDs are now not allocated until the transaction
 * does something.  So it is safe to do a database lookup if we want to
 * issue a warning about XID wrap.
 */
TransactionId
GetNewTransactionId(bool isSubXact)
{
......
        /*
         * If we are allocating the first XID of a new page of the commit log,
         * zero out that commit-log page before returning. We must do this while
         * holding XidGenLock, else another xact could acquire and commit a later
         * XID before we zero the page.  Fortunately, a page of the commit log
         * holds 32K or more transactions, so we don't have to do this very often.
         *
         * Extend pg_subtrans too.
         */
        ExtendCLOG(xid);
        ExtendSUBTRANS(xid);
......

ExtendCLOG(xid);擴展clog page，調用TransactionIdToPgIndex計算XID和CLOG_XACTS_PER_PAGE的餘數，如果不為0，則不需要擴展。
src/backend/access/transam/clog.c

#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)

/*
 * Make sure that CLOG has room for a newly-allocated XID.
 *
 * NB: this is called while holding XidGenLock.  We want it to be very fast
 * most of the time; even when it's not so fast, no actual I/O need happen
 * unless we're forced to write out a dirty clog or xlog page to make room
 * in shared memory.
 */
void
ExtendCLOG(TransactionId newestXact)
{
        int                     pageno;

        /*
         * No work except at first XID of a page.  But beware: just after
         * wraparound, the first XID of page zero is FirstNormalTransactionId.
         */
        if (TransactionIdToPgIndex(newestXact) != 0 &&    // 餘數不為0，說明不需要擴展。
                !TransactionIdEquals(newestXact, FirstNormalTransactionId))
                return;

        pageno = TransactionIdToPage(newestXact);

        LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);

        /* Zero the page and make an XLOG entry about it */
        ZeroCLOGPage(pageno, true);

        LWLockRelease(CLogControlLock);
}

ZeroCLOGPage(pageno, true);，調用SimpleLruZeroPage，擴展並初始化CLOG PAGE，寫XLOG日誌。

/*
 * Initialize (or reinitialize) a page of CLOG to zeroes.
 * If writeXlog is TRUE, also emit an XLOG record saying we did this.
 *
 * The page is not actually written, just set up in shared memory.
 * The slot number of the new page is returned.
 *
 * Control lock must be held at entry, and will be held at exit.
 */
static int
ZeroCLOGPage(int pageno, bool writeXlog)
{
        int                     slotno;

        slotno = SimpleLruZeroPage(ClogCtl, pageno);

        if (writeXlog)
                WriteZeroPageXlogRec(pageno);

        return slotno;
}

SimpleLruZeroPage(ClogCtl, pageno);，調用SlruSelectLRUPage(ctl, pageno);，從clog shared buffer中選擇SLOT。
src/backend/access/transam/slru.c

/*
 * Initialize (or reinitialize) a page to zeroes.
 *
 * The page is not actually written, just set up in shared memory.
 * The slot number of the new page is returned.
 *
 * Control lock must be held at entry, and will be held at exit.
 */
int
SimpleLruZeroPage(SlruCtl ctl, int pageno)
{
        SlruShared      shared = ctl->shared;
        int                     slotno;

        /* Find a suitable buffer slot for the page */
        slotno = SlruSelectLRUPage(ctl, pageno);
        Assert(shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
                   (shared->page_status[slotno] == SLRU_PAGE_VALID &&
                        !shared->page_dirty[slotno]) ||
                   shared->page_number[slotno] == pageno);

        /* Mark the slot as containing this page */
        shared->page_number[slotno] = pageno;
        shared->page_status[slotno] = SLRU_PAGE_VALID;
        shared->page_dirty[slotno] = true;
        SlruRecentlyUsed(shared, slotno);

        /* Set the buffer to zeroes */
        MemSet(shared->page_buffer[slotno], 0, BLCKSZ);

        /* Set the LSNs for this new page to zero */
        SimpleLruZeroLSNs(ctl, slotno);

        /* Assume this page is now the latest active page */
        shared->latest_page_number = pageno;

        return slotno;
}

SlruSelectLRUPage(SlruCtl ctl, int pageno)，從clog buffer選擇一個空的SLOT，如果沒有空的SLOT，則需要調用SlruInternalWritePage(ctl, bestvalidslot, NULL);，寫shared buffer page。

/*
 * Select the slot to re-use when we need a free slot.
 *
 * The target page number is passed because we need to consider the
 * possibility that some other process reads in the target page while
 * we are doing I/O to free a slot.  Hence, check or recheck to see if
 * any slot already holds the target page, and return that slot if so.
 * Thus, the returned slot is *either* a slot already holding the pageno
 * (could be any state except EMPTY), *or* a freeable slot (state EMPTY
 * or CLEAN).
 *
 * Control lock must be held at entry, and will be held at exit.
 */
static int
SlruSelectLRUPage(SlruCtl ctl, int pageno)
{
......
        /* See if page already has a buffer assigned */  先查看clog buffer中是否有空SLOT，有則返回，不需要調pg_fsync
        for (slotno = 0; slotno < shared->num_slots; slotno++)
        {
            if (shared->page_number[slotno] == pageno &&
                shared->page_status[slotno] != SLRU_PAGE_EMPTY)
                return slotno;
        }
...... 
        /*  如果沒有找到空SLOT，則需要從clog buffer中選擇一個使用最少的PAGE，注意他不會選擇最近臨近的PAGE，優先選擇IO不繁忙的PAGE
         * If we find any EMPTY slot, just select that one. Else choose a
         * victim page to replace.  We normally take the least recently used
         * valid page, but we will never take the slot containing
         * latest_page_number, even if it appears least recently used.  We
         * will select a slot that is already I/O busy only if there is no
         * other choice: a read-busy slot will not be least recently used once
         * the read finishes, and waiting for an I/O on a write-busy slot is
         * inferior to just picking some other slot.  Testing shows the slot
         * we pick instead will often be clean, allowing us to begin a read at
         * once.
         *  
         * Normally the page_lru_count values will all be different and so
         * there will be a well-defined LRU page.  But since we allow
         * concurrent execution of SlruRecentlyUsed() within
         * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
         * acquire the same lru_count values.  In that case we break ties by
         * choosing the furthest-back page.
         *
         * Notice that this next line forcibly advances cur_lru_count to a
         * value that is certainly beyond any value that will be in the
         * page_lru_count array after the loop finishes.  This ensures that
         * the next execution of SlruRecentlyUsed will mark the page newly
         * used, even if it's for a page that has the current counter value.
         * That gets us back on the path to having good data when there are
         * multiple pages with the same lru_count.
         */
        cur_count = (shared->cur_lru_count)++;
        for (slotno = 0; slotno < shared->num_slots; slotno++)
        {
            int         this_delta;
            int         this_page_number;

            if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)  // 如果在此期間出現了空SLOT，返回這個slotno
                return slotno;
            this_delta = cur_count - shared->page_lru_count[slotno];
            if (this_delta < 0)
            {
                /*
                 * Clean up in case shared updates have caused cur_count
                 * increments to get "lost".  We back off the page counts,
                 * rather than trying to increase cur_count, to avoid any
                 * question of infinite loops or failure in the presence of
                 * wrapped-around counts.
                 */
                shared->page_lru_count[slotno] = cur_count;
                this_delta = 0;
            }
            this_page_number = shared->page_number[slotno];
            if (this_page_number == shared->latest_page_number)
                continue;
            if (shared->page_status[slotno] == SLRU_PAGE_VALID)  // IO不繁忙的髒頁
            {
                if (this_delta > best_valid_delta ||
                    (this_delta == best_valid_delta &&
                     ctl->PagePrecedes(this_page_number,
                                       best_valid_page_number)))
                {
                    bestvalidslot = slotno;
                    best_valid_delta = this_delta;
                    best_valid_page_number = this_page_number;
                }
            }
            else
            {
                if (this_delta > best_invalid_delta ||
                    (this_delta == best_invalid_delta &&
                     ctl->PagePrecedes(this_page_number,
                                       best_invalid_page_number)))
                {
                    bestinvalidslot = slotno;  // 當所有頁麵IO都繁忙時，無奈隻能從IO繁忙中選擇一個.
                    best_invalid_delta = this_delta;
                    best_invalid_page_number = this_page_number;
                }
            }
        }

        /*  如果選擇到的PAGE
         * If all pages (except possibly the latest one) are I/O busy, we'll
         * have to wait for an I/O to complete and then retry.  In that
         * unhappy case, we choose to wait for the I/O on the least recently
         * used slot, on the assumption that it was likely initiated first of
         * all the I/Os in progress and may therefore finish first.
         */
        if (best_valid_delta < 0)  // 說明沒有找到SLRU_PAGE_VALID的PAGE，所有PAGE都處於IO繁忙的狀態。
        {
            SimpleLruWaitIO(ctl, bestinvalidslot);
            continue;
        }

        /*
         * If the selected page is clean, we're set.
         */
        if (!shared->page_dirty[bestvalidslot])  // 如果這個頁麵已經不是髒頁（例如被CHECKPOINT刷出了），那麼直接返回
            return bestvalidslot;

......
僅僅當以上所有的步驟，都沒有找到一個EMPTY SLOT時，才需要主動刷髒頁（在SlruInternalWritePage調用pg_fsync)。
                /*
                 * Write the page.  注意第三個參數為NULL，即fdata
                 */
                SlruInternalWritePage(ctl, bestvalidslot, NULL);
......

SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)，調用SlruPhysicalWritePage，執行write。

/*
 * Write a page from a shared buffer, if necessary.
 * Does nothing if the specified slot is not dirty.
 *
 * NOTE: only one write attempt is made here.  Hence, it is possible that
 * the page is still dirty at exit (if someone else re-dirtied it during
 * the write).  However, we *do* attempt a fresh write even if the page
 * is already being written; this is for checkpoints.
 *
 * Control lock must be held at entry, and will be held at exit.
 */
static void
SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
{
......
        /* Do the write */
        ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
......

SLRU PAGE狀態

/*
 * Page status codes.  Note that these do not include the "dirty" bit.
 * page_dirty can be TRUE only in the VALID or WRITE_IN_PROGRESS states;
 * in the latter case it implies that the page has been re-dirtied since
 * the write started.
 */
typedef enum
{
    SLRU_PAGE_EMPTY,            /* buffer is not in use */
    SLRU_PAGE_READ_IN_PROGRESS, /* page is being read in */
    SLRU_PAGE_VALID,            /* page is valid and not being written */
    SLRU_PAGE_WRITE_IN_PROGRESS /* page is being written out */
} SlruPageStatus;

SlruPhysicalWritePage(ctl, pageno, slotno, fdata);，這裏涉及pg_clog相關的SlruCtlData結構，do_fsync=true。

/*
 * Physical write of a page from a buffer slot
 *
 * On failure, we cannot just ereport(ERROR) since caller has put state in
 * shared memory that must be undone.  So, we return FALSE and save enough
 * info in static variables to let SlruReportIOError make the report.
 *
 * For now, assume it's not worth keeping a file pointer open across
 * independent read/write operations.  We do batch operations during
 * SimpleLruFlush, though.
 *
 * fdata is NULL for a standalone write, pointer to open-file info during
 * SimpleLruFlush.
 */
static bool
SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
                                          SlruFlush fdata);
......
        int                     fd = -1;
......
//   如果文件不存在，自動創建
        if (fd < 0)
        {
                /*
                 * If the file doesn't already exist, we should create it.  It is
                 * possible for this to need to happen when writing a page that's not
                 * first in its segment; we assume the OS can cope with that. (Note:
                 * it might seem that it'd be okay to create files only when
                 * SimpleLruZeroPage is called for the first page of a segment.
                 * However, if after a crash and restart the REDO logic elects to
                 * replay the log from a checkpoint before the latest one, then it's
                 * possible that we will get commands to set transaction status of
                 * transactions that have already been truncated from the commit log.
                 * Easiest way to deal with that is to accept references to
                 * nonexistent files here and in SlruPhysicalReadPage.)
                 *
                 * Note: it is possible for more than one backend to be executing this
                 * code simultaneously for different pages of the same file. Hence,
                 * don't use O_EXCL or O_TRUNC or anything like that.
                 */
                SlruFileName(ctl, path, segno);
                fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY,
                                                           S_IRUSR | S_IWUSR);
......
        /*
         * If not part of Flush, need to fsync now.  We assume this happens
         * infrequently enough that it's not a performance issue.
         */
        if (!fdata)  // 因為傳入的fdata=NULL，並且ctl->do_fsync=true，所以以下pg_fsync被調用。
        {
                if (ctl->do_fsync && pg_fsync(fd))  // 對於pg_clog和multixact，do_fsync=true。
                {
                        slru_errcause = SLRU_FSYNC_FAILED;
                        slru_errno = errno;
                        CloseTransientFile(fd);
                        return false;
                }

                if (CloseTransientFile(fd))
                {
                        slru_errcause = SLRU_CLOSE_FAILED;
                        slru_errno = errno;
                        return false;
                }
        }

ctl->do_fsync && pg_fsync(fd)涉及的代碼：
src/include/access/slru.h

/*
 * SlruCtlData is an unshared structure that points to the active information
 * in shared memory.
 */
typedef struct SlruCtlData
{
        SlruShared      shared;

        /*
         * This flag tells whether to fsync writes (true for pg_clog and multixact
         * stuff, false for pg_subtrans and pg_notify).
         */
        bool            do_fsync;

        /*
         * Decide which of two page numbers is "older" for truncation purposes. We
         * need to use comparison of TransactionIds here in order to do the right
         * thing with wraparound XID arithmetic.
         */
        bool            (*PagePrecedes) (int, int);

        /*
         * Dir is set during SimpleLruInit and does not change thereafter. Since
         * it's always the same, it doesn't need to be in shared memory.
         */
        char            Dir[64];
} SlruCtlData;
typedef SlruCtlData *SlruCtl;

src/backend/access/transam/slru.c

......
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
                          LWLock *ctllock, const char *subdir)
......
        ctl->do_fsync = true;           /* default behavior */  // 初始化LRU時，do_fsync默認是true的。
......

以下是clog初始化LRU的調用，可以看到它沒有修改do_fsync，所以是TURE。
src/backend/access/transam/clog.c

/*
 * Number of shared CLOG buffers.
 *
 * Testing during the PostgreSQL 9.2 development cycle revealed that on a
 * large multi-processor system, it was possible to have more CLOG page
 * requests in flight at one time than the number of CLOG buffers which existed
 * at that time, which was hardcoded to 8.  Further testing revealed that
 * performance dropped off with more than 32 CLOG buffers, possibly because
 * the linear buffer search algorithm doesn't scale well.
 *
 * Unconditionally increasing the number of CLOG buffers to 32 did not seem
 * like a good idea, because it would increase the minimum amount of shared
 * memory required to start, which could be a problem for people running very
 * small configurations.  The following formula seems to represent a reasonable
 * compromise: people with very low values for shared_buffers will get fewer
 * CLOG buffers as well, and everyone else will get 32.
 *
 * It is likely that some further work will be needed here in future releases;
 * for example, on a 64-core server, the maximum number of CLOG requests that
 * can be simultaneously in flight will be even larger.  But that will
 * apparently require more than just changing the formula, so for now we take
 * the easy way out.
 */
Size
CLOGShmemBuffers(void)
{
        return Min(32, Max(4, NBuffers / 512));
}

void
CLOGShmemInit(void)
{
        ClogCtl->PagePrecedes = CLOGPagePrecedes;
        SimpleLruInit(ClogCtl, "CLOG Ctl", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
                                  CLogControlLock, "pg_clog");
}

以下是subtrans初始化LRU的調用，看到它修改了do_fsync=false。所以subtrans擴展PAGE時不需要調用pg_fsync。
src/backend/access/transam/subtrans.c

void
SUBTRANSShmemInit(void)
{
        SubTransCtl->PagePrecedes = SubTransPagePrecedes;
        SimpleLruInit(SubTransCtl, "SUBTRANS Ctl", NUM_SUBTRANS_BUFFERS, 0,
                                  SubtransControlLock, "pg_subtrans");
        /* Override default assumption that writes should be fsync'd */
        SubTransCtl->do_fsync = false;
}

multixact.c也沒有修改do_fsync，所以也是需要fsync的。
MultiXactShmemInit(void)@src/backend/access/transam/multixact.c

pg_fsync代碼：
src/backend/storage/file/fd.c

/*
 * pg_fsync --- do fsync with or without writethrough
 */
int
pg_fsync(int fd)
{
        /* #if is to skip the sync_method test if there's no need for it */
#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
        if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
                return pg_fsync_writethrough(fd);
        else
#endif
                return pg_fsync_no_writethrough(fd);
}

/*
 * pg_fsync_no_writethrough --- same as fsync except does nothing if
 *      enableFsync is off
 */
int
pg_fsync_no_writethrough(int fd)
{
        if (enableFsync)
                return fsync(fd);
        else
                return 0;
}

/*
 * pg_fsync_writethrough
 */
int
pg_fsync_writethrough(int fd)
{
        if (enableFsync)
        {
#ifdef WIN32
                return _commit(fd);
#elif defined(F_FULLFSYNC)
                return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
#else
                errno = ENOSYS;
                return -1;
#endif
        }
        else
                return 0;
}

從上麵的代碼分析，擴展clog page時，如果在CLOG BUFFER中沒有EMPTY SLOT，則需要backend process主動刷CLOG PAGE，所以會有調用pg_fsync的動作。

clog page和數據庫BLOCKSZ (database block size)一樣大，默認是8K（如果編譯數據庫軟件時沒有修改的話，默認是8KB），最大可以設置為32KB。
每個事務在pg_clog中需要2個比特位來存儲事務信息(xmin commit/abort,xmax commit/abort)。
所以8K的clog page可以存儲32K個事務信息，換句話說，每32K個事務，需要擴展一次clog page。

下麵的代碼是clog的一些常用宏。
src/backend/access/transam/clog.c

/*
 * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
 * everywhere else in Postgres.
 *
 * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
 * CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE,
 * and CLOG segment numbering at
 * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
 * explicit notice of that fact in this module, except when comparing segment
 * and page numbers in TruncateCLOG (see CLOGPagePrecedes).
 */

/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT      2
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
#define CLOG_XACT_BITMASK       ((1 << CLOG_BITS_PER_XACT) - 1)

#define TransactionIdToPage(xid)         ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToPgIndex(xid)     ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToByte(xid)          (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)
#define TransactionIdToBIndex(xid)       ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)

查看數據庫的block size：

postgres@digoal-> pg_controldata |grep block
Database block size:                  8192
WAL block size:                       8192

我們可以使用stap來跟蹤是否調用pg_fsync，如果你要觀察backend process主動刷clog 髒頁，可以把checkpoint間隔開大，同時把clog shared buffer pages。你就會觀察到backend process主動刷clog 髒頁。

Size
CLOGShmemBuffers(void)
{
    return Min(32, Max(4, NBuffers / 512));
}

跟蹤

src/backend/access/transam/slru.c
SlruPhysicalWritePage
......
                SlruFileName(ctl, path, segno);
                fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY,
                                                           S_IRUSR | S_IWUSR);
......

src/backend/storage/file/fd.c
OpenTransientFile
pg_fsync(fd)

stap腳本
```shell
[root@digoal ~]# cat trc.stp 
global f_start[999999]

probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call { 
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

開啟一個pgbench執行txid_current()函數申請新的事務號。

postgres@digoal-> cat 7.sql
select txid_current();

約每秒32K左右的請求。

postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
progress: 242.0 s, 32567.3 tps, lat 0.030 ms stddev 0.179
progress: 243.0 s, 33656.6 tps, lat 0.029 ms stddev 0.038
progress: 244.0 s, 33948.1 tps, lat 0.029 ms stddev 0.021
progress: 245.0 s, 32996.8 tps, lat 0.030 ms stddev 0.046
progress: 246.0 s, 34156.7 tps, lat 0.029 ms stddev 0.015
progress: 247.0 s, 33259.5 tps, lat 0.029 ms stddev 0.074
progress: 248.0 s, 32979.6 tps, lat 0.030 ms stddev 0.043
progress: 249.0 s, 32892.6 tps, lat 0.030 ms stddev 0.039
progress: 250.0 s, 33090.7 tps, lat 0.029 ms stddev 0.020
progress: 251.0 s, 33238.3 tps, lat 0.029 ms stddev 0.017
progress: 252.0 s, 32341.3 tps, lat 0.030 ms stddev 0.045
progress: 253.0 s, 31999.0 tps, lat 0.030 ms stddev 0.167
progress: 254.0 s, 33332.6 tps, lat 0.029 ms stddev 0.056
progress: 255.0 s, 30394.6 tps, lat 0.032 ms stddev 0.027
progress: 256.0 s, 31862.7 tps, lat 0.031 ms stddev 0.023
progress: 257.0 s, 31574.0 tps, lat 0.031 ms stddev 0.112

跟蹤backend process

postgres@digoal-> ps -ewf|grep postgres
postgres  2921  1883 29 09:37 pts/1    00:00:05 pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
postgres  2924  1841 66 09:37 ?        00:00:13 postgres: postgres postgres [local] SELECT

[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 2924 >./stap.log 2>&1

從日誌中抽取pg_clog相關的跟蹤結果。

     0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
    31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
    53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
   102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
  1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
  1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

每隔1秒左右會產生一次fsync。

postgres=# select 1441503928836-1441503927731;
 ?column? 
----------
     1105
(1 row)

postgres=# select 1441503929819-1441503928836;
 ?column? 
----------
      983
(1 row)

前麵pgbench的輸出看到每秒產生約32000個事務，剛好等於一個clog頁的事務數(本例數據塊大小為8KB)。
(每個事務需要2個比特位，每個字節存儲4個事務信息)
8192*4=32768

如果你需要觀察backend process不刷clog buffer髒頁的情況。可以把checkpoint 間隔改小，或者手動執行checkpoint，同時還需要把clog buffer pages改大，例如：

Size
CLOGShmemBuffers(void)
{
    return Min(1024, Max(4, NBuffers / 2));
}

使用同樣的stap腳本，你就觀察不到backend process主動刷clog dirty page了。

通過以上分析，如果你發現backend process頻繁的clog，可以采取一些優化手段。
1. 因為每次擴展pg_clog文件後，文件大小都會發生變化，此時如果backend process調用pg_fdatasync也會寫文件係統metadata journal（以EXT4為例，假設mount參數data不等於writeback），這個操作是整個文件係統串行的，容易產生堵塞。
所以backend process挑選clog page時，不選擇最近的page number可以起到一定的效果，（最好是不選擇最近的clog file中的pages）。
另一種方法是先調用sync_file_range, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER，它不需要寫metadata。將文件寫入後再調用pg_fsync。減少等待data fsync的時間。
2. pg_clog文件預分配，目前pg_clog單個文件的大小是由CLOGShmemBuffers決定的，為BLOCKSZ的32倍。可以嚐試預分配這個文件，而不是每次都擴展，改變它的大小。
3. 延遲backend process 的 fsync請求到checkpoint處理。

[參考]
https://wiki.postgresql.org/wiki/Hint_Bits
https://blog.163.com/digoal@126/blog/static/1638770402015840480734/
src/backend/access/transam/varsup.c
src/backend/access/transam/clog.c
src/backend/access/transam/slru.c
src/include/access/slru.h
src/backend/access/transam/subtrans.c
src/backend/storage/file/fd.c

最後更新：2017-04-01 13:37:08

PostgreSQL pg_clog fsync 頻率分析

上一篇： Greenplum TPC-H測試

下一篇： MongoDB權限驗證流程

相關內容

熱門內容

最新內容