192 阿裏雲技術社區[雲棲]

PostgreSQL 類ORACLE RAC 的產品 DEMO實現

亞馬遜推出的Aurora數據庫引擎，支持一份存儲，一主多讀的架構。
這個架構和Oracle RAC類似，也是共享存儲，但是隻有一個實例可以執行寫操作，其他實例隻能執行讀操作。
相比傳統的基於複製的一主多讀，節約了存儲的成本，網絡帶寬的成本。

我們可以使用PostgreSQL的hot standby模式來模擬這種共享存儲一主多讀的架構，但是需要注意幾點，hot standby也會對數據庫有寫的動作，例如recovery時，會修改控製文件，數據文件等等，這些操作是多餘的。
另外很多狀態是存儲在內存中的，所以內存狀態也需要更新。
還有需要注意的是：

pg_xlog
pg_log
pg_clog
pg_multixact
postgresql.conf
recovery.conf
postmaster.pid

最終實現一主多備的架構，需要通過改PG內核來實現。
1. 這些文件應該是每個實例對應一份。
postgresql.conf, recovery.conf, postmaster.pid, pg_control
2. hot standby不執行實際的恢複操作，但是需要更新自己的內存狀態，如當前的OID，XID等等，以及更新自己的pg_control。
3. 在多實例間，要實現主到備節點的OS髒頁的同步，數據庫shared buffer髒頁的同步。

模擬過程：
不改任何代碼，在同一主機下啟多實例測試，會遇到一些問題。(後麵有問題描述，以及如何修改代碼來修複這些問題)
主實例配置文件：

 # vi postgresql.conf
listen_addresses='0.0.0.0'
port=1921
max_connections=100
unix_socket_directories='.'
ssl=on
ssl_ciphers='EXPORT40'
shared_buffers=512MB
huge_pages=try
max_prepared_transactions=0
max_stack_depth=100kB
dynamic_shared_memory_type=posix
max_files_per_process=500
wal_level=logical
fsync=off
synchronous_commit=off
wal_sync_method=open_datasync
full_page_writes=off
wal_log_hints=off
wal_buffers=16MB
wal_writer_delay=10ms
checkpoint_segments=8
archive_mode=off
archive_command='/bin/date'
max_wal_senders=10
max_replication_slots=10
hot_standby=on
wal_receiver_status_interval=1s
hot_standby_feedback=on
enable_bitmapscan=on
enable_hashagg=on
enable_hashjoin=on
enable_indexscan=on
enable_material=on
enable_mergejoin=on
enable_nestloop=on
enable_seqscan=on
enable_sort=on
enable_tidscan=on
log_destination='csvlog'
logging_collector=on
log_directory='pg_log'
log_truncate_on_rotation=on
log_rotation_size=10MB
log_checkpoints=on
log_connections=on
log_disconnections=on
log_duration=off
log_error_verbosity=verbose
log_line_prefix='%i
log_statement='none'
log_timezone='PRC'
autovacuum=on
log_autovacuum_min_duration=0
autovacuum_vacuum_scale_factor=0.0002
autovacuum_analyze_scale_factor=0.0001
datestyle='iso,
timezone='PRC'
lc_messages='C'
lc_monetary='C'
lc_numeric='C'
lc_time='C'
default_text_search_config='pg_catalog.english'

 # vi recovery.done
recovery_target_timeline='latest'
standby_mode=on
primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'

 # vi pg_hba.conf
local   replication     postgres                                trust
host    replication     postgres 127.0.0.1/32            trust

啟動主實例

postgres@digoal-> pg_ctl start

啟動隻讀實例，必須先刪除postmaster.pid，這點PostgreSQL新版本加了一個PATCH，如果這個文件被刪除，會自動關閉數據庫，所以我們需要注意，不要使用最新的PGSQL，或者把這個patch幹掉先

postgres@digoal-> cd $PGDATA
postgres@digoal-> mv recovery.done recovery.conf

postgres@digoal-> rm -f postmaster.pid
postgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"

查看當前控製文件狀態，隻讀實例改了控製文件，和前麵描述一致。

postgres@digoal-> pg_controldata |grep state
Database cluster state:               in archive recovery

連到主實例，創建表，插入測試數據。

psql -p 1921
postgres=# create table test1(id int);
CREATE TABLE
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10

在隻讀實例查看插入的數據。

postgres@digoal-> psql -h 127.0.0.1 -p 1922
postgres=# select * from test1;
 id 
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

主實例執行檢查點後，控製文件狀態會改回生產狀態。

psql -p 1921
postgres=# checkpoint;
CHECKPOINT

postgres@digoal-> pg_controldata |grep state
Database cluster state:               in production

但是如果在隻讀實例執行完檢查點，又會改回恢複狀態。

postgres@digoal-> psql -h 127.0.0.1 -p 1922
psql (9.4.4)
postgres=# checkpoint;
CHECKPOINT

postgres@digoal-> pg_controldata |grep state
Database cluster state:               in archive recovery

注意到，上麵的例子有1個問題，用流複製的話，會從主節點通過網絡拷貝XLOG記錄，並覆蓋同一份已經寫過的XLOG記錄的對應的OFFSET，這是一個問題，因為可能會造成主節點看到的數據不一致（比如一個數據塊改了多次，隻讀實例在恢複時將它覆蓋到老的版本了，在主實例上看到的就會變成老版本的BLOCK，後麵再來改這個問題，禁止隻讀實例恢複數據）。

另一方麵，我們知道PostgreSQL standby會從三個地方（流，pg_xlog，restore_command）讀取XLOG，進行恢複，所以在共享存儲的環境中，我們完全沒有必要用流複製的方式，直接從pg_xlog目錄讀取即可。
修改recovery.conf參數，將以下注釋

 # primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'

重啟隻讀實例。

pg_ctl stop -m fast
postgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"

重新測試數據一致性：
主實例

postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10

隻讀實例

postgres=# select count(*) from test1;
 count 
-------
    60
(1 row)

截至目前，有幾個問題未解決，
1 . standby還是要執行recovery的操作，recovery產生的write操作會隨著隻讀實例數量的增加而增加。
另外recovery有一個好處，解決了髒頁的問題，主實例shared buffer中的髒頁不需要額外的同步給隻讀實例了。
recovery還會帶來一個嚴重的BUG，回放可能和當前主節點操作同一個data page。
或者回放時將塊回放到老的狀態，而實際上主節點又更新了這個塊。造成數據塊的不一致。如果此時隻讀實例關閉，然後立即關閉主實例，數據庫再起來時，這個數據塊是不一致的。
2 . standby還是會改控製文件。
3 . 在同一個$PGDATA下啟動實例，首先要刪除postmaster.pid。
4 . 關閉實例時，已經被刪除postmaster.pid的實例，隻能通過找到postgres主進程的pid，然後發kill -s 15, 2或3的信號來關閉數據庫。

static void
set_mode(char *modeopt)
{
        if (strcmp(modeopt, "s") == 0 || strcmp(modeopt, "smart") == 0)
        {
                shutdown_mode = SMART_MODE;
                sig = SIGTERM;
        }
        else if (strcmp(modeopt, "f") == 0 || strcmp(modeopt, "fast") == 0)
        {
                shutdown_mode = FAST_MODE;
                sig = SIGINT;
        }
        else if (strcmp(modeopt, "i") == 0 || strcmp(modeopt, "immediate") == 0)
        {
                shutdown_mode = IMMEDIATE_MODE;
                sig = SIGQUIT;
        }
        else
        {
                write_stderr(_("%s: unrecognized shutdown mode \"%s\"\n"), progname, modeopt);
                do_advice();
                exit(1);
        }
}

5 . 當主節點刪除rel page時，隻讀實例回放時，會報invalid xlog對應的rel page不存在的錯誤，這個也是隻讀實例需要回放日誌帶來的問題。非常容易重現這個問題，刪除一個表即可。

2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,20,,2015-10-09 13:29:15 CST,1/0,0,WARNING,01000,"page 8 of relation base/151898/185251 does not exist",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"report_invalid_page, xlogutils.c:67",""  
2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,21,,2015-10-09 13:29:15 CST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"log_invalid_page, xlogutils.c:91",""  

  這個報錯可以先注釋這一段來繞過，從而可以演示下去。    

src/backend/access/transam/xlogutils.c
/* Log a reference to an invalid page */
static void
log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
                                 bool present)
{
  //////
        /*
         * Once recovery has reached a consistent state, the invalid-page table
         * should be empty and remain so. If a reference to an invalid page is
         * found after consistency is reached, PANIC immediately. This might seem
         * aggressive, but it's better than letting the invalid reference linger
         * in the hash table until the end of recovery and PANIC there, which
         * might come only much later if this is a standby server.
         */
        //if (reachedConsistency)
        //{
        //      report_invalid_page(WARNING, node, forkno, blkno, present);
        //      elog(PANIC, "WAL contains references to invalid pages");
        //}

6 . 由於本例是在同一個操作係統中演示，所以沒有遇到OS的dirty page cache的問題，如果是不同主機的環境，我們需要解決OS dirty page cache 的同步問題，或者消除dirty page cache，如使用direct IO。或者集群文件係統如gfs2.
如果要產品化，至少需要解決以上問題。

先解決aurora實例寫數據文件，控製文件，檢查點的問題。
1 . 增加一個啟動參數，表示這個實例是否為aurora實例（即隻讀實例）

 # vi src/backend/utils/misc/guc.c
/******** option records follow ********/

static struct config_bool ConfigureNamesBool[] =
{
        {
                {"aurora", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
                        gettext_noop("Enables advertising the server via Bonjour."),
                        NULL
                },
                &aurora,
                false,
                NULL, NULL, NULL
        },

2 . 新增變量

 # vi src/include/postmaster/postmaster.h
extern bool aurora;

3 . 禁止aurora實例更新控製文件

 # vi src/backend/access/transam/xlog.c
 #include "postmaster/postmaster.h"
bool aurora;

void
UpdateControlFile(void)
{
        if (aurora) return;

4 . 禁止aurora實例啟動bgwriter進程

 # vi src/backend/postmaster/bgwriter.c
 #include "postmaster/postmaster.h"
bool  aurora;

/*
 * Main entry point for bgwriter process
 *
 * This is invoked from AuxiliaryProcessMain, which has already created the
 * basic execution environment, but not enabled signals yet.
 */
void
BackgroundWriterMain(void)
{
  //////
        pg_usleep(1000000L);

        /*
         * If an exception is encountered, processing resumes here.
         *
         * See notes in postgres.c about the design of this coding.
         */
        if (!aurora && sigsetjmp(local_sigjmp_buf, 1) != 0)
        {

  //////
                /*
                 * Do one cycle of dirty-buffer writing.
                 */
                if (!aurora) {
                can_hibernate = BgBufferSync();
  //////
                }
                pg_usleep(1000000L);
        }
}

5 . 禁止aurora實例啟動checkpointer進程

 # vi src/backend/postmaster/checkpointer.c
 #include "postmaster/postmaster.h"
bool  aurora;
  //////
/*
 * Main entry point for checkpointer process
 *
 * This is invoked from AuxiliaryProcessMain, which has already created the
 * basic execution environment, but not enabled signals yet.
 */
void
CheckpointerMain(void)
{
  //////
        /*
         * Loop forever
         */
        for (;;)
        {
                bool            do_checkpoint = false;
                int                     flags = 0;
                pg_time_t       now;
                int                     elapsed_secs;
                int                     cur_timeout;
                int                     rc;

                pg_usleep(100000L);

                /* Clear any already-pending wakeups */
                if (!aurora)  ResetLatch(&MyProc->procLatch);

                /*
                 * Process any requests or signals received recently.
                 */
                if (!aurora) AbsorbFsyncRequests();

                if (!aurora && got_SIGHUP)
                {
                        got_SIGHUP = false;
                        ProcessConfigFile(PGC_SIGHUP);

                        /*
                         * Checkpointer is the last process to shut down, so we ask it to
                         * hold the keys for a range of other tasks required most of which
                         * have nothing to do with checkpointing at all.
                         *
                         * For various reasons, some config values can change dynamically
                         * so the primary copy of them is held in shared memory to make
                         * sure all backends see the same value.  We make Checkpointer
                         * responsible for updating the shared memory copy if the
                         * parameter setting changes because of SIGHUP.
                         */
                        UpdateSharedMemoryConfig();
                }
                if (!aurora && checkpoint_requested)
                {
                        checkpoint_requested = false;
                        do_checkpoint = true;
                        BgWriterStats.m_requested_checkpoints++;
                }
                if (!aurora && shutdown_requested)
                {
                        /*
                         * From here on, elog(ERROR) should end with exit(1), not send
                         * control back to the sigsetjmp block above
                         */
                        ExitOnAnyError = true;
                        /* Close down the database */
                        ShutdownXLOG(0, 0);
                        /* Normal exit from the checkpointer is here */
                        proc_exit(0);           /* done */
                }

                /*
                 * Force a checkpoint if too much time has elapsed since the last one.
                 * Note that we count a timed checkpoint in stats only when this
                 * occurs without an external request, but we set the CAUSE_TIME flag
                 * bit even if there is also an external request.
                 */
                now = (pg_time_t) time(NULL);
                elapsed_secs = now - last_checkpoint_time;
                if (!aurora && elapsed_secs >= CheckPointTimeout)
                {
                        if (!do_checkpoint)
                                BgWriterStats.m_timed_checkpoints++;
                        do_checkpoint = true;
                        flags |= CHECKPOINT_CAUSE_TIME;
                }

                /*
                 * Do a checkpoint if requested.
                 */
                if (!aurora && do_checkpoint)
                {
                        bool            ckpt_performed = false;
                        bool            do_restartpoint;

                        /* use volatile pointer to prevent code rearrangement */
                        volatile CheckpointerShmemStruct *cps = CheckpointerShmem;

                        /*
                         * Check if we should perform a checkpoint or a restartpoint. As a
                         * side-effect, RecoveryInProgress() initializes TimeLineID if
                         * it's not set yet.
                         */
                        do_restartpoint = RecoveryInProgress();

                        /*
                         * Atomically fetch the request flags to figure out what kind of a
                         * checkpoint we should perform, and increase the started-counter
                         * to acknowledge that we've started a new checkpoint.
                         */
                        SpinLockAcquire(&cps->ckpt_lck);
                        flags |= cps->ckpt_flags;
                        cps->ckpt_flags = 0;
                        cps->ckpt_started++;
                        SpinLockRelease(&cps->ckpt_lck);

                        /*
                         * The end-of-recovery checkpoint is a real checkpoint that's
                         * performed while we're still in recovery.
                         */
                        if (flags & CHECKPOINT_END_OF_RECOVERY)
                                do_restartpoint = false;
  //////
                        ckpt_active = false;
                }

                /* Check for archive_timeout and switch xlog files if necessary. */
                if (!aurora) CheckArchiveTimeout();
                /*
                 * Send off activity statistics to the stats collector.  (The reason
                 * why we re-use bgwriter-related code for this is that the bgwriter
                 * and checkpointer used to be just one process.  It's probably not
                 * worth the trouble to split the stats support into two independent
                 * stats message types.)
                 */
                if (!aurora) pgstat_send_bgwriter();

                /*
                 * Sleep until we are signaled or it's time for another checkpoint or
                 * xlog file switch.
                 */
                now = (pg_time_t) time(NULL);
                elapsed_secs = now - last_checkpoint_time;
                if (elapsed_secs >= CheckPointTimeout)
                        continue;                       /* no sleep for us ... */
                cur_timeout = CheckPointTimeout - elapsed_secs;
                if (!aurora && XLogArchiveTimeout > 0 && !RecoveryInProgress())
                {
                        elapsed_secs = now - last_xlog_switch_time;
                        if (elapsed_secs >= XLogArchiveTimeout)
                                continue;               /* no sleep for us ... */
                        cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
                }

                if (!aurora) rc = WaitLatch(&MyProc->procLatch,
                                           WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                           cur_timeout * 1000L /* convert to ms */ );

                /*
                 * Emergency bailout if postmaster has died.  This is to avoid the
                 * necessity for manual cleanup of all postmaster children.
                 */
                if (rc & WL_POSTMASTER_DEATH)
                        exit(1);
        }
}
  //////
/* SIGINT: set flag to run a normal checkpoint right away */
static void
ReqCheckpointHandler(SIGNAL_ARGS)
{
        if (aurora)
           return;
        int                     save_errno = errno;

        checkpoint_requested = true;
        if (MyProc)
                SetLatch(&MyProc->procLatch);

        errno = save_errno;
}
  //////
/*
 * AbsorbFsyncRequests
 *              Retrieve queued fsync requests and pass them to local smgr.
 *
 * This is exported because it must be called during CreateCheckPoint;
 * we have to be sure we have accepted all pending requests just before
 * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
 * non-checkpointer processes, do nothing if not checkpointer.
 */
void
AbsorbFsyncRequests(void)
{
        CheckpointerRequest *requests = NULL;
        CheckpointerRequest *request;
        int                     n;

        if (!AmCheckpointerProcess() || aurora)
                return;
  //////

6 . 禁止aurora實例手工調用checkpoint命令。

 # vi src/backend/tcop/utility.c
 #include "postmaster/postmaster.h"
bool  aurora;

  //////
void
standard_ProcessUtility(Node *parsetree,
                                                const char *queryString,
                                                ProcessUtilityContext context,
                                                ParamListInfo params,
                                                DestReceiver *dest,
                                                char *completionTag)
{
  //////
                case T_CheckPointStmt:
                        if (!superuser() || aurora)
                                ereport(ERROR,
                                                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                                                 errmsg("must be superuser to do CHECKPOINT")));

改完上麵的代碼，重新編譯一下，現在接近一個DEMO了。
現在aurora實例不會更新控製文件，不會寫數據文件，不會執行checkpoint，是我們想要的結果。
啟動隻讀實例時，加一個參數aurora=true，表示啟動aurora實例。

pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922 -c aurora=true"

不過要產品化，還有很多細節需要考慮。這隻是一個DEMO。阿裏雲RDS的小夥伴們加油。

還有一種更保險的玩法，共享存儲多讀架構，需要存儲兩份數據：
其中一份是主實例的存儲，它自己玩自己的，其他實例不對它做任何操作。
另一份是standby的，這部作為共享存儲，給多個隻讀實例來使用。

[參考]
1 . https://aws.amazon.com/cn/rds/aurora/
2 . src/backend/access/transam/xlog.c

/*
 * Open the WAL segment containing WAL position 'RecPtr'.
 *
 * The segment can be fetched via restore_command, or via walreceiver having
 * streamed the record, or it can already be present in pg_xlog. Checking
 * pg_xlog is mainly for crash recovery, but it will be polled in standby mode
 * too, in case someone copies a new segment directly to pg_xlog. That is not
 * documented or recommended, though.
 *
 * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should
 * prepare to read WAL starting from RedoStartLSN after this.
 *
 * 'RecPtr' might not point to the beginning of the record we're interested
 * in, it might also point to the page or segment header. In that case,
 * 'tliRecPtr' is the position of the WAL record we're interested in. It is
 * used to decide which timeline to stream the requested WAL from.
 *
 * If the record is not immediately available, the function returns false
 * if we're not in standby mode. In standby mode, waits for it to become
 * available.
 *
 * When the requested record becomes available, the function opens the file
 * containing it (if not open already), and returns true. When end of standby
 * mode is triggered by the user, and there is no more WAL available, returns
 * false.
 */
static bool
WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                                                        bool fetching_ckpt, XLogRecPtr tliRecPtr)
{
  //////
        static pg_time_t last_fail_time = 0;
        pg_time_t       now;

        /*-------
         * Standby mode is implemented by a state machine:
         *
         * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just
         *        pg_xlog (XLOG_FROM_XLOG)
         * 2. Check trigger file
         * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)
         * 4. Rescan timelines
         * 5. Sleep 5 seconds, and loop back to 1.
         *
         * Failure to read from the current source advances the state machine to
         * the next state.
         *
         * 'currentSource' indicates the current state. There are no currentSource
         * values for "check trigger", "rescan timelines", and "sleep" states,
         * those actions are taken when reading from the previous source fails, as
         * part of advancing to the next state.
         *-------
         */

最後更新：2017-04-01 13:37:07

PostgreSQL 類ORACLE RAC 的產品 DEMO實現

上一篇：【轉】Open Container Initiative發布Roadmap，部分核心技術CoreOS被排除在外

下一篇： PostgreSQL 內存OOM控製策略導致數據庫無法啟動的診斷一例(如何有效避免oom)

相關內容

熱門內容

最新內容

PostgreSQL 類ORACLE RAC 的產品 DEMO實現

上一篇： 【轉】Open Container Initiative發布Roadmap，部分核心技術CoreOS被排除在外

下一篇： PostgreSQL 內存OOM控製策略導致數據庫無法啟動的診斷一例(如何有效避免oom)

相關內容

熱門內容

最新內容

上一篇：【轉】Open Container Initiative發布Roadmap，部分核心技術CoreOS被排除在外