139
魔獸
微觀、宏觀、精準 多視角估算數據庫性能(選型、做預算不求人)
標簽
PostgreSQL , PPAS , Greenplum , HybridDB for PostgreSQL , 性能 , 估算 , pgbench , 統計細信息 , explain算法 , 統計信息導入導出
背景
在提預算時必不可少的環境是評估需要多少硬件。
通常會要求業務方提供一些數據,例如用戶數、PV、UV等。但是這種評估純靠經驗,方法非常的粗糙也不準確。
那麼到底如何評估需要多少硬件、或者說需要什麼樣規格的硬件來支撐你未來的業務呢?
對於PostgreSQL這個數據庫產品來說,我介紹一下三種評估方法:
1、微觀評估(相對來說比較準確)
2、宏觀評估(對選型有幫助,對規格幫助不大,略顯粗糙)
3、精準評估(最為準確,但是要求對業務非常熟悉,對未來的瓶頸把握準確)
一、微觀估算法
我們在通過SQL與數據庫交互時,數據庫是如何執行SQL的呢?
首先要PARSE SQL,然後生成執行路徑,選擇最優執行路徑,執行SQL,最關鍵的是選擇最優執行路徑。PostgreSQL是CBO的優化器,根據成本選擇。
這裏提到了成本,成本是怎麼算出來的呢?成本是結合掃描方法、統計信息、估算需要掃描多少個數據塊,掃描多少條記錄,最後通過對應掃描方法的成本估算算法算出來的。
一個 QUERY 有哪些成本
1、成本包括:
IO成本,CPU成本。
2、IO成本包括:
連續IO成本,離散IO層板。
3、CPU成本包括:
獲取索引、TOAST索引、堆表、TOAST表的tuple或ITEM的成本;
操作符、函數處理行的成本;
處理JOIN的成本等等。
一個 QUERY 如何執行和傳遞成本
生成好執行計劃後,QUERY的執行就會按執行樹來執行
執行樹由若幹個節點組成,從一個節點,跳到下一個節點,就好像接力賽一樣。
節點跟節點之間傳遞的是什麼呢?
Path數據結構,主要包含(rows, startup_cost, total_cost)。一個數據節點
rows,表示這個節點有多少滿足條件的行,輸出到下一個節點。
startup_cost,表示這個節點得到第一條符合條件的記錄,需要多少成本。
total_cost,表示這個節點得到所有符合條件的記錄,需要多少成本。
執行節點有哪些種類
執行節點的種類很多,可以從成本計算的代碼中得到:
src/backend/optimizer/path/costsize.c
/*
* cost_seqscan
* Determines and returns the cost of scanning a relation sequentially.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)
/*
* cost_samplescan
* Determines and returns the cost of scanning a relation using sampling.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)
/*
* cost_gather
* Determines and returns the cost of gather path.
*
* 'rel' is the relation to be operated upon
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
* 'rows' may be used to point to a row estimate; if non-NULL, it overrides
* both 'rel' and 'param_info'. This is useful when the path doesn't exactly
* correspond to any particular RelOptInfo.
*/
cost_gather(GatherPath *path, PlannerInfo *root, RelOptInfo *rel,
ParamPathInfo *param_info, double *rows)
/*
* cost_gather_merge
* Determines and returns the cost of gather merge path.
*
* GatherMerge merges several pre-sorted input streams, using a heap that at
* any given instant holds the next tuple from each stream. If there are N
* streams, we need about N*log2(N) tuple comparisons to construct the heap at
* startup, and then for each output tuple, about log2(N) comparisons to
* replace the top heap entry with the next tuple from the same stream.
*/
cost_gather_merge(GatherMergePath *path, PlannerInfo *root, RelOptInfo *rel,
ParamPathInfo *param_info, Cost input_startup_cost, Cost input_total_cost, double *rows)
/*
* cost_index
* Determines and returns the cost of scanning a relation using an index.
*
* 'path' describes the indexscan under consideration, and is complete
* except for the fields to be set by this routine
* 'loop_count' is the number of repetitions of the indexscan to factor into
* estimates of caching behavior
*
* In addition to rows, startup_cost and total_cost, cost_index() sets the
* path's indextotalcost and indexselectivity fields. These values will be
* needed if the IndexPath is used in a BitmapIndexScan.
*
* NOTE: path->indexquals must contain only clauses usable as index
* restrictions. Any additional quals evaluated as qpquals may reduce the
* number of returned tuples, but they won't reduce the number of tuples
* we have to fetch from the table, so they don't reduce the scan cost.
*/
cost_index(IndexPath *path, PlannerInfo *root, double loop_count, bool partial_path)
/*
* cost_bitmap_heap_scan
* Determines and returns the cost of scanning a relation using a bitmap
* index-then-heap plan.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
* 'bitmapqual' is a tree of IndexPaths, BitmapAndPaths, and BitmapOrPaths
* 'loop_count' is the number of repetitions of the indexscan to factor into
* estimates of caching behavior
*
* Note: the component IndexPaths in bitmapqual should have been costed
* using the same loop_count.
*/
cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
ParamPathInfo *param_info, Path *bitmapqual, double loop_count)
/*
* cost_bitmap_tree_node
* Extract cost and selectivity from a bitmap tree node (index/and/or)
*/
cost_bitmap_tree_node(Path *path, Cost *cost, Selectivity *selec)
/*
* cost_bitmap_and_node
* Estimate the cost of a BitmapAnd node
*
* Note that this considers only the costs of index scanning and bitmap
* creation, not the eventual heap access. In that sense the object isn't
* truly a Path, but it has enough path-like properties (costs in particular)
* to warrant treating it as one. We don't bother to set the path rows field,
* however.
*/
cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root)
/*
* cost_bitmap_or_node
* Estimate the cost of a BitmapOr node
*
* See comments for cost_bitmap_and_node.
*/
cost_bitmap_or_node(BitmapOrPath *path, PlannerInfo *root)
/*
* cost_tidscan
* Determines and returns the cost of scanning a relation using TIDs.
*
* 'baserel' is the relation to be scanned
* 'tidquals' is the list of TID-checkable quals
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
cost_tidscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
List *tidquals, ParamPathInfo *param_info)
/*
* cost_subqueryscan
* Determines and returns the cost of scanning a subquery RTE.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info)
/*
* cost_functionscan
* Determines and returns the cost of scanning a function RTE.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
cost_functionscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)
/*
* cost_tablefuncscan
* Determines and returns the cost of scanning a table function.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
cost_tablefuncscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)
/*
* cost_valuesscan
* Determines and returns the cost of scanning a VALUES RTE.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
cost_valuesscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)
/*
* cost_ctescan
* Determines and returns the cost of scanning a CTE RTE.
*
* Note: this is used for both self-reference and regular CTEs; the
* possible cost differences are below the threshold of what we could
* estimate accurately anyway. Note that the costs of evaluating the
* referenced CTE query are added into the final plan as initplan costs,
* and should NOT be counted here.
*/
cost_ctescan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)
cost_namedtuplestorescan(Path *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info)
/*
* cost_recursive_union
* Determines and returns the cost of performing a recursive union,
* and also the estimated output size.
*
* We are given Paths for the nonrecursive and recursive terms.
*/
cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
/*
* cost_sort
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*
* If the total volume exceeds sort_mem, we switch to a tape-style merge
* algorithm. There will still be about t*log2(t) tuple comparisons in
* total, but we will also need to write and read each tuple once per
* merge pass. We expect about ceil(logM(r)) merge passes where r is the
* number of initial runs formed and M is the merge order used by tuplesort.c.
* Since the average initial run should be about sort_mem, we have
* disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
* cpu = comparison_cost * t * log2(t)
*
* If the sort is bounded (i.e., only the first k result tuples are needed)
* and k tuples can fit into sort_mem, we use a heap method that keeps only
* k tuples in the heap; this will require about t*log2(k) tuple comparisons.
*
* The disk traffic is assumed to be 3/4ths sequential and 1/4th random
* accesses (XXX can't we refine that guess?)
*
* By default, we charge two operator evals per tuple comparison, which should
* be in the right ballpark in most cases. The caller can tweak this by
* specifying nonzero comparison_cost; typically that's used for any extra
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
* 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
* 'sort_mem' is the number of kilobytes of work memory allowed for the sort
* 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
*
* NOTE: some callers currently pass NIL for pathkeys because they
* can't conveniently supply the sort keys. Since this routine doesn't
* currently do anything with pathkeys anyway, that doesn't matter...
* but if it ever does, it should react gracefully to lack of key data.
* (Actually, the thing we'd most likely be interested in is just the number
* of sort keys, which all callers *could* supply.)
*/
cost_sort(Path *path, PlannerInfo *root, List *pathkeys,
Cost input_cost, double tuples, int width, Cost comparison_cost, int sort_mem, double limit_tuples)
/*
* cost_append
* Determines and returns the cost of an Append node.
*
* We charge nothing extra for the Append itself, which perhaps is too
* optimistic, but since it doesn't do any selection or projection, it is a
* pretty cheap node.
*/
cost_append(Path *path, List *subpaths, int num_nonpartial_subpaths)
/*
* cost_merge_append
* Determines and returns the cost of a MergeAppend node.
*
* MergeAppend merges several pre-sorted input streams, using a heap that
* at any given instant holds the next tuple from each stream. If there
* are N streams, we need about N*log2(N) tuple comparisons to construct
* the heap at startup, and then for each output tuple, about log2(N)
* comparisons to replace the top entry.
*
* (The effective value of N will drop once some of the input streams are
* exhausted, but it seems unlikely to be worth trying to account for that.)
*
* The heap is never spilled to disk, since we assume N is not very large.
* So this is much simpler than cost_sort.
*
* As in cost_sort, we charge two operator evals per tuple comparison.
*
* 'pathkeys' is a list of sort keys
* 'n_streams' is the number of input streams
* 'input_startup_cost' is the sum of the input streams' startup costs
* 'input_total_cost' is the sum of the input streams' total costs
* 'tuples' is the number of tuples in all the streams
*/
cost_merge_append(Path *path, PlannerInfo *root, List *pathkeys,
int n_streams, Cost input_startup_cost, Cost input_total_cost, double tuples)
/*
* cost_material
* Determines and returns the cost of materializing a relation, including
* the cost of reading the input data.
*
* If the total volume of data to materialize exceeds work_mem, we will need
* to write it to disk, so the cost is much higher in that case.
*
* Note that here we are estimating the costs for the first scan of the
* relation, so the materialization is all overhead --- any savings will
* occur only on rescan, which is estimated in cost_rescan.
*/
cost_material(Path *path, Cost input_startup_cost,
Cost input_total_cost, double tuples, int width)
/*
* cost_agg
* Determines and returns the cost of performing an Agg plan node,
* including the cost of its input.
*
* aggcosts can be NULL when there are no actual aggregate functions (i.e.,
* we are using a hashed Agg node just to do grouping).
*
* Note: when aggstrategy == AGG_SORTED, caller must ensure that input costs
* are for appropriately-sorted input.
*/
cost_agg(Path *path, PlannerInfo *root, AggStrategy aggstrategy,
const AggClauseCosts *aggcosts, int numGroupCols, double numGroups, Cost input_startup_cost, Cost input_total_cost, double input_tuples)
/*
* cost_windowagg
* Determines and returns the cost of performing a WindowAgg plan node,
* including the cost of its input.
*
* Input is assumed already properly sorted.
*/
cost_windowagg(Path *path, PlannerInfo *root, List *windowFuncs,
int numPartCols, int numOrderCols, Cost input_startup_cost, Cost input_total_cost, double input_tuples)
/*
* cost_group
* Determines and returns the cost of performing a Group plan node,
* including the cost of its input.
*
* Note: caller must ensure that input costs are for appropriately-sorted
* input.
*/
cost_group(Path *path, PlannerInfo *root, int numGroupCols, double numGroups,
Cost input_startup_cost, Cost input_total_cost,
double input_tuples)
/*
* cost_subplan
* Figure the costs for a SubPlan (or initplan).
*
* Note: we could dig the subplan's Plan out of the root list, but in practice
* all callers have it handy already, so we make them pass it.
*/
cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
/*
* cost_rescan
* Given a finished Path, estimate the costs of rescanning it after
* having done so the first time. For some Path types a rescan is
* cheaper than an original scan (if no parameters change), and this
* function embodies knowledge about that. The default is to return
* the same costs stored in the Path. (Note that the cost estimates
* actually stored in Paths are always for first scans.)
*
* This function is not currently intended to model effects such as rescans
* being cheaper due to disk block caching; what we are concerned with is
* plan types wherein the executor caches results explicitly, or doesn't
* redo startup calculations, etc.
*/
cost_rescan(PlannerInfo *root, Path *path, Cost *rescan_startup_cost, /* output parameters */
Cost *rescan_total_cost)
/*
* cost_qual_eval
* Estimate the CPU costs of evaluating a WHERE clause.
* The input can be either an implicitly-ANDed list of boolean
* expressions, or a list of RestrictInfo nodes. (The latter is
* preferred since it allows caching of the results.)
* The result includes both a one-time (startup) component,
* and a per-evaluation component.
*/
cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root)
/*
* cost_qual_eval_node
* As above, for a single RestrictInfo or expression.
*/
cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root)
cost_qual_eval_walker(Node *node, cost_qual_eval_context *context)
如何估算每個節點的成本
記得前麵提到的接力棒嗎?接力棒裏麵包含了rows,這個非常關鍵。
rows是告訴下一個節點,你可能要處理這麼多行。
而有rows是不夠的,還有成本因子,因為每行還可能涉及到操作符的計算、並行worker的成本等。
這些因子的設置如下:
src/backend/optimizer/path/costsize.c
* seq_page_cost Cost of a sequential page fetch
* random_page_cost Cost of a non-sequential page fetch
* cpu_tuple_cost Cost of typical CPU time to process a tuple
* cpu_index_tuple_cost Cost of typical CPU time to process an index tuple
* cpu_operator_cost Cost of CPU time to execute an operator or function
* parallel_tuple_cost Cost of CPU time to pass a tuple from worker to master backend
* parallel_setup_cost Cost of setting up shared memory for parallelism
成本計算圖
一些優化器的成本估算例子,可以參考文檔:
https://www.postgresql.org/docs/10/static/planner-stats-details.html
src/backend/optimizer/path/costsize.c
從成本如何得到執行時間
注意成本是虛化的東西,和時間是不掛鉤的,但是我們可以讓他們掛鉤起來。
這就需要做校準,把成本因子調教成輸出的cost等於執行時間的值。
我在之前發表的文章中提到了如何校準,請參考。
《優化器成本因子校對 - PostgreSQL explain cost constants alignment to timestamp》
如何在用戶沒有數據的情況下,估算性能
實際上方法很簡單,我們需要業務方提供幾個東西即可:
1、表定義
2、被評估的SQL
3、統計信息,需要提供我用中文注釋的部分。
View "pg_catalog.pg_stats"
Column | Type | Collation | Nullable | Default
------------------------+----------+-----------+----------+---------
schemaname | name | | |
tablename | name | | |
attname | name | | |
inherited | boolean | | |
null_frac | real | | | 空值比例
avg_width | integer | | | 平均行長度
n_distinct | real | | | 多少唯一值,或唯一值比例,-1表示唯一
most_common_vals | anyarray | | | 高頻詞
most_common_freqs | real[] | | | 高頻詞的出現頻率
histogram_bounds | anyarray | | | 按記錄數均分為若幹BUCKET的 分位數(列值)
correlation | real | | | 存儲和實際值的線性相關性
most_common_elems | anyarray | | | 對於多值類型(數組),元素的高頻詞
most_common_elem_freqs | real[] | | | 元素高頻詞出現的頻率
elem_count_histogram | real[] | | | 元素按記錄數均分為若幹BUCKET的 分位數(元素值)
因為pg_stats支持導出導入,所以不需要實際數據即可完成,postgrespro版本就提供了這樣的功能。
https://postgrespro.com/docs/postgresproee/9.6/dump-stat.html
4、已調教好的成本因子
* seq_page_cost Cost of a sequential page fetch
* random_page_cost Cost of a non-sequential page fetch
* cpu_tuple_cost Cost of typical CPU time to process a tuple
* cpu_index_tuple_cost Cost of typical CPU time to process an index tuple
* cpu_operator_cost Cost of CPU time to execute an operator or function
* parallel_tuple_cost Cost of CPU time to pass a tuple from worker to master backend
* parallel_setup_cost Cost of setting up shared memory for parallelism
int effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
有以上要素,我們就能通過explain SQL得到估算出來的SQL執行時間。
就可以得到TPS等等。
從執行時間如何得到TPS
分為幾種情況
1、CPU是瓶頸時,TPS = 核數*(1秒/執行時間)。
2、IO是瓶頸時,TPS = (磁盤帶寬或IO能力) / (每個query的讀寫吞吐或IO)
二、宏觀估算法
宏觀估算,通過產品本身的特色來估算。
Greenplum和PostgreSQL兩個產品的特色
1、RDS PostgreSQL 10 適合以10TB ~ 100TB,OLTP為主,OLAP為輔的場景。與Oracle覆蓋的場景非常類似。
兼容SQL:2011,百萬+級tpmC。
支持多核並行計算。
支持可讀寫的OSS對象存儲外部表。
支持常用類型、擴展數據類型:JSON(B)、Hstore(KV), PostGIS空間數據庫、pgrouting(路由,圖式搜索)、數組、ltree樹類型、HLL估值類型, smlar, imgsmlr等。
支持SQL流計算插件
支持時序插件
支持btree, hash, gin, gist, sp-gist, bloom, brin等索引。
支持plpgsql, sql服務端編程。
支持分析型語法(多維計算、窗口查詢)、遞歸查詢(樹形查詢、圖式搜索、等場景)。支持文本全文檢索、模煳查詢、相似查詢、正則查詢。支持數組相似查詢,圖像相似查詢。
1.1 適合業務場景:
TB+級OLTP(在線事務處理)+OLAP(實時分析)。
模煳查詢、相似搜索、正則搜索
全文檢索
物聯網
流式數據處理
社交
圖式搜索
獨立事件分析
冷熱分離
異步消息
多值類型、圖像特征值 相似搜索
實時數據清洗
GIS應用
任意字段實時搜索
... ...
1.2 主打:功能、穩定性、性能、高可用、可靠性、Oracle兼容性、HTAP。
2、HybridDB for PostgreSQL(Greenplum開源版GPDB改進而來) 適合PB級實時OLAP,非常典型的海量數倉。
兼容SQL:2008,兼容TPC-H,TPC-DS。有數十年的商業化曆練經驗。
支持可讀寫的OSS對象存儲外部表
支持常用類型、擴展數據類型:JSON、PostGIS空間數據庫、數組、HLL估值類型。
支持bitmap, hash, btree索引。
支持pljava服務端編程。
支持分析型語法(多維計算、窗口查詢、MADlib機器學習)、支持全文檢索語法。
支持列存儲、行存儲、壓縮、混合存儲。
支持4階段聚合,支持節點間自動重分布。
支持水平擴容。
2.1 適合業務場景:
PB+級實時分析。(傳統統計;時間、空間、屬性多維屬性透視、圈人;任意表、任意維度JOIN;)
2.2 主打:分析型SQL兼容性、功能、穩定性、性能、高可用、擴展性。
3、RDS PPAS 9.6
PostgreSQL商業版本PPAS,TB+級OLTP+OLAP數據庫,兼容SQL:2011,兼容Oracle(SQL語法、函數、PLSQL存儲過程),支持單機多核並行計算,百萬+級tpmC。
適合業務場景:
最小化業務改動量,低成本去O。
TB+級OLTP(在線事務處理)+OLAP(實時分析)。
主打 功能、穩定性、性能、高可用、Oracle兼容性、HTAP。
三、精準實測法
精準實測,需要清楚的了解業務。了解業務的數據結構,業務邏輯,模擬事務請求。
壓測方法
根據實際業務,設計測試模型,模擬事務請求,壓測得到benchmark。
例子
上麵的兩篇文檔中,設計了一個場景,裏麵就涉及到如何設計場景,如何設計結構、QUERY、以及測試腳本,壓測。
下麵再列舉一個例子。
1、設計表結構
create table a(id int8 primary key, info text, crt_time timestamp);
2、設計SQL
insert into a values (:id, md5(random()::text), now()) on conflict (id) do update set info=excluded.info, crt_time=excluded.crt_time;
3、設計測試腳本
pgbench裏麵支持多種隨機數生成方法,支持sleep來模擬客戶端業務邏輯的處理,支持多線程。具體詳見pgbench文檔。
vi test.sql
\set id random(1,100000000)
insert into a values (:id, md5(random()::text), now()) on conflict (id) do update set info=excluded.info, crt_time=excluded.crt_time;
4、壓測(連接數、壓測時長)
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 120
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 37100343
latency average = 0.103 ms
latency stddev = 0.282 ms
tps = 309166.975398 (including connections establishing)
tps = 309180.511436 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set id random(1,100000000)
0.103 insert into a values (:id, md5(random()::text), now()) on conflict (id) do update set info=excluded.info, crt_time=excluded.crt_time;
PostgreSQL測試客戶端pgbench文檔:
https://www.postgresql.org/docs/9.6/static/pgbench.html
工業標準測試
1、tpc-b
PostgreSQL pgbench客戶端自帶的測試模型,就是tpc-b。具體請參考pgbench的幫助文檔,很簡單。
https://www.postgresql.org/docs/9.6/static/pgbench.html
2、pgbench for sysbench
這個測試的是一些mysql流行的場景
《PostgreSQL 使用 pgbench 測試 sysbench 相關case》
3、tpc-c
TPC-C是工業標準的OLTP測試,涉及較多複雜查詢。
4、linkbenchmark
linkbench是facebook的一個測試模型,用於測試一些圖論相關的寫入和查詢
《facebook linkbench 測試PostgreSQL社交關係圖譜場景性能》
四、一些常見性能指標
小結
根據業務的發展,估算數據庫性能,估算需要投入多少硬件,本文提供了三種方法。
1、微觀評估(相對來說比較準確)
當業務開發好後,表結構、QUERY都已經固定了,唯一不固定的是數據。數據可以通過業務方來估算,多少條記錄,有多少唯一值,相關性如何,高頻詞情況如何等等。
結合 成本因子的調教、統計信息、結構、query,得到每一種QUERY的執行時間。評估達到這樣的TPS需要多少硬件。
2、宏觀評估(對選型有幫助,對規格幫助不大,略顯粗糙)
宏觀評估,適合選型,因為它隻是多各種產品的特性的總結。
3、精準評估(最為準確,但是要求對業務非常熟悉,對未來的瓶頸把握準確)
這個可以在業務開發初期就進行評估,而且相對來說比較準確。
根據表結構,業務邏輯,設計測試腳本,根據實際的測試結果,結合業務的發展期望進行評估。
最後,本文還提供了若幹種工業標準測試的方法,以及若幹種已有的測試數據僅供參考。
最後更新:2017-09-24 16:33:19