畫像圈人 + 目標人群FEED行為透視

標簽

PostgreSQL , Greenplum , HybridDB for PostgreSQL

背景

本文講的是這樣的場景：畫像圈人和人群行為透視的結合。

數據量

標簽數 5W+   
  
用戶數 10W+   
  
用戶行為數據 10億+

業務目標：根據標簽圈出用戶群體，對這些用戶的行為進行透視分析。

第一步，圈人

第一步是圈出用戶群體，如果這一步的數據量非常龐大，則可以參考如下文章：

1、《多字段，任意組合(0建模) - 毫秒級實時圈人》

2、《萬億級營銷(圈人)邁入毫秒時代 - 實時推薦係統數據庫設計》

3、用阿裏雲RDS varbitx支持萬億user_tags級實時圈人

《阿裏雲RDS for PostgreSQL varbitx插件與實時畫像應用場景介紹》

《基於阿裏雲RDS PostgreSQL 打造實時用戶畫像推薦係統》

4、《音視圖(泛內容)網站透視分析 DB設計 - 阿裏雲(RDS、HybridDB) for PostgreSQL最佳實踐》

第一步加速的核心是高速定位到人群，前麵的案例中用到的技術點：

GIN倒排索引、倒置BITMAP、BITMAP SCAN、metascan、BRIN索引等，目標是降低掃描，聚集等。

但是，本案例的第一步數據量並不大，而是10幾萬，即使直接裸算也是很快的（並行）。

第二步，人群行為透視

第二步根據第一步得到的UID，在行為日誌中搜索對應人群的行為，並進行透視。涉及的數據量較大，但依舊有優化方法。

優化點，人群行為數據聚集存儲，例如按UID聚集，通過METASCAN、BRIN索引等方法降低行為數據表的IO掃描量。因為行為數據通常是APPEND ONLY的，因此按時間分區，按UID整理過去分區數據是可行的。

對比HybridDB 和 RDS PostgreSQL

在不涉及前麵提到的優化手段的前提下，對比一下HybridDB和RDS PostgreSQL的性能。讓用戶更加了解這兩個產品的差異和適應場景。

HybridDB for PostgreSQL

1、規格：

HDB PG 48個計算單元

2、建表、生成測試數據（畫像表10萬，行為表10億）

create table a(id int, tag int[]) with (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, CHECKSUM=off);  
  
create table a_log (id int, c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int, c10 int) with (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, CHECKSUM=off);  
  
create or replace function gen_rand_ints(int, int, int) returns int[] as $$    
  select array(select (random()*($2-$1))::int+$1 from generate_series(1,$3));    
$$ language sql strict;    
  
insert into a select id, gen_rand_ints(1,50000,300) from generate_series(1,100000) t(id);  
  
insert into a_log select random()*100000, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500 from generate_series(1,1000000000);

3、空間占用

postgres=# select pg_size_pretty(pg_relation_size('a'));  
 pg_size_pretty   
----------------  
 76 MB  
(1 row)  
  
postgres=# select pg_size_pretty(pg_relation_size('a_log'));  
 pg_size_pretty   
----------------  
 19 GB  
(1 row)

4、透視查詢語句，執行計劃

select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
  
postgres=# explain select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
                                                 QUERY PLAN                                                    
-------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=13869664.73..13869664.74 rows=1 width=48)  
   ->  Gather Motion 48:1  (slice1; segments: 48)  (cost=13869664.20..13869664.70 rows=1 width=48)  
         ->  Aggregate  (cost=13869664.20..13869664.21 rows=1 width=48)  
               ->  Hash Join  (cost=5031.25..13860616.22 rows=18850 width=16)  
                     Hash Cond: a_log.id = a.id  
                     ->  Append-only Columnar Scan on a_log  (cost=0.00..11344275.00 rows=20833334 width=20)  
                     ->  Hash  (cost=5030.00..5030.00 rows=3 width=4)  
                           ->  Append-only Columnar Scan on a  (cost=0.00..5030.00 rows=3 width=4)  
                                 Filter: tag @> '{1}'::integer[]  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
(11 rows)

5、透視查詢耗時，約6秒。

postgres=# select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
    sum    |       avg        | min | max   
-----------+------------------+-----+-----  
 754891765 | 250.136526278975 |   0 | 500  
(1 row)  
Time: 5983.161 ms

RDS for PostgreSQL 多核並行

1、規格：

RDS PG 10 獨占物理機

2、建表、生成測試數據（畫像表10萬，行為表10億）

create table a(id int, tag int[]) ;  
  
create table a_log (id int, c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int, c10 int) ;  
  
create or replace function gen_rand_ints(int, int, int) returns int[] as $$    
  select array(select (random()*($2-$1))::int+$1 from generate_series(1,$3));    
$$ language sql strict;    
  
insert into a select id, gen_rand_ints(1,50000,300) from generate_series(1,100000) t(id);  
  
insert into a_log select random()*100000, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500, random()*500 from generate_series(1,1000000000);

3、空間占用

postgres=# select pg_size_pretty(pg_relation_size('a'));  
 pg_size_pretty   
----------------  
 125 MB  
(1 row)  
  
postgres=# select pg_size_pretty(pg_relation_size('a_log'));  
 pg_size_pretty   
----------------  
 71 GB  
(1 row)

4、並行透視查詢語句，執行計劃

alter table a set (parallel_workers =4);  
alter table a_log set (parallel_workers =32);  
set max_parallel_workers_per_gather =32;  
  
select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
  
postgres=# explain select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
                                             QUERY PLAN                                               
----------------------------------------------------------------------------------------------------  
 Finalize Aggregate  (cost=2763595.31..2763595.32 rows=1 width=48)  
   ->  Gather  (cost=2763594.98..2763594.99 rows=32 width=48)  
         Workers Planned: 32  
         ->  Partial Aggregate  (cost=2763594.98..2763594.99 rows=1 width=48)  
               ->  Hash Join  (cost=5256.25..2762060.42 rows=153456 width=16)  
                     Hash Cond: (a_log.id = a.id)  
                     ->  Parallel Seq Scan on a_log  (cost=0.00..2638082.08 rows=31250008 width=20)  
                     ->  Hash  (cost=5250.00..5250.00 rows=500 width=4)  
                           ->  Seq Scan on a  (cost=0.00..5250.00 rows=500 width=4)  
                                 Filter: (tag @> '{1}'::integer[])  
(10 rows)

5、並行透視查詢耗時，約4.5秒。

postgres=# select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
    sum    |         avg          | min | max   
-----------+----------------------+-----+-----  
 787224592 | 249.9961077211050058 |   0 | 500  
(1 row)  
  
Time: 4550.672 ms (00:04.551)

RDS for PostgreSQL 非並行

1、非並行透視查詢語句，執行計劃

set max_parallel_workers_per_gather =0;  
  
select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
  
postgres=# explain select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
                                    QUERY PLAN                                      
----------------------------------------------------------------------------------  
 Aggregate  (cost=16179053.37..16179053.38 rows=1 width=48)  
   ->  Hash Join  (cost=5256.25..16129947.57 rows=4910580 width=16)  
         Hash Cond: (a_log.id = a.id)  
         ->  Seq Scan on a_log  (cost=0.00..12325584.56 rows=1000000256 width=20)  
         ->  Hash  (cost=5250.00..5250.00 rows=500 width=4)  
               ->  Seq Scan on a  (cost=0.00..5250.00 rows=500 width=4)  
                     Filter: (tag @> '{1}'::integer[])  
(7 rows)

2、非並行透視查詢耗時，約115秒。

postgres=# select sum(c1), avg(c2), min(c3), max(c4) from a_log join a on (a_log.id=a.id and a.tag @> array[1]);  
    sum    |         avg          | min | max   
-----------+----------------------+-----+-----  
 787224592 | 249.9961077211050058 |   0 | 500  
(1 row)  
  
Time: 115492.869 ms (01:55.493)

小結

產品	記錄數	空間占用
HDB PG 列存+壓縮	10億(11列)	19 GB
PG 10 堆表	10億(11列)	71 GB

產品	並行度、計算節點數	JOIN+數組過濾+透視耗時
HDB PG	48	6秒
PG 10	32	4.5秒
PG 10	1	115秒

Greenplum和PostgreSQL兩個產品的特色

1、PostgreSQL 10 適合以10TB ~ 100TB，OLTP為主，OLAP為輔的場景。與Oracle覆蓋的場景非常類似。

兼容SQL:2011，百萬+級tpmC。

支持多核並行計算。

支持可讀寫的OSS對象存儲外部表。

支持常用類型、擴展數據類型：JSON(B)、Hstore(KV), PostGIS空間數據庫、pgrouting(路由,圖式搜索)、數組、ltree樹類型、HLL估值類型, smlar, imgsmlr等。

支持SQL流計算插件

支持時序插件

支持btree, hash, gin, gist, sp-gist, bloom, brin等索引。

支持plpgsql, sql服務端編程。

支持分析型語法（多維計算、窗口查詢）、遞歸查詢(樹形查詢、圖式搜索、等場景)。支持文本全文檢索、模煳查詢、相似查詢、正則查詢。支持數組相似查詢，圖像相似查詢。

1.1 適合業務場景：

 TB+級OLTP(在線事務處理)+OLAP(實時分析)。  

 模煳查詢、相似搜索、正則搜索  

 全文檢索  

 物聯網  

 流式數據處理  

 社交  

 圖式搜索  

 獨立事件分析  

 冷熱分離  

 異步消息  

 多值類型、圖像特征值 相似搜索  

 實時數據清洗  

 GIS應用  

 任意字段實時搜索  

 ... ...

1.2 主打：功能、穩定性、性能、高可用、可靠性、Oracle兼容性、HTAP。

2、HybridDB for PostgreSQL(Greenplum開源版GPDB改進而來) 適合PB級實時OLAP，非常典型的海量數倉。

兼容SQL:2008，兼容TPC-H，TPC-DS。有數十年的商業化曆練經驗。

支持可讀寫的OSS對象存儲外部表

支持常用類型、擴展數據類型：JSON、PostGIS空間數據庫、數組、HLL估值類型。

支持bitmap, hash, btree索引。

支持pljava服務端編程。

支持分析型語法（多維計算、窗口查詢、MADlib機器學習）、支持全文檢索語法。

支持列存儲、行存儲、壓縮、混合存儲。

支持4階段聚合，支持節點間自動重分布。

支持水平擴容。

2.1 適合業務場景：

PB+級實時分析。（傳統統計；時間、空間、屬性多維屬性透視、圈人；任意表、任意維度JOIN；）

2.2 主打：分析型SQL兼容性、功能、穩定性、性能、高可用、擴展性。

最後更新：2017-09-24 16:33:09

畫像圈人 + 目標人群FEED行為透視

標簽

背景

第一步，圈人

第二步，人群行為透視

對比HybridDB 和 RDS PostgreSQL

HybridDB for PostgreSQL

RDS for PostgreSQL 多核並行

RDS for PostgreSQL 非並行

小結

Greenplum和PostgreSQL兩個產品的特色

上一篇：空間|時間|對象圈人 + 目標人群透視 - 暨PostgreSQL 10與Greenplum的對比和選擇

下一篇：如何防止網站被黑?

相關內容

熱門內容

最新內容

畫像圈人 + 目標人群FEED行為透視

標簽

背景

第一步，圈人

第二步，人群行為透視

對比HybridDB 和 RDS PostgreSQL

HybridDB for PostgreSQL

RDS for PostgreSQL 多核並行

RDS for PostgreSQL 非並行

小結

Greenplum和PostgreSQL兩個產品的特色

上一篇： 空間|時間|對象 圈人 + 目標人群透視 - 暨PostgreSQL 10與Greenplum的對比和選擇

下一篇： 如何防止網站被黑?

相關內容

熱門內容

最新內容

上一篇：空間|時間|對象圈人 + 目標人群透視 - 暨PostgreSQL 10與Greenplum的對比和選擇

下一篇：如何防止網站被黑?