377 技術社區[雲棲]

啤酒和紙尿褲最搭? - 用HybridDB/PostgreSQL查詢商品營銷最佳組合

標簽

PostgreSQL , 商品最佳組合 , 阿裏雲HybridDB

背景

購買早餐時，包子和豆漿、茶葉蛋是最佳搭檔嗎？

為什麼紙尿褲和啤酒是最佳搭檔？

這些問題在積累了一定的訂單數據後，是可以挖掘出來的。這個問題實際上是4.8號PostgreSQL社區杭州區活動上，一位社區的朋友提出來的，如何使用PostgreSQL找出最佳搭配的商品。

實際上有一個專業的推薦數據庫，支持多種推薦算法，也可以很好的解決這個問題。

《推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹》

但是本文不打算使用RecDB這個產品來解決這樣的問題。而是使用統計的方法能得出結論。

本文統計方法限製

本文涉及的統計方法隻能用於計算直接關聯的商品（表現為在同一個訂單中的數據）的最佳組合。

如果你要計算間接關聯的商品組合，例如A用戶買了1,2，B用戶買了2,3，實際上1,3是存在間接關聯關係的。那麼你需要用到recDB中提到的推薦算法，或者使用類似圖式搜索。

參考

《金融風控、公安刑偵、社會關係、人脈分析等需求分析與數據庫實現 - PostgreSQL圖數據庫場景應用》

場景虛構

假設有10萬商品ID，虛構一批用戶的購買或購物車記錄，每一條訂單或購物車記錄中包含5到15個商品。一共構建約1100萬條這樣的記錄。

建表

postgres=# create unlogged table buy (pay_id int8, item_id int[]);  
CREATE TABLE

造數據

創建一個函數，用於插入buy表，（5到15個商品的數組）

create or replace function f() returns void as $$    
declare    
begin    
  for i in 5..15 loop    
    insert into buy (item_id) select array_agg((100000*random())::int8) from generate_series(1,i);    
  end loop;    
end;    
$$ language plpgsql strict;

使用pgbench，生成1100萬記錄

vi test.sql    
select f();    
    
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 100 -j 100 -t 10000    
  
transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 100  
number of threads: 100  
number of transactions per client: 10000  
number of transactions actually processed: 1000000/1000000  
latency average = 1.155 ms  
latency stddev = 1.814 ms  
tps = 85204.625725 (including connections establishing)  
tps = 85411.351807 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         1.158  select f();

確認數據已寫入

postgres=# select count(*) from buy;  
  count     
----------  
 11000000  
(1 row)  
  
postgres=# select * from buy limit 10;  
 pay_id |                           item_id                              
--------+--------------------------------------------------------------  
        | {6537,76804,33612,75580,8021}  
        | {72437,66015,2939,56128,7056}  
        | {40983,79581,15954,21039,6702,90279}  
        | {93626,8337,13416,69371,4366,75868}  
        | {84611,56893,25201,74038,59337,62045,59178}  
        | {97422,48801,69714,77056,17059,79714,21598}  
        | {42997,50834,57214,52866,83656,76342,5639,93416}  
        | {53543,24369,31552,28654,38516,63657,86564,11483}  
        | {58873,23162,23369,55091,32046,29907,31895,65658,5487}  
        | {39916,6641,85068,55870,27679,91770,46150,12290,48662,71350}  
(10 rows)

GIN索引

postgres=# create index idx_buy_item on buy using gin(item_id);

分裂函數

分裂的目的是將一筆訂單中的數組，分裂成若幹個組合。例如5個商品的訂單，拆分成4+3+2+1=10個2個商品的組合。

{6537,76804,33612,75580,8021}

拆分為如下組合

{6537,76804}  
  
{6537,33612}  
  
{6537,75580}  
  
{6537,8021}  
  
{76804,33612}  
  
{76804,75580}  
  
{76804,8021}  
  
{33612,75580}  
  
{33612,8021}  
  
{75580,8021}

創建一個函數來完成這樣的拆分工作

使用遞歸查詢可以滿足重新組合的目的

例子

WITH RECURSIVE   
t(i) AS (  
  SELECT * FROM unnest('{A,B,C}'::char[])  
),   
cte AS (  
     SELECT i AS combo, i, 1 AS ct   
     FROM t   
   UNION ALL   
     SELECT cte.combo || t.i, t.i, ct + 1   
     FROM cte, t   
     WHERE ct <= 3  -- 組合3+1=4次  
       AND position(t.i in cte.combo) = 0   -- 新加入的字符不在已有字符中  
)   
SELECT ARRAY(SELECT combo FROM cte ORDER BY ct, combo) AS result;  
  
                      result                         
---------------------------------------------------  
 {A,B,C,AB,AC,BA,BC,CA,CB,ABC,ACB,BAC,BCA,CAB,CBA}  
(1 row)

函數1，返回指定個數的組合

假設數組中沒有重複元素

create or replace function array_regroup(  
  i_arr int[],   -- 輸入數組  
  i_elems int    -- 打散成固定長度的組合  
) returns setof int[] as $$  
declare  
  v_arr_len int := array_length(i_arr, 1);  -- 輸入的數組長度  
begin  
  -- 保護  
  if i_elems > v_arr_len then  
    raise notice 'you cann''t return group len % more then %', i_elems, v_arr_len;  
    return;  
  elsif i_elems = v_arr_len then  
    return next i_arr;  
    return;  
  elsif i_elems = 1 then  
    return query select array(select i) from unnest(i_arr) t(i);  
    return;  
  end if;  
  
  return query  
  WITH RECURSIVE   
  t(i) AS (  
      select array(select i) from unnest(i_arr) t(i)  
  ),   
  cte AS (  
     SELECT i AS combo, i, 1 AS ct   
     FROM t   
   UNION ALL   
     SELECT array(select i from (select unnest(array_cat(cte.combo, t.i)) order by 1) t(i)), t.i, ct + 1   
     FROM cte, t   
     WHERE cte.ct <= i_elems-1  -- 組合若幹次  
       AND (not cte.combo @> t.i)   -- 新加入的值不在已組合的值中  
  )   
  SELECT combo FROM cte where array_length(combo,1)=i_elems group by combo;   
  
  return;  
end;  
$$ language plpgsql strict;

postgres=# select array_regroup(array[1,2,3],2);  
 array_regroup   
---------------  
 {2,3}  
 {1,2}  
 {1,3}  
(3 rows)

函數2，返回所有個數的組合

create or replace function array_regroup(  
  i_arr int[]   -- 輸入數組  
) returns setof int[] as $$  
declare  
  v_arr_len int := array_length(i_arr, 1);  -- 輸入的數組長度  
begin  
  
  return query  
  WITH RECURSIVE   
  t(i) AS (  
      select array(select i) from unnest(i_arr) t(i)  
  ),   
  cte AS (  
     SELECT i AS combo, i, 1 AS ct   
     FROM t   
   UNION ALL   
     SELECT array(select i from (select unnest(array_cat(cte.combo, t.i)) order by 1) t(i)), t.i, ct + 1   
     FROM cte, t   
     WHERE cte.ct <= v_arr_len-1  -- 組合若幹次  
       AND (not cte.combo @> t.i)   -- 新加入的值不在已組合的值中  
  )   
  SELECT combo FROM cte group by combo;   
  
  return;  
end;  
$$ language plpgsql strict;

postgres=# select array_regroup(array[1,2,3]);  
 array_regroup   
---------------  
 {2}  
 {2,3}  
 {1,2}  
 {1}  
 {1,2,3}  
 {3}  
 {1,3}  
(7 rows)

函數3，返回指定個數的組合，僅輸出包含了某些元素的組合（例如包含了麵包ID的數組）

create or replace function array_regroup(  
  i_arr int[],    -- 輸入數組  
  i_elems int,    -- 打散成固定長度的組合  
  i_arr_contain int[]  -- 包含了這些商品ID的數組  
) returns setof int[] as $$  
declare  
  v_arr_len int := array_length(i_arr, 1);  -- 輸入的數組長度  
begin  
  -- 保護  
  if i_elems > v_arr_len then  
    raise notice 'you cann''t return group len % more then %', i_elems, v_arr_len;  
    return;  
  elsif i_elems = v_arr_len then  
    return next i_arr;  
    return;  
  elsif i_elems = 1 then  
    return query select array(select i) from unnest(i_arr) t(i);  
    return;  
  end if;  
  
  return query  
  WITH RECURSIVE   
  t(i) AS (  
      select array(select i) from unnest(i_arr) t(i)  
  ),   
  cte AS (  
     SELECT i AS combo, i, 1 AS ct   
     FROM t   
   UNION ALL   
     SELECT array(select i from (select unnest(array_cat(cte.combo, t.i)) order by 1) t(i)), t.i, ct + 1   
     FROM cte, t   
     WHERE cte.ct <= i_elems-1  -- 組合若幹次  
       AND (not cte.combo @> t.i)   -- 新加入的值不在已組合的值中  
       AND (cte.combo @> i_arr_contain)  
  )   
  SELECT combo FROM cte where array_length(combo,1)=i_elems group by combo;   
  
  return;  
end;  
$$ language plpgsql strict;

postgres=# select array_regroup(array[1,2,3,4,5],2,array[1]);  
 array_regroup   
---------------  
 {1,2}  
 {1,3}  
 {1,4}  
 {1,5}  
(4 rows)  
  
Time: 1.150 ms

求單品的最佳一級組合

例如，找出麵包的1個最佳搭檔。

假設麵包的商品ID=6537

postgres=# select item_id from buy where item_id @> array[6537];  
......  
 {60573,17248,6537,77857,43349,66208,13656}  
 {97564,50031,79924,24255,6537,21174,39117}  
 {24026,78667,99115,87856,64782,8344,73169,41478,63091,29609,6537,71982,75382}  
 {53094,97465,26156,54181,6537}  
(1101 rows)  
Time: 5.791 ms  
  
postgres=# explain select item_id from buy where item_id @> array[6537];  
                                   QUERY PLAN                                      
---------------------------------------------------------------------------------  
 Bitmap Heap Scan on buy  (cost=457.45..51909.51 rows=55000 width=60)  
   Recheck Cond: (item_id @> '{6537}'::integer[])  
   ->  Bitmap Index Scan on idx_buy_item  (cost=0.00..443.70 rows=55000 width=0)  
         Index Cond: (item_id @> '{6537}'::integer[])  
(4 rows)

組合，尋找出現次數最多的組合。

postgres=# select count(*), array_regroup(item_id,2,array[6537]) from buy where item_id @> array[6537] group by 2 order by 1 desc;  
 count | array_regroup   
-------+---------------  
     3 | {6537,55286}  
     3 | {6537,48661}  
     3 | {6537,78337}  
     3 | {6537,72623}  
     3 | {6537,81442}  
     3 | {6537,66414}  
     3 | {6537,35346}  
     3 | {6537,79565}  
     3 | {3949,6537}  
......  
  
Time: 286.859 ms

求單品的最佳二級組合

例如，找出麵包的兩個最佳搭檔。

postgres=# select count(*), array_regroup(item_id,3,array[6537]) from buy where item_id @> array[6537] group by 2 order by 1 desc;  
 count |   array_regroup      
-------+--------------------  
     1 | {32,999,6537}  
     1 | {6537,49957,91533}  
     1 | {6537,49957,88377}  
     1 | {6537,49957,57887}  
     1 | {6537,49957,55192}  
     1 | {6537,49952,95266}  
     1 | {6537,49952,56916}  
     1 | {6537,49945,60492}  
     1 | {6537,49940,92888}  
......  
  
Time: 1055.414 ms

統計全網數據的最佳一級組合

可能需要很久

select count(*), array_regroup(item_id,2) from buy group by 2 order by 1 desc limit 10;

統計全網數據的最佳N級組合

可能需要很久

select count(*), array_regroup(item_id, n) from buy group by 2 order by 1 desc limit 10;

小結

1. 這個案例並沒有什麼高超的技術含量，僅僅是將數組按推薦級數進行分裂，統計出現的次數。

用到的數據庫特性包括：

1.1. 數組類型的支持

1.2. plpgsql服務端編程

1.3. 數組元素的索引檢索(包含某個元素)

1.4. MPP, 分布式數據庫架構，提升運算速度。可以參考阿裏雲的HybridDB for PostgreSQL產品。

2. 注意, 本文統計方法限製

本文涉及的統計方法隻能用於計算直接關聯的商品（表現為在同一個訂單中的數據）的最佳組合。

參考

《金融風控、公安刑偵、社會關係、人脈分析等需求分析與數據庫實現 - PostgreSQL圖數據庫場景應用》

3. 阿裏雲HybridDB for PostgreSQL提供了MPP的能力，可以水平擴展，非常適合OLAP場景。例如本案涉及的大量group by的操作，可以得到很好的性能提升。

4. PostgreSQL 9.6開始加入了基於CPU的多核並行計算功能，對於OLAP場景也有非常大的性能提升，例如本案涉及的大量group by的操作，可以得到很好的性能提升。

參考

https://github.com/DataSystemsLab/recdb-postgresql

https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html

《推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹》

《分析加速引擎黑科技 - LLVM、列存、多核並行、算子複用大聯姻 - 一起來開啟PostgreSQL的百寶箱》

《PostgreSQL 向量化執行插件(瓦片式實現) 10x提速OLAP》

《ApsaraDB的左右互搏(PgSQL+HybridDB+OSS) - 解決OLTP+OLAP混合需求》

《金融風控、公安刑偵、社會關係、人脈分析等需求分析與數據庫實現 - PostgreSQL圖數據庫場景應用》

最後更新：2017-04-10 20:02:47

啤酒和紙尿褲最搭? - 用HybridDB/PostgreSQL查詢商品營銷最佳組合

標簽

背景

本文統計方法限製

場景虛構

造數據

GIN索引

分裂函數

函數1，返回指定個數的組合

函數2，返回所有個數的組合

函數3，返回指定個數的組合，僅輸出包含了某些元素的組合（例如包含了麵包ID的數組）

求單品的最佳一級組合

求單品的最佳二級組合

統計全網數據的最佳一級組合

統計全網數據的最佳N級組合

小結

參考

上一篇：用java進行麵向對象編程，麵向對象是什麼意思

下一篇：推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹

相關內容

熱門內容

最新內容

啤酒和紙尿褲最搭? - 用HybridDB/PostgreSQL查詢商品營銷最佳組合

標簽

背景

本文統計方法限製

場景虛構

造數據

GIN索引

分裂函數

函數1，返回指定個數的組合

函數2，返回所有個數的組合

函數3，返回指定個數的組合，僅輸出包含了某些元素的組合（例如包含了麵包ID的數組）

求單品的最佳一級組合

求單品的最佳二級組合

統計全網數據的最佳一級組合

統計全網數據的最佳N級組合

小結

參考

上一篇： 用java進行麵向對象編程，麵向對象是什麼意思

下一篇： 推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹

相關內容

熱門內容

最新內容

上一篇：用java進行麵向對象編程，麵向對象是什麼意思

下一篇：推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹