推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹

標簽

PostgreSQL , RecDB , 推薦係統 , 圖式搜索

背景

中華文化源遠流長，從古至今有很多有趣的學問。比如看風水、看相，在西方文化中有類似的比如星座學說。

這些和推薦係統有什麼關係呢？

個人感覺星座學說這些學問有一定的理論基礎，更多的則是也是經驗的總結。

推薦係統實際上和星座學說類似，有一定的算法基礎，算法則可能是來自經驗的總結。

在現實場景中，可以找到很多類似的例子，這些例子是一些數據，根據推薦算法，可以找到用戶可能感興趣的東西。

1. 比如豆瓣有幾千萬用戶，用戶在豆瓣上可以對影片打分和點評，就形成了一批這樣的數據：

用戶ID，影片ID，打分  
  
1, '終結者', 90  
902, '笑傲江湖', 78  
......  
  
我暫且稱之為打分表, 或者rating數據

2. 電商應用，也有類似的rating數據，比如這樣的數據結構：

用戶ID，購買的商品ID，購買數量  
  
或者  
  
用戶ID，瀏覽的店鋪ID，瀏覽的次數  
  
或者  
  
用戶ID，瀏覽的商品ID，瀏覽次數

3. 其他應用，使用應用程序的feed，可以找到用戶與用戶，用戶與物體的親密度(rating)。

有了用戶與用戶，用戶與物體的親密度(rating)數據之後，就可以根據推薦算法，給用戶推薦其他用戶或物體。

接下來容重介紹一款繼承了多種推薦算法的數據庫RecDB。

RecDB(推薦數據庫)

RecDB是基於PostgreSQL打造的一款專業的推薦數據庫，目前繼承了以下推薦算法。

Currently, the available recommendation algorithms that could be passed to the USING clause are the following:

ItemCosCF Item-Item Collaborative Filtering using Cosine Similarity measure.
ItemPearCF Item-Item Collaborative Filtering using Pearson Correlation Similarity measure.
UserCosCF User-User Collaborative Filtering using Cosine Similarity measure.
UserPearCF User-User Collaborative Filtering using Cosine Similarity measure.
SVD Simon Funk Singular Value Decomposition.

Applications powered by RecDB can produce online and flexible personalized recommendations to end-users.

RecDB的用法非常簡單，就如同使用PostgreSQL一樣，對用戶非常友好，同時擴展了一些推薦相關的SQL語法。

如果你還不會用PostgreSQL，可以來看看這個

《2011年功力的德哥教你2天擼通PostgreSQL - 入門、開發、原理、管理、調優》

RecDB安裝

1. 編譯安裝recdb

git clone https://github.com/Sarwat/recdb-postgresql.git  
  
cd recdb-postgresql/PostgreSQL  
  
./configure --prefix=/home/digoal/recdb  
  
make world -j 32  
  
make install-world  
  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=9999  
export PGDATA=/home/digoal/pgdata/pg_root9999  
export LANG=en_US.utf8  
export PGHOME=/home/digoal/recdb  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export LD_RUN_PATH=$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=127.0.0.1  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi

2. 初始化數據庫集群

initdb -D $PGDATA -E UTF8 --locale=C -U postgres

3. 配置

cd $PGDATA  
  
vi postgresql.conf  
listen_addresses = '0.0.0.0'  
port = 9999   
max_connections = 100  
unix_socket_directory = '.'  
shared_buffers = 32GB  
maintenance_work_mem = 1GB  
vacuum_cost_delay = 0ms  
bgwriter_delay = 10ms  
bgwriter_lru_maxpages = 1000  
bgwriter_lru_multiplier = 5.0  
wal_level = minimal  
synchronous_commit = off  
wal_buffers = 128MB  
wal_writer_delay = 10ms  
checkpoint_segments = 2048    
checkpoint_timeout = 35min  
checkpoint_completion_target = 0.1  
random_page_cost = 1.2  
effective_cache_size = 128GB  
log_destination = 'csvlog'  
logging_collector = on  
log_directory = 'pg_log'  
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'  
log_truncate_on_rotation = on  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose  
log_timezone = 'PRC'  
log_autovacuum_min_duration = 0  
autovacuum_max_workers = 8  
autovacuum_naptime = 10s  
autovacuum_vacuum_scale_factor = 0.05  
autovacuum_analyze_scale_factor = 0.1  
autovacuum_freeze_max_age = 1500000000  
autovacuum_vacuum_cost_delay = 0  
vacuum_freeze_table_age = 1300000000  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'  
lc_monetary = 'C'  
lc_numeric = 'C'  
lc_time = 'C'  
default_text_search_config = 'pg_catalog.english'  
  
  
vi pg_hba.conf  
host all all 0.0.0.0/0 md5

4. 啟動數據庫集群

pg_ctl start

5. 測試數據庫是否可用

psql  
psql (9.2.0)  
Type "help" for help.  
  
postgres=# \dt  
No relations found.  
postgres=# \l  
                             List of databases  
   Name    |  Owner   | Encoding | Collate | Ctype |   Access privileges     
-----------+----------+----------+---------+-------+-----------------------  
 postgres  | postgres | UTF8     | C       | C     |   
 template0 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
 template1 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
(3 rows)

RecDB 使用例子

1. 創建測試表，三個字段，分別為用戶ID，電影ID，評分。

postgres=# create table douban_rating(uid int, movie_id int, access_rating real);  
CREATE TABLE

2. 插入100萬評分數據，1000萬用戶ID，點評1000部電影

postgres=# insert into douban_rating select random()*10000000, random()*1000, random() from generate_series(1,1000000);  
INSERT 0 1000000

3. 創建RECOMMENDER，需要指出推薦算法，字段。

推薦算法如下

ItemCosCF Item-Item Collaborative Filtering using Cosine Similarity measure.
ItemPearCF Item-Item Collaborative Filtering using Pearson Correlation Similarity measure.
UserCosCF User-User Collaborative Filtering using Cosine Similarity measure.
UserPearCF User-User Collaborative Filtering using Cosine Similarity measure.
SVD Simon Funk Singular Value Decomposition.

postgres=# set maintenance_work_mem ='32GB';  
SET  
postgres=# set work_mem ='32GB';  
SET  
postgres=# \timing  
Timing is on.  
  
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating  
USERS FROM uid  
ITEMS FROM movie_id  
EVENTS FROM access_rating  
USING ItemCosCF;  
  
NOTICE:  CREATE TABLE will create implicit sequence "movierecindex_systemid_seq" for serial column "movierecindex.systemid"  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movierecindex_pkey" for table "movierecindex"  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movierecview1491804367109912_pkey" for table "movierecview1491804367109912"  
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "movierecmodel1491804367109912_pkey" for table "movierecmodel1491804367109912"  
CREATE RECOMMENDER

postgres=# \dt+  
                                  List of relations  
 Schema |             Name              | Type  |  Owner   |    Size    | Description   
--------+-------------------------------+-------+----------+------------+-------------  
 public | douban_rating                 | table | postgres | 42 MB      |   
 public | movierecindex                 | table | postgres | 16 kB      |   
 public | movierecmodel1491804035307409 | table | postgres | 2136 kB    |   
 public | movierecview1491804035307409  | table | postgres | 8192 bytes |   
 public | recdbproperties               | table | postgres | 8192 bytes |   
 public | recmodelscatalogue            | table | postgres | 16 kB      |   
(6 rows)

查看數據

postgres=# select * from douban_rating limit 10;  
   uid   | movie_id | access_rating   
---------+----------+---------------  
  359604 |     4798 |      0.796408  
 7749197 |     1764 |      0.194022  
 9288272 |      947 |      0.696304  
 6032232 |     1738 |      0.745247  
 6451861 |     6278 |      0.416638  
 3290076 |      510 |      0.571497  
  334635 |     4904 |      0.552451  
 2313039 |     3906 |      0.963749  
 2535368 |     6377 |      0.297736  
 2939719 |     7603 |      0.624071  
(10 rows)

查看某個用戶的點評數據

postgres=# select * from douban_rating where uid=359604;  
  uid   | movie_id | access_rating   
--------+----------+---------------  
 359604 |     4798 |      0.796408  
(1 row)

查看具有共性的用戶，點評了哪些電影

postgres=# select distinct movie_id from douban_rating where uid in (select uid from douban_rating where movie_id=4798);  
 movie_id   
----------  
     2667  
     4798  
     1686  
     9008  
     3994  
     1156  
     1679  
     4103  
     1090  
(9 rows)

4. 使用推薦算法，給uid=359604的用戶，推薦其他電影，返回的是具有共性的電影，分值從高到低排序。

postgres=# SELECT * FROM douban_rating R  
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 359604  
ORDER BY R.access_rating desc  
LIMIT 10;  
  uid   | movie_id | access_rating   
--------+----------+---------------  
 359604 |     2667 |      0.796408  
 359604 |     9008 |      0.796408  
 359604 |     1090 |      0.796408  
 359604 |     1679 |      0.796408  
 359604 |     1686 |      0.796408  
 359604 |     4103 |      0.796408  
 359604 |     1156 |      0.796408  
 359604 |     3994 |      0.796408  
 359604 |        6 |             0  
 359604 |        5 |             0  
(10 rows)

執行計劃如下

postgres=# explain (analyze,verbose,timing,costs,buffers) SELECT * FROM douban_rating R  
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 359604  
ORDER BY R.access_rating desc  
LIMIT 10;  
                                                                    QUERY PLAN                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.091..3337.093 rows=10 loops=1)  
   Output: uid, movie_id, access_rating  
   Buffers: shared hit=61840, temp read=9993 written=9993  
   ->  Sort  (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.090..3337.092 rows=10 loops=1)  
         Output: uid, movie_id, access_rating  
         Sort Key: r.access_rating  
         Sort Method: top-N heapsort  Memory: 25kB  
         Buffers: shared hit=61840, temp read=9993 written=9993  
         ->  Result  (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.739..3332.023 rows=10001 loops=1)  
               Output: uid, movie_id, access_rating  
               Buffers: shared hit=61840, temp read=9993 written=9993  
               ->  Recommend on public.douban_rating r  (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.738..3329.177 rows=10001 loops=1)  
                     Output: uid, movie_id, access_rating  
                     Filter: (r.uid = 359604)  
                     Buffers: shared hit=61840, temp read=9993 written=9993  
 Total runtime: 3337.116 ms  
(16 rows)

5. 由於共性用戶較少，重新造一部分數據，讓共性更多，1萬用戶點評1000部電影。

postgres=# drop RECOMMENDER MovieRec;  
DROP RECOMMENDER  
postgres=# truncate douban_rating ;  
TRUNCATE TABLE  
postgres=# insert into douban_rating select random()*10000, random()*1000, random() from generate_series(1,1000000);  
INSERT 0 1000000  
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating  
USERS FROM uid  
ITEMS FROM movie_id  
EVENTS FROM access_rating  
USING ItemCosCF;

查詢原始數據如下

postgres=# select * from douban_rating limit 10;  
 uid  | movie_id | access_rating   
------+----------+---------------  
  128 |      848 |      0.796747  
 4953 |      401 |      0.832318  
 2766 |      874 |       0.61931  
 5572 |      430 |      0.550044  
    6 |      709 |      0.798314  
 1896 |      237 |      0.559974  
 4917 |      614 |      0.517259  
 6697 |      886 |      0.804338  
 2232 |      534 |      0.873135  
 4574 |      557 |       0.38828  
(10 rows)  
  
  
postgres=# select * from douban_rating where uid=128;  
 uid | movie_id | access_rating   
-----+----------+---------------  
 128 |      848 |      0.796747  
 128 |      755 |      0.139934  
 128 |       79 |      0.633511  
 128 |      979 |      0.145586  
 128 |      120 |      0.153884  
 128 |      839 |     0.0865545  
 ......

7. 給用戶128推薦電影，現在的結果有了一定的可參考性。

postgres=# SELECT * FROM MovieRec R                                                 
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 128  
ORDER BY R.access_rating desc  
LIMIT 10;  
 uid | movie_id | access_rating   
-----+----------+---------------  
 128 |      422 |      0.514567  
 128 |      127 |      0.514059  
 128 |      495 |      0.513637  
 128 |      974 |      0.513447  
 128 |      487 |      0.512524  
 128 |       64 |      0.512519  
 128 |      868 |      0.512367  
 128 |      132 |      0.512323  
 128 |        0 |       0.51225  
 128 |       54 |      0.512082  
(10 rows)

8. recDB支持複雜的查詢，例如JOIN.

SELECT * FROM ml_ratings R, Movies M  
RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF  
WHERE R.userid = 1 AND M.movieid = R.itemid AND M.genre LIKE '%Comedy%'  
ORDER BY R.ratingval  
LIMIT 10

小結

1. 未來RecDB可能會整合到PG內核，也可能作為一款PG的插件（可能性較大），Pipelinedb(流式計算數據庫）也在做插件化的改造。

目前基於PG改造或擴展的產品非常的多，包括

1. 流計算數據庫產品 pipelineDB

2. 推薦數據庫產品 recDB

3. 時序數據庫 timescaleDB

4. 分布式數據庫插件 citus

5. 列存儲插件 IMCS, cstore等

6. 麵向OLAP的codegen數據庫 pg_LLVM

7. 向量計算插件 vops

。。。

不同的場景，可以找到適合對應場景的插件。

2. recDB實現的推薦查詢，與圖數據也有一定的類似性，PostgreSQL在圖式搜索方麵的應用可以參考如下

《金融風控、公安刑偵、社會關係、人脈分析等需求分析與數據庫實現 - PostgreSQL圖數據庫場景應用》

參考

https://github.com/DataSystemsLab/recdb-postgresql

https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html

<5分鍾用PostgreSQL實現推薦係統>

最後更新：2017-04-10 20:02:47

推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹

標簽

背景

RecDB(推薦數據庫)

RecDB安裝

RecDB 使用例子

更多例子

小結

參考

上一篇：啤酒和紙尿褲最搭? - 用HybridDB/PostgreSQL查詢商品營銷最佳組合

下一篇：時序數據庫分析 - TimescaleDB時序數據庫介紹

相關內容

熱門內容

最新內容

推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹

標簽

背景

RecDB(推薦數據庫)

RecDB安裝

RecDB 使用例子

更多例子

小結

參考

上一篇： 啤酒和紙尿褲最搭? - 用HybridDB/PostgreSQL查詢商品營銷最佳組合

下一篇： 時序數據庫分析 - TimescaleDB時序數據庫介紹

相關內容

熱門內容

最新內容

上一篇：啤酒和紙尿褲最搭? - 用HybridDB/PostgreSQL查詢商品營銷最佳組合

下一篇：時序數據庫分析 - TimescaleDB時序數據庫介紹