95
技術社區[雲棲]
推薦係統分析 - 推薦算法, RecDB推薦數據庫介紹
標簽
PostgreSQL , RecDB , 推薦係統 , 圖式搜索
背景
中華文化源遠流長,從古至今有很多有趣的學問。比如看風水、看相,在西方文化中有類似的比如星座學說。
這些和推薦係統有什麼關係呢?
個人感覺星座學說這些學問有一定的理論基礎,更多的則是也是經驗的總結。
推薦係統實際上和星座學說類似,有一定的算法基礎,算法則可能是來自經驗的總結。
在現實場景中,可以找到很多類似的例子,這些例子是一些數據,根據推薦算法,可以找到用戶可能感興趣的東西。
1. 比如豆瓣有幾千萬用戶,用戶在豆瓣上可以對影片打分和點評,就形成了一批這樣的數據:
用戶ID,影片ID,打分
1, '終結者', 90
902, '笑傲江湖', 78
......
我暫且稱之為打分表, 或者rating數據
2. 電商應用,也有類似的rating數據,比如這樣的數據結構:
用戶ID,購買的商品ID,購買數量
或者
用戶ID,瀏覽的店鋪ID,瀏覽的次數
或者
用戶ID,瀏覽的商品ID,瀏覽次數
3. 其他應用, 使用應用程序的feed,可以找到用戶與用戶,用戶與物體的親密度(rating)。
有了用戶與用戶,用戶與物體的親密度(rating)數據之後,就可以根據推薦算法,給用戶推薦其他用戶或物體。
推薦算法相關的文章可以參考
https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html
接下來容重介紹一款繼承了多種推薦算法的數據庫RecDB。
RecDB(推薦數據庫)
RecDB是基於PostgreSQL打造的一款專業的推薦數據庫,目前繼承了以下推薦算法。
Currently, the available recommendation algorithms that could be passed to the USING clause are the following:
-
ItemCosCF
Item-Item Collaborative Filtering using Cosine Similarity measure. -
ItemPearCF
Item-Item Collaborative Filtering using Pearson Correlation Similarity measure. -
UserCosCF
User-User Collaborative Filtering using Cosine Similarity measure. -
UserPearCF
User-User Collaborative Filtering using Cosine Similarity measure. -
SVD
Simon Funk Singular Value Decomposition.
Applications powered by RecDB can produce online and flexible personalized recommendations to end-users.
RecDB的用法非常簡單,就如同使用PostgreSQL一樣,對用戶非常友好,同時擴展了一些推薦相關的SQL語法。
如果你還不會用PostgreSQL,可以來看看這個
《2011年功力的德哥教你2天擼通PostgreSQL - 入門、開發、原理、管理、調優》
RecDB安裝
1. 編譯安裝recdb
git clone https://github.com/Sarwat/recdb-postgresql.git
cd recdb-postgresql/PostgreSQL
./configure --prefix=/home/digoal/recdb
make world -j 32
make install-world
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=9999
export PGDATA=/home/digoal/pgdata/pg_root9999
export LANG=en_US.utf8
export PGHOME=/home/digoal/recdb
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export LD_RUN_PATH=$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=127.0.0.1
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
2. 初始化數據庫集群
initdb -D $PGDATA -E UTF8 --locale=C -U postgres
3. 配置
cd $PGDATA
vi postgresql.conf
listen_addresses = '0.0.0.0'
port = 9999
max_connections = 100
unix_socket_directory = '.'
shared_buffers = 32GB
maintenance_work_mem = 1GB
vacuum_cost_delay = 0ms
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 5.0
wal_level = minimal
synchronous_commit = off
wal_buffers = 128MB
wal_writer_delay = 10ms
checkpoint_segments = 2048
checkpoint_timeout = 35min
checkpoint_completion_target = 0.1
random_page_cost = 1.2
effective_cache_size = 128GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.1
autovacuum_freeze_max_age = 1500000000
autovacuum_vacuum_cost_delay = 0
vacuum_freeze_table_age = 1300000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
vi pg_hba.conf
host all all 0.0.0.0/0 md5
4. 啟動數據庫集群
pg_ctl start
5. 測試數據庫是否可用
psql
psql (9.2.0)
Type "help" for help.
postgres=# \dt
No relations found.
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+-------+-----------------------
postgres | postgres | UTF8 | C | C |
template0 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
(3 rows)
RecDB 使用例子
1. 創建測試表,三個字段,分別為用戶ID,電影ID,評分。
postgres=# create table douban_rating(uid int, movie_id int, access_rating real);
CREATE TABLE
2. 插入100萬評分數據,1000萬用戶ID,點評1000部電影
postgres=# insert into douban_rating select random()*10000000, random()*1000, random() from generate_series(1,1000000);
INSERT 0 1000000
3. 創建RECOMMENDER,需要指出推薦算法,字段。
推薦算法如下
-
ItemCosCF
Item-Item Collaborative Filtering using Cosine Similarity measure. -
ItemPearCF
Item-Item Collaborative Filtering using Pearson Correlation Similarity measure. -
UserCosCF
User-User Collaborative Filtering using Cosine Similarity measure. -
UserPearCF
User-User Collaborative Filtering using Cosine Similarity measure. -
SVD
Simon Funk Singular Value Decomposition.
postgres=# set maintenance_work_mem ='32GB';
SET
postgres=# set work_mem ='32GB';
SET
postgres=# \timing
Timing is on.
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating
USERS FROM uid
ITEMS FROM movie_id
EVENTS FROM access_rating
USING ItemCosCF;
NOTICE: CREATE TABLE will create implicit sequence "movierecindex_systemid_seq" for serial column "movierecindex.systemid"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "movierecindex_pkey" for table "movierecindex"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "movierecview1491804367109912_pkey" for table "movierecview1491804367109912"
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "movierecmodel1491804367109912_pkey" for table "movierecmodel1491804367109912"
CREATE RECOMMENDER
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-------------------------------+-------+----------+------------+-------------
public | douban_rating | table | postgres | 42 MB |
public | movierecindex | table | postgres | 16 kB |
public | movierecmodel1491804035307409 | table | postgres | 2136 kB |
public | movierecview1491804035307409 | table | postgres | 8192 bytes |
public | recdbproperties | table | postgres | 8192 bytes |
public | recmodelscatalogue | table | postgres | 16 kB |
(6 rows)
查看數據
postgres=# select * from douban_rating limit 10;
uid | movie_id | access_rating
---------+----------+---------------
359604 | 4798 | 0.796408
7749197 | 1764 | 0.194022
9288272 | 947 | 0.696304
6032232 | 1738 | 0.745247
6451861 | 6278 | 0.416638
3290076 | 510 | 0.571497
334635 | 4904 | 0.552451
2313039 | 3906 | 0.963749
2535368 | 6377 | 0.297736
2939719 | 7603 | 0.624071
(10 rows)
查看某個用戶的點評數據
postgres=# select * from douban_rating where uid=359604;
uid | movie_id | access_rating
--------+----------+---------------
359604 | 4798 | 0.796408
(1 row)
查看具有共性的用戶,點評了哪些電影
postgres=# select distinct movie_id from douban_rating where uid in (select uid from douban_rating where movie_id=4798);
movie_id
----------
2667
4798
1686
9008
3994
1156
1679
4103
1090
(9 rows)
4. 使用推薦算法,給uid=359604的用戶,推薦其他電影,返回的是具有共性的電影,分值從高到低排序。
postgres=# SELECT * FROM douban_rating R
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF
WHERE R.uid = 359604
ORDER BY R.access_rating desc
LIMIT 10;
uid | movie_id | access_rating
--------+----------+---------------
359604 | 2667 | 0.796408
359604 | 9008 | 0.796408
359604 | 1090 | 0.796408
359604 | 1679 | 0.796408
359604 | 1686 | 0.796408
359604 | 4103 | 0.796408
359604 | 1156 | 0.796408
359604 | 3994 | 0.796408
359604 | 6 | 0
359604 | 5 | 0
(10 rows)
執行計劃如下
postgres=# explain (analyze,verbose,timing,costs,buffers) SELECT * FROM douban_rating R
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF
WHERE R.uid = 359604
ORDER BY R.access_rating desc
LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.091..3337.093 rows=10 loops=1)
Output: uid, movie_id, access_rating
Buffers: shared hit=61840, temp read=9993 written=9993
-> Sort (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.090..3337.092 rows=10 loops=1)
Output: uid, movie_id, access_rating
Sort Key: r.access_rating
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=61840, temp read=9993 written=9993
-> Result (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.739..3332.023 rows=10001 loops=1)
Output: uid, movie_id, access_rating
Buffers: shared hit=61840, temp read=9993 written=9993
-> Recommend on public.douban_rating r (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.738..3329.177 rows=10001 loops=1)
Output: uid, movie_id, access_rating
Filter: (r.uid = 359604)
Buffers: shared hit=61840, temp read=9993 written=9993
Total runtime: 3337.116 ms
(16 rows)
5. 由於共性用戶較少,重新造一部分數據,讓共性更多,1萬用戶點評1000部電影。
postgres=# drop RECOMMENDER MovieRec;
DROP RECOMMENDER
postgres=# truncate douban_rating ;
TRUNCATE TABLE
postgres=# insert into douban_rating select random()*10000, random()*1000, random() from generate_series(1,1000000);
INSERT 0 1000000
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating
USERS FROM uid
ITEMS FROM movie_id
EVENTS FROM access_rating
USING ItemCosCF;
查詢原始數據如下
postgres=# select * from douban_rating limit 10;
uid | movie_id | access_rating
------+----------+---------------
128 | 848 | 0.796747
4953 | 401 | 0.832318
2766 | 874 | 0.61931
5572 | 430 | 0.550044
6 | 709 | 0.798314
1896 | 237 | 0.559974
4917 | 614 | 0.517259
6697 | 886 | 0.804338
2232 | 534 | 0.873135
4574 | 557 | 0.38828
(10 rows)
postgres=# select * from douban_rating where uid=128;
uid | movie_id | access_rating
-----+----------+---------------
128 | 848 | 0.796747
128 | 755 | 0.139934
128 | 79 | 0.633511
128 | 979 | 0.145586
128 | 120 | 0.153884
128 | 839 | 0.0865545
......
7. 給用戶128推薦電影,現在的結果有了一定的可參考性。
postgres=# SELECT * FROM MovieRec R
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF
WHERE R.uid = 128
ORDER BY R.access_rating desc
LIMIT 10;
uid | movie_id | access_rating
-----+----------+---------------
128 | 422 | 0.514567
128 | 127 | 0.514059
128 | 495 | 0.513637
128 | 974 | 0.513447
128 | 487 | 0.512524
128 | 64 | 0.512519
128 | 868 | 0.512367
128 | 132 | 0.512323
128 | 0 | 0.51225
128 | 54 | 0.512082
(10 rows)
8. recDB支持複雜的查詢,例如JOIN.
SELECT * FROM ml_ratings R, Movies M
RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF
WHERE R.userid = 1 AND M.movieid = R.itemid AND M.genre LIKE '%Comedy%'
ORDER BY R.ratingval
LIMIT 10
更多例子
https://github.com/DataSystemsLab/recdb-postgresql/tree/master/examples
小結
1. 未來RecDB可能會整合到PG內核,也可能作為一款PG的插件(可能性較大),Pipelinedb(流式計算數據庫)也在做插件化的改造。
目前基於PG改造或擴展的產品非常的多,包括
1. 流計算數據庫產品 pipelineDB
2. 推薦數據庫產品 recDB
3. 時序數據庫 timescaleDB
4. 分布式數據庫插件 citus
5. 列存儲插件 IMCS, cstore等
6. 麵向OLAP的codegen數據庫 pg_LLVM
7. 向量計算插件 vops
。。。
不同的場景,可以找到適合對應場景的插件。
2. recDB實現的推薦查詢,與圖數據也有一定的類似性,PostgreSQL在圖式搜索方麵的應用可以參考如下
《金融風控、公安刑偵、社會關係、人脈分析等需求分析與數據庫實現 - PostgreSQL圖數據庫場景應用》
參考
https://github.com/DataSystemsLab/recdb-postgresql
https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html
最後更新:2017-04-10 20:02:47
上一篇:
啤酒和紙尿褲最搭? - 用HybridDB/PostgreSQL查詢商品營銷最佳組合
下一篇:
時序數據庫分析 - TimescaleDB時序數據庫介紹
百度押注地圖謀求平台化:來自Offline的挑戰
Alibaba Single's Day Festival – An Evolution of Alibaba Cloud's Underlying Infrastructure
融入產業生態的靶向孵化
計算機常識--Windows篇
xmemcached發布1.3.6
智能家庭本周鋒聞:簡易VR眼鏡來襲,大佬玩起虛擬現實
在iOS上實現瀑布流界麵
《HttpClient官方文檔》2.1 連接持久性-2.2 HTTP連接路由
windows中修改catalina.sh上傳到linux執行報錯This file is needed to run this program
ComBox控件下拉框選項的添加