537 阿裏雲技術社區[雲棲]

PostgrSQL 遞歸SQL的幾個應用 - 極客與正常人的思維

標簽

PostgreSQL , with recursive , 遞歸查詢 , 求差 , 求唯一值 , 求稀疏列 , 求時序數據最新值

背景

遞歸SQL的幾個應用

遞歸查詢，在幾個場景的應用，有十全大補丸的功效。

一、求唯一值

場景

某張表，數據量在億級別，求某稀疏列的唯一值。

create table sex (sex char(1), otherinfo text);    
create index idx_sex_1 on sex(sex);    
insert into sex select 'm', generate_series(1,50000000)||'this is test';    
insert into sex select 'w', generate_series(1,50000000)||'this is test';

正常人的做法 - 蝸牛

select distinct col from table ;

蝸牛般的速度。

極客的做法 - 0.17毫秒

with recursive skip as (    
  (    
    select min(t.sex) as sex from sex t where t.sex is not null    
  )    
  union all    
  (    
    select (select min(t.sex) as sex from sex t where t.sex > s.sex and t.sex is not null)     
      from skip s where s.sex is not null   
  )    
)     
select * from skip where sex is not null;   
  
Total runtime: 0.173 ms

飛一般的感覺。

二、求差

場景

一張小表A，裏麵存儲了一些ID，大約幾百個。

（比如說巡邏車輛ID，環衛車輛的ID，公交車，微公交的ID）。

另外有一張日誌表B，每條記錄中的ID是來自前麵那張小表的，但不是每個ID都出現在這張日誌表中，比如說一天可能隻有幾十個ID會出現在這個日誌表的當天的數據中。

（比如車輛的行車軌跡數據，每秒上報軌跡，數據量就非常龐大）。

怎麼快速的找出今天沒有出現的ID？

（哪些巡邏車輛沒有出現在這個片區，是不是偷懶了？哪些環衛車輛沒有出行，哪些公交或微公交沒有出行）？

建表

create table a(id int primary key, info text);  
  
create table b(id int primary key, aid int, crt_time timestamp);  
create index b_aid on b(aid);

插入測試數據

-- a表插入1000條  
insert into a select generate_series(1,1000), md5(random()::text);  
  
-- b表插入500萬條，隻包含aid的500個id。  
insert into b select generate_series(1,5000000), generate_series(1,500), clock_timestamp();

正常人的做法 - 2.3秒

select * from a where id not in (select aid from b);   
618秒  
  
select a.id from a left join b on (a.id=b.aid) where b.* is null;  
2.3秒

極客的做法 - 11毫秒

select * from a where id not in   
(  
with recursive skip as (    
  (    
    select min(aid) aid from b where aid is not null    
  )    
  union all    
  (    
    select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)     
      from skip s where s.aid is not null    
  )    
)     
select aid from skip where aid is not null  
);  
  
11毫秒

飛一般的感覺。

三、求時序數據最新值

場景

有很多傳感器，不斷的在上報數據，用戶需要查詢當前最新的，每個傳感器上報的值。

這種需求，可以使用窗口查詢，但是如何加速，如何快速的取出批量數據？

創建測試表如下，

create unlogged table sort_test(  
  id serial8 primary key,  -- 主鍵  
  c2 int,  -- 傳感器ID  
  c3 int  -- 傳感器值  
);

寫入1000萬傳感器測試數據，10萬個傳感器。

postgres=# insert into sort_test (c2,c3) select random()*100000, random()*100 from generate_series(1,10000000);  
INSERT 0 10000000

正常人的做法 - 44秒

select id,c2,c3 from (select id,c2,c3,row_number() over(partition by c2 order by id desc) rn from sort_test) t where rn=1;  
  
44秒

極客的做法 - 4秒

postgres=# create index sort_test_1 on sort_test(c2,id desc);   
CREATE INDEX  
  
with recursive skip as (    
  (    
    select (c2,c3)::r as r from sort_test where id in (select id from sort_test where c2 is not null order by c2,id desc limit 1)   
  )    
  union all    
  (    
    select (  
      select (c2,c3)::r as r from sort_test where id in (select id from sort_test t where t.c2>(s.r).c2 and t.c2 is not null order by c2,id desc limit 1)   
    ) from skip s where (s.r).c2 is not null  
  )      
)     
select (t.r).c2, (t.r).c3 from skip t where t.* is not null;   
  
4.2秒

如果數據需要處理或實時展示，流式返回

postgres=# begin;  
BEGIN  
Time: 0.079 ms  
postgres=# declare cur cursor for with recursive skip as (    
  (    
    select (c2,c3)::r as r from sort_test where id in (select id from sort_test where c2 is not null order by c2,id desc limit 1)   
  )    
  union all    
  (    
    select (  
      select (c2,c3)::r as r from sort_test where id in (select id from sort_test t where t.c2>(s.r).c2 and t.c2 is not null order by c2,id desc limit 1)   
    ) from skip s where (s.r).c2 is not null  
  )      
)     
select (t.r).c2, (t.r).c3 from skip t where t.* is not null;   
DECLARE CURSOR  
Time: 1.240 ms  
postgres=# fetch 100 from cur;  
    r       
----------  
 (0,93)  
 (1,52)  
 (2,65)  
.....  
  (97,78)  
 (98,44)  
 (99,99)  
(100 rows)  
  
Time: 4.314 ms

飛一般的感覺。