PostgreSQL DISTINCT 和 DISTINCT ON 語法的使用

標簽

PostgreSQL , distinct , distinct on , window 函數

背景

本文介紹一下distinct的幾種用法。

https://www.postgresql.org/docs/10/static/queries-select-lists.html

1、返回唯一值

After the select list has been processed, the result table can optionally be subject to the elimination of duplicate rows.   
  
The DISTINCT key word is written directly after SELECT to specify this:  
  
SELECT DISTINCT select_list ...  
  
(Instead of DISTINCT the key word ALL can be used to specify the default behavior of retaining all rows.)  
  
Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.

2、返回指定列唯一的任意行。也可以使用窗口來實現類似功能，但是性能沒有distinct on好，因為它是任意的。

Alternatively, an arbitrary expression can determine what rows are to be considered distinct:  
  
SELECT DISTINCT ON (expression [, expression ...]) select_list ...  
  
Here expression is an arbitrary value expression that is evaluated for all rows.   
  
A set of rows for which all the expressions are equal are considered duplicates,   
and only the first row of the set is kept in the output.   
  
Note that the “first row” of a set is unpredictable unless the query is sorted   
on enough columns to guarantee a unique ordering of the rows arriving at the DISTINCT filter.   
  
(DISTINCT ON processing occurs after ORDER BY sorting.)  
  
The DISTINCT ON clause is not part of the SQL standard and is sometimes considered bad   
style because of the potentially indeterminate nature of its results.   
  
With judicious use of GROUP BY and subqueries in FROM,   
this construct can be avoided, but it is often the most convenient alternative.

3、返回唯一值個數

select count(distinct (表達式1, ....)) from tbl;  

或

select count(*) from (select 表達式,.... from tbl group by 表達式,....) t;

例子

1、返回所有記錄。

select ALL id, c1 from test;  
  
或  
  
select id, c1 from test;

2、返回 id,c1 唯一值。(這裏NULL視為相等)。

select DISTINCT id, c1 from test;   
  
或  
  
select id, c1 from test group by id, c1;

3、返回c3唯一的任意行。

select distinct on (c3) c2,c3 from tbl;

postgres=# explain (analyze,verbose,timing,costs,buffers) select distinct on (c3) c2,c3 from tbl;  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Unique  (cost=115063.84..120063.84 rows=11 width=8) (actual time=1865.317..2279.840 rows=11 loops=1)  
   Output: c2, c3  
   Buffers: shared hit=5406  
   ->  Sort  (cost=115063.84..117563.84 rows=1000000 width=8) (actual time=1865.312..2068.536 rows=1000000 loops=1)  
         Output: c2, c3  
         Sort Key: tbl.c3  
         Sort Method: quicksort  Memory: 71452kB  
         Buffers: shared hit=5406  
         ->  Seq Scan on public.tbl  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.017..264.041 rows=1000000 loops=1)  
               Output: c2, c3  
               Buffers: shared hit=5406  
 Planning time: 0.070 ms  
 Execution time: 2291.536 ms  
(13 rows)

postgres=# select distinct on (c3) c2,c3 from tbl;  -- c3 唯一, 但是可能返回任意行  
 c2  | c3   
-----+----  
 100 |  0  
  23 |  1  
  63 |  2  
  47 |  3  
  34 |  4  
  17 |  5  
  60 |  6  
   8 |  7  
  70 |  8  
  73 |  9  
  94 | 10  
(11 rows)

使用窗口函數可以達到類似效果，但是可以確定返回哪行，因此也更慢一些：

select * from (select row_number() over (partition by c3) as rn, * from tbl) t where rn=1;  
  
postgres=# select * from (select row_number() over (partition by c3) as rn, * from tbl) t where rn=1;  
 rn | c1  | c2  | c3   
----+-----+-----+----  
  1 | 420 | 100 |  0  
  1 | 721 |  23 |  1  
  1 |  80 |  63 |  2  
  1 | 322 |  47 |  3  
  1 | 457 |  34 |  4  
  1 | 386 |  17 |  5  
  1 | 491 |  60 |  6  
  1 | 260 |   8 |  7  
  1 |  41 |  70 |  8  
  1 |  56 |  73 |  9  
  1 | 154 |  94 | 10  
(11 rows)

postgres=# explain analyze select * from (select row_number() over (partition by c3) as rn, * from tbl) t where rn=1;  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Subquery Scan on t  (cost=115063.84..145063.84 rows=5000 width=20) (actual time=1855.132..2860.276 rows=11 loops=1)  
   Filter: (t.rn = 1)  
   Rows Removed by Filter: 999989  
   ->  WindowAgg  (cost=115063.84..132563.84 rows=1000000 width=20) (actual time=1855.129..2739.190 rows=1000000 loops=1)  
         ->  Sort  (cost=115063.84..117563.84 rows=1000000 width=12) (actual time=1855.115..2028.946 rows=1000000 loops=1)  
               Sort Key: tbl.c3  
               Sort Method: quicksort  Memory: 71452kB  
               ->  Seq Scan on tbl  (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.015..251.021 rows=1000000 loops=1)  
 Planning time: 0.115 ms  
 Execution time: 2871.551 ms  
(10 rows)

4、返回有多少個唯一值

select count(distinct (表達式,....)) from tbl;

postgres=# select count(distinct c3) from tbl;  
 count   
-------  
    11  
(1 row)  
  
postgres=# select count(distinct (c3,c2)) from tbl;  
 count   
-------  
  1111  
(1 row)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select count(distinct (c3,c2)) from tbl;;  
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=17906.00..17906.01 rows=1 width=8) (actual time=6905.660..6905.661 rows=1 loops=1)  
   Output: count(DISTINCT ROW(c3, c2))  
   Buffers: shared hit=5406  
   ->  Seq Scan on public.tbl  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.017..156.436 rows=1000000 loops=1)  
         Output: c1, c2, c3  
         Buffers: shared hit=5406  
 Planning time: 0.062 ms  
 Execution time: 6905.727 ms  
(8 rows)

或使用group by的方法

select count(*) from (select 表達式,.... from tbl group by 表達式,....) t;

postgres=# select count(*) from (select c2,c3 from tbl group by c2,c3) t;  
 count   
-------  
  1111  
(1 row)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from (select c2,c3 from tbl group by c2,c3) t;  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=20431.00..20431.01 rows=1 width=8) (actual time=674.609..674.609 rows=1 loops=1)  
   Output: count(*)  
   Buffers: shared hit=5406  
   ->  HashAggregate  (cost=20406.00..20417.11 rows=1111 width=8) (actual time=674.093..674.409 rows=1111 loops=1)  
         Output: tbl.c2, tbl.c3  
         Group Key: tbl.c2, tbl.c3  
         Buffers: shared hit=5406  
         ->  Seq Scan on public.tbl  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.014..143.904 rows=1000000 loops=1)  
               Output: tbl.c1, tbl.c2, tbl.c3  
               Buffers: shared hit=5406  
 Planning time: 0.120 ms  
 Execution time: 674.684 ms  
(12 rows)

最後更新：2017-10-28 23:34:30

PostgreSQL DISTINCT 和 DISTINCT ON 語法的使用

標簽

背景

例子

上一篇： Greenplum 點查詢的優化(分布鍵)

下一篇： Greenplum 模煳查詢實踐

相關內容

熱門內容

最新內容

PostgreSQL DISTINCT 和 DISTINCT ON 語法的使用

標簽

背景

例子

上一篇： Greenplum 點查詢的優化(分布鍵)

下一篇： Greenplum 模煳查詢 實踐

相關內容

熱門內容

最新內容

下一篇： Greenplum 模煳查詢實踐