580
技術社區[雲棲]
PostgreSQL 聚合函數講解 - 6 分組排序聚合
分組排序聚合的例子.
Table 9-51. Ordered-Set Aggregate Functions
Function | Direct Argument Type(s) | Aggregated Argument Type(s) | Return Type | Description |
---|---|---|---|---|
mode() WITHIN GROUP (ORDER BYsort_expression) |
any sortable type | same as sort expression | returns the most frequent input value (arbitrarily choosing the first one if there are multiple equally-frequent results) | |
percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression) |
double precision | double precisionor interval | same as sort expression | continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed |
percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression) |
double precision[] | double precisionor interval | array of sort expression's type | multiple continuous percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the value corresponding to that percentile |
percentile_disc(fraction) WITHIN GROUP (ORDER BY sort_expression) |
double precision | any sortable type | same as sort expression | discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction |
percentile_disc(fractions) WITHIN GROUP (ORDER BY sort_expression) |
double precision[] | any sortable type | array of sort expression's type | multiple discrete percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the input value corresponding to that percentile |
All the aggregates listed in Table 9-51 ignore null values in their sorted input. For those that take a fraction parameter, the fraction value must be between 0 and 1; an error is thrown if not. However, a null fraction value simply produces a null result.
mode比較好理解, 就是取分組中出現頻率最高的值或表達式, 如果最高頻率的值有多個, 則隨機取一個.
postgres=# create table test(id int, info text);
CREATE TABLE
postgres=# insert into test values (1,'test1');
INSERT 0 1
postgres=# insert into test values (1,'test1');
INSERT 0 1
postgres=# insert into test values (1,'test2');
INSERT 0 1
postgres=# insert into test values (1,'test3');
INSERT 0 1
postgres=# insert into test values (2,'test1');
INSERT 0 1
postgres=# insert into test values (2,'test1');
INSERT 0 1
postgres=# insert into test values (2,'test1');
INSERT 0 1
postgres=# insert into test values (3,'test4');
INSERT 0 1
postgres=# insert into test values (3,'test4');
INSERT 0 1
postgres=# insert into test values (3,'test4');
INSERT 0 1
postgres=# insert into test values (3,'test4');
INSERT 0 1
postgres=# insert into test values (3,'test4');
INSERT 0 1
postgres=# select * from test;
id | info
----+-------
1 | test1
1 | test1
1 | test2
1 | test3
2 | test1
2 | test1
2 | test1
3 | test4
3 | test4
3 | test4
3 | test4
3 | test4
(12 rows)
取出所有數據中, 出現頻率最高的info, 有可能是test1也有可能是test4, 因為他們的出現頻率一致.
mode的返回結果數據類型和order by後麵的表達式一致.
postgres=# select mode() within group (order by info) from test;
mode
-------
test1
(1 row)
如果按INFO來分組的話, 取出出現頻率最高的info, 實際上這個操作是沒有任何意義的, 返回值就是所有記錄的info的唯一值.
postgres=# select mode() within group (order by info) from test group by info;
mode
-------
test1
test2
test3
test4
(4 rows)
按id來分組, 取出組內出現頻率最高的info值, 這個是有意義的.
postgres=# select mode() within group (order by info) from test group by id;
mode
-------
test1
test1
test4
(3 rows)
id=1 , 出現頻率最高的info是test1. 出現2次.
如下 :
postgres=# select id,info,count(*) from test group by id,info;
id | info | count
----+-------+-------
1 | test1 | 2
1 | test3 | 1
3 | test4 | 5
1 | test2 | 1
2 | test1 | 3
(5 rows)
如果要返回mode()並返回頻率次數. 可以使用row_number()窗口來實現. 如下.
postgres=# select id,info,cnt from (select id,info,cnt,row_number() over(partition by id order by cnt desc) as rn from (select id,info,count(*) cnt from test group by id,info) t) t where t.rn=1;
id | info | cnt
----+-------+-----
1 | test1 | 2
2 | test1 | 3
3 | test4 | 5
(3 rows)
其他, mode的返回結果數據類型和order by後麵的表達式一致.
postgres=# select mode() within group (order by id) from test;
mode
------
3
(1 row)
postgres=# select mode() within group (order by id+1) from test;
mode
------
4
(1 row)
另外還有4個函數是和數據分布有關的, 需要指定從0到1的分布位置. 返回排序後, 在指定分布位置的值或表達式的值.
src/backend/utils/adt/orderedsetaggs.c
if (percentile < 0 || percentile > 1 || isnan(percentile))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("percentile value %g is not between 0 and 1",
percentile)));
同時還需要注意區分連續分布和離散分布.
postgres=# create table test(id int, info text);
CREATE TABLE
postgres=# insert into test values (1,'test1');
INSERT 0 1
postgres=# insert into test values (2,'test2');
INSERT 0 1
postgres=# insert into test values (3,'test2');
INSERT 0 1
postgres=# insert into test values (4,'test2');
INSERT 0 1
postgres=# insert into test values (5,'test2');
INSERT 0 1
postgres=# insert into test values (6,'test2');
INSERT 0 1
postgres=# insert into test values (7,'test2');
INSERT 0 1
postgres=# insert into test values (8,'test3');
INSERT 0 1
postgres=# insert into test values (100,'test3');
INSERT 0 1
postgres=# insert into test values (1000,'test4');
INSERT 0 1
postgres=# select * from test;
id | info
------+-------
1 | test1
2 | test2
3 | test2
4 | test2
5 | test2
6 | test2
7 | test2
8 | test3
100 | test3
1000 | test4
(10 rows)
取連續分布的中位數可以用percentile_cont(0.5)來獲得.
postgres=# select percentile_cont(0.5) within group (order by id) from test;
percentile_cont
-----------------
5.5
(1 row)
這個5.5是怎麼計算來的呢? 參考本文末尾 :
If (CRN = FRN = RN) then the result is
(value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
解釋 :
N = 當前分組的行數 = 10
RN = (1+傳入參數*(N-1)) = (1+0.5*(10-1)) = 5.5
CRN = ceiling(RN) = 6
FRN = floor(RN) = 5
value of expression for row at FRN : 當前分組內第FRN行的值 = 5
value of expression for row at CRN : 當前分組內第CRN行的值 = 6
所以最終中位數值 :
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN) =
(6-5.5)*(5) + (5.5 - 5)*(6) = 5.5;
使用info分組 :
postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;
percentile_cont | info
-----------------+-------
1 | test1
4.5 | test2
54 | test3
1000 | test4
(4 rows)
驗證這個值4.5 | test2 :
2 | test2
3 | test2
4 | test2
5 | test2
6 | test2
7 | test2
N = 當前分組的行數 = 6
RN = (1+傳入參數*(N-1)) = (1+0.5*(6-1)) = 3.5
CRN = ceiling(RN) = 4
FRN = floor(RN) = 3
value of expression for row at FRN : 當前分組內第FRN行的值 = 4
value of expression for row at CRN : 當前分組內第CRN行的值 = 5
所以最終中位數值 :
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN) =
(4-3.5)*(4) + (3.5 - 3)*(5) = 4.5;
當輸入參數為數組時, 返回值也是數組, 如下 :
postgres=# select percentile_cont(array[0.5, 1]) within group (order by id) from test;
percentile_cont
-----------------
{5.5,1000}
(1 row)
接下來看一下稀疏分布 :
返回行號大於等於指定百分比的值或表達式值.
例如 :
postgres=# select id from test;
id
------
1
2
3
4
5
6
7
8
100
1000
(10 rows)
當前組一共10行, 取位置在0.5的.即行號>=0.5*10的第一行的值或表達式的值.
postgres=# select percentile_disc(0.5) within group (order by id) from test;
percentile_disc
-----------------
5
(1 row)
postgres=# select percentile_disc(0.5) within group (order by id^2) from test;
percentile_disc
-----------------
25
(1 row)
輸入0.11, 表示行號返回>=1.1的第一行的值.
postgres=# select percentile_disc(0.11) within group (order by id) from test;
percentile_disc
-----------------
2
(1 row)
再看個例子
postgres=# select id,info,count(*) over (partition by info) from test;
id | info | count
------+-------+-------
1 | test1 | 1
2 | test2 | 6
3 | test2 | 6
4 | test2 | 6
5 | test2 | 6
6 | test2 | 6
7 | test2 | 6
8 | test3 | 2
100 | test3 | 2
1000 | test4 | 1
(10 rows)
取分組的數據, 主要看test2 這個組一共有6行, 0.3*6=1.8, 所以它需要取第二行的數據.
postgres=# select info,percentile_disc(0.3) within group (order by id) from test group by info;
info | percentile_disc
-------+-----------------
test1 | 1
test2 | 3
test3 | 8
test4 | 1000
(4 rows)
[注意]
最終計算的是表達式的分布數, 而不是計算列值的分布數後再計算表達式.
驗證如下 :
或者你可以看代碼 :
postgres=# select percentile_cont(0.5) within group (order by id^2),info from test group by info;
percentile_cont | info
-----------------+-------
1 | test1
20.5 | test2
5032 | test3
1000000 | test4
(4 rows)
postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;
percentile_cont | info
-----------------+-------
1 | test1
4.5 | test2
54 | test3
1000 | test4
(4 rows)
postgres=# select 4.5^2;
?column?
---------------------
20.2500000000000000
(1 row)
postgres=# select 54^2;
?column?
----------
2916
(1 row)
5. src/backend/utils/adt/orderedsetaggs.c
6. 算法 :
PERCENTILE_CONT函數解釋 :
The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, you can compute the row number you are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+(P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).
The final result will be:
If (CRN = FRN = RN) then the result is
(value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
PERCENTILE_DISC函數解釋 :
The first expr must evaluate to a numeric value between 0 and 1, because it is a percentile value. This expression must be constant within each aggregate group. The ORDER BY clause takes a single expression that can be of any type that can be sorted.
For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P.
MEDIAN(中位數)詳解, Oracle有單獨的計算中位數的函數, 實際上就是PERCENTILE_CONT(0.5) :
MEDIAN is an inverse distribution function that assumes a continuous distribution model. It takes a numeric or datetime value and returns the middle value or an interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.
This function takes as arguments any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type. If you specify only expr, then the function returns the same data type as the numeric data type of the argument. If you specify the OVER clause, then Oracle Database determines the argument with the highest numeric precedence, implicitly converts the remaining arguments to that data type, and returns that data type.
The result of MEDIAN is computed by first ordering the rows. Using N as the number of rows in the group, Oracle calculates the row number (RN) of interest with the formula RN = (1 + (0.5*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).
The final result will be:
if (CRN = FRN = RN) then
(value of expression from row at RN)
else
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
最後更新:2017-04-01 13:38:50