06 July, 2015

Analytical Functions - Introduction

1. What we can do with analytical functions can be done through simple joins and subqueries, but aggregate functions are always faster or atleast equal to native SQL queries.

2. The “group by” works similar to the aggregate function, but the “non-group by” columns are not allowed in the query. In aggregate func, this restriction is omitted and we can see the non-group by data also and the grouped value is repeated for all the results.

select student_class, count(*) cnt from students
where class in ('6A','6B')
group by class

select student_class, house, count(*) over (partition by student_class) cnt from students
where class in ('CF','CHILD')

3. The Analytical functions are computed after all the where, join, group by and having clauses are complete and before the order by clause. So, the analytical functions can appear only in the select clause and main order by clause of the query.

4. If the condition inside the OVER() clause is left empty, the analytical functions works on the entire resultset. We can use any non-analytical function in the OVER() clause.
select status, ext_batch_id, count(*) over () cnt from gefi_batch_header
where status in ('CF','CHILD')

5. The functions SUM, COUNT, AVG, MIN, MAX are the common analytic functions the result of which does not depend on the order of the records.
Functions like LEAD, LAG, RANK, DENSE_RANK, ROW_NUMBER, FIRST, FIRST VALUE, LAST, LAST VALUE depends on order of records. In the next example we will see how to specify that. We can order the records using the ORDER BY clause inside the OVER() clause.

Now , we will explore each of the above mentioned analytical functions in detail:

ROW_NUMBER, RANK and DENSE_RANK

The order by clause of the row_number() function, does not need to be in the select part of the query. This is not possible in the normal order by clause used in the query.
select status, ext_batch_id, row_number() over (partition by status order by batch_id) cnt from gefi_batch_header
where status in ('CF','CHILD')

You can even omit the partitioning clause and still use the order by in the OVER() clause.
select status, ext_batch_id, row_number() over (order by batch_id) cnt from gefi_batch_header
where status in ('CF','CHILD')

RANK and DENSE_RANK, both provide a rank to the resultset rows, but there is a slight difference in the 2 approaches:
SELECT empno, deptno, sal,
RANK() OVER (PARTITION BY deptno
ORDER BY sal DESC NULLS LAST) RANK,
DENSE_RANK() OVER (PARTITION BY
deptno ORDER BY sal DESC NULLS
LAST) DENSE_RANK
FROM emp
WHERE deptno IN (10, 20)
ORDER BY 2, RANK;
 
EMPNO  DEPTNO   SAL  RANK DENSE_RANK
------ ------- ----- ----- ----------
  7839      10  5000     1          1
  7782      10  2450     2          2
  7934      10  1300     3          3
  7788      20  3000     1          1
  7902      20  3000     1          1
  7566      20  2975     3          2
  7876      20  1100     4          3
  7369      20   800     5          4
 
8 rows selected.


As you can see from the query and its result, the RANK skips the rank number, if it finds 2 results with the same rank level. The next in order is assigned the next possible rank. But in case if DENSE_RANK, the values are not skipped and hence it is dense[may be].