Window Functions
Window functions allow you to perform calculations across a set of rows that are related to the current row.
For example; we have a table called orders
that contains the following columns: order_id
, customer_id
, order_date
, and order_amount
. You want to calculate the running total of sales for each customer, ordered by their order dates. You can use the SUM
window function to do this:
SELECT order_id, customer_id, order_date, order_amount,
SUM(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM orders;
In this example, the SUM
function is applied to the order_amount
column and is partitioned by the customer_id
column. This means that the running total will be calculated separately for each customer.
The ORDER BY
clause specifies that the order dates should be used to determine the order of the rows within each partition. This means that the running total will be calculated in the order of the customer’s orders.
The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
clause specifies that the window frame for the calculation should include all rows from the beginning of the partition up to and including the current row. This means that the running total will be calculated from the first order for each customer up to and including the current order.
The result of the query will be a table that contains the same columns as the orders
table, plus an additional column called running_total
, which contains the running total of sales for each customer, ordered by their order dates.
By using Window Functions in SQL, you can perform complex calculations on your data and gain deeper insights into your business. This example demonstrates how to calculate running totals for each customer, but you can use Window Functions to perform many other types of calculations, such as calculating moving averages, ranking data, and more.
Common Table Expressions (CTEs)
CTEs allow you to define a temporary result set that can be used in subsequent SQL statements.
Letโs say we have a table called employees
that contains the following columns: employee_id
, employee_name
, department_id
, and salary
. You want to calculate the average salary for each department, and then find the employees whose salary is higher than their department’s average salary. To do this, you can use a CTE to first calculate the average salary for each department, and then use that CTE in a second query to find the employees whose salary is higher than their department’s average salary.
WITH department_avg_salary AS (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
)
SELECT employee_id, employee_name, salary, department_avg_salary.avg_salary
FROM employees
INNER JOIN department_avg_salary ON employees.department_id = department_avg_salary.department_id
WHERE salary > department_avg_salary.avg_salary;
In this example, the first query defines a CTE called department_avg_salary
. This CTE calculates the average salary for each department, using the AVG
function and a GROUP BY
clause that groups the employees by their department.
The second query then uses the department_avg_salary
CTE as if it were a table, joining it with the employees
table on the department_id
column. The result is filtered by the WHERE
clause to only include employees whose salary is higher than their department’s average salary.
The advantage of using a CTE, in this case, is that it allows you to break down the problem into two steps: First, calculating the average salary for each department, and then selecting the employees whose salary is higher than their departmentโs average salary. By separating the calculation into two steps, the query is easier to read and maintain.
CTEs can be used in many other situations as well, such as for recursive queries, complex joins, and more. By using CTEs, you can make your SQL queries more readable and easier to understand.
Aggregate Functions
Aggregate functions are functions that perform a calculation on a set of values and return a single value as a result. These functions allow us to perform calculations across multiple rows or columns of a table and can be used to summarize data in a meaningful way. The most common aggregate functions in SQL are SUM, AVG, MIN, MAX, and COUNT.
For example, there is a table called sales
that contains the following columns: sale_id
, product_id
, sale_date
, sale_amount
, and region
. You want to calculate the total sales and average sales per product, as well as the top-selling product in each region. To do this, you can use Aggregate Functions to group the sales by product and region and calculate the total and average sales, as well as find the top-selling product in each region.
SELECT
product_id,
AVG(sale_amount) AS avg_sale_amount,
SUM(sale_amount) AS total_sale_amount,
region,
RANK() OVER (PARTITION BY region ORDER BY SUM(sale_amount) DESC) AS rank
FROM sales
GROUP BY product_id, region;
In this example, the query has got three Aggregate Functions: AVG
, SUM
, and RANK
.
The AVG
function calculates the average sale amount per product and region, while the SUM
function calculates the total sale amount per product and region. The GROUP BY
clause groups the sales by product and region.
The RANK
function is used to find the top-selling product in each region. The OVER
clause specifies the ranking should be performed over each region separately, while the PARTITION BY
clause specifies the column to partition the data on (in this case, region
). The ORDER BY
clause specifies that the ranking should be based on the sum of the sale amount for each product in each region, in descending order.
The result of the query includes the product_id
, region
, total_sale_amount
, avg_sale_amount
, and rank
columns. The rank
column indicates the ranking of each product in each region based on total sale amount, with the top-selling product in each region having a rank of 1.
The advantage of using Aggregate Functions, in this case, is that they allow you to group and summarize the data, and calculate useful metrics such as total and average sale amounts. The RANK
function also allows you to find the top-selling product in each region, which can be useful for identifying trends and opportunities for improvement.
Aggregate Functions can be used in many other situations as well, such as calculating minimum and maximum values, counting records, and more. By using Aggregate Functions, you can make your SQL queries more powerful and flexible.
Pivot Tables
Pivot tables are tables that summarize and aggregate data from a larger table in a way that makes it easier to analyze. It allows us to transform data from rows to columns, and display the data in a more meaningful way.
Pivot tables in SQL are created using the PIVOT operator, which is used to aggregate data based on a specific column and display the results in a tabular format.
For example:
SELECT
customer_id,
[1] AS Product1,
[2] AS Product2,
[3] AS Product3,
[4] AS Product4,
[5] AS Product5
FROM (
SELECT
customer_id,
product_id,
order_quantity
FROM orders
) p
PIVOT (
SUM(order_quantity)
FOR product_id IN ([1], [2], [3], [4], [5])
) AS pvt;
In above example, the query uses the PIVOT
operator to pivot the data by product ID, with columns for each product and rows for each customer. The SUM
function is used to calculate the total quantity of each product ordered by each customer.
The subquery p
is used to extract the necessary columns from the orders
table. The PIVOT
operator is then applied to the subquery, with the SUM
function used to calculate the total quantity of each product ordered by each customer. The FOR
clause specifies the pivot column (in this case, product_id
), and the IN
clause specifies the values to pivot on (in this case, [1], [2], [3], [4], [5]
).
The result of the query is a pivot table that shows the total quantity of each product ordered by each customer, with columns for each product and rows for each customer.
Pivot tables can be used in many other situations as well, such as summarizing sales data, analyzing survey results, and more. By using pivot tables, you can make your SQL queries more powerful and flexible.
Subqueries
Subqueries in SQL are nested queries used to retrieve data from one or more tables, and the results of the subquery are used in the main query. They can be used to filter, sort, and group data, and can be classified as single-row or multiple-row subqueries. Subqueries are enclosed in parentheses and can be used in various parts of a SQL statement such as the SELECT, FROM, WHERE, and HAVING clauses.
For example; there are two tables called employees
and salaries
. The employees
table contains the following columns: employee_id
, first_name
, last_name
, and department_id
. The salaries
table contains the following columns: employee_id
, salary
, and salary_date
. You want to find the names of the employees who have the highest salary in each department. To do this, you can use a subquery to find the maximum salary for each department, and then join the result with the employees
and salaries
tables to get the names of the employees who have that salary.
SELECT
e.first_name,
e.last_name,
e.department_id,
s.salary
FROM
employees e
INNER JOIN salaries s ON e.employee_id = s.employee_id
INNER JOIN (
SELECT
department_id,
MAX(salary) AS max_salary
FROM
salaries
GROUP BY
department_id
) m ON s.department_id = m.department_id AND s.salary = m.max_salary;
The below query uses a subquery to find the maximum salary for each department. The subquery is executed first and returns a result set that contains the maximum salary for each department. The main query then joins the employees
and salaries
tables with the result of the subquery, to get the names of the employees who have the highest salary in each department.
The INNER JOIN
clause is used to join the employees
and salaries
tables, using the employee_id
column as the join key. The subquery is joined to the main query using the department_id
column, and the salary
column is used to match the maximum salary for each department.
The result of the query is a table that shows the names of the employees who have the highest salary in each department, along with their department ID and salary.
Cross Joins
Cross Joins are a type of join operation that returns the Cartesian product of two or more tables. They do not use a join condition but instead combine every row from one table with every row from the other table. The resulting table contains all possible combinations of rows from both tables. Cross joins can be useful in certain situations, such as when generating test data or when performing a calculation that requires all possible combinations of values from multiple tables. However, they can also be computationally expensive and can generate very large result sets.
In below example; there are 2 tables called customers
and orders
. The customers
table contains the following columns: customer_id
, customer_name
, and city
. The orders
table contains the following columns: order_id
, customer_id
, and order_date
. You want to find the total number of orders placed by each customer in each city. To do this, you can use a cross join to generate a result set that combines every customer with every city, and then join the result with the orders
table to get the number of orders for each combination.
SELECT
c.customer_id,
c.customer_name,
c.city,
COUNT(o.order_id) AS order_count
FROM
customers c
CROSS JOIN (
SELECT DISTINCT
city
FROM
customers
) cities
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE
c.city = cities.city
GROUP BY
c.customer_id,
c.customer_name,
c.city;
In below example, the query uses a cross join to generate a result set that combines every customer with every city. The cross join is executed first, and returns a result set that contains every combination of customer and city. The main query then joins the result of the cross join with the orders
table, using a left join to ensure that all customers are included in the result even if they have not placed any orders.
The WHERE
clause is used to filter the result to only include rows where the customer’s city matches the city in the cross join. This ensures that the result only shows the number of orders for each customer in their respective city.
The GROUP BY
clause is used to group the result by customer ID, customer name, and city. The COUNT()
function is used to count the number of orders for each customer in each city.
The result of the query is a table that shows the total number of orders placed by each customer in each city.
Temporary Tables
Temporary tables in SQL are tables that are created and used for a specific purpose during the execution of a SQL statement or transaction. They are stored in memory or on disk, and are automatically dropped when the session that created them ends or when they are no longer needed. Temporary tables are often used to store intermediate results, or to break down complex queries into smaller, more manageable parts.
They can be created using the CREATE TEMPORARY TABLE statement and can be manipulated like regular tables using SQL commands such as SELECT, INSERT, UPDATE, and DELETE. Temporary tables can be very useful for optimizing complex queries and improving performance, as they can help reduce the amount of data that needs to be processed at any given time.
Letโs say I have a table called sales
that contains the following columns: date
, product
, category
, and sales_amount
. You want to create a report that shows the total sales for each category for each month over the past year. To do this, you can use a temporary table to create a summary of the sales data for each month, and then join the temporary table with the sales
table to get the total sales for each category.
First, you can create the temporary table using a CREATE TEMPORARY TABLE
statement:
CREATE TEMPORARY TABLE monthly_sales_summary (
month DATE,
category VARCHAR(50),
total_sales DECIMAL(10,2)
);
This statement creates a temporary table called monthly_sales_summary
with three columns: month
, category
, and total_sales
. The month
column is of type DATE
, the category
column is of type VARCHAR(50)
, and the total_sales
column is of type DECIMAL(10,2)
.
Next, I populate the temporary table with the summary data using an INSERT INTO
statement:
INSERT INTO monthly_sales_summary (month, category, total_sales)
SELECT
DATE_TRUNC('month', date) AS month,
category,
SUM(sales_amount) AS total_sales
FROM
sales
WHERE
date >= DATE_TRUNC('year', CURRENT_DATE) -- sales from the past year
GROUP BY
DATE_TRUNC('month', date),
category;
This statement uses the DATE_TRUNC
function to truncate the date
column to the month level, grouping the sales data by month and category. The result of this query is inserted into the monthly_sales_summary
table, which now contains a summary of the sales data for each month.
Finally, I can join the temporary table with the sales
table to get the total sales for each category:
SELECT s.category, mss.month, mss.total_sales
FROM sales s JOIN monthly_sales_summary mss ON s.category = mss.category AND DATE_TRUNC('month', s.date)=mss.month
WHERE s.date >= DATE_TRUNC('year', CURRENT_DATE) -- sales from the past year
ORDER BY s.category, mss.month;
This statement joins the sales
table with the monthly_sales_summary
table on the category
and month
columns, and selects the category
, month
, and total_sales
columns from the temporary table. The WHERE
clause is used to filter the result to only include sales data from the past year, and the ORDER BY
clause is used to sort the result by category and month.
The result of the query is a table that shows the total sales for each category for each month over the past year.
Materialized Views
Materialized views in SQL are precomputed result sets that are stored as physical tables. They are created and maintained based on a SQL query and are used to improve the performance of frequently executed queries. Materialized views can be refreshed on a schedule or on demand to ensure that the data is up to date. When a query is executed against a materialized view, the result set is retrieved from the physical table, rather than being computed from the original tables.
This can result in significant performance improvements, especially for complex queries that involve joins or aggregation functions. Materialized views are commonly used in data warehousing and business intelligence applications, where they can help speed up reports and dashboards.
For example; there is a large table called sales
that contains the following columns: date
, product
, category
, and sales_amount
. You want to create a report that shows the total sales for each category for each month over the past year. However, running this query directly on the sales
table would be slow, as it contains millions of rows. To speed up the query, you can create a materialized view that summarizes the sales data by month and category.
To create a materialized view, you can use the CREATE MATERIALIZED VIEW
statement, like this:
CREATE MATERIALIZED VIEW monthly_sales_summary AS
SELECT
DATE_TRUNC('month', date) AS month,
category,
SUM(sales_amount) AS total_sales
FROM
sales
WHERE
date >= DATE_TRUNC('year', CURRENT_DATE) -- sales from the past year
GROUP BY
DATE_TRUNC('month', date),
category;
This statement creates a materialized view called monthly_sales_summary
, which contains a summary of the sales data for each month and category. The SELECT
statement is the same as the one used to create the temporary table in the previous example, but instead of using a temporary table, the result is stored in the materialized view.
Materialized views are similar to tables in that they store data on disk, but they are updated automatically when the underlying data changes. You can refresh the materialized view manually using the REFRESH MATERIALIZED VIEW
statement, or you can set up a scheduled refresh using a cron job or another scheduling tool.
Once the materialized view has been created, you can query it like any other table:
SELECT category, month, total_sales FROM monthly_sales_summary ORDER BY category, month;
This statement selects the category
, month
, and total_sales
columns from the monthly_sales_summary
materialized view, and sorts the result by category and month.
The advantage of using a materialized view, in this case, is that it allows you to precompute and store the summary data, reducing the amount of time it takes to run the query. Materialized views are particularly useful for reports that are run frequently and require complex calculations on large datasets. However, they do have some limitations, such as the fact that they can take up a significant amount of disk space, and may not be updated immediately when the underlying data changes.
Final Thoughts
Mastering advanced SQL techniques such as window functions, CTEs, aggregate functions, pivot tables, subqueries, cross joins, temporary tables, and materialized views can help you tackle complex data analysis tasks more effectively.
Remember to stay up-to-date with the latest developments and use these techniques judiciously based on your specific needs and constraints. By developing your SQL skills over time, you can unlock new insights and stay ahead of the curve in a data-driven world.
credit: https://medium.com/@atakankorez/advanced-sql-techniques-you-should-know-66f7d3ac638f