SQL window features in data science interviews conducted by Airbnb, Netflix, Twitter, and Uber

Business

May 17, 2022

SQL window features in data science interviews conducted by Airbnb, Netflix, Twitter, and Uber

Window functions are a group of functions that will perform calculations on a set of rows related to your current row. They are considered advanced sql and are often asked during data science interviews. It’s also used a lot at work to solve many different kinds of problems. Let’s summarize the 4 different types of window functions and explain why and when you would use them.

4 types of window functions
1. Regular added functions
o These are added as AVG, MIN/MAX, COUNT, SUM
o You will want to use them to aggregate your data and group it by another column like month or year
2. Classification functions
or ROW_NUMBER, RANK, RANK_DENSE
o These are functions that help you classify your data. You can classify your entire data set or classify it by groups, such as by month or country.
o Extremely useful for generating ranking indexes within groups
3. Generation of statistics
o These are great if you need to output simple statistics like NTILE (percentiles, quartiles, medians)
o You can use this for your entire dataset or per group
4. Handling time series data
o A very common window function, especially if you need to calculate trends such as a monthly moving average or growth metric.
o LAG and LEAD are the two functions that allow you to do this.

1. Regular aggregate function

Regular aggregate functions are functions like average, count, sum, min/max that are applied to columns. The goal is to apply the aggregation function if you want to apply aggregations to different groups in the dataset, such as month.

This is similar to the kind of calculation that can be done with an aggregate function that you would find in the SELECT clause, but unlike regular aggregate functions, window functions don’t group multiple rows into a single output row, they group together or retain their own identities, depending on how you find them.
Average() Example:
Let’s take a look at an example of an avg() window function implemented to answer a data analysis question. You can see the question and write the code in the following link:
platform.stratascratch.com/coding-question?id=10302&python=

This is a perfect example of using a window function and then avg()ing a group of months. Here we are trying to calculate the average distance per dollar per month. This is hard to do in SQL without this window function. Here we have applied the avg() window function to the third column where we find the average value for month-year for each month-year in the dataset. We can use this metric to calculate the difference between the month average and the date average for each request date in the table.

The code to implement the window function would look like this:

SELECT a.request_date,
a.dist_to_cost,
AVG(a.dist_to_cost) OVER(PARTITION BY a.request_mnth) AS avg_dist_to_cost
FROM
(SELECT *,
to_char(request_date::date, ‘YYYY-MM’) AS request_month,
(distance_to_trip/monetary_cost) AS dist_to_cost
FROM uber_request_logs) a
ORDER BY request_date

2. Classification functions
Ranking functions are an important utility for a data scientist. You are always sorting and indexing your data to better understand which rows are the best in your dataset. The SQL window functions give you 3 ranking utilities: RANK(), DENSE_RANK(), ROW_NUMBER(), depending on your exact use case. These features will help you list your data in order and in groups depending on what you want.
Range() Example:
Let’s take a look at a sort window function example to see how we can sort data into groups using SQL window functions. Follow along interactively with this link: platform.stratascratch.com/coding-question?id=9898&python=

Here we want to find the highest salaries by department. We can’t just find the top 3 salaries without a window function because it will only give us the top 3 salaries across all departments, so we need to sort salaries by departments individually. This is done by rank() and split by department. From there, it’s very easy to narrow down the top 3 across all departments.

Here is the code to generate this table. You can copy and paste into the SQL editor at the link above and see the same result.

SELECT department,
salary,
RANK() OVER (PARTITION BY a department
ORDER BY a.salary DESC) AS rank_id
FROM
(SELECT department, salary
FROM twitter_employee
GROUP BY department, salary
ORDER BY department, salary) to
ORDER BY DEPARTMENT,
DESC salary

3. NTIL
NTILE is a very useful function for those in data analytics, business analytics, and data science. Often when dealing with a statistical data deadline, you probably need to create robust statistics like quartiles, quintiles, medians, deciles in your daily work, and NTILE makes it easy to generate these results.

NTILE takes an argument of the number of bins (or basically how many bins you want to split your data into) and then creates this number of bins by splitting your data into that number of bins. You set how the data is sorted and split, if you want additional groupings.

NTILE(100) Example
In this example, we will learn how to use NTILE to categorize our data into percentiles. You can follow it interactively at the link here: platform.stratascratch.com/coding-question?id=10303&python=

What you’re trying to do here is identify the top 5 percent of claims based on the score an algorithm generates. But you can’t just find the top 5% and place an order because you want to find the top 5% by state. So one way to do this is to use a sort function NTILE() and then PARTITION by state. You can then apply a filter on the WHERE clause to get the top 5%.

Here is the code to display the entire table above. You can copy and paste it into the link above.

SELECT policy_number,
Express,
cost_claim,
fraud Score,
percentile
FROM
(SELECT *,
NTILE(100) ENVELOPE(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM fraud_score) to
WHERE percentile <=5

4. Handling time series data
LAG and LEAD are two window functions that are useful for handling time series data. The only difference between LAG and LEAD is whether you want to get data from previous or next rows, almost like sampling previous or future data.

You can use LAG and LEAD to calculate monthly growth or moving averages. As a data scientist and business analyst, he is always dealing with time series data and creating those time metrics.

LAG() Example:
In this example, we want to find the percentage growth year over year, which is a very common question that data scientists and business analysts answer on a daily basis. The problem statement, data, and SQL editor are at the following link if you want to try coding the solution yourself: platform.stratascratch.com/coding-question?id=9637&python=

The tricky thing about this problem is that the data is configured – you need to use the value from the previous row in your metric. But SQL is not designed to do that. SQL is designed to compute whatever you want, as long as the values are in the same row. So we can use the lag() or lead() window function which will take the previous or next rows and put them in their current row, which is what this question is asking.

Here is the code to display the entire table above. You can copy and paste the code into the SQL editor at the link above:

SELECT year,
host_of_the_current_year,
previous_previous_host,
round(((current_year_host – prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth
FROM
(SELECT year,
host_of_the_current_year,
LAG(host_current_year, 1) ABOUT (ORDER BY year) AS host_previous_year
FROM
(SELECT statement (year
FROM host_since::date) AS year,
count (id) host_current_year
FROM airbnb_search_details
WHERE host_since IS NOT NULL
GROUP BY statement (year
FROM host_from::date)
ORDER BY year) t1) t2

admin

SQL window features in data science interviews conducted by Airbnb, Netflix, Twitter, and Uber

Leave a Reply Cancel reply

Archives

Categories

Recent Posts

Meta

Recent Comments