7 DuckDB SQL Queries That Save You Hours of Pandas Work

By admin2010

July 7, 2025

27

Picture by Creator | Canva

Pandas library has one of many fastest-growing communities. This recognition has opened the door for options, like polars. On this article, we are going to discover one such different, DuckDB.

DuckDB is an SQL database that you would be able to run proper in your pocket book. No setup is required, and no servers are wanted. It’s straightforward to put in and may work with Pandas in parallel.

In contrast to different SQL databases, you don’t have to configure the server. It simply works along with your pocket book after set up. Meaning no native setup complications, you’re writing the code immediately. DuckDB handles filtering, joins, and aggregations with clear SQL syntax, in comparison with Pandas, and performs considerably higher on giant datasets.

So sufficient with the phrases, let’s get began!

Knowledge Challenge – Uber Enterprise Modeling

We are going to use it with Jupyter Pocket book, combining it with Python for knowledge evaluation. To make issues extra thrilling, we are going to work on a real-life knowledge challenge. Let’s get began!

A Data Project Example for DuckDB SQL Queries

Right here is the hyperlink to the info challenge we’ll be utilizing on this article. It’s an information challenge from Uber known as Companion’s Enterprise Modeling.

Uber used this knowledge challenge within the recruitment course of for the info science positions, and you may be requested to investigate the info for 2 totally different eventualities.

State of affairs 1: Examine the price of two bonus applications designed to get extra drivers on-line throughout a busy day.
State of affairs 2: Calculate and evaluate the annual web revenue of a conventional taxi driver vs one who companions with Uber and buys a automobile.

Loading Dataset

Let’s load the dataframe first. This step will probably be wanted; therefore, we are going to register this dataset with DuckDB within the following sections.

import pandas as pd
df = pd.read_csv("dataset_2.csv")

Exploring the Dataset

Listed here are the primary few rows:

A Data Project Example for DuckDB SQL Queries

Let’s see all of the columns.

Right here is the output.

A Data Project Example for DuckDB SQL Queries

Join DuckDB and Register the DataFrame

Good, it’s a actually simple dataset, however how can we join DuckDB with this dataset?
First, if in case you have not put in it but, set up DuckDB.

Connecting with DuckDB is straightforward. Additionally, if you wish to learn the documentation, test it out right here.

Now, right here is the code to make a connection and register the dataframe.

import duckdb
con = duckdb.join()

con.register("my_data", df)

Connect DuckDB and Register the DataFrame

Good, let’s begin exploring seven queries that may prevent hours of Pandas work!

1. Multi-Standards Filtering for Advanced Eligibility Guidelines

One of the crucial vital benefits of SQL is the way it naturally handles filtering, particularly multi-condition filtering, very simply.

Implementation of Multi-Criterial Filtering in DuckDB vs Pandas

DuckDB permits you to apply a number of filters utilizing SQL’s The place Clauses and logic, which scales effectively because the variety of filters grows.

SELECT 
    *
FROM knowledge
WHERE condition_1
  AND condition_2
  AND condition_3
  AND condition_4

Now let’s see how we’d write the identical logic in Pandas. In Pandas, the small logic is expressed utilizing chained boolean masks with brackets, which might get verbose beneath many situations.

filtered_df = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"]) &
    (df["condition_4"])
]

Each strategies are equally readable and relevant to fundamental use. DuckDB feels extra pure and cleaner because the logic will get extra advanced.

Multi-Standards Filtering for the Uber Knowledge Challenge

On this case, we wish to discover drivers who qualify for a particular Uber bonus program.

In keeping with the principles, the drivers should:

Be on-line for at the very least 8 hours
Full at the very least 10 journeys
Settle for at the very least 90% of trip requests
Having a ranking of 4.7 or above

Now all we’ve got to do is write a question that does all these filterings. Right here is the code.

SELECT 
    COUN(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM knowledge
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Ranking >= 4.7

However to execute this code with Python, we have to add con.execute(“”” “””) and fetchdf() strategies as proven under:

con.execute("""
SELECT 
    COUNT(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM knowledge
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Ranking >= 4.7
""").fetchdf()

We are going to do that all through the article. Now that you understand how to run it in a Jupyter pocket book, we’ll present solely the SQL code any further, and also you’ll know methods to convert it to the Pythonic model.
Good. Now, do not forget that the info challenge needs us to calculate the entire payout for Possibility 1.

Multi-Criteria Filtering

We’ve calculated the sum of the driving force, however we should always multiply this by $50, as a result of the payout will probably be $50 for every driver, so we are going to do it with COUNT(*) * 50.
Right here is the output.

Multi-Criteria Filtering

2. Quick Aggregation to Estimate Enterprise Incentives

SQL is nice for rapidly aggregating, particularly when you must summarize knowledge throughout rows.

Implementation of Aggregation in DuckDB vs Pandas

DuckDB enables you to combination values throughout rows utilizing SQL capabilities like SUM and COUNT in a single compact block.

SELECT 
    COUNT(*) AS num_rows,
    SUM(column_name) AS total_value
FROM knowledge
WHERE some_condition

In pandas, you first have to filter the dataframe, then individually rely and sum utilizing chaining strategies.

filtered = df[df["some_condition"]]
num_rows = filtered.form[0]
total_value = filtered["column_name"].sum()

DuckDB is extra concise and simpler to learn, and doesn’t require managing intermediate variables.

Aggregation in Uber Knowledge Challenge

Good, let’s transfer on to the second bonus scheme, Possibility 2. In keeping with the challenge description, drivers will obtain $4 per journey if:

They full at the very least 12 journeys.
Have a ranking of 4.7 or higher.

This time, as a substitute of simply counting the drivers, we have to add the variety of journeys they accomplished because the bonus is paid per journey, not per particular person.

SELECT 
    COUNT(*) AS qualified_drivers,
    SUM("Journeys Accomplished") * 4 AS total_payout
FROM knowledge
WHERE "Journeys Accomplished" >= 12
  AND Ranking >= 4.7

The rely right here tells us what number of drivers qualify. Nevertheless, to calculate the entire payout, we are going to calculate their journeys and multiply by $4, as required by Possibility 2.

Aggregation in DuckDB

Right here is the output.

Aggregation in DuckDB

With DuckDB, we don’t have to loop by the rows or construct customized aggregations. The Sum operate takes care of every thing we’d like.

3. Detect Overlaps and Variations Utilizing Boolean Logic

In SQL, you’ll be able to simply mix the situations through the use of Boolean Logic, comparable to AND, OR, and NOT.

Implementation of Boolean Logic in DuckDB vs Pandas

DuckDB helps boolean logic natively within the WHERE clause utilizing AND, OR, and NOT.

SELECT *
FROM knowledge
WHERE condition_a
  AND condition_b
  AND NOT (condition_c)

Pandas requires a mixture of logical operators with masks and parentheses, together with the usage of “~” for negation.

filtered = df[
    (df["condition_a"]) &
    (df["condition_b"]) &
    ~(df["condition_c"])
]

Whereas each are practical, DuckDB is simpler to cause about when the logic entails exclusions or nested situations.

Boolean Logic for Uber Knowledge Challenge

Now we’ve got calculated Possibility 1 and Possibility 2, what comes subsequent? Now it’s time to do the comparability. Keep in mind our subsequent query.

Boolean Logic in DuckDB

That is the place we will use Boolean Logic. We’ll use a mixture of AND and NOT.

SELECT COUNT(*) AS only_option1
FROM knowledge
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Ranking >= 4.7
  AND NOT ("Journeys Accomplished" >= 12 AND Ranking >= 4.7)

Right here is the output.

Boolean Logic in DuckDB

Let’s break it down:

The primary 4 situations are right here for Possibility 1.
The NOT(..) half is used to exclude drivers who additionally qualify for Possibility 2.

It’s fairly simple, proper?

4. Fast Cohort Sizing with Conditional Filters

Generally, you wish to perceive how large a particular group or cohort is inside your knowledge.

Implementation of Conditional Filters in DuckDB vs Pandas?

DuckDB handles cohort filtering and share calculation with one SQL question, even together with subqueries.

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM knowledge), 2) AS share
FROM knowledge
WHERE condition_1
  AND condition_2
  AND condition_3

Pandas requires filtering, counting, and handbook division to calculate percentages.

filtered = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"])
]
share = spherical(100.0 * len(filtered) / len(df), 2)

DuckDB right here is cleaner and quicker. It minimizes the variety of steps and avoids repeated code.

Cohort Sizing For Uber Knowledge Challenge

Now we’re on the final query of State of affairs 1. On this query, Uber needs us to search out out the drivers that might not obtain some duties, like journeys and acceptance fee, but had increased scores, particularly the drivers.

Accomplished lower than 10 journeys
Had an acceptance fee decrease than 90
Had a ranking increased than 4.7

Now, these are three separate filters, and we wish to calculate the share of drivers satisfying every of them. Let’s see the question.

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM knowledge), 2) AS share
FROM knowledge
WHERE "Journeys Accomplished" < 10
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) = 4.7

Right here is the output.

Cohort Sizing in DuckDB

Right here, we filtered the rows the place all three situations had been glad, counted them, and divided them by the entire variety of drivers to get a share.

5. Primary Arithmetic Queries for Income Modeling

Now, let’s say you wish to do some fundamental math. You’ll be able to write expressions straight into your SELECT assertion.

Implementation of Arithmetic in DuckDB vs Pandas

DuckDB permits arithmetic to be written straight within the SELECT clause like a calculator.

SELECT 
    daily_income * work_days * weeks_per_year AS annual_revenue,
    weekly_cost * weeks_per_year AS total_cost,
    (daily_income * work_days * weeks_per_year) - (weekly_cost * weeks_per_year) AS net_income
FROM knowledge

Pandas requires a number of intermediate calculations in separate variables for a similar end result.

daily_income = 200
weeks_per_year = 49
work_days = 6
weekly_cost = 500

annual_revenue = daily_income * work_days * weeks_per_year
total_cost = weekly_cost * weeks_per_year
net_income = annual_revenue - total_cost

DuckDB simplifies the maths logic right into a readable SQL block, whereas Pandas will get a bit cluttered with variable assignments.

Primary Arithmetic in Uber Knowledge Challenge

In State of affairs 2, Uber requested us to calculate how a lot cash (after bills) the driving force makes per 12 months with out partnering with Uber. Listed here are some bills like fuel, lease, and insurance coverage.

Basic Arithmetic in DuckDB

Now let’s calculate the annual income and subtract the bills from it.

SELECT 
    200 * 6 * (52 - 3) AS annual_revenue,
    200 * (52 - 3) AS gas_expense,
    500 * (52 - 3) AS rent_expense,
    400 * 12 AS insurance_expense,
    (200 * 6 * (52 - 3)) 
      - (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income

Right here is the output.

Basic Arithmetic in DuckDB

With DuckDB, you’ll be able to write this like a SQL matrix block. You do not want Pandas Dataframes or handbook looping!

6. Conditional Calculations for Dynamic Expense Planning

What in case your price construction modifications based mostly on sure situations?

Implementation of Conditional Calculations in DuckDB vs Pandas

DuckDB enables you to apply conditional logic utilizing arithmetic changes inside your question.

SELECT 
    original_cost * 1.05 AS increased_cost,
    original_cost * 0.8 AS discounted_cost,
    0 AS removed_cost,
    (original_cost * 1.05 + original_cost * 0.8) AS total_new_cost

Pandas makes use of the identical logic with a number of math strains and handbook updates to variables.

weeks_worked = 49
fuel = 200
insurance coverage = 400

gas_expense = fuel * 1.05 * weeks_worked
insurance_expense = insurance coverage * 0.8 * 12
rent_expense = 0
whole = gas_expense + insurance_expense

DuckDB turns what could be a multi-step logic in pandas right into a single SQL expression.

Conditional Calculations in Uber Knowledge Challenge

On this situation, we now mannequin what occurs if the driving force companions with Uber and buys a automobile. The bills change like

Gasoline price will increase by 5%
Insurance coverage decreases by 20%
No extra lease expense

con.execute("""
SELECT 
    200 * 1.05 * 49 AS gas_expense,
    400 * 0.8 * 12 AS insurance_expense,
    0 AS rent_expense,
    (200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense
""").fetchdf()

Right here is the output.

Conditional Calculations in DuckDB

7. Aim-Pushed Math for Income Concentrating on

Generally, your evaluation could be pushed by a enterprise aim like hitting a income goal or masking a one time price.

Implementation of Aim-Pushed Math in DuckDB vs Pandas

DuckDB handles multi-step logic utilizing CTEs. It makes the question modular and simple to learn.

WITH vars AS (
  SELECT base_income, cost_1, cost_2, target_item
),
calc AS (
  SELECT 
    base_income - (cost_1 + cost_2) AS current_profit,
    cost_1 * 1.1 + cost_2 * 0.8 + target_item AS new_total_expense
  FROM vars
),
ultimate AS (
  SELECT 
    current_profit + new_total_expense AS required_revenue,
    required_revenue / 49 AS required_weekly_income
  FROM calc
)
SELECT required_weekly_income FROM ultimate

Pandas requires nesting of calculations and reuse of earlier variables to keep away from duplication.

weeks = 49
original_income = 200 * 6 * weeks
original_cost = (200 + 500) * weeks + 400 * 12
net_income = original_income - original_cost

# new bills + automobile price
new_gas = 200 * 1.05 * weeks
new_insurance = 400 * 0.8 * 12
car_cost = 40000

required_revenue = net_income + new_gas + new_insurance + car_cost
required_weekly_income = required_revenue / weeks

DuckDB permits you to construct a logic pipeline step-by-step, with out cluttering your pocket book with scattered code.

Aim-Pushed Math in Uber Knowledge Challenge

Now that we’ve got modeled the brand new prices, let’s reply the ultimate enterprise query:

How rather more does the driving force have to earn per week to do each?

Repay a $40.000 automobile inside a 12 months
Keep the identical yearly web revenue

Now let’s write the code representing this logic.

WITH vars AS (
  SELECT 
    52 AS total_weeks_per_year,
    3 AS weeks_off,
    6 AS days_per_week,
    200 AS fare_per_day,
    400 AS monthly_insurance,
    200 AS gas_per_week,
    500 AS vehicle_rent,
    40000 AS car_cost
),
base AS (
  SELECT 
    total_weeks_per_year,
    weeks_off,
    days_per_week,
    fare_per_day,
    monthly_insurance,
    gas_per_week,
    vehicle_rent,
    car_cost,
    total_weeks_per_year - weeks_off AS weeks_worked,
    (fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue,
    (gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas,
    (vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent,
    (monthly_insurance * 12) AS original_insurance
  FROM vars
),
evaluate AS (
  SELECT *,
    (original_gas + original_rent + original_insurance) AS original_total_expense,
    (original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income
  FROM base
),
new_costs AS (
  SELECT *,
    gas_per_week * 1.05 * weeks_worked AS new_gas,
    monthly_insurance * 0.8 * 12 AS new_insurance
  FROM evaluate
),
ultimate AS (
  SELECT *,
    new_gas + new_insurance + car_cost AS new_total_expense,
    original_net_income + new_gas + new_insurance + car_cost AS required_revenue,
    required_revenue / weeks_worked AS required_weekly_revenue,
    original_annual_revenue / weeks_worked AS original_weekly_revenue
  FROM new_costs
)
SELECT 
  ROUND(required_weekly_revenue, 2) AS required_weekly_revenue,
  ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift
FROM ultimate

Right here is the output.

Goal-Driven Math in DuckDB

Last Ideas

On this article, we explored methods to join with DuckDB and analyze knowledge. As an alternative of utilizing lengthy Pandas capabilities, we used SQL queries. We additionally did this utilizing a real-life knowledge challenge that Uber requested within the knowledge scientist recruitment course of.
For knowledge scientists engaged on analysis-heavy duties, it’s a light-weight however highly effective different to Pandas. Attempt utilizing it in your subsequent challenge, particularly when SQL logic matches the issue higher.

Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from prime firms. Nate writes on the newest traits within the profession market, provides interview recommendation, shares knowledge science tasks, and covers every thing SQL.

7 DuckDB SQL Queries That Save You Hours of Pandas Work

Knowledge Challenge – Uber Enterprise Modeling

Loading Dataset

Exploring the Dataset

Join DuckDB and Register the DataFrame

1. Multi-Standards Filtering for Advanced Eligibility Guidelines

Implementation of Multi-Criterial Filtering in DuckDB vs Pandas

Multi-Standards Filtering for the Uber Knowledge Challenge

2. Quick Aggregation to Estimate Enterprise Incentives

Implementation of Aggregation in DuckDB vs Pandas

Aggregation in Uber Knowledge Challenge

3. Detect Overlaps and Variations Utilizing Boolean Logic

Implementation of Boolean Logic in DuckDB vs Pandas

Boolean Logic for Uber Knowledge Challenge

4. Fast Cohort Sizing with Conditional Filters

Implementation of Conditional Filters in DuckDB vs Pandas?

Cohort Sizing For Uber Knowledge Challenge

5. Primary Arithmetic Queries for Income Modeling

Implementation of Arithmetic in DuckDB vs Pandas

Primary Arithmetic in Uber Knowledge Challenge

6. Conditional Calculations for Dynamic Expense Planning

Implementation of Conditional Calculations in DuckDB vs Pandas

Conditional Calculations in Uber Knowledge Challenge

7. Aim-Pushed Math for Income Concentrating on

Implementation of Aim-Pushed Math in DuckDB vs Pandas

Aim-Pushed Math in Uber Knowledge Challenge

Last Ideas

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY