
Picture by Creator | Canva
Pandas library has one of many fastest-growing communities. This recognition has opened the door for options, like polars. On this article, we are going to discover one such different, DuckDB.
DuckDB is an SQL database that you would be able to run proper in your pocket book. No setup is required, and no servers are wanted. It’s straightforward to put in and may work with Pandas in parallel.
In contrast to different SQL databases, you don’t have to configure the server. It simply works along with your pocket book after set up. Meaning no native setup complications, you’re writing the code immediately. DuckDB handles filtering, joins, and aggregations with clear SQL syntax, in comparison with Pandas, and performs considerably higher on giant datasets.
So sufficient with the phrases, let’s get began!
Knowledge Challenge – Uber Enterprise Modeling
We are going to use it with Jupyter Pocket book, combining it with Python for knowledge evaluation. To make issues extra thrilling, we are going to work on a real-life knowledge challenge. Let’s get began!
Right here is the hyperlink to the info challenge we’ll be utilizing on this article. It’s an information challenge from Uber known as Companion’s Enterprise Modeling.
Uber used this knowledge challenge within the recruitment course of for the info science positions, and you may be requested to investigate the info for 2 totally different eventualities.
- State of affairs 1: Examine the price of two bonus applications designed to get extra drivers on-line throughout a busy day.
- State of affairs 2: Calculate and evaluate the annual web revenue of a conventional taxi driver vs one who companions with Uber and buys a automobile.
Loading Dataset
Let’s load the dataframe first. This step will probably be wanted; therefore, we are going to register this dataset with DuckDB within the following sections.
import pandas as pd
df = pd.read_csv("dataset_2.csv")
Exploring the Dataset
Listed here are the primary few rows:
Let’s see all of the columns.
Right here is the output.
Join DuckDB and Register the DataFrame
Good, it’s a actually simple dataset, however how can we join DuckDB with this dataset?
First, if in case you have not put in it but, set up DuckDB.
Connecting with DuckDB is straightforward. Additionally, if you wish to learn the documentation, test it out right here.
Now, right here is the code to make a connection and register the dataframe.
import duckdb
con = duckdb.join()
con.register("my_data", df)
Good, let’s begin exploring seven queries that may prevent hours of Pandas work!
1. Multi-Standards Filtering for Advanced Eligibility Guidelines
One of the crucial vital benefits of SQL is the way it naturally handles filtering, particularly multi-condition filtering, very simply.
Implementation of Multi-Criterial Filtering in DuckDB vs Pandas
DuckDB permits you to apply a number of filters utilizing SQL’s The place Clauses and logic, which scales effectively because the variety of filters grows.
SELECT
*
FROM knowledge
WHERE condition_1
AND condition_2
AND condition_3
AND condition_4
Now let’s see how we’d write the identical logic in Pandas. In Pandas, the small logic is expressed utilizing chained boolean masks with brackets, which might get verbose beneath many situations.
filtered_df = df[
(df["condition_1"]) &
(df["condition_2"]) &
(df["condition_3"]) &
(df["condition_4"])
]
Each strategies are equally readable and relevant to fundamental use. DuckDB feels extra pure and cleaner because the logic will get extra advanced.
Multi-Standards Filtering for the Uber Knowledge Challenge
On this case, we wish to discover drivers who qualify for a particular Uber bonus program.
In keeping with the principles, the drivers should:
- Be on-line for at the very least 8 hours
- Full at the very least 10 journeys
- Settle for at the very least 90% of trip requests
- Having a ranking of 4.7 or above
Now all we’ve got to do is write a question that does all these filterings. Right here is the code.
SELECT
COUN(*) AS qualified_drivers,
COUNT(*) * 50 AS total_payout
FROM knowledge
WHERE "Provide Hours" >= 8
AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
AND "Journeys Accomplished" >= 10
AND Ranking >= 4.7
However to execute this code with Python, we have to add con.execute(“”” “””) and fetchdf() strategies as proven under:
con.execute("""
SELECT
COUNT(*) AS qualified_drivers,
COUNT(*) * 50 AS total_payout
FROM knowledge
WHERE "Provide Hours" >= 8
AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
AND "Journeys Accomplished" >= 10
AND Ranking >= 4.7
""").fetchdf()
We are going to do that all through the article. Now that you understand how to run it in a Jupyter pocket book, we’ll present solely the SQL code any further, and also you’ll know methods to convert it to the Pythonic model.
Good. Now, do not forget that the info challenge needs us to calculate the entire payout for Possibility 1.
We’ve calculated the sum of the driving force, however we should always multiply this by $50, as a result of the payout will probably be $50 for every driver, so we are going to do it with COUNT(*) * 50.
Right here is the output.
2. Quick Aggregation to Estimate Enterprise Incentives
SQL is nice for rapidly aggregating, particularly when you must summarize knowledge throughout rows.
Implementation of Aggregation in DuckDB vs Pandas
DuckDB enables you to combination values throughout rows utilizing SQL capabilities like SUM and COUNT in a single compact block.
SELECT
COUNT(*) AS num_rows,
SUM(column_name) AS total_value
FROM knowledge
WHERE some_condition
In pandas, you first have to filter the dataframe, then individually rely and sum utilizing chaining strategies.
filtered = df[df["some_condition"]]
num_rows = filtered.form[0]
total_value = filtered["column_name"].sum()
DuckDB is extra concise and simpler to learn, and doesn’t require managing intermediate variables.
Aggregation in Uber Knowledge Challenge
Good, let’s transfer on to the second bonus scheme, Possibility 2. In keeping with the challenge description, drivers will obtain $4 per journey if:
- They full at the very least 12 journeys.
- Have a ranking of 4.7 or higher.
This time, as a substitute of simply counting the drivers, we have to add the variety of journeys they accomplished because the bonus is paid per journey, not per particular person.
SELECT
COUNT(*) AS qualified_drivers,
SUM("Journeys Accomplished") * 4 AS total_payout
FROM knowledge
WHERE "Journeys Accomplished" >= 12
AND Ranking >= 4.7
The rely right here tells us what number of drivers qualify. Nevertheless, to calculate the entire payout, we are going to calculate their journeys and multiply by $4, as required by Possibility 2.
Right here is the output.
With DuckDB, we don’t have to loop by the rows or construct customized aggregations. The Sum operate takes care of every thing we’d like.
3. Detect Overlaps and Variations Utilizing Boolean Logic
In SQL, you’ll be able to simply mix the situations through the use of Boolean Logic, comparable to AND, OR, and NOT.
Implementation of Boolean Logic in DuckDB vs Pandas
DuckDB helps boolean logic natively within the WHERE clause utilizing AND, OR, and NOT.
SELECT *
FROM knowledge
WHERE condition_a
AND condition_b
AND NOT (condition_c)
Pandas requires a mixture of logical operators with masks and parentheses, together with the usage of “~” for negation.
filtered = df[
(df["condition_a"]) &
(df["condition_b"]) &
~(df["condition_c"])
]
Whereas each are practical, DuckDB is simpler to cause about when the logic entails exclusions or nested situations.
Boolean Logic for Uber Knowledge Challenge
Now we’ve got calculated Possibility 1 and Possibility 2, what comes subsequent? Now it’s time to do the comparability. Keep in mind our subsequent query.
That is the place we will use Boolean Logic. We’ll use a mixture of AND and NOT.
SELECT COUNT(*) AS only_option1
FROM knowledge
WHERE "Provide Hours" >= 8
AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
AND "Journeys Accomplished" >= 10
AND Ranking >= 4.7
AND NOT ("Journeys Accomplished" >= 12 AND Ranking >= 4.7)
Right here is the output.
Let’s break it down:
- The primary 4 situations are right here for Possibility 1.
- The NOT(..) half is used to exclude drivers who additionally qualify for Possibility 2.
It’s fairly simple, proper?
4. Fast Cohort Sizing with Conditional Filters
Generally, you wish to perceive how large a particular group or cohort is inside your knowledge.
Implementation of Conditional Filters in DuckDB vs Pandas?
DuckDB handles cohort filtering and share calculation with one SQL question, even together with subqueries.
SELECT
ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM knowledge), 2) AS share
FROM knowledge
WHERE condition_1
AND condition_2
AND condition_3
Pandas requires filtering, counting, and handbook division to calculate percentages.
filtered = df[
(df["condition_1"]) &
(df["condition_2"]) &
(df["condition_3"])
]
share = spherical(100.0 * len(filtered) / len(df), 2)
DuckDB right here is cleaner and quicker. It minimizes the variety of steps and avoids repeated code.
Cohort Sizing For Uber Knowledge Challenge
Now we’re on the final query of State of affairs 1. On this query, Uber needs us to search out out the drivers that might not obtain some duties, like journeys and acceptance fee, but had increased scores, particularly the drivers.
- Accomplished lower than 10 journeys
- Had an acceptance fee decrease than 90
- Had a ranking increased than 4.7
Now, these are three separate filters, and we wish to calculate the share of drivers satisfying every of them. Let’s see the question.
SELECT
ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM knowledge), 2) AS share
FROM knowledge
WHERE "Journeys Accomplished" < 10
AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) = 4.7
Right here is the output.
Right here, we filtered the rows the place all three situations had been glad, counted them, and divided them by the entire variety of drivers to get a share.
5. Primary Arithmetic Queries for Income Modeling
Now, let’s say you wish to do some fundamental math. You’ll be able to write expressions straight into your SELECT assertion.
Implementation of Arithmetic in DuckDB vs Pandas
DuckDB permits arithmetic to be written straight within the SELECT clause like a calculator.
SELECT
daily_income * work_days * weeks_per_year AS annual_revenue,
weekly_cost * weeks_per_year AS total_cost,
(daily_income * work_days * weeks_per_year) - (weekly_cost * weeks_per_year) AS net_income
FROM knowledge
Pandas requires a number of intermediate calculations in separate variables for a similar end result.
daily_income = 200
weeks_per_year = 49
work_days = 6
weekly_cost = 500
annual_revenue = daily_income * work_days * weeks_per_year
total_cost = weekly_cost * weeks_per_year
net_income = annual_revenue - total_cost
DuckDB simplifies the maths logic right into a readable SQL block, whereas Pandas will get a bit cluttered with variable assignments.
Primary Arithmetic in Uber Knowledge Challenge
In State of affairs 2, Uber requested us to calculate how a lot cash (after bills) the driving force makes per 12 months with out partnering with Uber. Listed here are some bills like fuel, lease, and insurance coverage.
Now let’s calculate the annual income and subtract the bills from it.
SELECT
200 * 6 * (52 - 3) AS annual_revenue,
200 * (52 - 3) AS gas_expense,
500 * (52 - 3) AS rent_expense,
400 * 12 AS insurance_expense,
(200 * 6 * (52 - 3))
- (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income
Right here is the output.
With DuckDB, you’ll be able to write this like a SQL matrix block. You do not want Pandas Dataframes or handbook looping!
6. Conditional Calculations for Dynamic Expense Planning
What in case your price construction modifications based mostly on sure situations?
Implementation of Conditional Calculations in DuckDB vs Pandas
DuckDB enables you to apply conditional logic utilizing arithmetic changes inside your question.
SELECT
original_cost * 1.05 AS increased_cost,
original_cost * 0.8 AS discounted_cost,
0 AS removed_cost,
(original_cost * 1.05 + original_cost * 0.8) AS total_new_cost
Pandas makes use of the identical logic with a number of math strains and handbook updates to variables.
weeks_worked = 49
fuel = 200
insurance coverage = 400
gas_expense = fuel * 1.05 * weeks_worked
insurance_expense = insurance coverage * 0.8 * 12
rent_expense = 0
whole = gas_expense + insurance_expense
DuckDB turns what could be a multi-step logic in pandas right into a single SQL expression.
Conditional Calculations in Uber Knowledge Challenge
On this situation, we now mannequin what occurs if the driving force companions with Uber and buys a automobile. The bills change like
- Gasoline price will increase by 5%
- Insurance coverage decreases by 20%
- No extra lease expense
con.execute("""
SELECT
200 * 1.05 * 49 AS gas_expense,
400 * 0.8 * 12 AS insurance_expense,
0 AS rent_expense,
(200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense
""").fetchdf()
Right here is the output.
7. Aim-Pushed Math for Income Concentrating on
Generally, your evaluation could be pushed by a enterprise aim like hitting a income goal or masking a one time price.
Implementation of Aim-Pushed Math in DuckDB vs Pandas
DuckDB handles multi-step logic utilizing CTEs. It makes the question modular and simple to learn.
WITH vars AS (
SELECT base_income, cost_1, cost_2, target_item
),
calc AS (
SELECT
base_income - (cost_1 + cost_2) AS current_profit,
cost_1 * 1.1 + cost_2 * 0.8 + target_item AS new_total_expense
FROM vars
),
ultimate AS (
SELECT
current_profit + new_total_expense AS required_revenue,
required_revenue / 49 AS required_weekly_income
FROM calc
)
SELECT required_weekly_income FROM ultimate
Pandas requires nesting of calculations and reuse of earlier variables to keep away from duplication.
weeks = 49
original_income = 200 * 6 * weeks
original_cost = (200 + 500) * weeks + 400 * 12
net_income = original_income - original_cost
# new bills + automobile price
new_gas = 200 * 1.05 * weeks
new_insurance = 400 * 0.8 * 12
car_cost = 40000
required_revenue = net_income + new_gas + new_insurance + car_cost
required_weekly_income = required_revenue / weeks
DuckDB permits you to construct a logic pipeline step-by-step, with out cluttering your pocket book with scattered code.
Aim-Pushed Math in Uber Knowledge Challenge
Now that we’ve got modeled the brand new prices, let’s reply the ultimate enterprise query:
How rather more does the driving force have to earn per week to do each?
- Repay a $40.000 automobile inside a 12 months
- Keep the identical yearly web revenue
Now let’s write the code representing this logic.
WITH vars AS (
SELECT
52 AS total_weeks_per_year,
3 AS weeks_off,
6 AS days_per_week,
200 AS fare_per_day,
400 AS monthly_insurance,
200 AS gas_per_week,
500 AS vehicle_rent,
40000 AS car_cost
),
base AS (
SELECT
total_weeks_per_year,
weeks_off,
days_per_week,
fare_per_day,
monthly_insurance,
gas_per_week,
vehicle_rent,
car_cost,
total_weeks_per_year - weeks_off AS weeks_worked,
(fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue,
(gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas,
(vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent,
(monthly_insurance * 12) AS original_insurance
FROM vars
),
evaluate AS (
SELECT *,
(original_gas + original_rent + original_insurance) AS original_total_expense,
(original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income
FROM base
),
new_costs AS (
SELECT *,
gas_per_week * 1.05 * weeks_worked AS new_gas,
monthly_insurance * 0.8 * 12 AS new_insurance
FROM evaluate
),
ultimate AS (
SELECT *,
new_gas + new_insurance + car_cost AS new_total_expense,
original_net_income + new_gas + new_insurance + car_cost AS required_revenue,
required_revenue / weeks_worked AS required_weekly_revenue,
original_annual_revenue / weeks_worked AS original_weekly_revenue
FROM new_costs
)
SELECT
ROUND(required_weekly_revenue, 2) AS required_weekly_revenue,
ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift
FROM ultimate
Right here is the output.
Last Ideas
On this article, we explored methods to join with DuckDB and analyze knowledge. As an alternative of utilizing lengthy Pandas capabilities, we used SQL queries. We additionally did this utilizing a real-life knowledge challenge that Uber requested within the knowledge scientist recruitment course of.
For knowledge scientists engaged on analysis-heavy duties, it’s a light-weight however highly effective different to Pandas. Attempt utilizing it in your subsequent challenge, particularly when SQL logic matches the issue higher.
Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from prime firms. Nate writes on the newest traits within the profession market, provides interview recommendation, shares knowledge science tasks, and covers every thing SQL.