The Fundamentals of Debugging Python Issues

By admin2010

July 22, 2025

0

2

The Fundamentals of Debugging Python Issues

Picture by Writer | Canva

Ever run a Python script and instantly wished you hadn’t pressed Enter?

Debugging in knowledge science isn’t just an act; it’s a survival talent — notably when coping with messy datasets or devising prediction fashions on which precise individuals rely.

On this article, we’ll discover the fundamentals of debugging, particularly in your knowledge science workflows, utilizing a real-life dataset from a DoorDash supply job, and most significantly, debug like a professional.

DoorDash Supply Length Prediction: What Are We Dealing With?

Debugging Python Problems in Delivery Duration Prediction

On this knowledge venture, DoorDash requested its knowledge science candidates to foretell the supply period. Let’s first take a look at the dataset data. Right here is the code:

Right here is the output:

Debugging Python Problem in Predicting Delivery Duration

Evidently they didn’t present the supply period, so you need to calculate it right here. It’s easy, however no worries if you’re a newbie. Let’s see how it may be calculated.

import pandas as pd
from datetime import datetime

# Assuming historical_data is your DataFrame
historical_data["created_at"] = pd.to_datetime(historical_data['created_at'])
historical_data["actual_delivery_time"] = pd.to_datetime(historical_data['actual_delivery_time'])
historical_data["actual_total_delivery_duration"] = (historical_data["actual_delivery_time"] - historical_data["created_at"]).dt.total_seconds()
historical_data.head()

Right here is the output’s head; you possibly can see the actual_total_delivery_duration.

Output of Debugging Python Problem of Delivery Duration Prediction

Good, now we will begin! However earlier than that, right here is the info definition language for this dataset.

Columns in `historical_data.csv`

Time options:

market_id: A metropolis/area wherein DoorDash operates, e.g., Los Angeles, given within the knowledge as an id.
created_at: Timestamp in UTC when the order was submitted by the buyer to DoorDash. (Notice: this timestamp is in UTC, however in case you want it, the precise timezone of the area was US/Pacific).
actual_delivery_time: Timestamp in UTC when the order was delivered to the buyer.

Retailer options:

store_id: An ID representing the restaurant the order was submitted for.
store_primary_category: Delicacies class of the restaurant, e.g., Italian, Asian.
order_protocol: A retailer can obtain orders from DoorDash by means of many modes. This subject represents an ID denoting the protocol.

Order options:

total_items: Whole variety of objects within the order.
subtotal: Whole worth of the order submitted (in cents).
num_distinct_items: Variety of distinct objects included within the order.
min_item_price: Worth of the merchandise with the least price within the order (in cents).
max_item_price: Worth of the merchandise with the best price within the order (in cents).

Market options:

DoorDash being a market, we’ve data on the state of {the marketplace} when the order is positioned, which can be utilized to estimate supply time. The next options are values on the time of created_at (order submission time):

total_onshift_dashers: Variety of accessible dashers who’re inside 10 miles of the shop on the time of order creation.
total_busy_dashers: Subset of the above total_onshift_dashers who’re presently engaged on an order.
total_outstanding_orders: Variety of orders inside 10 miles of this order which might be presently being processed.

Predictions from different fashions:

We have now predictions from different fashions for varied phases of the supply course of that we will use:

estimated_order_place_duration: Estimated time for the restaurant to obtain the order from DoorDash (in seconds).
estimated_store_to_consumer_driving_duration: Estimated journey time between the shop and client (in seconds).

Nice, so let’s get began!

Widespread Python Errors in Knowledge Science Initiatives

Common Python Errors in Data Science Projects

On this part, we’ll uncover widespread debugging errors in one of many knowledge science tasks, beginning with studying the dataset and going by means of to a very powerful half: modeling.

Studying the Dataset: `FileNotFoundError`, Dtype Warning, and Fixes

Case 1: File Not Discovered — Traditional

In knowledge science, your first bug usually greets you at read_csv. And never with a hiya. Let’s debug that precise second collectively, line by line. Right here is the code:

import pandas as pd

attempt:
    df = pd.read_csv('Strata Questions/historical_data.csv')
    df.head(3)
besides FileNotFoundError as e:
    import os
    print("File not discovered. Here is the place Python is trying:")
    print("Working listing:", os.getcwd())
    print("Out there recordsdata:", os.listdir())
    increase e

Right here is the output.

Debugging Python Errors in Data Science Projects

You don’t simply increase an error—you interrogate it. This reveals the place the code thinks it’s and what it sees round it. In case your file’s not on the listing, now you already know. No guessing. Simply info.

Substitute the trail with the complete one, and voilà!

Debugging Python Errors in File Not Found

Case 2: Dtype Misinterpretation — Python’s Quietly Improper Guess

You load the dataset, however one thing’s off. The bug hides inside your varieties.

# Assuming df is your loaded DataFrame
attempt:
    print("Column Sorts:n", df.dtypes)
besides Exception as e:
    print("Error studying dtypes:", e)

Right here is the output.

Debugging Python Errors in Dtype Misinterpretation

Case 3: Date Parsing — The Silent Saboteur

We found that we should always calculate the supply period first, and we did it with this technique.

attempt:
    # This code was proven earlier to calculate the supply period
    df["created_at"] = pd.to_datetime(df['created_at'])
    df["actual_delivery_time"] = pd.to_datetime(df['actual_delivery_time'])
    df["actual_total_delivery_duration"] = (df["actual_delivery_time"] - df["created_at"]).dt.total_seconds()
    print("Efficiently calculated supply period and checked dtypes.")
    print("Related dtypes:n", df[['created_at', 'actual_delivery_time', 'actual_total_delivery_duration']].dtypes)
besides Exception as e:
    print("Error throughout date processing:", e)

Right here is the output.

Debugging Python Errors in Data Parsing

Good {and professional}! Now we keep away from these pink errors, which is able to raise our temper—I do know seeing them can dampen your motivation.

Dealing with Lacking Knowledge: `KeyErrors`, `NaNs`, and Logical Pitfalls

Some bugs don’t crash your code. They only provide the fallacious outcomes, silently, till you surprise why your mannequin is trash.

This part digs into lacking knowledge—not simply clear it, however debug it correctly.

Case 1: KeyError — You Thought That Column Existed

Right here is our code.

attempt:
    print(df['store_rating'])
besides KeyError as e:
    print("Column not discovered:", e)
    print("Listed below are the accessible columns:n", df.columns.tolist())

Right here is the output.

KeyError in Debugging Python problems

The code did not break due to logic; it broke due to an assumption. That’s exactly the place debugging lives. All the time listing your columns earlier than accessing them blindly.

Case 2: NaN Depend — Lacking Values You Didn’t Anticipate

You assume the whole lot’s clear. However real-world knowledge at all times hides gaps. Let’s verify for them.

attempt:
    null_counts = df.isnull().sum()
    print("Nulls per column:n", null_counts[null_counts > 0])
besides Exception as e:
    print("Failed to examine nulls:", e)

Right here is the output.

NaN Count in Debugging Python problems

This exposes the silent troublemakers. Perhaps store_primary_category is lacking in hundreds of rows. Perhaps timestamps failed conversion and at the moment are NaT.

You wouldn’t have identified until you checked. Debugging — confirming each assumption.

Case 3: Logical Pitfalls — Lacking Knowledge That Isn’t Really Lacking

Let’s say you attempt to filter orders the place the subtotal is larger than 1,000,000, anticipating a whole lot of rows. However this offers you zero:

attempt:
    filtered = df[df['subtotal'] > 1000000]
    print("Rows with subtotal > 1,000,000:", filtered.form[0])
besides Exception as e:
    print("Filtering error:", e)

That’s not a code error—it’s a logic error. You anticipated high-value orders, however possibly none exist above that threshold. Debug it with a spread verify:

print("Subtotal vary:", df['subtotal'].min(), "to", df['subtotal'].max())

Right here is the output.

Logical Pitfalls in Debugging Python Problems

Case 4: `isna()` ≠ Zero Doesn’t Imply It’s Clear

Even when isna().sum() reveals zero, there could be soiled knowledge, like whitespace or ‘None’ as a string. Run a extra aggressive verify:

attempt:
    fake_nulls = df[df['store_primary_category'].isin(['', ' ', 'None', None])]
    print("Rows with faux lacking classes:", fake_nulls.form[0])
besides Exception as e:
    print("Faux lacking worth verify failed:", e)

This catches hidden trash that isnull() misses.

Handling Missing Data in Debugging Python Problems

Function Engineering Glitches: `TypeErrors`, Date Parsing, and Extra

Function engineering appears enjoyable at first, till your new column breaks each mannequin or throws a TypeError mid-pipeline. Right here’s debug that section like somebody who’s been burned earlier than.

Case 1: You Suppose You Can Divide, However You Can’t

Let’s create a brand new function. If an error happens, our try-except block will catch it.

attempt:
    df['value_per_item'] = df['subtotal'] / df['total_items']
    print("value_per_item created efficiently")
besides Exception as e:
    print("Error occurred:", e)

Right here is the output.

Feature Engineering Glitches in Debugging Python Problems

No errors? Good. However let’s look nearer.

print(df[['subtotal', 'total_items', 'value_per_item']].pattern(3))

Right here is the output.

Feature Engineering Glitches in Debugging Python Problems

Case 2: Date Parsing Gone Improper

Now, altering your dtype is essential, however what if you happen to suppose the whole lot was performed appropriately, but issues persist?

# That is the usual means, however it might fail silently on combined varieties
df["created_at"] = pd.to_datetime(df["created_at"])
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"])

You would possibly suppose it’s okay, but when your column has combined varieties, it may fail silently or break your pipeline. That’s why, as an alternative of immediately making transformations, it is higher to make use of a strong operate.

from datetime import datetime

def parse_date_debug(df, col):
    attempt:
        parsed = pd.to_datetime(df[col])
        print(f"[SUCCESS] '{col}' parsed efficiently.")
        return parsed
    besides Exception as e:
        print(f"[ERROR] Didn't parse '{col}':", e)
        # Discover non-date-like values to debug
        non_datetimes = df[pd.to_datetime(df[col], errors="coerce").isna()][col].distinctive()
        print("Pattern values inflicting difficulty:", non_datetimes[:5])
        increase

df["created_at"] = parse_date_debug(df, "created_at")
df["actual_delivery_time"] = parse_date_debug(df, "actual_delivery_time")

Right here is the output.

Wrong Date Parsing in Debugging Python Problems

This helps you hint defective rows when datetime parsing crashes.

Case 3: Naive Division That Might Mislead

This received’t throw an error in our DataFrame because the columns are already numeric. However this is the difficulty: some datasets sneak in object varieties, even after they appear to be numbers. That results in:

Deceptive ratios
Improper mannequin habits
No warnings

df["busy_dashers_ratio"] = df["total_busy_dashers"] / df["total_onshift_dashers"]

Let’s validate varieties earlier than computing, even when the operation received’t throw an error.

import numpy as np

def create_ratio_debug(df, num_col, denom_col, new_col):
    num_type = df[num_col].dtype
    denom_type = df[denom_col].dtype

    if not np.issubdtype(num_type, np.quantity) or not np.issubdtype(denom_type, np.quantity):
        print(f"[TYPE WARNING] '{num_col}' or '{denom_col}' shouldn't be numeric.")
        print(f"{num_col}: {num_type}, {denom_col}: {denom_type}")
        df[new_col] = np.nan
        return df
    
    if (df[denom_col] == 0).any():
        print(f"[DIVISION WARNING] '{denom_col}' accommodates zeros.")
    
    df[new_col] = df[num_col] / df[denom_col]
    return df

df = create_ratio_debug(df, "total_busy_dashers", "total_onshift_dashers", "busy_dashers_ratio")

Right here is the output.

Naive Division Misleading in Debugging Python Problems

This offers visibility into potential division-by-zero points and prevents silent bugs.

Modeling Errors: Form Mismatch and Analysis Confusion

Case 1: NaN Values in Options Trigger Mannequin to Crash

Let’s say we wish to construct a linear regression mannequin. LinearRegression() doesn’t assist NaN values natively. If any row in X has a lacking worth, the mannequin refuses to coach.

Right here is the code, which intentionally creates a form mismatch to set off an error:

from sklearn.linear_model import LinearRegression

X_train = df[["estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"]].iloc[:-10]
y_train = df["actual_total_delivery_duration"].iloc[:-5] 
mannequin = LinearRegression()
mannequin.match(X_train, y_train)

Right here is the output.

Modeling Mistakes in Debugging Python Problems

Let’s debug this difficulty. First, we verify for NaNs.

print(X_train.isna().sum())

Right here is the output.

Debugging Python Problems in NaN Values

Good, let’s verify the opposite variable too.

print(y_train.isna().sum())

Right here is the output.

Debugging Python Problems in NaN Values

The mismatch and NaN values should be resolved. Right here is the code to repair it.

from sklearn.linear_model import LinearRegression

# Re-align X and y to have the identical size
X = df[["estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"]]
y = df["actual_total_delivery_duration"]

# Step 1: Drop rows with NaN in options (X)
valid_X = X.dropna()

# Step 2: Align y to match the remaining indices of X
y_aligned = y.loc[valid_X.index]

# Step 3: Discover indices the place y shouldn't be NaN
valid_idx = y_aligned.dropna().index

# Step 4: Create ultimate clear datasets
X_clean = valid_X.loc[valid_idx]
y_clean = y_aligned.loc[valid_idx]

mannequin = LinearRegression()
mannequin.match(X_clean, y_clean)
print("✅ Mannequin skilled efficiently!")

And voilà! Right here is the output.

Dataset of Debugging Python Problems

Case 2: Object Columns (Dates) Crash the Mannequin

Let’s say you attempt to prepare a mannequin utilizing a timestamp like actual_delivery_time.

However — oh no — it is nonetheless an object or datetime sort, and also you by accident combine it with numeric columns. Linear regression doesn’t like that one bit.

from sklearn.linear_model import LinearRegression

X = df[["actual_delivery_time", "estimated_order_place_duration"]]
y = df["actual_total_delivery_duration"]

mannequin = LinearRegression()
mannequin.match(X, y)

Right here is the error code:

Debugging Python in Object Columns

You are combining two incompatible knowledge varieties within the X matrix:

One column (actual_delivery_time) is datetime64.
The opposite (estimated_order_place_duration) is int64.

Scikit-learn expects all options to be the identical numeric dtype. It may well’t deal with combined varieties like datetime and int. Let’s remedy it by changing the datetime column to a numeric illustration (Unix timestamp).

# Guarantee datetime columns are parsed appropriately, coercing errors to NaT
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")

# Recalculate period in case of recent NaNs
df["actual_total_delivery_duration"] = (df["actual_delivery_time"] - df["created_at"]).dt.total_seconds()

# Convert datetime to a numeric function (Unix timestamp in seconds)
df["delivery_time_timestamp"] = df["actual_delivery_time"].astype("int64") // 10**9

Good. Now that the dtypes are numeric, let’s apply the ML mannequin.

from sklearn.linear_model import LinearRegression

# Use the brand new numeric timestamp function
X = df[["delivery_time_timestamp", "estimated_order_place_duration"]]
y = df["actual_total_delivery_duration"]

# Drop any remaining NaNs from our function set and goal
X_clean = X.dropna()
y_clean = y.loc[X_clean.index].dropna()
X_clean = X_clean.loc[y_clean.index]

mannequin = LinearRegression()
mannequin.match(X_clean, y_clean)
print("✅ Mannequin skilled efficiently!")

Right here is the output.

Debugging Python in Object Columns

Nice job!

Ultimate Ideas: Debug Smarter, Not More durable

Mannequin crashes don’t at all times stem from complicated bugs — typically, it is only a stray NaN or an unconverted date column sneaking into your knowledge pipeline.

Relatively than wrestling with cryptic stack traces or tossing try-except blocks like darts at nighttime, dig into your DataFrame early. Peek at .data(), verify .isna().sum(), and don’t draw back from .dtypes. These easy steps unveil hidden landmines earlier than you even hit match().

I’ve proven you that even one missed object sort or a sneaky lacking worth can sabotage a mannequin. However with a sharper eye, cleaner prep, and intentional function extraction, you’ll shift from debugging reactively to constructing intelligently.

Nate Rosidi is a knowledge scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from prime corporations. Nate writes on the newest traits within the profession market, offers interview recommendation, shares knowledge science tasks, and covers the whole lot SQL.

The Fundamentals of Debugging Python Issues

DoorDash Supply Length Prediction: What Are We Dealing With?

Columns in `historical_data.csv`

Widespread Python Errors in Knowledge Science Initiatives

Studying the Dataset: `FileNotFoundError`, Dtype Warning, and Fixes

Case 1: File Not Discovered — Traditional

Case 2: Dtype Misinterpretation — Python’s Quietly Improper Guess

Case 3: Date Parsing — The Silent Saboteur

Dealing with Lacking Knowledge: `KeyErrors`, `NaNs`, and Logical Pitfalls

Case 1: KeyError — You Thought That Column Existed

Case 2: NaN Depend — Lacking Values You Didn’t Anticipate

Case 3: Logical Pitfalls — Lacking Knowledge That Isn’t Really Lacking

Case 4: `isna()` ≠ Zero Doesn’t Imply It’s Clear

Function Engineering Glitches: `TypeErrors`, Date Parsing, and Extra

Case 1: You Suppose You Can Divide, However You Can’t

Case 2: Date Parsing Gone Improper

Case 3: Naive Division That Might Mislead

Modeling Errors: Form Mismatch and Analysis Confusion

Case 1: NaN Values in Options Trigger Mannequin to Crash

Case 2: Object Columns (Dates) Crash the Mannequin

Ultimate Ideas: Debug Smarter, Not More durable

Failed Automation Tasks? It’s Not the Instruments

The Obtain: how your knowledge is getting used to coach AI, and why chatbots aren’t docs

NumPy-style broadcasting for R TensorFlow customers

LEAVE A REPLY Cancel reply

Most Popular

Chart Artwork: Gold (XAU/USD) Is Again To Its Triangle Resistance!

Failed Automation Tasks? It’s Not the Instruments

Prediction Platform Polymarket Buys QCEX Trade in $112 Million Deal to Reenter the U.S.

$830 Goal For Solana? Analyst Says The Math Checks Out

Recent Comments

ABOUT US

POPULAR POSTS

Chart Artwork: Gold (XAU/USD) Is Again To Its Triangle Resistance!

Failed Automation Tasks? It’s Not the Instruments

Prediction Platform Polymarket Buys QCEX Trade in $112 Million Deal to Reenter the U.S.

POPULAR CATEGORY

The Fundamentals of Debugging Python Issues

DoorDash Supply Length Prediction: What Are We Dealing With?

Columns in historical_data.csv

Widespread Python Errors in Knowledge Science Initiatives

Studying the Dataset: FileNotFoundError, Dtype Warning, and Fixes

Case 1: File Not Discovered — Traditional

Case 2: Dtype Misinterpretation — Python’s Quietly Improper Guess

Case 3: Date Parsing — The Silent Saboteur

Dealing with Lacking Knowledge: KeyErrors, NaNs, and Logical Pitfalls

Case 1: KeyError — You Thought That Column Existed

Case 2: NaN Depend — Lacking Values You Didn’t Anticipate

Case 3: Logical Pitfalls — Lacking Knowledge That Isn’t Really Lacking

Case 4: isna() ≠ Zero Doesn’t Imply It’s Clear

Function Engineering Glitches: TypeErrors, Date Parsing, and Extra

Case 1: You Suppose You Can Divide, However You Can’t

Case 2: Date Parsing Gone Improper

Case 3: Naive Division That Might Mislead

Modeling Errors: Form Mismatch and Analysis Confusion

Case 1: NaN Values in Options Trigger Mannequin to Crash

Case 2: Object Columns (Dates) Crash the Mannequin

Ultimate Ideas: Debug Smarter, Not More durable

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY

Columns in `historical_data.csv`

Studying the Dataset: `FileNotFoundError`, Dtype Warning, and Fixes

Dealing with Lacking Knowledge: `KeyErrors`, `NaNs`, and Logical Pitfalls

Case 4: `isna()` ≠ Zero Doesn’t Imply It’s Clear

Function Engineering Glitches: `TypeErrors`, Date Parsing, and Extra