On this tutorial, we discover how we use Daft as a high-performance, Python-native information engine to construct an end-to-end analytical pipeline. We begin by loading a real-world MNIST dataset, then progressively rework it utilizing UDFs, function engineering, aggregations, joins, and lazy execution. Additionally, we exhibit methods to seamlessly mix structured information processing, numerical computation, and machine studying. By the tip, we aren’t simply manipulating information, we’re constructing an entire model-ready pipeline powered by Daft’s scalable execution engine.
!pip -q set up daft pyarrow pandas numpy scikit-learn
import os
os.environ["DO_NOT_TRACK"] = "true"
import numpy as np
import pandas as pd
import daft
from daft import col
print("Daft model:", getattr(daft, "__version__", "unknown"))
URL = "https://github.com/Eventual-Inc/mnist-json/uncooked/grasp/mnist_handwritten_test.json.gz"
df = daft.read_json(URL)
print("nSchema (sampled):")
print(df.schema())
print("nPeek:")
df.present(5)
We set up Daft and its supporting libraries instantly in Google Colab to make sure a clear, reproducible atmosphere. We configure elective settings and confirm the put in model to substantiate every part is working accurately. By doing this, we set up a steady basis for constructing our end-to-end information pipeline.
def to_28x28(pixels):
arr = np.array(pixels, dtype=np.float32)
if arr.dimension != 784:
return None
return arr.reshape(28, 28)
df2 = (
df
.with_column(
"img_28x28",
col("picture").apply(to_28x28, return_dtype=daft.DataType.python())
)
.with_column(
"pixel_mean",
col("img_28x28").apply(lambda x: float(np.imply(x)) if x will not be None else None,
return_dtype=daft.DataType.float32())
)
.with_column(
"pixel_std",
col("img_28x28").apply(lambda x: float(np.std(x)) if x will not be None else None,
return_dtype=daft.DataType.float32())
)
)
print("nAfter reshaping + easy options:")
df2.choose("label", "pixel_mean", "pixel_std").present(5)
We load a real-world MNIST JSON dataset instantly from a distant URL utilizing Daft’s native reader. We examine the schema and preview the info to know its construction and column sorts. It permits us to validate the dataset earlier than making use of transformations and have engineering.
@daft.udf(return_dtype=daft.DataType.checklist(daft.DataType.float32()), batch_size=512)
def featurize(images_28x28):
out = []
for img in images_28x28.to_pylist():
if img is None:
out.append(None)
proceed
img = np.asarray(img, dtype=np.float32)
row_sums = img.sum(axis=1) / 255.0
col_sums = img.sum(axis=0) / 255.0
whole = img.sum() + 1e-6
ys, xs = np.indices(img.form)
cy = float((ys * img).sum() / whole) / 28.0
cx = float((xs * img).sum() / whole) / 28.0
vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0], dtype=np.float32)])
out.append(vec.astype(np.float32).tolist())
return out
df3 = df2.with_column("options", featurize(col("img_28x28")))
print("nFeature column created (checklist[float]):")
df3.choose("label", "options").present(2)
We reshape the uncooked pixel arrays into structured 28×28 pictures utilizing a row-wise UDF. We compute statistical options, such because the imply and customary deviation, to complement the dataset. By making use of these transformations, we convert uncooked picture information into structured and model-friendly representations.
label_stats = (
df3.groupby("label")
.agg(
col("label").depend().alias("n"),
col("pixel_mean").imply().alias("mean_pixel_mean"),
col("pixel_std").imply().alias("mean_pixel_std"),
)
.kind("label")
)
print("nLabel distribution + abstract stats:")
label_stats.present(10)
df4 = df3.be part of(label_stats, on="label", how="left")
print("nJoined label stats again onto every row:")
df4.choose("label", "n", "mean_pixel_mean", "mean_pixel_std").present(5)
We implement a batch UDF to extract richer function vectors from the reshaped pictures. We carry out group-by aggregations and be part of abstract statistics again to the dataset for contextual enrichment. This demonstrates how we mix scalable computation with superior analytics inside Daft.
small = df4.choose("label", "options").acquire().to_pandas()
small = small.dropna(subset=["label", "features"]).reset_index(drop=True)
X = np.vstack(small["features"].apply(np.array).values).astype(np.float32)
y = small["label"].astype(int).values
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
clf = LogisticRegression(max_iter=1000, n_jobs=None)
clf.match(X_train, y_train)
pred = clf.predict(X_test)
acc = accuracy_score(y_test, pred)
print("nBaseline accuracy (feature-engineered LogisticRegression):", spherical(acc, 4))
print("nClassification report:")
print(classification_report(y_test, pred, digits=4))
out_df = df4.choose("label", "options", "pixel_mean", "pixel_std", "n")
out_path = "/content material/daft_mnist_features.parquet"
out_df.write_parquet(out_path)
print("nWrote parquet to:", out_path)
df_back = daft.read_parquet(out_path)
print("nRead-back examine:")
df_back.present(3)
We materialize chosen columns into pandas and prepare a baseline Logistic Regression mannequin. We consider efficiency to validate the usefulness of our engineered options. Additionally, we persist the processed dataset to Parquet format, finishing our end-to-end pipeline from uncooked information ingestion to production-ready storage.
On this tutorial, we constructed a production-style information workflow utilizing Daft, shifting from uncooked JSON ingestion to function engineering, aggregation, mannequin coaching, and Parquet persistence. We demonstrated methods to combine superior UDF logic, carry out environment friendly groupby and be part of operations, and materialize outcomes for downstream machine studying, all inside a clear, scalable framework. Via this course of, we noticed how Daft allows us to deal with complicated transformations whereas remaining Pythonic and environment friendly. We completed with a reusable, end-to-end pipeline that showcases how we are able to mix fashionable information engineering and machine studying workflows in a unified atmosphere.
Take a look at the Full Codes right here. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as nicely.

