
Picture by Writer
# Introduction
Python and knowledge tasks have a dependency downside. Between Python variations, digital environments, system-level packages, and working system variations, getting another person’s code to run in your machine can typically take longer than understanding the code itself.
Docker solves this by packaging your code and its whole atmosphere — Python model, dependencies, system libraries — right into a single artifact referred to as the picture. From the picture you can begin containers that run identically in your laptop computer, your teammate’s machine, and a cloud server. You cease debugging environments and begin transport work.
On this article, you will be taught Docker by sensible examples with a deal with knowledge tasks: containerizing a script, serving a machine studying mannequin with FastAPI, wiring up a multi-service pipeline with Docker Compose, and scheduling a job with a cron container.
# Conditions
Earlier than working by the examples, you will want:
- Docker and Docker Compose put in in your working system. Observe the official set up information in your platform.
- Familiarity with the command line and Python.
- Familiarity with writing a Dockerfile, constructing a picture, and operating a container from that picture.
For those who’d like a fast refresher, listed here are a few articles to get you on top of things:
You do not want deep Docker data to observe alongside. Every instance explains what’s taking place because it goes.
# Containerizing a Python Script with Pinned Dependencies
Let’s begin with the commonest use case: you will have a Python script and a necessities.txt, and also you need it to run reliably anyplace.
We’ll construct an information cleansing script that reads a uncooked gross sales CSV file, removes duplicates, fills in lacking values, and writes a cleaned model to disk.
// Structuring the Challenge
The challenge is organized as follows:
data-cleaner/
├── Dockerfile
├── necessities.txt
├── clean_data.py
└── knowledge/
└── raw_sales.csv
// Writing the Script
Here is the information cleansing script that makes use of Pandas to do the heavy lifting:
# clean_data.py
import pandas as pd
import os
INPUT_PATH = "knowledge/raw_sales.csv"
OUTPUT_PATH = "knowledge/cleaned_sales.csv"
print("Studying knowledge...")
df = pd.read_csv(INPUT_PATH)
print(f"Rows earlier than cleansing: {len(df)}")
# Drop duplicate rows
df = df.drop_duplicates()
# Fill lacking numeric values with column median
for col in df.select_dtypes(embody="quantity").columns:
df[col] = df[col].fillna(df[col].median())
# Fill lacking textual content values with 'Unknown'
for col in df.select_dtypes(embody="object").columns:
df[col] = df[col].fillna('Unknown')
print(f"Rows after cleansing: {len(df)}")
df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned file saved to {OUTPUT_PATH}")
// Pinning Dependencies
Pinning actual variations is vital. With out it, pip set up pandas would possibly set up totally different variations on totally different machines. Pinned variations assure everybody will get the identical habits. You’ll be able to outline the precise variations within the necessities.txt file like so:
pandas==2.2.0
openpyxl==3.1.2
// Defining the Dockerfile
This Dockerfile builds a minimal, cache-friendly picture for the cleansing script:
# Use a slim Python 3.11 base picture
FROM python:3.11-slim
# Set the working listing contained in the container
WORKDIR /app
# Copy and set up dependencies first (for layer caching)
COPY necessities.txt .
RUN pip set up --no-cache-dir -r necessities.txt
# Copy the script into the container
COPY clean_data.py .
# Default command to run when the container begins
CMD ["python", "clean_data.py"]
There are some things value explaining right here. We use python:3.11-slim as a substitute of the total Python picture as a result of it is considerably smaller and strips out packages you do not want.
We copy necessities.txt earlier than copying the remainder of the code and that is intentional. Docker builds photographs in layers and caches each. For those who solely change clean_data.py, Docker will not reinstall all of your dependencies on the following construct. It reuses the cached pip layer and jumps straight to copying your up to date script. That small ordering determination can prevent minutes of rebuild time.
// Constructing and Working
With the picture constructed, you’ll be able to run the container and mount your native knowledge folder:
# Construct the picture and tag it
docker construct -t data-cleaner .
# Run it, mounting your native knowledge/ folder into the container
docker run --rm -v $(pwd)/knowledge:/app/knowledge data-cleaner
The -v $(pwd)/knowledge:/app/knowledge flag mounts your native knowledge/ folder into the container at /app/knowledge. That is how the script reads your CSV and the way the cleaned output will get written again to your machine. Nothing is baked into the picture and the information stays in your filesystem.
The --rm flag routinely removes the container after it finishes. Since it is a one-off script, there is no purpose to maintain a stopped container mendacity round.
# Serving a Machine Studying Mannequin with FastAPI
You have educated a mannequin and also you need to make it obtainable over HTTP so different companies can ship knowledge and get predictions again. FastAPI works nice for this: it is quick, light-weight, and handles enter validation with Pydantic.
// Structuring the Challenge
The challenge separates the mannequin artifact from the appliance code:
ml-api/
├── Dockerfile
├── necessities.txt
├── app.py
└── mannequin.pkl
// Writing the App
The next app masses the mannequin as soon as at startup and exposes a /predict endpoint:
# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import numpy as np
app = FastAPI(title="Gross sales Forecast API")
# Load the mannequin as soon as at startup
with open("mannequin.pkl", "rb") as f:
mannequin = pickle.load(f)
class PredictRequest(BaseModel):
area: str
month: int
marketing_spend: float
units_in_stock: int
class PredictResponse(BaseModel):
area: str
predicted_revenue: float
@app.get("/well being")
def well being():
return {"standing": "okay"}
@app.submit("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
strive:
options = [[
request.month,
request.marketing_spend,
request.units_in_stock
]]
prediction = mannequin.predict(options)
return PredictResponse(
area=request.area,
predicted_revenue=spherical(float(prediction[0]), 2)
)
besides Exception as e:
increase HTTPException(status_code=500, element=str(e))
The PredictRequest class does the enter validation for you. If somebody sends a request with a lacking subject or a string the place a quantity is anticipated, FastAPI rejects it with a transparent error message earlier than your mannequin code even runs. The mannequin is loaded as soon as at startup — not on each request — which retains response occasions quick.
The /well being endpoint is a small however vital addition: Docker, load balancers, and cloud platforms use it to examine whether or not your service is definitely up and prepared.
// Defining the Dockerfile
This Dockerfile bakes the mannequin immediately into the picture so the container is totally self-contained:
FROM python:3.11-slim
WORKDIR /app
COPY necessities.txt .
RUN pip set up --no-cache-dir -r necessities.txt
# Copy the mannequin and the app collectively
COPY mannequin.pkl .
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
The mannequin.pkl is baked into the picture at construct time. This implies the container is totally self-contained, and also you needn’t mount something once you run it. The --host 0.0.0.0 flag tells Uvicorn to hear on all community interfaces contained in the container, not simply localhost. With out this, you will not be capable of attain the API from exterior the container.
// Constructing and Working
Construct the picture and begin the API server:
docker construct -t ml-api .
docker run --rm -p 8000:8000 ml-api
Take a look at it with curl:
curl -X POST http://localhost:8000/predict
-H "Content material-Sort: software/json"
-d '{"area": "North", "month": 3, "marketing_spend": 5000.0, "units_in_stock": 320}'
# Constructing a Multi-Service Pipeline with Docker Compose
Actual knowledge tasks not often contain only one course of. You would possibly want a database, a script that masses knowledge into it, and a dashboard that reads from it — all operating collectively.
Docker Compose permits you to outline and run a number of containers as a single software. Every service has its personal container, however all of them share a personal community to allow them to speak to one another.
// Structuring the Challenge
The pipeline splits every service into its personal subdirectory:
pipeline/
├── docker-compose.yml
├── loader/
│ ├── Dockerfile
│ ├── necessities.txt
│ └── load_data.py
└── dashboard/
├── Dockerfile
├── necessities.txt
└── app.py
// Defining the Compose File
This Compose file declares all three companies and wires them along with well being checks and shared URL atmosphere variables:
# docker-compose.yml
model: "3.9"
companies:
db:
picture: postgres:15
atmosphere:
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
POSTGRES_DB: analytics
volumes:
- pgdata:/var/lib/postgresql/knowledge
healthcheck:
check: ["CMD-SHELL", "pg_isready -U admin -d analytics"]
interval: 5s
retries: 5
loader:
construct: ./loader
depends_on:
db:
situation: service_healthy
atmosphere:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
dashboard:
construct: ./dashboard
depends_on:
db:
situation: service_healthy
ports:
- "8501:8501"
atmosphere:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
volumes:
pgdata:
// Writing the Loader Script
This script waits briefly for the database, then masses a CSV into the gross sales desk utilizing SQLAlchemy:
# loader/load_data.py
import pandas as pd
from sqlalchemy import create_engine
import os
import time
DATABASE_URL = os.environ["DATABASE_URL"]
# Give the DB a second to be totally prepared
time.sleep(3)
engine = create_engine(DATABASE_URL)
df = pd.read_csv("sales_data.csv")
df.to_sql("gross sales", engine, if_exists="change", index=False)
print(f"Loaded {len(df)} rows into the gross sales desk.")
Let’s take a better have a look at the Compose file. Every service runs in its personal container, however they’re all on the identical Docker-managed community, to allow them to attain one another utilizing the service title as a hostname. The loader connects to db:5432 — and never localhost — as a result of db is the service title, and Docker handles the DNS decision routinely.
The healthcheck on the PostgreSQL service is vital. depends_on alone solely waits for the container to start out, not for PostgreSQL to be prepared to simply accept connections. The healthcheck makes use of pg_isready to verify the database is definitely up earlier than the loader tries to attach. The pgdata quantity persists the database between runs; stopping and restarting the pipeline will not wipe your knowledge.
// Beginning Every thing
Deliver up all companies with a single command:
docker compose up --build
To cease every part, run:
# Scheduling Jobs with a Cron Container
Typically you want a script to run on a schedule. Perhaps it fetches knowledge from an API each hour and writes it to a database or a file. You do not need to arrange a full orchestration system like Airflow for one thing this easy. A cron container does the job cleanly.
// Structuring the Challenge
The challenge features a crontab file alongside the script and Dockerfile:
data-fetcher/
├── Dockerfile
├── necessities.txt
├── fetch_data.py
└── crontab
// Writing the Fetch Script
This script makes use of Requests to hit an API endpoint and saves the outcomes as a timestamped CSV:
# fetch_data.py
import requests
import pandas as pd
from datetime import datetime
import os
API_URL = "https://api.instance.com/gross sales/newest"
OUTPUT_DIR = "/app/output"
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"[{datetime.now()}] Fetching knowledge...")
response = requests.get(API_URL, timeout=10)
response.raise_for_status()
knowledge = response.json()
df = pd.DataFrame(knowledge["records"])
timestamp = datetime.now().strftime("%Ypercentmpercentd_percentHpercentM")
output_path = f"{OUTPUT_DIR}/sales_{timestamp}.csv"
df.to_csv(output_path, index=False)
print(f"[{datetime.now()}] Saved {len(df)} data to {output_path}")
// Defining the Crontab
The crontab schedules the script to run each hour and redirects all output to a log file:
# Run each hour, on the hour
0 * * * * python /app/fetch_data.py >> /var/log/fetch.log 2>&1
The >> /var/log/fetch.log 2>&1 half redirects each normal output and error output to a log file. That is the way you examine what occurred after the very fact.
// Defining the Dockerfile
This Dockerfile installs cron, registers the schedule, and retains it operating within the foreground:
FROM python:3.11-slim
# Set up cron
RUN apt-get replace && apt-get set up -y cron && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY necessities.txt .
RUN pip set up --no-cache-dir -r necessities.txt
COPY fetch_data.py .
COPY crontab /and many others/cron.d/fetch-job
# Set right permissions and register the crontab
RUN chmod 0644 /and many others/cron.d/fetch-job && crontab /and many others/cron.d/fetch-job
# cron -f runs cron within the foreground, which is required for Docker
CMD ["cron", "-f"]
The cron -f flag is vital right here. Docker retains a container alive so long as its major course of is operating. If cron ran within the background (its default), the primary course of would exit instantly and Docker would cease the container. The -f flag retains cron operating within the foreground so the container stays alive.
// Constructing and Working
Construct the picture and begin the container in indifferent mode:
docker construct -t data-fetcher .
docker run -d --name fetcher -v $(pwd)/output:/app/output data-fetcher
Test the logs any time:
docker exec fetcher cat /var/log/fetch.log
The output folder is mounted out of your native machine, so the CSV information land in your filesystem although the script runs contained in the container.
# Wrapping Up
I hope you discovered this Docker article useful. Docker does not must be difficult. Begin with the primary instance, swap in your individual script and dependencies, and get comfy with the build-run cycle. As soon as you have performed that, the opposite patterns observe naturally. Docker is an efficient match when:
- You want reproducible environments throughout machines or workforce members
- You are sharing scripts or fashions which have particular dependency necessities
- You are constructing multi-service techniques that have to run collectively reliably
- You need to deploy anyplace with out setup friction
That mentioned, you don’t all the time want to make use of Docker for all your Python work. It is most likely overkill when:
- You are doing fast, exploratory evaluation just for your self
- Your script has no exterior dependencies past the usual library
- You are early in a challenge and your necessities are altering quickly
For those who’re all in favour of going additional, take a look at 5 Easy Steps to Mastering Docker for Knowledge Science.
Completely happy coding!
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embody DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and low! At the moment, she’s engaged on studying and sharing her data with the developer group by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates participating useful resource overviews and coding tutorials.
