How one can Crawl an Whole Documentation Web site with Olostep

By admin2010

April 20, 2026

79

How one can Crawl an Whole Documentation Web site with Olostep

Picture by Creator

# Introduction

Internet crawling is the method of robotically visiting net pages, following hyperlinks, and accumulating content material from a web site in a structured means. It’s generally used to assemble massive quantities of data from documentation websites, articles, information bases, and different net sources.

Crawling a whole web site after which changing that content material right into a format that an AI agent can really use shouldn’t be so simple as it sounds. Documentation websites usually include nested pages, repeated navigation hyperlinks, boilerplate content material, and inconsistent web page constructions. On prime of that, the extracted content material must be cleaned, organized, and saved in a means that’s helpful for downstream AI workflows resembling retrieval, question-answering, or agent-based methods.

On this information, we’ll be taught why to make use of Olostep as a substitute of Scrapy or Selenium, arrange every little thing wanted for the net crawling venture, write a easy crawling script to scrape a documentation web site, and at last create a frontend utilizing Gradio in order that anybody can present a hyperlink and different arguments to crawl web site pages.

# Selecting Olostep Over Scrapy or Selenium

Scrapy is highly effective, however it’s constructed as a full scraping framework. That’s helpful if you need deep management, but it surely additionally means extra setup and extra engineering work.

Selenium is best recognized for browser automation. It’s helpful for interacting with JavaScript-heavy pages, however it’s not actually designed as a documentation crawling workflow by itself.

With Olostep, the pitch is much more direct: search, crawl, scrape, and construction net information by one utility programming interface (API), with help for LLM-friendly outputs like Markdown, textual content, HTML, and structured JSON. Which means you don’t have to manually sew collectively items for discovery, extraction, formatting, and downstream AI use in the identical means.

For documentation websites, that can provide you a a lot sooner path from URL to usable content material since you are spending much less time constructing the crawling stack your self and extra time working with the content material you really need.

# Putting in the Packages and Setting the API Key

First, set up the Python packages used on this venture. The official Olostep software program growth equipment (SDK) requires Python 3.11 or later.

pip set up olostep python-dotenv tqdm

These packages deal with the principle elements of the workflow:

olostep connects your script to the Olostep API
python-dotenv hundreds your API key from a .env file
tqdm provides a progress bar so you’ll be able to monitor saved pages

Subsequent, create a free Olostep account, open the dashboard, and generate an API key from the API keys web page. Olostep’s official docs and integrations level customers to the dashboard for API key setup.

Then create a .env file in your venture folder:

OLOSTEP_API_KEY=your_real_api_key_here

This retains your API key separate out of your Python code, which is a cleaner and safer method to handle credentials.

# Creating the Crawler Script

On this a part of the venture, we’ll construct the Python script that crawls a documentation web site, extracts every web page in Markdown format, cleans the content material, and saves it regionally as particular person recordsdata. We’ll create the venture folder, add a Python file, after which write the code step-by-step so it’s straightforward to observe and take a look at.

First, create a venture folder on your crawler. Inside that folder, create a brand new Python file named crawl_docs_with_olostep.py.

Now we’ll add the code to this file one part at a time. This makes it simpler to grasp what every a part of the script does and the way the total crawler works collectively.

// Defining the Crawl Settings

Begin by importing the required libraries. Then outline the principle crawl settings, such because the beginning URL, crawl depth, web page restrict, embody and exclude guidelines, and the output folder the place the Markdown recordsdata might be saved. These values management how a lot of the documentation website will get crawled and the place the outcomes are saved.

import os
import re
from pathlib import Path
from urllib.parse import urlparse

from dotenv import load_dotenv
from tqdm import tqdm
from olostep import Olostep

START_URL = "https://docs.olostep.com/"
MAX_PAGES = 10
MAX_DEPTH = 1

INCLUDE_URLS = [
    "/**"
]

EXCLUDE_URLS = []

OUTPUT_DIR = Path("olostep_docs_output")

// Making a Helper Operate to Generate Protected File Names

Every crawled web page must be saved as its personal Markdown file. To try this, we’d like a helper operate that converts a URL right into a clear and filesystem-safe file title. This avoids issues with slashes, symbols, and different characters that don’t work properly in file names.

def slugify_url(url: str) -> str:
    parsed = urlparse(url)
    path = parsed.path.strip("https://www.kdnuggets.com/")

    if not path:
        path = "index"

    filename = re.sub(r"[^a-zA-Z0-9/_-]+", "-", path)
    filename = filename.substitute("https://www.kdnuggets.com/", "__").strip("-_")

    return f"{filename or 'web page'}.md"

// Making a Helper Operate to Save Markdown Recordsdata

Subsequent, add helper capabilities to course of the extracted content material earlier than saving it.

The primary operate cleans the Markdown by eradicating further interface textual content, repeated clean strains, and undesirable web page components resembling suggestions prompts. This helps preserve the saved recordsdata centered on the precise documentation content material.

def clean_markdown(markdown: str) -> str:
    textual content = markdown.substitute("rn", "n").strip()
    textual content = re.sub(r"[s*u200b?s*](#.*?)", "", textual content, flags=re.DOTALL)

    strains = [line.rstrip() for line in text.splitlines()]

    start_index = 0
    for index in vary(len(strains) - 1):
        title = strains[index].strip()
        underline = strains[index + 1].strip()
        if title and underline and set(underline) == {"="}:
            start_index = index
            break
    else:
        for index, line in enumerate(strains):
            if line.lstrip().startswith("# "):
                start_index = index
                break

    strains = strains[start_index:]

    for index, line in enumerate(strains):
        if line.strip() == "Was this web page useful?":
            strains = strains[:index]
            break

    cleaned_lines: checklist[str] = []
    for line in strains:
        stripped = line.strip()
        if stripped in {"Copy web page", "YesNo", "⌘I"}:
            proceed
        if not stripped and cleaned_lines and never cleaned_lines[-1]:
            proceed
        cleaned_lines.append(line)

    return "n".be part of(cleaned_lines).strip()

The second operate saves the cleaned Markdown into the output folder and provides the supply URL on the prime of the file. There’s additionally a small helper operate to clear previous Markdown recordsdata earlier than saving a brand new crawl outcome.

def save_markdown(output_dir: Path, url: str, markdown: str) -> None:
    output_dir.mkdir(dad and mom=True, exist_ok=True)
    filepath = output_dir / slugify_url(url)

    content material = f"""---
source_url: {url}
---

{markdown}
"""
    filepath.write_text(content material, encoding="utf-8")

There’s additionally a small helper operate to clear previous Markdown recordsdata earlier than saving a brand new crawl outcome.

def clear_output_dir(output_dir: Path) -> None:
    if not output_dir.exists():
        return

    for filepath in output_dir.glob("*.md"):
        filepath.unlink()

// Creating the Principal Crawler Logic

That is the principle a part of the script. It hundreds the API key from the .env file, creates the Olostep consumer, begins the crawl, waits for it to complete, retrieves every crawled web page as Markdown, cleans the content material, and saves it regionally.

This part ties every little thing collectively and turns the person helper capabilities right into a working documentation crawler.

def foremost() -> None:
    load_dotenv()
    api_key = os.getenv("OLOSTEP_API_KEY")

    if not api_key:
        increase RuntimeError("Lacking OLOSTEP_API_KEY in your .env file.")

    consumer = Olostep(api_key=api_key)

    crawl = consumer.crawls.create(
        start_url=START_URL,
        max_pages=MAX_PAGES,
        max_depth=MAX_DEPTH,
        include_urls=INCLUDE_URLS,
        exclude_urls=EXCLUDE_URLS,
        include_external=False,
        include_subdomain=False,
        follow_robots_txt=True,
    )

    print(f"Began crawl: {crawl.id}")
    crawl.wait_till_done(check_every_n_secs=5)

    pages = checklist(crawl.pages())
    clear_output_dir(OUTPUT_DIR)

    for web page in tqdm(pages, desc="Saving pages"):
        strive:
            content material = web page.retrieve(["markdown"])
            markdown = getattr(content material, "markdown_content", None)

            if markdown:
                save_markdown(OUTPUT_DIR, web page.url, clean_markdown(markdown))
        besides Exception as exc:
            print(f"Didn't retrieve {web page.url}: {exc}")

    print(f"Executed. Recordsdata saved in: {OUTPUT_DIR.resolve()}")


if __name__ == "__main__":
    foremost()

Word: The total script is offered right here: kingabzpro/web-crawl-olostep, an online crawler and starter net app constructed with Olostep.

// Testing the Internet Crawling Script

As soon as the script is full, run it out of your terminal:

python crawl_docs_with_olostep.py

Because the script runs, you will notice the crawler course of the pages and save them one after the other as Markdown recordsdata in your output folder.

After the crawl finishes, open the saved recordsdata to test the extracted content material. You need to see clear, readable Markdown variations of the documentation pages.

At that time, your documentation content material is able to use in AI workflows resembling search, retrieval, or agent-based methods.

# Creating the Olostep Internet Crawling Internet Software

On this a part of the venture, we’ll construct a easy net utility on prime of the crawler script. As a substitute of modifying the Python file each time, this utility provides you a neater method to enter a documentation URL, select crawl settings, run the crawl, and preview the saved Markdown recordsdata in a single place.

The frontend code for this utility is offered in app.py within the repository: web-crawl-olostep/app.py.

This utility does a couple of helpful issues:

Allows you to enter a beginning URL for the crawl
Allows you to set the utmost variety of pages to crawl
Allows you to management crawl depth
Allows you to add embody and exclude URL patterns
Runs the backend crawler immediately from the interface
Saves the crawled pages right into a folder based mostly on the URL
Reveals all saved Markdown recordsdata in a dropdown
Previews every Markdown file immediately inside the appliance
Allows you to clear earlier crawl outcomes with one button

To begin the appliance, run:

After that, Gradio will begin an area net server and supply a hyperlink like this:

* Working on native URL: http://127.0.0.1:7860
* To create a public hyperlink, set `share=True` in `launch()`.

As soon as the appliance is operating, open the native URL in your browser. In our instance, we gave the appliance the Claude Code documentation URL and requested it to crawl 50 pages with a depth of 5.

Gradio Interface for Documentation Crawling

While you click on Run Crawl, the appliance passes your settings to the backend crawler and begins the crawl. Within the terminal, you’ll be able to watch the progress as pages are crawled and saved one after the other.

After the crawl finishes, the output folder will include the saved Markdown recordsdata. On this instance, you’d see that fifty recordsdata had been added.

The dropdown within the utility is then up to date robotically, so you’ll be able to open any saved file and preview it immediately within the net interface as correctly formatted Markdown.

This makes the crawler a lot simpler to make use of. As a substitute of fixing values in code each time, you’ll be able to take a look at totally different documentation websites and crawl settings by a easy interface. That additionally makes the venture simpler to share with different individuals who could not need to work immediately in Python.

# Ultimate Takeaway

Internet crawling shouldn’t be solely about accumulating pages from a web site. The actual problem is popping that content material into clear, structured recordsdata that an AI system can really use. On this venture, we used a easy Python script and a Gradio utility to make that course of a lot simpler.

Simply as importantly, the workflow is quick sufficient for actual use. In our instance, crawling 50 pages with a depth of 5 took solely round 50 seconds, which reveals you could put together documentation information shortly with out constructing a heavy pipeline.

This setup also can transcend a one-time crawl. You possibly can schedule it to run day by day with cron or Activity Scheduler, and even replace solely the pages which have modified. That retains your documentation contemporary whereas utilizing solely a small variety of credit.

For groups that want this type of workflow to make enterprise sense, Olostep is constructed with that in thoughts. It’s considerably extra inexpensive than constructing or sustaining an inner crawling resolution, and at the least 50% cheaper than comparable options available on the market.

As your utilization grows, the fee per request continues to lower, which makes it a sensible alternative for bigger documentation pipelines. That mixture of reliability, scalability, and powerful unit economics is why a number of the fastest-growing AI-native startups depend on Olostep to energy their information infrastructure.

Abid Ali Awan (@1abidaliawan) is a licensed information scientist skilled who loves constructing machine studying fashions. At the moment, he’s specializing in content material creation and writing technical blogs on machine studying and information science applied sciences. Abid holds a Grasp’s diploma in know-how administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids fighting psychological sickness.

How one can Crawl an Whole Documentation Web site with Olostep

# Introduction

# Selecting Olostep Over Scrapy or Selenium

# Putting in the Packages and Setting the API Key

# Creating the Crawler Script

// Defining the Crawl Settings

// Making a Helper Operate to Generate Protected File Names

// Making a Helper Operate to Save Markdown Recordsdata

// Creating the Principal Crawler Logic

// Testing the Internet Crawling Script

# Creating the Olostep Internet Crawling Internet Software

# Ultimate Takeaway

Run the Mythos Enhanced Coding Mannequin Domestically with llama.cpp and Pi

The Obtain: Chinese language AI divides the White Home, and a document copyright payout

NVIDIA Releases Cosmos 3 Edge: A 4B-Parameter Open World Mannequin That Causes and Generates Robotic Actions On-System

LEAVE A REPLY Cancel reply

Most Popular

Crypto Readability Act nonetheless at mercy of ethics part as Democrats balk at Trump deal

Run the Mythos Enhanced Coding Mannequin Domestically with llama.cpp and Pi

CSPR is out there for buying and selling!

SN62 is accessible for buying and selling!

Recent Comments

ABOUT US

POPULAR POSTS

Crypto Readability Act nonetheless at mercy of ethics part as Democrats balk at Trump deal

Run the Mythos Enhanced Coding Mannequin Domestically with llama.cpp and Pi

CSPR is out there for buying and selling!

POPULAR CATEGORY