Daily Python Projects

Daily Python Projects

Web Scraping with BeautifulSoup: Day 2 - Multi-Page Scraping with Pagination

Today we scrape 1,000 books from 50 pages and pull rich information for each book.

Ardit Sulce's avatar
Ardit Sulce
Jun 11, 2026
∙ Paid

Projects in this week’s series:

This week, we build a Web Scraping with BeautifulSoup suite that extracts a complete product catalog from a real website, scales up to thousands of items across many pages, and turns the scraped data into a searchable mini-database.

  • Day 1: Scrape a Single Page

  • Day 2: Multi-Page Scraping with Pagination (Today)

  • Day 3: Search & Analyze the Scraped Data

View All Projects This Week

Today’s Project

Yesterday we scraped 20 books from one page. Today we scrape 1,000 books from 50 pages — the full catalog of books.toscrape.com. Then we go deeper: for each book, we also visit its detail page to pull richer information you can’t get from the listing — the exact UPC, real stock count, full description, and category.

This is what a real scraper looks like: it follows links, scales across pages, and handles failures politely. By the end of today, you’ll have a single CSV holding every book on the site, ready for Day 3’s search tool.

Project Task

Build a multi-page book scraper that:

  • Scrapes every book across all 50 catalog pages

  • Detects the “next page” link to follow pagination automatically

  • Visits each book’s detail page to pull richer fields (UPC, stock count, description, category)

  • Adds a polite delay between requests

  • Shows clear progress (page X of Y, book X of Y)

  • Continues running if one page or detail page fails

  • Saves everything to a single CSV

  • Reports a final summary with counts and totals

This project gives you hands-on practice with pagination, two-level scraping (listing → detail page), polite rate limiting, robust error handling, progress reporting, and combining data from multiple HTTP requests — the techniques behind every production-grade scraper.

Expected Output

Running the full-catalog scraper:

python scrape_all_books.py

Console Output:

The script will run and visit different pages (50 in total) of the books.toscrape.com website and print out info for each book in the terminal:

Generated all_books.csv:

In addition to being printed in the terminal, the data will also be saved in a CSV file. Here is a snapshot of the file:

We will have 1,000 rows, 9 columns, every book on the site. Ready for tomorrow’s search tool.

Setup Instructions

Install Required Packages:

pip install requests beautifulsoup4

(Same as Day 1 — no new dependencies.)

Run it:

python scrape_all_books.py

The full run takes roughly 5-10 minutes depending on your internet — that’s 50 listing pages + 1,000 detail pages = 1,050 requests with a 0.2 second delay between each. The script prints progress so you’ll know it’s working.

Tip: if you just want to see it work without waiting, set MAX_PAGES = 3 near the top of the script for a quick three-page test.

Understanding Pagination

The first scaling problem is finding all the pages. We could hardcode page-1.html through page-50.html, but that breaks the moment the site adds page 51. The robust way is to let the page tell us where to go next.

Look at the bottom of any catalog page and you’ll find:

<li class="next">
  <a href="page-2.html">next</a>
</li>

So our strategy is: scrape the current page, look for <li class="next">, follow it. When that link doesn’t exist, we’re done.

def find_next_url(soup, current_url):
    """Return the absolute URL of the next page, or None if this is the last."""
    next_li = soup.find("li", class_="next")
    if next_li is None:
        return None  # we've reached the end of the catalog
    next_href = next_li.find("a")["href"]
    return urljoin(current_url, next_href)

That urljoin(current_url, ...) is doing real work: pages 2 through 50 use relative links like page-3.html, and urljoin resolves each one against the current page’s URL. The result is always a correct absolute URL.

Understanding the Main Pagination Loop

With find_next_url in hand, the main scrape becomes a small while loop:

url = BASE_URL
page_number = 0
all_books = []

while url:
    page_number += 1
    print(f"Page {page_number}:  {url}")

    html = fetch_page(url)
    if html is None:
        break  # can't fetch this page, stop the run

    soup = BeautifulSoup(html, "html.parser")
    books = scrape_listing_page(soup)
    all_books.extend(books)

    url = find_next_url(soup, url)   # None when we've reached the end
    time.sleep(REQUEST_DELAY)

The whole loop reads in plain English: fetch this page, extract its books, find the next URL, sleep, repeat. When find_next_url returns None, the loop ends naturally. No page counting, no hardcoded limit, no fragility.

Understanding Polite Scraping

Hitting a server with 1,050 requests as fast as Python can send them is rude — and a great way to get rate-limited or banned. The fix is a small delay between requests:

import time

REQUEST_DELAY = 0.2   # seconds between requests

# ...inside the loop, after each request:
time.sleep(REQUEST_DELAY)

0.2 seconds is invisible to you but a huge break for the server — five requests per second is gentle for any well-running site. The fact that books.toscrape.com explicitly invites scraping doesn’t change the principle: every scraper you write should pause between requests. The habit matters more than this particular site’s tolerance.

Understanding Two-Level Scraping

A listing page tells you a book exists. The detail page tells you everything about it — the UPC, the precise stock count, the full description, the category. To get those, we visit each book’s detail URL.

The pattern is two-level: scrape the listing, then for each book in that listing, fetch and scrape its detail page.

def enrich_with_details(book):
    """Fetch the book's detail page and add the extra fields to its dict."""
    html = fetch_page(book["url"])
    if html is None:
        # detail fetch failed; keep the listing data, add empty extras
        book.update(stock_count=None, category=None, upc=None, description=None)
        return book

    soup = BeautifulSoup(html, "html.parser")
    book["upc"] = extract_upc(soup)
    book["stock_count"] = extract_stock_count(soup)
    book["category"] = extract_category(soup)
    book["description"] = extract_description(soup)
    return book

The function takes a book dict (from the listing scrape) and returns the same dict with the extra fields added. This composes cleanly — listing data and detail data end up in one place.

Understanding find_next_sibling

find_next_sibling("p") is worth a closer look — it’s the move for “find me the next paragraph after this element.” That’s how the description is structured on the site: a header div followed by the description paragraph at the same level.

header = soup.find("div", id="product_description")
description_p = header.find_next_sibling("p")

This pattern shows up constantly in scraping when content is logically related but lives in separate tags. Sibling navigation lets you say “the thing right after this one,” without having to know its position in the document. You’ll reach for it again and again.

Understanding Saving the Combined Data

The output schema is the union of listing data and detail data — nine columns total:

FIELDNAMES = [
    "title", "price", "rating", "availability",
    "stock_count", "category", "upc", "description", "url",
]

with open("all_books.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=FIELDNAMES)
    writer.writeheader()
    writer.writerows(all_books)

Because every book is a dict with the same keys, DictWriter aligns the columns correctly even when a few rows have empty description fields. The result is a single, clean CSV — 1,001 lines including the header — that becomes the input for Day 3.

Coming Tomorrow

Tomorrow we put the dataset to work. The Search & Analyze tool takes all_books.csv and turns it into a query interface: search by title or keyword, filter by category, price range, and rating, and generate summary statistics — average prices per category, distribution of ratings, top-N by any field. Your scraped data becomes a small but real, useful database.

View Code Evolution

Compare today’s full-catalog scraper with yesterday’s single-page version and see how a small pagination loop and a detail-page step scale 20 books into 1,000 — without rewriting the extraction logic.

Keep reading with a 7-day free trial

Subscribe to Daily Python Projects to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Ardit Sulce · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture