Web Scraping with Python: First Scraper Using Beautiful Soup

Web scraping is an essential skill for developers who need to collect structured data from the web. This article walks you through creating a small, robust scraper in Python using requests and Beautiful Soup, explains how the pieces fit together, and highlights ethical and technical best practices.

Prerequisites

You should be comfortable with basic Python (functions, lists, dicts) and have Python 3.8+ installed. Familiarity with HTML (tags, attributes) and CSS selectors is helpful but not required.

Installing dependencies

Install the minimal packages you’ll need:

pip install requests beautifulsoup4

Optionally, install lxml for faster parsing:

pip install lxml

Your first scraper

The following example fetches a page, parses it with Beautiful Soup, and extracts titles and links from article-like elements. It demonstrates headers, timeouts, basic error handling, and polite pauses.

import requests
from bs4 import BeautifulSoup
import time
import csv

URL = 'https://example.com/articles'
HEADERS = {'User-Agent': 'MyScraper/1.0 (+https://example.com/contact)'}

def fetch(url):
    try:
        resp = requests.get(url, headers=HEADERS, timeout=10)
        resp.raise_for_status()
        return resp.text
    except requests.RequestException as e:
        print('Request failed:', e)
        return None

def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    results = []
    for card in soup.select('article.post'):
        title_el = card.select_one('h2 a')
        if not title_el:
            continue
        title = title_el.get_text(strip=True)
        link = title_el.get('href')
        summary = (card.select_one('p.summary') or card.select_one('p')).get_text(strip=True)
        results.append({'title': title, 'link': link, 'summary': summary})
    return results

if __name__ == '__main__':
    html = fetch(URL)
    if not html:
        raise SystemExit('Failed to fetch page')
    items = parse(html)
    with open('articles.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['title', 'link', 'summary'])
        writer.writeheader()
        writer.writerows(items)
    print(f'Saved {len(items)} items')

How it works

Fetching the page

The requests.get call retrieves the HTML. Use a realistic User-Agent, a timeout, and handle exceptions (network errors, non-2xx status codes) using raise for status.

Parsing with Beautiful Soup

Beautiful Soup converts HTML into a parse tree. You can select elements using .select (CSS selectors) or helper methods like .find all, .find, and .select one. Using lxml as the parser improves speed and robustness.

Extracting data

After locating elements, use .get_text(strip=True) for text and .get(‘href’) for attributes. Normalize or resolve relative URLs with urllib.parse.urljoin when needed.

Best practices and etiquette

Respect robots.txt and site terms of service. Programmatic scraping can be legally sensitive.
Rate limit your requests (time.sleep between requests) and avoid high concurrency on small sites.
Cache responses when possible to reduce load.
Set a meaningful User-Agent and provide contact info if appropriate.
Use timeouts and retry/backoff strategies for transient errors.
Avoid scraping private or paywalled content without explicit permission.

Handling pagination and sessions

For paginated content, iterate through pages (next links or query params). Use requests.Session() to persist cookies and reduce overhead if the site uses session-based navigation.

When pages are dynamic (JavaScript)

If the content is rendered client-side, consider these options:

Reverse-engineer the API calls made by the page (best when available).
Use a headless browser (Playwright, Selenium) or requests-html for JS execution.
Check network tab to find JSON endpoints — scraping JSON is more robust than parsing HTML.

Error handling and robustness

Validate selectors against multiple pages (sites change layout).
Log failures and sample HTML for debugging.
Add unit tests for parsing functions using saved HTML fixtures.

Storing and using scraped data

Save results to CSV, JSON, or a database depending on volume. Respect data privacy and usage policies when storing or sharing scraped data.

Advanced topics (next steps)

Parallel scraping with care: use async requests (aiohttp) or controlled thread pools and always obey site limits.
Proxies and IP rotation: for large-scale scraping only, and ensure you comply with legal and ethical rules.
Structured extraction: use tools like extruct or specialized libraries when extracting microdata, JSON-LD, or RDFa.

Conclusion

Beautiful Soup with requests is an efficient, maintainable starting point for many scraping tasks. Start small, respect the target site, and make incremental improvements: handle errors, add retries, and consider API-based or headless-browser approaches only when necessary.

Happy scraping — and be kind to the servers you depend on!

Web Scraping
Web Scraping with Python: Your First Scraper Using Beautiful Soup

Web Scraping with Python: Your First Scraper Using Beautiful Soup

Prerequisites

Installing dependencies

Your first scraper

How it works

Fetching the page

Parsing with Beautiful Soup

Extracting data

Best practices and etiquette

Handling pagination and sessions

When pages are dynamic (JavaScript)

Error handling and robustness

Storing and using scraped data

Advanced topics (next steps)

Conclusion

Web ScrapingWeb Scraping with Python: Your First Scraper Using Beautiful Soup

Web Scraping with Python: Your First Scraper Using Beautiful Soup

Prerequisites

Installing dependencies

Your first scraper

How it works

Fetching the page

Parsing with Beautiful Soup

Extracting data

Best practices and etiquette

Handling pagination and sessions

When pages are dynamic (JavaScript)

Error handling and robustness

Storing and using scraped data

Advanced topics (next steps)

Conclusion

Web Scraping
Web Scraping with Python: Your First Scraper Using Beautiful Soup