BeautifulSoup (bs4)

BeautifulSoup is a Python library that parses HTML and XML documents for web scraping and data extraction.

BeautifulSoup (bs4) is a Python library that parses HTML and XML documents into navigable parse trees for web scraping and data extraction.

What BeautifulSoup Does and When to Use It

BeautifulSoup (bs4) transforms raw HTML and XML markup into a tree of Python objects. The library provides methods to search, navigate, and modify that tree. BeautifulSoup handles malformed markup gracefully, which makes it reliable for scraping real-world web pages that contain broken or non-standard HTML.

BeautifulSoup does not fetch web pages on its own. It requires a separate HTTP library such as the Python requests library to download page content. BeautifulSoup focuses exclusively on parsing and extracting data from the downloaded HTML or XML. The official documentation lives at crummy.com/software/BeautifulSoup/bs4/doc.

BeautifulSoup is not a full web scraping framework. For large-scale crawling with built-in concurrency, request scheduling, and data pipelines, use Scrapy instead. For pages that require JavaScript rendering, combine BeautifulSoup with Selenium or Playwright. BeautifulSoup works best for small-to-medium scraping tasks, one-off data extraction, and projects where precise HTML parsing control matters.

How to Install BeautifulSoup

Install BeautifulSoup (bs4) with pip. The package name on PyPI is beautifulsoup4, not bs4.

pip install beautifulsoup4

Install the Python requests library to fetch web pages before parsing them with BeautifulSoup.

pip install requests

Install the lxml parser for faster HTML parsing performance with BeautifulSoup.

pip install lxml

Verify the installation by importing bs4 and printing the version number.

import bs4
print(bs4.__version__)

Core Concepts of BeautifulSoup

BeautifulSoup Object Types: Tag, NavigableString, and BeautifulSoup

BeautifulSoup converts an HTML document into four Python object types. A Tag object represents an HTML or XML element such as <div> or <a>. A NavigableString object holds the text content inside a tag. The BeautifulSoup object represents the entire parsed document and behaves like a top-level Tag. A Comment object is a special NavigableString for HTML comments.

from bs4 import BeautifulSoup

html = "<html><head><title>Example</title></head><body><p>Hello</p></body></html>"
soup = BeautifulSoup(html, "lxml")

print(type(soup.title))         # <class 'bs4.element.Tag'>
print(type(soup.title.string))  # <class 'bs4.element.NavigableString'>
print(type(soup))               # <class 'bs4.BeautifulSoup'>

BeautifulSoup Search Methods: find() and find_all()

BeautifulSoup provides find() and find_all() as the primary methods for locating elements in the parse tree. The find() method returns the first matching Tag object or None if no match exists. The find_all() method returns a ResultSet (a list) of all matching elements.

first_link = soup.find("a")
all_links = soup.find_all("a")

Both methods accept filters by tag name, CSS class ( class_ parameter), element ID, attributes, and custom functions. The find_all() method also accepts a limit parameter to cap the number of results.

BeautifulSoup CSS Selectors: select() and select_one()

BeautifulSoup supports CSS selector queries through the select() and select_one() methods. These methods use the SoupSieve library internally. The select() method returns a list of all matching elements. The select_one() method returns the first match or None.

products = soup.select("div.product-list > article.item")
header = soup.select_one("#main-header")

CSS selectors support class selectors ( .classname), ID selectors ( #id), attribute selectors ( a[href]), descendant selectors ( div p), child selectors ( ul > li), and pseudo-classes ( :first-child, :nth-child()).

BeautifulSoup HTML Parsers: lxml, html.parser, and html5lib

BeautifulSoup delegates the actual parsing to an external parser. The lxml parser is the fastest option and handles most malformed HTML well. Python's built-in html.parser requires no additional installation but runs slower on large documents. The html5lib parser mimics browser-level parsing and handles the most severely broken HTML, but it is the slowest of the three.

soup = BeautifulSoup(html, "lxml")         # Fastest, requires pip install lxml
soup = BeautifulSoup(html, "html.parser")  # Built-in, no install needed
soup = BeautifulSoup(html, "html5lib")     # Most lenient, requires pip install html5lib

BeautifulSoup Tree Navigation: Parents, Children, and Siblings

BeautifulSoup provides attributes to traverse the parse tree relative to any element. The .parent attribute returns the enclosing tag. The .children iterator yields direct child nodes. The .descendants iterator yields all nested nodes recursively. The .next_sibling and .previous_sibling attributes move between elements at the same tree level.

element = soup.find("span", class_="price")
parent_tag = element.parent
child_list = list(element.children)
next_tag = element.next_sibling

Common Tasks with BeautifulSoup

How to Extract Text from HTML Elements with BeautifulSoup

BeautifulSoup extracts text from elements using .get_text() or the .text property. The strip=True parameter removes leading and trailing whitespace. The separator parameter inserts a string between text from nested elements.

paragraph = soup.find("p")
clean_text = paragraph.get_text(strip=True)
all_text = soup.get_text(separator=" | ")

How to Extract Attribute Values from HTML Tags with BeautifulSoup

BeautifulSoup reads tag attributes using dictionary-style access or the .get() method. The .get() method returns None instead of raising a KeyError when the attribute does not exist. The .attrs property returns all attributes as a dictionary.

link = soup.find("a")
url = link.get("href")
all_attrs = link.attrs

How to Parse Multiple Pages with BeautifulSoup and the Requests Library

BeautifulSoup parses one page at a time. Loop through a list of URLs, fetch each page with the Python requests library, and parse the response with BeautifulSoup.

import requests
from bs4 import BeautifulSoup

urls = ["https://example.com/page1", "https://example.com/page2"]

for url in urls:
    response = requests.get(url, timeout=10)
    soup = BeautifulSoup(response.text, "lxml")
    title = soup.find("title")
    if title:
        print(title.get_text())

Scrapy is a full web scraping framework that provides built-in concurrency, link following, and data pipelines. Use Scrapy instead of BeautifulSoup for large-scale crawling projects.

Selenium and Playwright are browser automation tools that render JavaScript. Combine either tool with BeautifulSoup when the target page loads content dynamically through JavaScript.

The Python requests library handles HTTP communication. BeautifulSoup depends on the requests library (or similar HTTP client) to fetch web pages before parsing them.

lxml is both a standalone XML/HTML processing library and the recommended parser backend for BeautifulSoup. Installing lxml gives BeautifulSoup its fastest parsing speed.