BeautifulSoup (bs4)
BeautifulSoup is a Python library that parses HTML and XML documents for web scraping and data extraction.
- What BeautifulSoup Does and When to Use It
- How to Install BeautifulSoup
- Core Concepts of BeautifulSoup
- BeautifulSoup Object Types: Tag, NavigableString, and BeautifulSoup
- BeautifulSoup Search Methods: find() and find_all()
- BeautifulSoup CSS Selectors: select() and select_one()
- BeautifulSoup HTML Parsers: lxml, html.parser, and html5lib
- BeautifulSoup Tree Navigation: Parents, Children, and Siblings
- Common Tasks with BeautifulSoup
- How to Extract Text from HTML Elements with BeautifulSoup
- How to Extract Attribute Values from HTML Tags with BeautifulSoup
- How to Parse Multiple Pages with BeautifulSoup and the Requests Library
- Related Tools and Guides
BeautifulSoup (bs4) is a Python library that parses HTML and XML documents into navigable parse trees for web scraping and data extraction.
What BeautifulSoup Does and When to Use It
BeautifulSoup (bs4) transforms raw HTML and XML markup into a tree of Python objects. The library provides methods to search, navigate, and modify that tree. BeautifulSoup handles malformed markup gracefully, which makes it reliable for scraping real-world web pages that contain broken or non-standard HTML.
BeautifulSoup does not fetch web pages on its own. It requires a separate HTTP library such as the Python requests library to download page content. BeautifulSoup focuses exclusively on parsing and extracting data from the downloaded HTML or XML. The official documentation lives at crummy.com/software/BeautifulSoup/bs4/doc.
BeautifulSoup is not a full web scraping framework. For large-scale crawling with built-in concurrency, request scheduling, and data pipelines, use Scrapy instead. For pages that require JavaScript rendering, combine BeautifulSoup with Selenium or Playwright. BeautifulSoup works best for small-to-medium scraping tasks, one-off data extraction, and projects where precise HTML parsing control matters.
How to Install BeautifulSoup
Install BeautifulSoup (bs4) with pip. The package name on PyPI is
beautifulsoup4, not
bs4.
pip install beautifulsoup4Install the Python requests library to fetch web pages before parsing them with BeautifulSoup.
pip install requestsInstall the lxml parser for faster HTML parsing performance with BeautifulSoup.
pip install lxmlVerify the installation by importing bs4 and printing the version number.
import bs4
print(bs4.__version__)Core Concepts of BeautifulSoup
BeautifulSoup Object Types: Tag, NavigableString, and BeautifulSoup
BeautifulSoup converts an HTML document into four Python object types. A
Tag object represents an HTML or XML element such as
<div> or
<a>. A
NavigableString object holds the text content inside a tag. The
BeautifulSoup object represents the entire parsed document and behaves like a top-level
Tag. A
Comment object is a special
NavigableString for HTML comments.
from bs4 import BeautifulSoup
html = "<html><head><title>Example</title></head><body><p>Hello</p></body></html>"
soup = BeautifulSoup(html, "lxml")
print(type(soup.title)) # <class 'bs4.element.Tag'>
print(type(soup.title.string)) # <class 'bs4.element.NavigableString'>
print(type(soup)) # <class 'bs4.BeautifulSoup'>BeautifulSoup Search Methods: find() and find_all()
BeautifulSoup provides
find() and
find_all() as the primary methods for locating elements in the parse tree. The
find() method returns the first matching
Tag object or
None if no match exists. The
find_all() method returns a
ResultSet (a list) of all matching elements.
first_link = soup.find("a")
all_links = soup.find_all("a")Both methods accept filters by tag name, CSS class (
class_ parameter), element ID, attributes, and custom functions. The
find_all() method also accepts a
limit parameter to cap the number of results.
BeautifulSoup CSS Selectors: select() and select_one()
BeautifulSoup supports CSS selector queries through the
select() and
select_one() methods. These methods use the SoupSieve library internally. The
select() method returns a list of all matching elements. The
select_one() method returns the first match or
None.
products = soup.select("div.product-list > article.item")
header = soup.select_one("#main-header")CSS selectors support class selectors (
.classname), ID selectors (
#id), attribute selectors (
a[href]), descendant selectors (
div p), child selectors (
ul > li), and pseudo-classes (
:first-child,
:nth-child()).
BeautifulSoup HTML Parsers: lxml, html.parser, and html5lib
BeautifulSoup delegates the actual parsing to an external parser. The
lxml parser is the fastest option and handles most malformed HTML well. Python's built-in
html.parser requires no additional installation but runs slower on large documents. The
html5lib parser mimics browser-level parsing and handles the most severely broken HTML, but it is the slowest of the three.
soup = BeautifulSoup(html, "lxml") # Fastest, requires pip install lxml
soup = BeautifulSoup(html, "html.parser") # Built-in, no install needed
soup = BeautifulSoup(html, "html5lib") # Most lenient, requires pip install html5libBeautifulSoup Tree Navigation: Parents, Children, and Siblings
BeautifulSoup provides attributes to traverse the parse tree relative to any element. The
.parent attribute returns the enclosing tag. The
.children iterator yields direct child nodes. The
.descendants iterator yields all nested nodes recursively. The
.next_sibling and
.previous_sibling attributes move between elements at the same tree level.
element = soup.find("span", class_="price")
parent_tag = element.parent
child_list = list(element.children)
next_tag = element.next_siblingCommon Tasks with BeautifulSoup
How to Extract Text from HTML Elements with BeautifulSoup
BeautifulSoup extracts text from elements using
.get_text() or the
.text property. The
strip=True parameter removes leading and trailing whitespace. The
separator parameter inserts a string between text from nested elements.
paragraph = soup.find("p")
clean_text = paragraph.get_text(strip=True)
all_text = soup.get_text(separator=" | ")How to Extract Attribute Values from HTML Tags with BeautifulSoup
BeautifulSoup reads tag attributes using dictionary-style access or the
.get() method. The
.get() method returns
None instead of raising a
KeyError when the attribute does not exist. The
.attrs property returns all attributes as a dictionary.
link = soup.find("a")
url = link.get("href")
all_attrs = link.attrsHow to Parse Multiple Pages with BeautifulSoup and the Requests Library
BeautifulSoup parses one page at a time. Loop through a list of URLs, fetch each page with the Python requests library, and parse the response with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
urls = ["https://example.com/page1", "https://example.com/page2"]
for url in urls:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
title = soup.find("title")
if title:
print(title.get_text())Related Tools and Guides
Scrapy is a full web scraping framework that provides built-in concurrency, link following, and data pipelines. Use Scrapy instead of BeautifulSoup for large-scale crawling projects.
Selenium and Playwright are browser automation tools that render JavaScript. Combine either tool with BeautifulSoup when the target page loads content dynamically through JavaScript.
The Python requests library handles HTTP communication. BeautifulSoup depends on the requests library (or similar HTTP client) to fetch web pages before parsing them.
lxml is both a standalone XML/HTML processing library and the recommended parser backend for BeautifulSoup. Installing lxml gives BeautifulSoup its fastest parsing speed.