Introduction
The web is a vast source of information, but it is not always easy to access and use for natural language applications.
In this blog post, we will show you how to crawl and scrape the target URL, extract and clean the content, and store it in Azure Blob Storage. We will use Python as the programming language, and some popular libraries such as requests, asyncio, BeautifulSoup, and lxml.
By following this blog post, you will learn how to:
- Make asynchronous HTTP requests to get the HTML content of a web page.
- Use different libraries to parse and extract the content from the HTML.
- Compare the advantages and disadvantages of BeautifulSoup and lxml.
- Clean and normalize the extracted content.
- Store the content in Azure Blob Storage using the Azure Storage Python library.
Crawl and Scrape the Target URL
Scraping is a method to extract information from HTML content but to do this, we must know first the page structure from where we want to extract information. The first thing you need to do when scraping a web page is get the HTML content through an HTTP request, to process it. The native library from Python to work with HTTP requests is requests.
import requests
The main problem with this library is that it doesn’t support asynchronous requests directly. To solve this issue and use asynchronous calls we use another library called asyncio, which allows us to use tasks and async/await.
import asyncio
Now we can use both to make an async request to get the HTML:
async def getHTML(url: str): loop : asyncio.AbstractEventLoop = asyncio.get_event_loop() try: future = loop.run_in_executor(None, requests.get, url) return await future # Handle exceptions related to the requests module except requests.exceptions.RequestException as e: pass # Handle all other exceptions except Exception as e: print("An error occurred:", e)
Libraries to Extract the Content
Once we get the HTML content, we need to process it with a parser. For this there are several libraries, the most used are BeautifulSoup and lxml. This project uses BeautifulSoup, but there is also another class developed with lxml for experiment purposes.
Extract the Content using BeautifulSoup
At first, you must import the corresponding library:
from bs4 import BeautifulSoup
With the HTML that the request returned, you must build an object that will be used to process the HTML.
soup = BeautifulSoup (response.content, "html.parser", from_encoding="iso-8859-1")
To Get the Information, the Most Used Functions Are:
- find: The .find() returns the first element that matches your query criteria.
- find_all: The .find_all() returns an array of elements that you can then parse individually.
- select_one: The .select_one() returns the first element that matches your query criteria using CSS selectors.
For example, if you want to get the title of the web page, you can use:
title = soup.find("title").text
Or, if you want to get all the links in the web page, you can use:
links = soup.find_all("a") for link in links: print(link["href"])
Or, if you want to get the first paragraph with the class intro, you can use:
intro = soup.select_one("p.intro").text
Extract the Content using lxml
At first, you must import the corresponding library:
from lxml import html
With the HTML that the request returned, you must build an object that will be used to process the HTML.
parsed_content = html.fromstring(content)
To get the information, the function to use is .xpath(), where the parameter is an XPath string. XPath is a syntax for defining parts of an XML document. You can use XPath expressions to select nodes or node-sets in an XML document.
For example, if you want to get the title of the web page, you can use:
title = parsed_content.xpath("//title/text()")[0]
Difference between BeautifulSoup and lxml
BeautifulSoup is recommended for scenarios where flexibility on the search is necessary, for example search by two CSS classes without any particular order. lxml instead, since it uses XPath to make the search, it’s very strict and not so much flexible.
However, lxml has some advantages over BeautifulSoup, such as:
- It is faster and more memory efficient.
- It supports XML namespaces and validation.
- It has better support for XPath and XSLT.
Therefore, the choice of the library depends on your needs and preferences. You can try both and see which one works better for you.
After extracting the content from the HTML, you may need to clean and normalize it before storing it in Azure Blob Storage.
Store the Extracted Content in Azure Blob Storage
The final step is to store the extracted content in Azure Blob Storage, which is a cloud service that provides scalable and secure storage for any type of data. Azure Blob Storage allows you to access and manage your data from anywhere, using any platform or device.
To use Azure Blob Storage, you need to have an Azure account and a storage account. You also need to install the Azure Storage, which provides a simple way to interact with Azure Blob Storage using Python.
To install the Azure Storage SDK for Python, you can use the following command:
pip install azure-storage-blob
To use the Azure Storage SDK for Python, you need to import the BlobServiceClient class and create a connection object that represents the storage account. You also need to get the connection string and the container name from the Azure portal. You can store these values in a .env file and load them using the dotenv module.
For example, if you want to create a connection object and a container client, you can use:
from azure.storage.blob import BlobServiceClient from dotenv import load_dotenv import os # Load the environment variables load_dotenv() # Get the connection string and the container name AZURE_BLOB_CONNECTION_STRING : str = os.getenv("AZURE_BLOB_CONNECTION_STRING") AZURE_PAGE_CONTAINER = os.getenv("AZURE_PAGE_CONTAINER") # Create a connection object blobServiceClient = BlobServiceClient.from_connection_string(AZURE_BLOB_CONNECTION_STRING) # Create a container client container_client = blobServiceClient.get_container_client(AZURE_PAGE_CONTAINER)
Then, you can upload the extracted content to Azure Blob Storage as a JSON document using the upload_blob method. You need to create a blob client that represents the blob that you want to upload and provide the data as a JSON string. You also need to generate a unique file name for the blob, which can be based on the current date and time.
Example:
If you want to upload the content from the previous steps, you can use:
import json from datetime import datetime # Create a document with the extracted content document = { "title": title, "summary": summary, "texts": texts } # Convert the document to a JSON string json_document = json.dumps(document) # Create a blob client dt = datetime.now() fileName = dt.strftime("%Y%m%d_%H%M%S%f") + ".json" blob = blobServiceClient.get_blob_client(container=AZURE_PAGE_CONTAINER, blob=fileName) # Upload the content blob.upload_blob(json_document)
You can also download the content from Azure Blob Storage as a JSON document using the download_blob method. There also is the need to create a blob client that represents the blob that you want to download and provide the file name as a parameter. After that, you can then read the data as a JSON string and parse it into a Python object.
For example, if you want to download the content with a given file name, you can use:
# Create a blob client blob = blobServiceClient.get_blob_client(container=AZURE_PAGE_CONTAINER, blob=fileName # Download the content data = blob.download_blob().readall() document = json.loads(data) print(document)
What You Can Achieve:
By following this blog post, you will gain the skills to crawl, scrape, and extract content from websites efficiently and store web content securely in Azure Blob Storage. The code provided utilizes both BeautifulSoup and LXML, giving you a comprehensive understanding of the two widely used libraries. The asynchronous approach enhances performance, making it suitable for large-scale web scraping tasks.
Conclusion:
Web scraping is not only about data extraction but also about making that data usable. In this blog post, we’ve explored the intricacies of crawling, scraping, and storing web content. Stay tuned for the next part, where we step into utilizing Azure Blob Data and storing it in ACS along with vectors.