Building Our Very First Web Scraper in 15 Minutes

27.07.2017 |

Episode #4 of the course Build your own web scraping tool by Hartley Brody

 

Now that we have the concepts in hand and we’ve seen how to use your browser’s developer tools to inspect the HTTP requests and HTML response, we’re ready to get started with building our first web scraper.

I usually use Python, since it’s very simple to write and has some great libraries for web scraping. If you’re already comfortable with another language, feel free to use something like PHP, Ruby, Java, or any other language. Remember, all we need is the ability to make HTTP requests and parse HTML responses, so it’s hard to go wrong with your technology choice.

To build our first web scraper, we’ll need to start with a simple program that makes an HTTP request to the page we’re scraping. Here’s some sample Python code that accomplishes that.

import requests

r = requests.get(“https://scrapethissite.com/pages/“)
print r.status_code == requests.codes.ok

You’ll see that—to make the HTTP request—we’re not only using the URL of the page, we’re also telling the Python requests library to make a GET request to this page, since that’s the type of request we saw when we were inspecting it earlier in our browser’s developer tools.

Now that we’ve built a simple program that makes a request, let’s take a look at how to handle the response.

Parsing and structuring HTML responses is a surprisingly difficult task, even for simple websites. Instead of doing it all ourselves, it’s much better to use a free, pre-written library of someone else’s code to make the job much easier.

In Python, the most popular library to use for this task is Beautiful Soup. Once we give it the HTML of the page we got back from the server, it will make it easy for us to find the HTML patterns we saw when we inspected the page earlier.

import requests
from bs4 import BeautifulSoup

r = requests.get(“https://scrapethissite.com/pages/”)
print r.status_code == requests.codes.ok

soup = BeautifulSoup(r.text, “html.parser“)
print soup.find(“h3“, “page-title“).text

There we go! Once we run that program, we’ll see that it makes the HTTP request, parses the HTML response, and then finds the exact piece of data we were looking for on the page.

Not too complex, was it?

This was just a simple example, where all the data we were extracting was on one page. In a more realistic scenario, your web scraper will need to visit many different pages to find all the data you want to extract.

We’ll take a look at a few patterns for that in the next lesson.

 

Recommended book

Python Web Scraping by Katharine Jarmul, Richard Lawson

 

Share with friends