Scraping Common Interface Elements Like Forms and Pagination
In our previous lessons, we built a very simple web scraper in just a few lines of Python code. It made an HTTP request to a single URL and then looked through the HTML response we got back to find the pattern we needed to print out the relevant page’s title.
In the real world, web scraping programs are usually a bit more complex than that. Your program will often need to make many requests to load all the pages you need to view all the data you’re hoping to scrape.
Spend some time clicking through several of the pages you’re hoping to extract data from, and make note of how the URL at the top of your browser changes on each page. In order to build a scraper to traverse all these pages, you’ll need to find the patterns in the HTTP requests that will load the data you’re looking for.
While every site has different patterns, there are some common user interface elements you’ll see.
Scraping Content behind Forms
Maybe the site you’re scraping exposes its data via search. Try typing different search queries, and pay attention to where your search terms end up in the URL. Usually you’ll see that there’s a query parameter that gets added to the URL like q=your+search+terms.
Even if the form you’re using to browse your target data isn’t a “search form,” maybe there’s a drop down or some buttons you check. You should still play around with entering different values into the form and seeing what URL you’re taken to when you submit the form.
Eventually, you should discover a pattern in the query parameters and the values they need to take. At this point, you’ll have an idea in your head about all the different HTTP requests you’d need to make to the site in order to view all the data behind this form.
If you find that submitting the form doesn’t take you to a page with some query parameters in the URL, it’s possible that the form is making a POST request instead of a GET request. Use the developer tools we discussed in an earlier lesson to examine the HTTP request that the form is making when you submit it.
You might find that the form data is being sent to the website “behind the scenes” as part of the POST body of your HTTP request. This is common for logins, checkouts, or other forms where sensitive data is being entered that shouldn’t end up in the URL.
At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination interface that keeps you from seeing all the results in one response.
Try clicking to “page 2” of the results, again watching to see how the page’s URL changes. Usually, you’ll see some sort of offset= query parameter added to the URL, which is either the page number or the number of items displayed on the page. Try changing this to a really high number, and see what response you get when you “fall off the end” of the data.
With this information, you can now iterate over every page of results, incrementing the offset parameter as necessary, until you hit that “end of data” condition.
The other thing you can try doing is changing the “Display X Per Page,” which most pagination interfaces now have. Again, look for a new query parameter to be appended to the URL, which indicates how many items are on the page.
Try setting this to an arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot, since it can cut down on the number of HTTP requests you’ll need to make in order to get all the data you need.
At this point, you should have a sense of how to find the patterns of HTTP requests you need to make in order to scrape websites that use common form and pagination elements. In our next email, we’ll talk about a term that strikes fear in the hearts of many professional web scrapers—but how it really makes things easier if we remember our two web scraping fundamentals.
Share with friends