Solutions to Other Web Scraping “Gotchas” You May Encounter
Episode #8 of the course Build your own web scraping tool by Hartley Brody
By now, you should have most of the skills you need to tackle nine out of 10 web scraping challenges. In this lesson, I wanted to run through a few of the less common issues you might see, but that are still frequent enough to deserve mention: Logins and Hidden Form Values.
Scraping Sites That Require Logins
HTTP is inherently a stateless protocol, meaning there’s no inherent way to tie two separate HTTP requests together as coming from the same user. This can be good for you as a web scraper, since it means there’s no common thread tying all your HTTP requests together.
But for many reasons, sometimes a site will require you to send along a bit of information about who you are in order to get access to the proper response HTML. The most common example of this is needing to log in to a site in order to access protected pages.
The way most sites handle identifying yourself across multiple requests is by setting cookies. Basically, once you successfully fill out a login form, the site sets a cookie on your browser that identifies who you are.
When you make subsequent requests to that same website, your browser automatically appends the cookie value to those requests, so the website knows you’re the same person who just logged in earlier.
When you’re building your web scraper for a site that requires logins, you’ll have to make sure you keep track of the cookie values set on initial login requests, and then send those same cookies along to subsequent requests.
Most HTTP libraries have a concept of a “Session” object that automatically handles this for you. Here’s the Python requests library:
# create a session
session = requests.Session()
# make a login POST request, using the session
# subsequent session requests will automatically handle cookies
r = session.get(“http://example.com/protected_page”)
Without the correct cookies sent, a request to the protected URL will likely be redirected to a login form or presented with an error response. Keep in mind that logging in to the site using your web scraper will obviously identify you to the site you’re scraping. This may or may not pose a big deal, depending on your use case.
Hidden Form Values
In our earlier lesson on forms, you may have found that there were seemingly “random” values sent to the server when you submitted the form. Maybe there were values that changed each time the page loaded, or didn’t seem to have a corresponding form input field.
Many sites use hidden form fields in order to discourage automated access to their website or to cut down on security threats like CSRF attacks. The basic idea is this: when the page that the form appears on is loaded, the server generates some random values and “hides” it in the form.
Then, when the user fills out the rest of the fields and clicks “submit,” that hidden form value is sent along with the request and the server can compare it to the earlier random value it generated to make sure the form submission actually came from someone who loaded the correct form page.
This may seem like something that “defeats” your web scraping program, but again, there’s always a simple, clever solution. You simply need to load the page that hosts the form, then parse that page, looking for the hidden form value set by the server. Then grab that value and be sure to send it along when you’re making the HTTP request to actually “submit” the form.
In our next lesson, we’ll look at a few common bot detection schemes that major websites use and how you can get around them.
Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More by Matthew A. Russell
Share with friends