Defeating Common Bot-Detection Schemes

27.07.2017 |

Episode #9 of the course Build your own web scraping tool by Hartley Brody

 

If you’re simply scraping a small-time website that’s fairly unsophisticated, then you’ll likely never run into any major “anti-scraping” hurdles. However, if you’re trying to pull data from a major website with a large, dedicated tech team (think Amazon, Google, Yelp, Facebook, etc.), then you’ll likely start running into situations where your bot gets blocked and, potentially, your computer’s IP address gets blacklisted—you will no longer be able to access the site in your own browser!

 

Being a Good Web Scraping Citizen

It’s always important to be polite to the sites you’re scraping. Your scraper is likely accessing their site in a way they are not intending and might even explicitly forbid in their terms of service. If you just hammer a bunch of requests at it, you might overwhelm the site and knock its servers offline.

A general rule of thumb is that you shouldn’t be making more than one request per second to the target site you’re scraping. Some people even say that you should time how long your requests take to come back, and then wait 10 times longer before making the next one. So, if the site takes three seconds to respond to your requests, you should wait 30 seconds before making the next one. This allows the web server to be free to process the requests coming in from other users and ensures you’re not monopolizing the server’s limited (and expensive!) resources.

Another common tactic people mention is to try to time your scraper so it visits the target website during “off hours”—that is, during the middle of the night, when there’s unlikely to be much human traffic coming in from your area of the world. If the site is large and geographically distributed, then this might prove tricky.

Even if you can dial back the requests so as not to overwhelm their resources, the sites may forbid you from scraping content at certain URLs. Always be sure to check the site’s robots.txt file to see if there are any disallow rules that will impact the pages you were hoping to scrape.

 

Turning Up the Heat

All that being said, there are a few cases where—for business or legal reasons—you might not be worried about turning up the heat on the target site you’re hoping to scrape. In that case, you’ll want to check out a proxy IP service. For my own scraping projects, I use these guys.

The basic idea with a proxy service is that you route the HTTP requests from your web scraping program through many other servers before it reaches the target website. In this way, the target website doesn’t see a bunch of requests all coming from one computer on the network. Instead, the requests you’re making appear to be distributed across many different IP addresses, each of which is only making a small number of requests individually.

This allows you to stay “below the radar” and throw a bunch of requests at the target website without making it obvious that they’re all coming from one web scraping program.

 

Legal Implications of Web Scraping

A common question I get from people is whether web scraping is illegal. While I’m certainly not a lawyer and can’t provide you with an authoritative legal answer, I tend to think of using web scraping like using a hammer.

There are many ways you can use a hammer that are perfectly legitimate: building a house or fixing your garage, for example. But if you use a hammer to break into someone’s car, that’s obviously not going to be the most legal use of that tool.

Web scraping is the same way. There are plenty of uses of web scraping that are perfectly legitimate: Google visits and “scrapes” every page on the internet in order to add them to its search index, and no one really seems to mind that. The travel search portal you use to find cheap flights goes and scrapes a bunch of hotel and airline websites to help find you the best deals, which benefits businesses and consumers.

It totally depends on your use case and how you’re planning to use the scraped data in terms of whether it’s legal or not. Again, consult a lawyer in your area if you’re concerned that your project might encounter some legal snags.

 

Recommended book

Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL by Michael Schrenk

 

Share with friends