Tips for Building Your Very Own Web Scrapers

27.07.2017 |

Episode #10 of the course Build your own web scraping tool by Hartley Brody

 

For many people, building their own custom web scraper might be the very first time they’ve ever written code. While I tried to give you the high-level techniques to scrape any website, there are all sorts of common coding problems you might run into when you first get started.

That’s why I want to leave you with a quick list of suggestions for when you get stuck writing code. Here’s my top tips for handling (and avoiding!) error in your programs:

Print things out frequently. On every single line of code, you should have a sense of what all of the variables’ values are. If you’re not sure, print them out.

Start with code that already works. When in doubt, start with someone else’s existing code that already works. If you’re a beginner, you’re still more of a Hacker than an Engineer, and so it’s better to start with an existing structure and tweak it to meet your needs.

Run your code every time you make a small change. Every time you run your code, you’re getting feedback on your work. Is it getting closer to what you want, or is the change you just made causing it to fail?

Read the error message. It’s really easy to throw your hands up and say, “My code has an error,” and feel lost when you see a stacktrace. But in my experience, about two-thirds of error messages you see are fairly accurate and descriptive, especially with Python.

Google the error message. If you can’t seem to figure out what your error message is trying to tell you, your best bet is to copy and paste the last line of the stacktrace into Google. Chances are, you’ll get a few stackoverflow.com results, where people have asked similar questions and gotten explanations and answers.

Guess and check. If you’re not 100% sure how to fix something, be open to trying two or three things to see what happens. You should be running your code often (see earlier tip), so you’ll get feedback quickly. Does this fix my error? No? Okay, let’s go back and try something else.

The best way to learn a new tech skill is to apply it to a problem you’re actively looking to solve. That’s why I always recommend that people start learning web scraping using a target site that they’re actually hoping to collect data from.

At this point, you should have the tools you need to start designing and building your own web scrapers. You should know how to look for patterns in the HTTP requests and HTML response to find the data you’re looking for, and you should have a bit of practice seeing those concepts in action.

By now, you should be seeing the web through the eyes of a web scraper—there’s data everywhere! Now you know the secrets to unlocking it and collecting it for your own purposes.

 


If you don’t have a specific site in mind but still want to get started, check out Scrape This Site. It’s designed to be beginner friendly for those who want an easy sandbox to practice building web scrapers.

Start Scraping Now


 

Recommended reading

10 Debugging Tips for Beginners: How to Troubleshoot and Fix Your Code without Pulling Your Hair Out

Web Scraping Reference: A Simple Cheat Sheet for Web Scraping with Python

How to Scrape Amazon.com: 19 Lessons I Learned While Crawling 1MM+ Product Listings

 

Recommended book

The Ultimate Guide to Web Scraping by Hartley Brody

 

Share with friends