Getting Started with Web Scraping

27.07.2017 |

Episode #1 of the course Build your own web scraping tool by Hartley Brody

 

Pulling data from websites can be a difficult, frustrating experience. The information you want is staring you in the face, but without the proper skills, it can take hours for you to manually click through a site and copy/paste the data into a spreadsheet.

Fortunately, there’s a better way.

Web scraping is the art of pulling structured data from websites. At the core, becoming a proficient web scraper is not about having experience with a particular tool or programming language, it’s all about finding patterns on the target site you want to scrape.

That means you don’t need to have a strong technical background in order to start building web scrapers. If you can click through a website and find a consistent pattern in how it returns the data you want, you’re well on your way to building your very own web scraper.

In this email series, we’ll go over the fundamental concepts and talk a bit about what happens behind the scenes as you’re clicking around a website. We’ll look at how you can see these concepts in action with free tools provided by your browser. Then we’ll look at some common elements you’ll find on most websites and the techniques for pulling data from them.

By the end of the course, you’ll have a firm grasp of the concepts—as well as a bunch of specific techniques—and be able to start building your own web scraping programs.

It’s a good idea to have a target website in mind as you’re learning web scraping. It will allow you to see the concepts “in the wild” and get some hands-on experience as you’re going through the lessons.

There are many ways you can put web scraping to work to help your business or project. A few examples of sites that might be valuable to scrape include:

• Online marketplaces

• Business directory websites

• Competitors’ websites

• Job boards

• Online auction sites

• Stock ticker information

• Real estate websites and rental sites

• Booking and travel websites

• Sport scores and fantasy sports websites

If you don’t have a particular site you’d like to scrape yet, you might want to check out “Scrape This Site.” There’s an entire sandbox section of real web pages that are specifically designed to be simple and easy for anyone to scrape.

Your homework for this lesson is to simply start browsing your target website, looking for patterns on the site. Watch the page URLs and how they change as you click around, and start looking for patterns in where the data is shown on the page.

In our next lesson, we’ll discuss the two most important fundamentals of web scraping that apply to every single website on the internet.

 

Recommended book

Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell

 

Share with friends