Scraping Content inside an iFrame

27.07.2017 |

Episode #7 of the course Build your own web scraping tool by Hartley Brody

 

At this point in the course, you’ve learned the two basic fundamentals of web scraping, built a basic web scraper yourself, and started learning how to scrape data from sites that use forms, pagination, and Javascript to load their data.

In this lesson, I want to cover another common roadblock you might run into: sites that load content inside iFrames.

Like with Javascript-heavy sites, many people get confused and throw up their hands: “Well, the data I want to scrape isn’t in the page’s response HTML, so how can I possibly scrape it?”

The answer—just like in the last lesson—is to remember the fundamentals. They’re always your guiding light when you get stuck. Web scraping is about making the right HTTP requests in order to get the web server to return the data you’re hoping to extract.

In the case of iFrames, the parent page is actually embedding another page inside itself. If the data you want is inside the iFrame, all you have to do is find the URL of the page that’s loaded there.

Here’s a quick example.

Let’s say I’ve made a request to the parent page (i.e. the URL at the top of my browser window), and I see a bit of HTML in the response that looks like:

<div class=”frame-wrapper”>
  <iframe src=”http://example.com/foo/“></iframe>
</div>

Rather than wondering how I get at the data that’s inside the iFrame from the parent page, I go back to my fundamentals. What’s the HTTP request I need to make that will actually return the data I’m looking for?

The answer is a simple request to http://example.com/foo/. If you were to open your developer tools and go to the network tab when you load the parent page, you should see an HTTP request made to this embedded sub page and find that the data you’re looking for is contained in the response to a request for that URL.

Hopefully by now, you’ve really gotten a sense of the fundamentals. Any time I get emails from people along the lines of, “I see the data on the page in my browser, but I can’t seem to find it in the HTML response in my web scraping program,” the answer is almost always something to the effect of, “You’re not making the same HTTP request in your web scraping program that your browser is making when it loads the information on the page.”

Any content that can be viewed on a webpage can be scraped. Period.

Some sites make it a bit trickier than others, but as long as you’re making the right HTTP request and parsing the response, you should be able to see the exact same information in your web scraping program that you see on the screen in your browser.

Coming up next, we’ll talk about a few more common “gotchas” you might run into, as well as a few techniques for getting around common bot-detection schemes that bigger websites often put in place.

 

Recommended book

Web Scraping with Python by Richard Lawson

 

Share with friends