Scraping Javascript-Heavy Websites
Episode #6 of the course Build your own web scraping tool by Hartley Brody
The next topic I wanted to introduce in the email series is something that has sent even the most seasoned web scrapers running for the hills: Javascript. I’ve talked to dozens of web developers who insist that scraping a Javascript-heavy website requires all sorts of complex tools and libraries . . .
. . . that is, until I show them the one simple trick for getting the data they’re looking for right away.
Adding Content to the Page That Isn’t in the Initial HTML Response
Many modern sites use Javascript to load content asynchronously (i.e. after the initial page load). You may have heard the term, “AJAX,” used to describe content that’s added to the page after a request.
You’ll need to think about Javascript any time you see the page change without doing a full page reload. Think of pages like infinite scrolling websites that add more content to the page as you move down or websites that show “waiting” spinners and then stick more content on the page without taking you to a new page. These sites are likely using Javascript to make AJAX requests behind the scenes to load more content from the server and are then adding it to the page after it has initially loaded.
For sites that use Javascript to load their content with AJAX, the data you see on the web page in your browser might actually not be in the original HTTP response you get when you make a request to the parent page’s URL. In an earlier lesson, you might have found that the content you saw on the page didn’t seem to be in the HTML tag soup in the “view source” window.
This likely means the content you see in your browser was sent back from the server in some other, subsequent HTTP request, not the initial HTTP request to the URL you see at the top of your browser. Remember how we said some sites take hundreds of HTTP requests to load a given page? The data you’re seeing on the page might have come in the response to one of those many subsequent requests.
The Trick to Scraping Javascript-Heavy Sites
Some developers mistakenly think that you have to load the initial page and then also download and run all the Javascript on the page in order to get the content you want to show up. This requires many heavy-handed tools and is very slow and cumbersome to run at any significant scale.
But if you think back to our core fundamentals, you’ll remember how it just takes a simple HTTP request to the right URL to load the data you’re looking for. When the site is using Javascript to load content from subsequent HTTP requests, all we have to do is find the URLs of those “behind the scenes” requests.
Using the network tab in your developer tools, load the page and then scroll around or do whatever the site requires in order to show the data you want to scrape. Now look through the list of HTTP requests in the network tab. You can usually filter the requests to only look at the “XHR” requests (i.e. AJAX requests), and that’ll help you narrow down the list to only the requests triggered by Javascript.
It should only take a few minutes to look through these requests until you find the one that returned the data you were hoping to scrape. You might even find that these sorts of requests are even easier to scrape! Rather than returning an HTML response, many AJAX endpoints will return the data in a format like JSON or XML, which are easy to parse using common tools.
Armed with this new knowledge, you might find that scraping Javascript-heavy sites is even easier than scraping normal HTML websites—once you know the correct URLs that load the data you’re actually looking for.
Recommended book
JavaScript: The Definitive Guide: Activate Your Web Pages by David Flanagan
Share with friends