Web Scraping With Node.js: How To Scrape Websites/Apps With Puppeteer & Node.js
I feel as though most people immediately think Python whenever web scraping is brought up in conversation. That’s all well and good if you know how to write it, but what happens if you don’t?
Luckily for you, web scraping is easily attainable by using our good old buddy JavaScript. If you are thinking that I’m talking about writing scrapers directly in the browser, no no no my friend, I’m talking about writing them with Node.js.
But how you may ask? With the Puppeteer package of course!
Be sure to check out my two part YouTube series on this topic. Video is included at the end of the article!
What Is Puppeteer?
Instead of creating my own introduction, let’s see what the Puppeteer library has to say about itself.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Most things that you can do manually in the browser can be done using Puppeteer!
Puppeteer requires zero setup and comes bundled with the Chromium version it works best with, making it very easy to start with. Also, the library is maintained by the Google Chrome development team, which provides some peace of mind that it will continue to be worked on in the future.
What Can It Do?
Web scraping is just scratching the surface when it comes to all of the functionality that Puppeteer provides. For example, here are just some of the things you can do with it,
- Web Scraping
- Take Screenshots
- Generate PDF’s from pages.
- Testing
- Automate form submission, user interface testing, keyboard input, etc.
- Emulate Devices
You might be wondering if doing any of the above features require external dependencies in order to work properly. The answer to that question is no, if you have Node.js installed, and have added Puppeteer to your project via a package manager, you are good to go.
Apparently you can also use Puppeteer with Firefox if you so choose. I have not used it myself, but it is worth checking out if you are interested.
How Do I Use It To Scrape A Website/App?
Let’s take a look at some code to see how easy it is to start scraping data from any website or application in a Node.js environment.
All that’s required is that you require Puppeteer into your script, and create a new instance by using the provided launch method. From there, you can see some of the features being used that I described in the previous section.
Like I said though, this is just scratching the surface of this library.
Accessing The Extracted Data Outside Of The Browser Context
In the code snippet above, I wrote a simple example of selecting all of the <a>
elements in the hypothetical websites sidebar. Let’s imagine we want to extract the text from all of those links, and write it to a JSON file on our local file system.
To do so, we need to refactor the code a bit to assign the extracted data to a variable that can be accessed in the Node.js environment. I’ve removed the screenshot and PDF code, as it’s not really necessary for this example.
Notice how I grab all of the links like before, and then convert it to Array. When you use the document.querySelectorAll
method in the browser, you have to remember that it returns a browser specific data structure called a NodeList.
Node.js, which runs outside of a browser environment has no idea what a Nodelist is (sad face). By converting it to an Array before hand, we can be ensured that we can process the extracted values from the browser context in a format recognized by Node.
After that, it’s simple as using the FS (file system) module to write the Array to a JSON file.
Learn More Web Scraping Techniques With Puppeteer
That’s a wrap for our tutorial ladies and gentlemen! I truly appreciate you checking out the article, and hope you learned something beneficial from it.
If this article has intrigued you, I would encourage you to check out my two part YouTube series on Web Scraping With Node.js & Puppeteer.
In this series we extract team standings from an E-Sports website, and write the data to our local file system. After that, we set up automation via cron style scheduling, and program an email notification system to report errors and other types of data to a Gmail account.
You can check it out below!