in , ,

WebScraping Basics with Node.js And Cheerio

Very clear tutorial on basic webscraping – these are my notes:

Collecting the data

1 – Create folder to put your scraping file

mkdir nameOfFolder

2 – Create a package.json

npm init -y

3 – Install Cheerio & Request

npm i cheerio request // npm install = npm i, so this common will install cheerio and request

Cheerio makes it easy to scrape with jquery
Request is a lightweight http module to make requests

4 – Create your scraper file

touch scraper.js

5 – bring in request and cheerio and request into scraper.js

const request = require(‘request’)
const cheerio = require(‘cheerio’)

6 – Request a main url

In this example I just use this blog that has a list of blog posts that we can scrape.

request(‘https://www.codingsavedmylife.com/’, (error, response, html) => {
if(!error && response.statusCode == 200){
console.log(html)
}
})

// Go to the url, if there is no error and request is successful with 200, show me the html

7 – Execute the script

node scrape.js // In the terminal, you will get the whole html of the page

8 – save the html on the fly into cheerio.load

request('https://www.codingsavedmylife.com/', (error, response, html) => {
if(!error && response.statusCode == 200){
const $ = cheerio.load(html)
} })

Here the variable is $ (jquery sign), so we can use jquery to get elements from the html, just as if we were in the DOM.

9 – start selecting elements that you want to collect

const postTitle = $('.entry-header')
console.log(postTitle.html())
console.log(postTitle.text())

This will console the html and the text in the class .entry-header

10 – find()

const output = postTitle.find('h2').text()
console.log(output)

This will find all the h2 titles and show the text inside each h2

11 – next() / parent()

const output = postTitle.find('h2').next().text()
console.log(output)

The next class element in the class .entry-header is “.entry-meta”. The above code will show the metadata (author, category) for each post.

const output = postTitle.find(‘h2’).parent().text()
console.log(output)

This will show the text of the parent, which is basically the .entry-header. So console will show the same as 9 +.text()

12 – Looping through a menu example

Get the id of each element of the menu:

Finding id that starts with XYZ
$(‘li[id^=”menu-item-“]’) // id^ >> id that starts with

Loop through each element to collect the menu title

$(‘li[id^=”menu-item-“]’).each((i, el) => {
const item = $(el).text()
const link = $(el).find(‘a’).attr(‘href’)

    console.log(item, link)
})

// item is obvious. For link, you first have to find the ‘a’ element before you can collect the href.

13 – Looping through posts to get titles and links

$(‘.entry-header’).each((i, el) => {
const title = $(el).find(‘h2’).text()
const link = $(el).find(‘a’).attr(‘href’)

console.log(title, link)

}

// this will print the title and link for each post

Saving the data in a csv file

1 – Add fs to your file.

Fs is the file system from node. You don’t need to install it as it’s already in node, just call it at the top of the file.

const fs = require(‘fs’)

2 – Tell node where we will save this data

const writeStream = fs.createWriteStream(‘data.csv’)

3 – Create the headers of your CSV (column titles for spreadsheet)

writeStream.write(Title,Link \n)

4 – Create the rows for each post data added

In the loop :
writeStream.write(Title,Link \n)

5 – Add a message so you know when the scraping is finished

console.log(‘Scraping Is Finito…!’)

That’s it! When you run your scraping file (node nameOfFile), the data will automatically be save in your data.csv

What do you think?

Written by John

Leave a Reply

Your email address will not be published. Required fields are marked *

Learn Git WITH GIt-IT

Create React App vs Gatsby vs Next.js