How to Scrap Websites Like Amazon Using Node.js

Sourav Paul
The Startup
Published in
7 min readOct 3, 2020

--

Web scrapping by NodeJs

Web scraping/Web crawling is the technique of extracting data from websites. This data can further be stored in a database or any other storage system for analysis or other uses. While extracting data from websites can be done manually, web scraping usually refers to an automated process.

Web scraping is used by most bots for data extraction. There are various methodologies and tools you can use for web scraping, and in this tutorial I will be focusing on using a technique that involves DOM parsing a webpage using Node.js and it’s packages to perform a quick and effective web-scraping in a website like Amazon. Let’s dive in.

Prerequisites

There are multiple node modules to scrap websites. Here in this post I will be using Puppeteer to extract data from Amazon.in
With that in mind, this post assumes that readers know the following:

  • Understanding of JavaScript, ES6 and ES7 syntax
  • Familiarity with HTML and DOM parsing.
  • Functional programming concepts.

Let’s get started.

Setup

Before you begin, ensure that you have Node and npm or yarn installed on your machine. Since I will use ES6/7 syntax in this tutorial, it is recommended that you use the following versions of Node and npm for complete ES6/7 support: Node 8.9.0 or higher and npm 5.2.0 or higher.

Step 1- Create the Application Directory

# Create a new directory
mkdir amazon-scraping

# cd into the new directory
cd amazon-scraping
#create Nodejs module
\amazon-scraping> fsutil file createnew scrapper.js 2000

# Initiate a new package and install app dependencies
npm init -y
npm i puppeteer --save

Step 2- Import puppeteer and add the content to scrapper.js

const puppeteer = require('puppeteer');(async () => {
let resultObj = {}
let returnedResponse;
let browser
try {
browser = await puppeteer.launch({
headless:false,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--disable-features=site-per-process',
'--window-position=0,0',
'--disable-extensions',
'--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'
]
});
const page = await browser.newPage();
await page.setViewport({ width: 1366, height: 800 });
await page.goto('https://www.amazon.in/s?k=keyboard&tag=amdot- 21&ref=nb_sb_noss',{waitUntil: 'load', timeout: 30000});
await page.waitForSelector('#search > div.s-desktop-width-max')
}
catch(e){
console.log('Amazon scrap error-> ',e);
await browser.close();
}
})();

Save and run it with the following command:

node scrapper.js

The above code snippet on running will open a chromium and will redirect to the provided URL.

Chromium view

Step 3- Creating DOM parser functionality to extract data

Puppeteer allow selectors to select tags of the HTML document,using evaluate method.

await page.waitForSelector('#search > div.s-desktop-width-max')returnedResponse = await page.evaluate(()=>{let elementArray = [];let dataArray = [];if(document.querySelectorAll('#search > div.s-desktop-width-max.s-desktop-content.sg-row > div.sg-col-20-of-24.sg-col-28-of-32.sg-col-16-of-20.sg-col.sg-col-32-of-36.sg-col-8-of-12.sg-col-12-of-16.sg-col-24-of-28 > div > span:nth-child(4) > div.s-main-slot.s-result-list.s-search-results.sg-row > div').length > 0){let xyz = document.querySelectorAll('#search > div.s-desktop-width-max.s-desktop-content.sg-row> div.sg-col-20-of-24.sg-col-28-of-32.sg-col-16-of-20.sg-col.sg-col-32-of-36.sg-col-8-of-12.sg-col-12-of-16.sg-col-24-of-28> div > span:nth-child(4) > div.s-main-slot.s-result-list.s-search-results.sg-row > div')for(let divI = 3; divI<xyz.length-4;divI++){elementArray.push(xyz[divI])}let promise = new Promise((resolve,reject) =>{setTimeout(()=>{for(let text = 0; text < elementArray.length; text++){dataArray.push({"ProductName":elementArray[text].querySelector('div > span > div > div > div  h2 > a > span').innerText,"productURL": elementArray[text].querySelector('div > span > div > div > div  h2 > a ').href,"productImg" : elementArray[text].querySelector('div > span > div > div  span > a > div > img ').src,"price": elementArray[text].querySelector('div > span > div > div span.a-price-whole')?elementArray[text].querySelector('div > span > div > div span.a-price-whole').innerText.trim().replace(/\,/,""):'0',"strike": elementArray[text].querySelector('div > span > div > div span.a-price.a-text-price .a-offscreen')?elementArray[text].querySelector('div > span > div > div span.a-price.a-text-price .a-offscreen').innerText.trim().substr(1,9).replace(/\,/,""):'0',"rating": elementArray[text].querySelector('div > span > div > div a > i ')?elementArray[text].querySelector('div > span > div > div a > i').innerText:"","offer" :elementArray[text].querySelector('div > span > div > div > div.a-section.a-spacing-micro.s-grid-status-badge-container > a .a-badge .a-badge-text')?elementArray[text].querySelector('div > span > div > div > div.a-section.a-spacing-micro.s-grid-status-badge-container > a .a-badge .a-badge-text').innerText:''})resolve(dataArray)}},4000)})return promise;}else if(document.querySelectorAll('#search > div.s-desktop-width-max.s-opposite-dir > div > div.sg-col-20-of-24.s-matching-dir.sg-col-28-of-32.sg-col-16-of-20.sg-col.sg-col-32-of-36.sg-col-8-of-12.sg-col-12-of-16.sg-col-24-of-28 > div > span:nth-child(4) > div.s-main-slot.s-result-list.s-search-results.sg-row > div').length > 0){let xyz = document.querySelectorAll('#search > div.s-desktop-width-max.s-opposite-dir > div > div.sg-col-20-of-24.s-matching-dir.sg-col-28-of-32.sg-col-16-of-20.sg-col.sg-col-32-of-36.sg-col-8-of-12.sg-col-12-of-16.sg-col-24-of-28 > div > span:nth-child(4) > div.s-main-slot.s-result-list.s-search-results.sg-row > div')for(let divI = 2; divI<xyz.length-4;divI++){elementArray.push(xyz[divI])}let promise = new Promise((resolve,reject) =>{setTimeout(()=>{for(let text = 0; text < elementArray.length; text++){dataArray.push({"ProductName":elementArray[text].querySelector('div > span > div > div > div  h2 > a > span').innerText,"productURL": elementArray[text].querySelector('div > span > div > div > div  h2 > a ').href,"productImg" : elementArray[text].querySelector('div > span > div > div  span > a > div > img ').src,"price": elementArray[text].querySelector('div > span > div > div span.a-price-whole')?elementArray[text].querySelector('div > span > div > div span.a-price-whole').innerText.trim().replace(/\,/,""):'0',"strike": elementArray[text].querySelector('div > span > div > div span.a-price.a-text-price .a-offscreen')?elementArray[text].querySelector('div > span > div > div span.a-price.a-text-price .a-offscreen').innerText.trim().replace(/\,/,"").substr(1,9):'0',"rating": elementArray[text].querySelector('div > span > div > div a > i ')?elementArray[text].querySelector('div > span > div > div a > i').innerText:"","offer" :elementArray[text].querySelector('div.a-section div.a-section span')?elementArray[text].querySelector('div.a-section div.a-section span').innerText:''})resolve(dataArray)}},4000)})return promise;}})resultObj.product = returnedResponseconsole.log(resultObj.product)await browser.close();

Run the code by executing the following command in your command shell:

node scrapper.js

Lets go through and understand what they do.

Puppeteer provides a method called evaluate which helps in DOM parsing and extract data. The function passed to the page.evaluate returns a Promise, and page.evaluate would wait for the promise to resolve and return its value. A setTimeout of 4seconds is added as it takes some time to load the website before data can be extracted. Once all data is extracted and stored in an array, promise gets resolved and returns the data in JSON format.
The code above will only need minor adjustments, in order to save the scraped content into a file or a database. Amazon tries to prevent excessive scraping and imposes CAPTCHAs as anti scraping measure. In order to keep your script up and running, you can do the following:

  • Go Asynchronous
  • Rotating the IP address
  • Rotating the user agent
  • Retrying failed requests

Using a puppeteer cluster will enable you to smoothly scrape amazon product information asynchronously and help you to drastically increase speed. However, keep in mind to limit the number of concurrent requests to a level that will not harm the web server of the site you are scraping

Output

[{ 
ProductName: 'Lenovo USB Keyboard K4802, Black',
productURL:'https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0188843L5R2998TRU18&url=%2FLenovo-USB-Keyboard-K4802-Black%2Fdp%2FB07S96YK34%2Fref%3Dsr_1_2_sspa%3Fdchild%3D1%26keywords%3Dkeyboard%26qid%3D1601724032%26sr%3D8-2-spons%26tag%3Damdot-21%26psc%3D1&qualifier=1601724032&id=7929480531532188&widgetName=sp_atf&tag=amdot-21&language=en_IN',
productImg:'https://m.media-amazon.com/images/I/81dLtYBK5NL._AC_UY218_.jpg',
price: '649',
strike: '1192',
rating: '4.2 out of 5 stars',
offer: ''
},
{
ProductName:
'Zebronics ZEB-KM2100 Multimedia USB Keyboard Comes with 114 Keys Including 12 Dedicated Multimedia Keys & with Rupee Key',
productURL:'https://www.amazon.in/Zebronics-Km2100-Multimedia-USB-Keyboard/dp/B077T3BG5L/ref=sr_1_3?dchild=1&keywords=keyboard&qid=1601724032&sr=8-3&tag=amdot-21&tag=amdot-21&language=en_IN',
productImg:'https://m.media-amazon.com/images/I/81shebPwe0L._AC_UY218_.jpg',
price: '345',
strike: '399',
rating: '3.5 out of 5 stars',
offer: 'Best seller'
},
{
ProductName: 'HP 100 Wired USB Keyboard',
productURL:'https://www.amazon.in/HP-100-Wired-USB-Keyboard/dp/B07L4VCBLM/ref=sr_1_4?dchild=1&keywords=keyboard&qid=1601724032&sr=8-4&tag=amdot-21&tag=amdot-21&language=en_IN',
productImg:'https://m.media-amazon.com/images/I/81wRXdAOmkL._AC_UY218_.jpg',
price: '674',
strike: '799',
rating: '4.2 out of 5 stars',
offer: '' },
{
ProductName: 'Dell KB216 Wired Multimedia USB Keyboard',
productURL:'https://www.amazon.in/Dell-KB216-Wired-Multimedia-Keyboard/dp/B00ZYLMQH0/ref=sr_1_5?dchild=1&keywords=keyboard&qid=1601724032&sr=8-5&tag=amdot-21&tag=amdot-21&language=en_IN',
productImg:'https://m.media-amazon.com/images/I/811YM2Go9GL._AC_UY218_.jpg',
price: '699',
strike: '999',
rating: '4.3 out of 5 stars',
offer: '' },
{
ProductName:'Zebronics ZEB-KM2100 Multimedia USB Keyboard Comes with 114 Keys Including 12 Dedicated Multimedia Keys & with Rupee Key',
productURL:'https://www.amazon.in/Zebronics-Km2100-Multimedia-USB-Keyboard/dp/B077T3BG5L/ref=sxin_9?ascsubtag=amzn1.osa.518100b6-4486-4ba1-a7af-ce91569824e4.A21TJRUUN4KGV.en_IN&creativeASIN=B077T3BG5L&cv_ct_cx=keyboard&cv_ct_id=amzn1.osa.518100b6-4486-4ba1-a7af-ce91569824e4.A21TJRUUN4KGV.en_IN&cv_ct_pg=search&cv_ct_wn=osp-single-source-gl-ranking&dchild=1&keywords=keyboard&linkCode=oas&pd_rd_i=B077T3BG5L&pd_rd_r=0909bfbf-6174-46fc-8600-deabc31c0449&pd_rd_w=xHrET&pd_rd_wg=SFDS5&pf_rd_p=e0ec3157-32a0-4197-a7f6-49f9023b486e&pf_rd_r=8E42S178JQ07JWE5EP7D&qid=1601724032&sr=1-1-5b72de9d-29e4-4d53-b588-61ea05f598f4&tag=amdot-21&tag=amdot-21&language=en_IN',
productImg:'https://m.media-amazon.com/images/I/81shebPwe0L._AC_UL320_.jpg',
price: '345',
strike: '399',
rating: '3.5 out of 5 stars',
offer: 'Recommended article\nRecommended article' }
]

Use Cases

  • Price Monitoring: E-Commerce is a very competitive industry, making a smart and dynamic pricing strategy indispensable. Monitoring amazon prices enables you to adapt and optimize your pricing automatically.
  • More information: Amazon does provide a product API. However, product pages contain a lot more information that can be obtained via the API.
  • Review Information: Scraping Reviews from Amazon enables you analyze the customer satisfaction related to specific products.

The post is for learning purposes only in order to demonstrate different techniques of developing web scrapers. Author do not take responsibility for how the code snippets are used.

Link to GitHub- https://github.com/paul41/mediumpost.git

If you find this article helpful do share with your friends and followers and check out my other posts.

You can follow me on twitter @Syper78897264

Happy Coding!

--

--

Sourav Paul
The Startup

Software engineer,self learner & an adventurous traveller. I enjoy writing about tech, places & self improvement