How to Scrape any Website and Extract MetaTags Using JavaScript

·

6 min read

How to Scrape any Website and Extract MetaTags Using JavaScript

Scraper API provides a proxy service designed for web scraping. With over 20 million residential IPs across 12 countries, as well as software that can handle JavaScript rendering and solving CAPTCHAs, you can quickly complete large scraping jobs without ever having to worry about being blocked by any servers. We can use the power of ScraperAPI to extract Metatags such as Title, Description, Keywords, Open Graph Images links, etc. from any website, without dealing with any IP blocks and CAPTCHAs. ScrapperAPI handles it beautifully.

Implementation is extremely simple, and ScraperAPI offers unlimited bandwidth. Proxies are automatically rotated, but users can choose to maintain sessions if required. All you need to do is call the API with the URL that you want to scrape, and it will return the raw HTML. With Scraper API, you just focus on parsing the data, and they’ll handle the rest. Once the data is parsed, we will use the metascraper library to easily scrape Metatags from any website using Open Graph, JSON+LD, regular HTML Metatags, and a series of fallbacks.

The steps that we are following for metatags extractions:

  • Use ScraperAPI to scrape a website.
  • Use metascraper library to extract metatags.

Having built many web scrapers, we repeatedly went through the tiresome process of finding proxies, setting up headless browsers, and handling CAPTCHAs. That's why we decided to start Scraper API, it handles all of this for you so you can scrape any page with a simple API call!

— ScrapperAPI Story
As per data, they have handled 5 billion API requests per month for over 1,500 businesses and developers around the world

Implementation

When you sign up for Scraper API you are given an access key. All you need to do is call the API with your key and the URL that you want to scrape, and you will receive the raw HTML of the page as a result. It’s as simple as:

curl "https://api.scraperapi.com?api_key=XYZ&url=https://metascraper.js.org"

On the back end, when Scraper API receives your request, their service accesses the URL via one of their proxy servers, gets the data, and then sends it back to you.

Parse Website

Scraper API exposes a single API endpoint, simply send a GET request to http://api.scraperapi.com with two query string parameters, api_key which contains your API key, and url which contains the url you would like to scrape.

/* Node.Js */
const scraperapiClient = require("scraperapi-sdk")("XYZ");
const response = await scraperapiClient.get("http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance");
console.log(response);

Result

<!DOCTYPE html>
<html lang="en">

<head>
    <!-- Basic -->
    <meta charset="utf-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">

    <!-- Search Engine -->
    <meta name="description" content="easily scrape metadata from an article on the web.">
    <meta name="image" content="https://metascraper.js.org/static/logo-banner.png">
    <link rel="canonical" href="https://metascraper.js.org" />
    <title>metascraper, easily scrape metadata from an article on the web.</title>
    <meta name="viewport"
        content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">

    <!-- Schema.org for Google -->
    <meta itemprop="name" content="metascraper, easily scrape metadata from an article on the web.">
    <meta itemprop="description" content="easily scrape metadata from an article on the web.">
    <meta itemprop="image" content="https://metascraper.js.org/static/logo-banner.png">

    <!-- Twitter -->
    <meta name="twitter:card" content="summary_large_image">
    <meta name="twitter:title" content="metascraper, easily scrape metadata from an article on the web.">
    <meta name="twitter:description" content="easily scrape metadata from an article on the web.">
    <meta name="twitter:image" content="https://metascraper.js.org/static/logo-banner.png">
    <meta name="twitter:label1" value="Installation" />
    <meta name="twitter:data1" value="npm install metascraper --save" />

    <!-- Open Graph general (Facebook, Pinterest & Google+) -->
    <meta property="og:title" content="metascraper, easily scrape metadata from an article on the web.">
    <meta property="og:description" content="easily scrape metadata from an article on the web.">
    <meta property="og:image" content="https://metascraper.js.org/static/logo-banner.png">
    <meta property="og:logo" content="https://metascraper.js.org/static/logo.png">
    <meta property="og:url" content="https://metascraper.js.org">
    <meta property="og:type" content="website">

    <!-- Favicon -->
    <link rel="icon" type="image/png" href="/static/favicon-32x32.png" sizes="32x32" />
    <link rel="icon" type="image/png" href="/static/favicon-16x16.png" sizes="16x16" />
    <link rel="shortcut icon" href="/static/favicon.ico">

    <!-- Stylesheet -->
    <link href="https://fonts.googleapis.com/css?family=Bitter|Source+Sans+Pro" rel="stylesheet">
    <link rel="stylesheet" href="/static/style.min.css">
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codecopy/umd/codecopy.min.css">

</head>

<body>
    <div id="app"></div>
</body>
<script src="/static/main.min.js"></script>
<script src="//unpkg.com/docsify/lib/docsify.min.js"></script>
<script src="//unpkg.com/docsify/lib/plugins/ga.min.js"></script>
<script src="//unpkg.com/docsify/lib/plugins/external-script.min.js"></script>
<script src="//unpkg.com/prismjs/components/prism-bash.min.js"></script>
<script src="//unpkg.com/prismjs/components/prism-jsx.min.js"></script>
<script src="//cdn.jsdelivr.net/npm/codecopy/umd/codecopy.min.js"></script>

</html>

Metatags Extraction

metascraper is library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks. It follows a few principles:

  • Have a high accuracy for online articles by default.
  • Make it simple to add new rules or override existing ones.
  • Don’t restrict rules to CSS selectors or text accessors.

Installation and Code

npm install metascraper --save

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

const targetUrl = "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance";
const scraperapiClient = require("scraperapi-sdk")("XYZ");
;(async () => {
  const response = await scraperapiClient.get(targetUrl);
  const metadata = await metascraper({ response, targetUrl })
  console.log(metadata)
})()

Result

{
  "author": "Ellen Huet",
  "date": "2016-05-24T18:00:03.894Z",
  "description": "The HR startups go to war.",
  "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
  "publisher": "Bloomberg.com",
  "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
  "url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}

You have now successfully scraped a website, and extracted respective metatags:

  • Author Name
  • Date
  • Description
  • Open Graph Image
  • Publisher
  • Title
  • Url

ScraperApi and metascraper.js have made our life super easy. You can use them to extract any website with ease and without any hiccups. I used this process for extracting Hacker News articles metatags.

Scraper API Account Information

When you log into your Scraper API account, you will be presented with a dashboard that will show you how many requests you have used, how many requests you have left for the month, and the number of failed requests (which do not count towards your request limit).

If you would like to monitor your account usage and limits programmatically (how many concurrent requests you’re using, how many requests you’ve made, etc.) you may use the /account endpoint, which returns JSON.

curl "https://api.scraperapi.com/account?api_key=XYZ"

Geographic Location

To ensure your requests come from the United States, please use the country_code= flag (e.g. country_code=us)

curl "https://api.scraperapi.com?api_key=XYZ&url=https://metascraper.js.org&country_code=us"

Result

{
  "concurrentRequests": 553,
  "requestCount": 6655888,
  "failedRequestCount": 1118,
  "requestLimit": 10000000,
  "concurrencyLimit": 1000
}

Ending Note

Scraper API is the best proxy API service for web scraping in the market today and is features loaded with affordable pricing:

  • Over 20 million residential IPs in the pool
  • Simple dashboard to manage usage and billing
  • Geo-targeting: target 12+ countries around the world
  • Free plan with 1000 requests & all features
  • Seven-day, no questions asked refund policy
  • 24/7 support and great customer service
  • Rotating and sticky IP sessions
  • Easy setup
  • Able to render JavaScript pages
  • Custom browser headers
  • Premium proxy pools
  • Auto-extraction of data from popular sites

It’s easy to integrate and can use for all levels/sizes of scraping projects. If you have any serious scraping projects, then Scraper API is worth looking into. Even if you’re a casual user, you may benefit from using the free plan.

To read more such interesting topics, follow and read BoxPiper blog.

Support my work and buy me a Coffee. It'll mean the world to me. 😇

Did you find this article valuable?

Support Box Piper's Blog by becoming a sponsor. Any amount is appreciated!