How to write a scrapper/parser using Postman?

This blog shows how we can use Postman test script to write a scrapper/ parser using cheerio. Test scripts are executed on server side and not on the webapp/electron powered desktop app, which means you don't need to use your system or a separate server to run the parser.

How to write a scrapper/parser using Postman?

The TD; LR answer :

We can use the Postman test script to write a scrapper using cheerio. Test scripts are executed on the server-side and not on your browser. In the following snippet, I have a small scraper to get all the marathon events listed on the website https://indiarunning.com/mumbai.html and then console the result.

var raceList = []

const $ = cheerio.load(pm.response.text());
var rowData = []
$('table .hovertable tbody td').each(function(index, data) {
    var textData = $(this).text().trim()
    rowData.push(textData)
    if ((index+1)%5 == 0){
        raceList.push(rowData)
        rowData = []
    }
});
console.log(raceList);

The detailed answer:

We have mostly used postman to test our API request/response and sharing the API contracts. But there are a lot more things possible to do with postman which we rarely would explore.

Postman has a very unique onboarding culture that every person irrespective of which team has to go through. During the first few days of your joining, you are not given any task rather you are expected to explore the whole product of postman and try to build something out of it. When I joined postman as a Software Developer, I too went through the same process. I used postman and few other tools like Slack and Airtable to create a slack command which can help you get a list of all the marathon events happening around India. The website I was using to fetch the details didn't have any API to the first challenge I faced was how do I parse the HTML content to convert it into JSON data.

After some exploring through the postman and the features, I realized that I can use the Postman test script to write a parser that can parse the HTML content of a page and then convert it into my required format. Postman test scripts don't execute on your system, rather it executes in a runtime environment in the server wherever your postman collection would execute.

Test script examples

Once you start exploring the test script documentation you would realize how powerful it is. Though majorly people use it to write integration test suites, one can be creative and use it as per their requirement since it allows you to write custom JS code along with accessing all the libraries which are installed in that environment. One such library is cheerio, one can access all the cheerio APIs which are mentioned in the cheerio documentation.

cheeriojs/cheerio
Fast, flexible, and lean implementation of core jQuery designed specifically for the server. - cheeriojs/cheerio

Let me take you through step by step process on how you can use cheerio inside the Postman test script. The first thing we need to do is get the response to the HTTP GET/POST request of the URL. During this whole blog, we shall refer to a simple website that lists down the running events happening in the city of Mumbai.

Running Races in Mumbai
Running Races in Mumbai - India’s #1 site for running races information

This website only contains HTML responses. Our end goal is to parse the page content and then retrieve it in a JSON format.

Our first step is to retrieve the HTML response when we do a GET request to the URL. You can do so by creating a new request in the Postman webapp/desktop app.

Simple HTTP response.

Now that we have the HTTP response, we need to access it and do some data processing. To access the response you need to use the pm object. There are a lot of things that one can do with pm starting from passing data between requests in the same collection, set variable values, accessing cookies, and many more. We are interested here to access the response. You can access the response of any request by pm.response.text(), this gives us the raw HTML which returned by the server when we do a get request. Once we get the request we need to load it into the cheerio.  

Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript.

By using the following line we shall have the whole HTML content loaded into the memory.

const $ =  cheerio.load(pm.response.text());

Once we load the response now we can check the cheerio library syntax to find the appropriate functionality as per our requirements. For us, we need to find the list of elements that are present inside the <table class="hovertable"> tag having <tbody> tag.

Rendered HTML UI in browser
Raw HTML code is used to render in the browser.

$('table .hovertable tbody td').each(function(index, data)

The above line would give us a list of all the relevant <td> tags that fall under this category. A better approach would have been to find all the <tr> tags and then write another parser that would parse the <td> tags but I was short on time and didn't want to spend much time on this parser, so I took a shortcut and then found all the relevant tags and kept a count of the tags so that I know when a particular row was ending.

We are first getting the plain text from the HTML tag. By using $(this) we can access the current tag. For example, if the current loop is at the first index of <td> tag which let us consider to be <td>03-Jan-2021</td> then we would need to get only 03-Jan-2021 which can be retrieved by using the below code.

var textData = $(this).text().trim()

After this, all we need to do is add the textData value into our array so that we can use it as per our requirements. I have created the parser in such a way that it would give a list of lists. You can use it as per your requirement.

Console result from the parser which we have written.

Here is the full working code which you can paste inside the Tests tab and check the console for output similar to this example

var raceList = []

const $ = cheerio.load(pm.response.text());
var rowData = []
$('table .hovertable tbody td').each(function(index, data) {
    var textData = $(this).text().trim()
    rowData.push(textData)
    if ((index+1)%5 == 0){
        raceList.push(rowData)
        rowData = []
    }
});
console.log(raceList);

The beauty of this setup is since this runs in the Postman server, you don't need to deploy any additional resource to keep it running. You can use Postman monitor to schedule them and let them execute as per the schedule.