Using grep, curl, and tail to scrape data from a Web page
Posted on: Sunday, Feb 04, 2018
Another post on this blog provides a Bash script that automates the installation of the most recent version of Firefox Developer Edition (FFDE). The original version of that script required the manual input of FFDE's latest version number. Looking up that number was a hassle to the say the least--and added lots of friction to a process that should simple and fast.
Rather than require you to look up the most recent version number and then provide that value as argument to the Bash script, the script now uses three of the really handy utilities that lurk within Linux.
tail work together to fetch the most recent version number from the FFDE downloads page. This post goes into the detail of how that script uses these Linux utilities to get the latest FFDE version number. With these in place, running that script is quite simple now.
You can read more about
curl- transfer the contents of a URL
grep- find lines matching a pattern
tail- output the last part of files
Scraping data from a Web page
Mozilla provides a "releases" download page that shows the versions of FFDE available. The most recent version number is the last number in the list. Visit the releases page to see it. There isn't much to it, it's mostly just a list of version numbers.
Follow each of these steps by clicking the clipboard icon to copy a line to your clipboard then pasting it in a terminal session to run it.
In an open a terminal session pull down the FFDE release page's HTML with
This script needs that HTML in a text file, so it uses
curl's -o flag to specify an output file:
curl -o releases.txt https://download-installer.cdn.mozilla.net/pub/devedition/releases/
releases.txt file available, we'll run
grep against that file to extract the version numbers from it. To do so, grep uses a simple regular expression that matches a FFDE version number (59.0b6. for example), where
[0-9] specifies a single numeric digit,
\. looks for a single period (unescaped, the
. means any character to regex), and [a-z] specifies a letter from between
grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt
This list is all of the version numbers (with each one repeated twice), but we only need the last number (the most recent version) in the list. To get the last number,
grep pipes its output into the
grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt | tail -1
The last bit of this step is get the most recent version number into a Bash variable for use in a Bash script. This is done with Bash's substitution operator,
VERSION=$(grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt | tail -1) && echo $VERSION
While that was a long explanation it distills down to three lines (including a line to delete the
curl -o releases.txt https://download-installer.cdn.mozilla.net/pub/devedition/releases/ VERSION=$(grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt | tail -1) rm releases.txt
The general technique here of pulling down from a page with
curl and then parsing it with
tail (and whatever other Linux utilities you need to use) is very handy. Please let me know in the comments what tasks you're using Linux utilities for.