2023-04-03

Processing JSON Data to CSV with JQ

The tool jq is a popular and flexible tool for working with JSON data. Haiku Depot Server is able to provide a large data payload containing all the package data. We’ll set an environment variable with the URL to this package data;

export HDS_URL="https://depot.haiku-os.org/__pkg/all-haikuports_x86_64-en.json.gz"

To look at the data, one can use curl together with jq;

curl -L "${HDS_URL}" | gzip -d | jq . | less

The data is more complex and voluminous, but the interesting parts of the data for this article are shown in the structure below;

{
  "items": [
    {
	  "name": "burnitnow",
	  "derivedRating": 3.5
	},
    {
	  "name": "sanity",
	  "derivedRating": 3.6600000858306885
	},
    {
	  "name": "gish",
	  "derivedRating": 0
	},
    {
	  "name": "zziplib"
	}
  ]
}

The challenge is to use jq to produce a CSV output that contains all packages that have a value for derivedRating ordered by the highest rating first and then the name of the package. Here is how this is possible.

curl -L "${HDS_URL}" \
 | gzip -d \
 | jq -r '["package-name", "rating"], (.items|sort_by(-(.derivedRating//0),.name)|.[]|select(.derivedRating)|[.name, .derivedRating]) | @csv'

Using the test data above the output is;

"package-name","rating"
"sanity",3.6600000858306885
"burnitnow",3.5
"gish",0

The jq expression is composed of a pair of headers coupled with an expression to extract the data rows. The rows are then “piped” into @csv to produce the CSV. Abridged this expression has this form;

[ ...headers... ], ( ... ) | @csv

Broken up the jq expression extracting the data rows is;

.items - Extracts the array under items in the JSON data.
sort_by(-(.derivedRating//0),.name) - Sorts the rows of data by the derivedRating value, or should there be none, 0. A second sorting is on the name.
.[] - I think this means to then process each item in the array individually.
select(.derivedRating) - Include only those tuples with a value for derivedRating.
[.name, .derivedRating] - Map each tuple to an array with the name and the derivedRating value.

The need to summarise and work with JSON data comes up quite frequently and jq is a useful tool to be able to work with raw data in this format.