Skip to content

Instantly share code, notes, and snippets.

@ThomasG77
Last active April 30, 2024 15:17
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ThomasG77/a312659ca119460920c6d41d2027a4b2 to your computer and use it in GitHub Desktop.
Save ThomasG77/a312659ca119460920c6d41d2027a4b2 to your computer and use it in GitHub Desktop.
Recipe to get JSON using pagination using command line tools e.g curl, jq, bc, cat

curl + jq (with slurp option) + bash loop do the job

An example related to question on Twitter https://twitter.com/drewdaraabrams/status/1359933543619547137

Try curl https://entreprise.data.gouv.fr/api/sirene/v1/full_text/MONTPELLIERAIN

Among result, look at

"total_results": 161,
"total_pages": 17,
"per_page": 10,
"page": 1,

Then, try with option to get all in one curl https://entreprise.data.gouv.fr/api/sirene/v1/full_text/MONTPELLIERAIN?per_page=170

"total_results": 161,
"total_pages": 2,
"per_page": 100,
"page": 1,

We deduce the API paging max = 100

Let's try to make a compromise to get 4 pages of 50 elements. We could do the same with paging 100 but it will made only 2 http calls. For demo, we lower it to make 4 http calls.

Get all pages manually just to see the pattern of URLs calls

Now change the recipe to automate

pageoffset=50
# Get result of page 1 to count for paging calls
result1=$(curl "https://entreprise.data.gouv.fr/api/sirene/v1/full_text/MONTPELLIERAIN?per_page="$pageoffset"&page=1")

# Prepare to be able to get/calculate pages and hence number of API calls and variables to use when looping through each page
tot_result=$(echo $result1|jq -r .total_results)
tot_page=$(echo $result1|jq -r .total_pages) # Here tot page is available. You may need to calculate if you only get total result
calculated_tot_page=$(echo "if ( $tot_result%$pageoffset ) $tot_result/$pageoffset+1 else $tot_result/$pageoffset" |bc)

# Take each page, get it and save as a separate file
for ((i=1;i<=tot_page;i++));
    do curl -s "https://entreprise.data.gouv.fr/api/sirene/v1/full_text/MONTPELLIERAIN?per_page="$pageoffset"&page="$i | jq '.[0].etablissement[]' --slurp >| "/tmp/content"$i".json";
       sleep 0.3; # Add delay of 0.3s to not be kicked due to rate limitations
done;

# Merge your JSONs "et voilà!" as we say in French
cat /tmp/content*.json | jq -s . >| /tmp/out.json
@mochadwi
Copy link

mochadwi commented Oct 24, 2023

thanks @ThomasG77 i'm using this approaches to backup my private repository and it's quick and easy ways to export data for atm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment