Skip to content

Instantly share code, notes, and snippets.

@mjcollin
Last active April 10, 2018 18:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mjcollin/b1e55bff61b99d5272dac199bf872692 to your computer and use it in GitHub Desktop.
Save mjcollin/b1e55bff61b99d5272dac199bf872692 to your computer and use it in GitHub Desktop.
# All records in iDigBio in any of the three orders Neuroptera, Megaloptera, or Raphidioptera
http://search.idigbio.org/v2/search/records/?rq={"order":["neuroptera","megaloptera","raphidioptera"]}
# Counts of iDigBio records in each of the orders Neuroptera, Megaloptera, or Raphidioptera
http://search.idigbio.org/v2/summary/top/records/?rq={"order":["neuroptera","megaloptera","raphidioptera"]}&top_fields="order"
# Records that contain "centroid" or "county" in georeferenceing remarks
https://search.idigbio.org/v2/search/records/?rq={"data.dwc:georeferenceProtocol":["county","centroid"]}
@debpaul
Copy link

debpaul commented Sep 15, 2017

Thanks @mjcollin I'll have a look at the data now.

@debpaul
Copy link

debpaul commented Apr 4, 2018

Hey @mjcollin, now I want to get the metadata (essentially the citation attribution file in a DwC A that we provide) w/o downloading a million records.
Search: Family = Andrenidae, Apidae, Colletidae, Halictidae, Megachilidae, Melittidae, Stenotritidae

So count
http://search.idigbio.org/v2/summary/top/records/?rq={"family":["andrenidae", "apidae", "colletidae", "halictidae", "megachilidae", "melittidae", "stenotritidae"]}&top_fields="order"

returns
{"order":{"hymenoptera":{"itemCount":1939034},"hemiptera":{"itemCount":1}},"itemCount":1947675}
which matches (thank goodness) item count 1947675 when I search the public UI by these 7 families. as in

http://search.idigbio.org/v2/search/records/?rq={"family":["andrenidae", "apidae", "colletidae", "halictidae", "megachilidae", "melittidae", "stenotritidae"]}

How to switch query to summary metadata for the recordsets involved in the above ?
You said I could get the recordset UUIDs. Hm.
I need to know the recordsets (and then collections) and number of records per recordset.

THEN, I want to try and REDO this limiting results to show only US Specimens (which may prove quite difficult w/o a sophisticated bounding box for the USA)? OR will I be able to take advantage of indexed data? OR do I have to construct a search with all the USA variations across these recordsets?

Thanks for guidance! D

@mjcollin
Copy link
Author

mjcollin commented Apr 4, 2018

Step 1, change the summary field to recordset and set the limit on the number of results large enough that you will see all of them: http://search.idigbio.org/v2/summary/top/records/?rq={"family":["andrenidae", "apidae", "colletidae", "halictidae", "megachilidae", "melittidae", "stenotritidae"]}&top_fields="recordset"&count=1000

Now you have a list of the ~100 record sets and a count of the number of records in those family in each record set.

There is no recordset -> collection mapping (well, Joanna might have one but it's manually maintained based on her interpretations) so you now need to read the metadata for each recordset and decide what entity from that metadata you want to credit.

You will have to decide what geographic criteria best meet your research purpose. Some options:

  1. Use a bounding box - records with no lat lon assigned will be left out but it's exactly right for distribution modeling (don't forget about US possessions and make a decision on international waters boundaries :) )
  2. Add country = us to your record query - will leave fewer out but you're heavily dependent on the data cleaning to correctly get things marked as "us" instead of all the variations
  3. Add data.dwc:country = (stuff) to your query and do your own data cleaning. (This is the approach I used for "korea" as that was a very ambiguous term.)

@debpaul
Copy link

debpaul commented Apr 4, 2018

Thanks! Yay.
Next (beyond limiting to US), it's been suggested that I have a look at how many of these (US) records DO NOT have a map point.
Can we do the opposite of "must have geopoint"? where !geopoint exists? Additionally, I'd like to know where the geopoint information has been withheld (and so the dwc:informationWithheld field might prove useful here).

@debpaul
Copy link

debpaul commented Apr 10, 2018

http://search.idigbio.org/v2/search/records/?rq={%22geopoint%22:{%22type%22:%22missing%22}}
okay, so I need to do this with 7 bee families, US (bounding box?), where geopoint exists, and again where it is missing.
i will need help. the bounding box feature is not working in the portal (at least not for me). I get "NaN" error returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment