-
-
Save mjcollin/b1e55bff61b99d5272dac199bf872692 to your computer and use it in GitHub Desktop.
# All records in iDigBio in any of the three orders Neuroptera, Megaloptera, or Raphidioptera | |
http://search.idigbio.org/v2/search/records/?rq={"order":["neuroptera","megaloptera","raphidioptera"]} | |
# Counts of iDigBio records in each of the orders Neuroptera, Megaloptera, or Raphidioptera | |
http://search.idigbio.org/v2/summary/top/records/?rq={"order":["neuroptera","megaloptera","raphidioptera"]}&top_fields="order" | |
# Records that contain "centroid" or "county" in georeferenceing remarks | |
https://search.idigbio.org/v2/search/records/?rq={"data.dwc:georeferenceProtocol":["county","centroid"]} |
Hey @mjcollin, now I want to get the metadata (essentially the citation attribution file in a DwC A that we provide) w/o downloading a million records.
Search: Family = Andrenidae, Apidae, Colletidae, Halictidae, Megachilidae, Melittidae, Stenotritidae
So count
http://search.idigbio.org/v2/summary/top/records/?rq={"family":["andrenidae", "apidae", "colletidae", "halictidae", "megachilidae", "melittidae", "stenotritidae"]}&top_fields="order"
returns
{"order":{"hymenoptera":{"itemCount":1939034},"hemiptera":{"itemCount":1}},"itemCount":1947675}
which matches (thank goodness) item count 1947675 when I search the public UI by these 7 families. as in
http://search.idigbio.org/v2/search/records/?rq={"family":["andrenidae", "apidae", "colletidae", "halictidae", "megachilidae", "melittidae", "stenotritidae"]}
How to switch query to summary metadata for the recordsets involved in the above ?
You said I could get the recordset UUIDs. Hm.
I need to know the recordsets (and then collections) and number of records per recordset.
THEN, I want to try and REDO this limiting results to show only US Specimens (which may prove quite difficult w/o a sophisticated bounding box for the USA)? OR will I be able to take advantage of indexed data? OR do I have to construct a search with all the USA variations across these recordsets?
Thanks for guidance! D
Step 1, change the summary field to recordset and set the limit on the number of results large enough that you will see all of them: http://search.idigbio.org/v2/summary/top/records/?rq={"family":["andrenidae", "apidae", "colletidae", "halictidae", "megachilidae", "melittidae", "stenotritidae"]}&top_fields="recordset"&count=1000
Now you have a list of the ~100 record sets and a count of the number of records in those family in each record set.
There is no recordset -> collection mapping (well, Joanna might have one but it's manually maintained based on her interpretations) so you now need to read the metadata for each recordset and decide what entity from that metadata you want to credit.
You will have to decide what geographic criteria best meet your research purpose. Some options:
- Use a bounding box - records with no lat lon assigned will be left out but it's exactly right for distribution modeling (don't forget about US possessions and make a decision on international waters boundaries :) )
- Add country = us to your record query - will leave fewer out but you're heavily dependent on the data cleaning to correctly get things marked as "us" instead of all the variations
- Add data.dwc:country = (stuff) to your query and do your own data cleaning. (This is the approach I used for "korea" as that was a very ambiguous term.)
Thanks! Yay.
Next (beyond limiting to US), it's been suggested that I have a look at how many of these (US) records DO NOT have a map point.
Can we do the opposite of "must have geopoint"? where !geopoint exists? Additionally, I'd like to know where the geopoint information has been withheld (and so the dwc:informationWithheld field might prove useful here).
http://search.idigbio.org/v2/search/records/?rq={%22geopoint%22:{%22type%22:%22missing%22}}
okay, so I need to do this with 7 bee families, US (bounding box?), where geopoint exists, and again where it is missing.
i will need help. the bounding box feature is not working in the portal (at least not for me). I get "NaN" error returned.
Thanks @mjcollin I'll have a look at the data now.