Skip to content

Instantly share code, notes, and snippets.

@jedsundwall
jedsundwall / cng-meaning-survey.md
Last active November 9, 2023 04:36
Community responses to the question "What does the term “cloud-native geospatial” mean to you?"
  • FAIR data principles and distributed computing
  • data is stored in the cloud and can be directly read or queried with http read requests
  • Exploiting large geospatial datasets in the cloud in an optimized way by transmitting as few bytes as possible.
  • efficient with cloud storage
  • Ability to scale up/out geospatial analyses to cloud scale more easily
  • Big Data, tiled processing, STAC, portable/scalable workflows, COG
  • Technologies that are designed to work well in the cloud.
  • Less configuration
  • The data (and analytics - which is not yet achieved) moves from Desktop computers to clouds (plural), where they can be accessed using cloud services by expert but also non-expert users.
  • To work on cloud without lift and shift (I.e. spinning up a VM on cloud)

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

  • [ARC] Archived Crawl #1 - s3://commoncrawl/crawl-001/ - crawl data from 2008/2010
  • [ARC] Archived Crawl #2 - s3://commoncrawl/crawl-002/ - crawl data from 2009/2010
  • [ARC] Archived Crawl #3 - s3://commoncrawl/parse-output/ - crawl data from 2012
  • [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-20/
@jedsundwall
jedsundwall / pds-cf.json
Created August 19, 2016 19:34
CloudFormation template to create AWS Public Dataset
{
"AWSTemplateFormatVersion": "2010-09-09",
"Description": "This template creates the AWS infrastructure to publish a public data set on S3. It creates an S3 bucket for the dataset, an S3 bucket for access logs, and a policy to allows the Amazon Public Data Set program to read the logs and the public to read the dataset.",
"Outputs": {},
"Parameters": {
"DataSetName": {
"AllowedPattern": "[a-z0-9\\.\\-_]*",
"ConstraintDescription": "may only contain lowercase letters, numbers, and ., -, or _ characters",
"Description": "The name of the dataset's S3 bucket. This will be used to create the dataset and log S3 bucket.",
"MaxLength": "250",
{
"Version": "2012-10-17",
"Id": "BUCKET_NAME-pds-policy",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:List*",
"s3:Get*"
@jedsundwall
jedsundwall / gfs_and_hrrr_on_aws.md
Last active April 16, 2019 19:59
NOAA GFS and HRRR Model data on AWS
@jedsundwall
jedsundwall / gist:7b5ea0a33cc3ca0b9764f7090a59858a
Last active October 29, 2019 14:14
Setting up a Public AWS SNS Topic

How to create a publicly-accessible SNS topic that sends messages when objects are added to a public Amazon S3 bucket.

1. Create something within AWS that triggers notifications.

In this case, that's an S3 bucket that is continually updated by the addition of new sensor data. For the purposes of this tutorial, we’ll use s3://noaa-nexrad-level2 – one of our NEXRAD on AWS buckets – as an example.

2. Create an SNS topic and appropriate policy.

The SNS topic should be in the same region as the bucket. It will need to have a policy that allows our S3 bucket to publish to it, and anyone to subscribe to it using Lambda or SQS.

@jedsundwall
jedsundwall / mkdirs.sh
Created February 8, 2016 18:23
If you have a "urls.txt" file with a list of directory paths (one per row), this script will go through it and create directories for you. Just put it and the urls.txt into the directory where you'd like to create the paths, navigate their in the terminal and run "sh mkdirs.sh".
while read p; do
mkdir -p $p
done <urls.txt
<markdown>
Post content
</markdown>
@jedsundwall
jedsundwall / Best Of Music You Can Dance To In 2012 — Mix by Joakim Tracklist.md
Last active January 1, 2016 21:09
The track list to an amazing 2 hour mix of music from 2012 (and thereabouts) by Joakim Bouaziz. Formerly hosted on Soundcloud, it was taken down, likely due to copyright infringement. Tragic. If you have a copy of this mix, please email me: jedidiah at gmail.

Best Of Music You Can Dance To In 2012

Mix by Joakim Bouaziz

Tracklist

  • Jai Paul – Jasmine
  • Surahn – Take Your Time
  • Tanner Ross – Straight To Mexico
  • Kindness – House
  • Zombie Zombie – The Wisdom of Stones
@jedsundwall
jedsundwall / usagov-taxonomy.json
Created August 21, 2013 20:46
Taxonomy used by USA.gov to categorize U.S. government information.
[{"tag_id":"agriculture","tag_text":"agriculture"},{"tag_id":"airtravel","tag_text":"air travel"},{"tag_id":"arts","tag_text":"arts"},{"tag_id":"banking","tag_text":"banking"},{"tag_id":"benefits","tag_text":"benefits"},{"tag_id":"betterbusinessbureaus","tag_text":"better business bureaus"},{"tag_id":"biology","tag_text":"biology"},{"tag_id":"business","tag_text":"business"},{"tag_id":"businessdevelopment","tag_text":"business development"},{"tag_id":"career","tag_text":"career"},{"tag_id":"cars","tag_text":"cars"},{"tag_id":"challenges","tag_text":"challenges"},{"tag_id":"charities","tag_text":"charities"},{"tag_id":"childcare","tag_text":"child care"},{"tag_id":"children","tag_text":"children"},{"tag_id":"citizenship","tag_text":"citizenship"},{"tag_id":"college","tag_text":"college"},{"tag_id":"commerce","tag_text":"commerce"},{"tag_id":"community","tag_text":"community"},{"tag_id":"communitydevelopment","tag_text":"community development"},{"tag_id":"complaints","tag_text":"complaints"},{"tag_id":"conserva