Skip to content

Instantly share code, notes, and snippets.

@hubgit
Last active September 2, 2022 07:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hubgit/b0b9e38a55e22dd4e4ed296e4f7d0d0a to your computer and use it in GitHub Desktop.
Save hubgit/b0b9e38a55e22dd4e4ed296e4f7d0d0a to your computer and use it in GitHub Desktop.
Processing the Crossref Public Data File

First, download the data files using a BitTorrent client:

aria2c https://academictorrents.com/download/4dcfdf804775f2d92b7a030305fa0350ebef6f3e.torrent

Next, convert the data files to a single newline-delimited JSON file:

deno run process.ts

Finally, copy the output file to a Google Cloud Storage bucket:

gcloud alpha storage cp \
  --cache-control=no-transform \
  --content-encoding=gzip \
  --content-type=application/x-ndjson \
  crossref-data.ndjson.gz \
  gs://crossref-data/2022/crossref-data.ndjson.gz
import { gunzip } from 'https://deno.land/x/compress@v0.4.4/mod.ts'
// create a stream for writing to a gzipped, newline-delimited JSON output file
const output = await Deno.create('crossref-data.ndjson.gz')
const stream = new TransformStream()
stream.readable
.pipeThrough(
new TransformStream({
transform(chunk, controller) {
controller.enqueue(JSON.stringify(chunk))
controller.enqueue('\n')
},
})
)
.pipeThrough(new TextEncoderStream())
.pipeThrough(new CompressionStream('gzip'))
.pipeTo(output.writable)
const writer = stream.writable.getWriter()
// read a sorted list of files from the data dump directory
const dir = 'April 2022 Public Data File from Crossref'
const naturalSort = new Intl.Collator(undefined, {
numeric: true,
sensitivity: 'base',
})
const files = Array.from(Deno.readDirSync(dir))
.filter((entry) => entry.isFile)
.map((entry) => entry.name)
.filter((name) => name.endsWith('.json.gz'))
.sort((a, b) => naturalSort.compare(a, b))
.map((name) => `${dir}/${name}`)
// read the JSON array from each file, and write each item to the output stream
console.log(`Reading ${files.length} files…`)
for (const path of files) {
console.log(path)
const bytes = await Deno.readFile(path)
const json = new TextDecoder().decode(gunzip(bytes))
const data = JSON.parse(json)
for (const item of data.items) {
await writer.write(item)
}
}
await writer.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment