Skip to content

Instantly share code, notes, and snippets.

@hubgit
hubgit / README.md
Last active October 2, 2024 09:21
Remove metadata from a PDF file, using exiftool and qpdf. Note that embedded objects may still contain metadata.

Anonymising PDFs

PDF metadata

Metadata in PDF files can be stored in at least two places:

  • the Info Dictionary, a limited set of key/value pairs
  • XMP packets, which contain RDF statements expressed as XML

PDF files

@hubgit
hubgit / index.html
Last active August 21, 2024 10:45
Render the text of a PDF with PDF.js
<!doctype html>
<meta charset="utf-8">
<title>Render the text of a PDF with PDF.js</title>
<style>
.page-container {
box-shadow: 0 1px 3px #444;
position: relative;
font-size: 1px;
line-height: 1;
@hubgit
hubgit / cache-proxy.php
Last active August 8, 2024 17:50
PHP caching proxy
<?php
if ($_SERVER['REQUEST_METHOD'] == 'OPTIONS') {
header('Access-Control-Allow-Origin: *');
header('Access-Control-Allow-Methods: GET, OPTIONS');
header('Access-Control-Allow-Headers: accept, x-requested-with, content-type');
exit();
}
$url = $_GET['url'];
@hubgit
hubgit / html-purifier-strict.php
Created August 18, 2010 15:21
strict purification of HTML using HTMLPurifier
<?php
$url = 'http://en.wikipedia.org/wiki/1,1,1-Trichloroethane'; // example
$config = HTMLPurifier_Config::createDefault();
$config->set('URI.Base', $url); // set the base URL (overrides a <base element in the HTML head?)
$config->set('URI.MakeAbsolute', true); // make all URLs absolute using the base URL set above
$config->set('AutoFormat.RemoveEmpty', true); // remove empty elements
$config->set('HTML.Doctype', 'XHTML 1.0 Strict'); // valid XML output (?)
$config->set('HTML.AllowedElements', array('p', 'div', 'a', 'br', 'table', 'thead', 'tbody', 'tr', 'th', 'td', 'ul', 'ol', 'li', 'b', 'i'));
@hubgit
hubgit / list-files-in-folder.js
Created September 20, 2012 11:20
List all files in a folder (Google Apps Script)
function listFilesInFolder() {
var folder = DocsList.getFolder("Maudesley Debates");
var contents = folder.getFiles();
var file;
var data;
var sheet = SpreadsheetApp.getActiveSheet();
sheet.clear();
@hubgit
hubgit / SelectField.tsx
Last active July 16, 2024 07:40
Use react-select with Formik
import { FieldProps } from 'formik'
import React from 'react'
import Select, { Option, ReactSelectProps } from 'react-select'
export const SelectField: React.SFC<ReactSelectProps & FieldProps> = ({
options,
field,
form,
}) => (
<Select
[50, 100, 3].toSorted(Intl.Collator('en', { numeric: true }).compare)
@hubgit
hubgit / json-ld.js
Created June 16, 2020 09:16
Fetch, extract, parse, expand, frame and compact JSON-LD
const { JSDOM } = require('jsdom')
const { compact, expand, frame } = require('jsonld')
const url = 'https://www.bbc.co.uk/schedules/p00fzl6p/2020/06/14'
// fetch and parse HTML
const { window: { document } } = await JSDOM.fromURL(url)
// select the script elements containing JSON-LD
const elements = document.querySelectorAll('script[type="application/ld+json"]')
@hubgit
hubgit / pdf-annotations.js
Created July 28, 2015 08:13
Display a PDF and extract annotations
/*global PDFJS:false, console:false, Promise:false */
document.addEventListener('WebComponentsReady', function() {
'use strict';
//PDFJS.workerSrc = '';
PDFJS.disableWorker = true;
PDFJS.disableRange = true;
PDFJS.openExternalLinksInNewWindow = true;
@hubgit
hubgit / fetch-imp-audio.ts
Last active April 1, 2024 22:44
Fetch Independent Music Podcast audio files: `deno run fetch-imp-audio.ts`
import RSSParser from 'npm:rss-parser'
await Deno.mkdir('audio', { recursive: true })
const feedURL = 'https://anchor.fm/s/1252b450/podcast/rss'
const feed = await new RSSParser().parseURL(feedURL)
for (const item of feed.items) {
const { url } = item.enclosure
console.log(url)