Skip to content

Instantly share code, notes, and snippets.

@Ndpnt
Last active November 14, 2022 14:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Ndpnt/1426623350e25310e61b14a868bb5ee8 to your computer and use it in GitHub Desktop.
Save Ndpnt/1426623350e25310e61b14a868bb5ee8 to your computer and use it in GitHub Desktop.
[POC] Open Terms Archive versions history cleaning script

⚠️ This script is an experimental proof of concept for versions history cleaning

Along the life of an instance, unsatisfactory versions of documents might be extracted from snapshots. For example, they might be changes unrelated to terms, or empty documents, or change language… Such unsatisfactory versions decrease the value of the dataset: it becomes impossible to measure the actual number of changes, for example.

Reviewing and cleaning the dataset entails correcting the history of declarations, identifying some snapshots to skip, and extracting new versions from the snapshots based on this information. In the end, the whole versions history will be rewritten and overwritten. The declarations will be completed. All the original snapshots are left unchanged and the previous state of the versions is still available, allowing auditability.

This script recreates a history of versions from existing snapshots and declarations, based on the current configuration.

It allows to review generated versions and correct services declarations if needed. It also allows to skip unexploitable snapshots (empty content, botwall, loginwall, cookiewall, server error, …) or unwanted snapshots when there is a blink and we want to skip one in the alternative (switch between mobile and desktop pages, switch between languages, …).

Process

The script follow this process:

  • Iterate on every snapshot
  • Extract version
    • This will automatically erase refilters. Indeed, refilters are only historical artifacts: they correct a version that should not have been recorded as it was in the first place.
  • If the version cannot be generated:
    • If the snapshot is unexploitable, skip it. A snapshot is unexploitable if it does not contain the tracked document. We have encountered so far:
      • Empty content
      • Botwall
      • Loginwall
      • Cookiewall
      • Server error
      • Exception: if the provider is in a certain manner unable to provide the document to its expected audience, and not only to Open Terms Archive, this should be tracked (e.g. undergoing maintenance)
    • If the snapshot is exploitable, correct declaration. Potential reasons are:
      • Some selector is wrong. Usually, that means the history date for applying that selector is wrong (otherwise the declaration was wrong from the beginning). Take the fetchDate of the last snapshot that does not fail to generate a version as validUntil
  • If the generated version markup differs significantly, remove changes that do not reflect a change in the document content itself.
    • We have encountered so far:
      • Switching list styles (ordered to unordered list)
      • Switching between mobile and desktop pages
      • Switching between geographic region-optimised layouts
      • Switching between languages (934bddb9cdf40e7c53b5c43d0db3dc393e2a2eb4)
      • Switching between different browser-optimised layouts
      • Note: these should happen less and less as:
        • The Core is optimised to minimise such changes (single user agent)
        • Deployment is optimised to minimise such changes (single well-known IP)
        • Operations are optimised to minimise such changes (single process instead of parallel, decreasing the number of requests)
    • Known tactics, by order of preference:
      1. Declare both layouts in the same declaration
        • By using mutually exclusive selectors where each is applicable only in one case, yet the combination covers all cases
      2. Unify markup with filters (e.g. unwrap final destination URL of a link from a query parameter, replace some tags by others…)
      3. Skip the snapshot entirely (e.g. alternating between mobile and desktop pages). Choosing which ones to skip in the alternative is done with the following constraints:
        1. Maximise version quality (more markup, better readability)
        2. Maximise frequency (at least one version a day)
        3. Minimise changes to declaration
        4. Minimise declaration complexity
  • Review versions and apply some sanity checks
    • Add filters

Script usage instructions

Prerequisites

  • Clone OpenTermsArchive` project.
  • Clone all three repositories of the instance associated with the document: declarations, snapshots, and versions.
  • Get the latest version of this experimental script and it depends on the internal code of the Open Terms Archive, it must be copied to a folder in the scripts directory of the Open Terms Archive project, for example, in scripts/cleanup/index.js.
  • Install missing dependencies
git clone git@github.com:ambanum/OpenTermsArchive.git
cd OpenTermsArchive
wget https://gist.githubusercontent.com/Ndpnt/1426623350e25310e61b14a868bb5ee8/raw/03ffbf1f8b7ea221007dbb79fd8aec80d594814f/index.js -O ./scripts/cleanup/index.js
npm i colors commander inquirer

Edit the configuration

Add a new file in config/development.js with the following contents:

{
  "services": {
    "declarationsPath": "../${YOUR_INSTANCE}-declarations/declarations"
  },
  "recorder": {
    "versions": {
      "storage": {
        "type": "git",
        "git": {
          "path": "../${YOUR_INSTANCE}-versions",
          "publish": false,
          "snapshotIdentiferTemplate": "https://github.com/OpenTermsArchive/france-snapshots/commit/%SNAPSHOT_ID",
          "author": {
            "name": "Open Terms Archive Bot",
            "email": "bot@opentermsarchive.org"
          }
        }
      }
    },
    "snapshots": {
      "storage": {
        "type": "git",
        "git": {
          "path": "../${YOUR_INSTANCE}-snapshots",
          "publish": false,
          "repository": "git@github.com:OpenTermsArchive/${YOUR_INSTANCE}-snapshots.git",
          "author": {
            "name": "Open Terms Archive Bot",
            "email": "bot@opentermsarchive.org"
          }
        }
      }
    }
  }
}

Run the script

History cleaning is much easier to do for one document type of one service at a time. So it is recommended to - choose which document (service ID and document type) you want to clean the history of and to iterate on services and document types once this one is done.

node ./scripts/cleanup/index.js --interactive --serviceId $SERVICE_ID_YOU_WANT_TO_WORK_ON --document "$DOCUMENT_TYPE_YOU_WANT_TO_WORK_ON"

For example:

node ./scripts/cleanup/index.js --interactive --serviceId Aigle --document "General Conditions of Sale"

To exit the script, type ctrl-C.

It's only when all declarations are fixed and all unwanted snapshots are marked as to be skipped that the whole history will be regenerated by no specifying service and document type and not enabling interactive mode.

node ./scripts/cleanup/index.js
import fsApi from 'fs';
import fs from 'fs/promises';
import os from 'node:os';
import path from 'path';
import { fileURLToPath } from 'url';
import colors from 'colors';
import { program } from 'commander';
import config from 'config';
import inquirer from 'inquirer';
import jsdom from 'jsdom';
import { InaccessibleContentError } from '../../src/archivist/errors.js';
import filter from '../../src/archivist/filter/exports.js';
import Record from '../../src/archivist/recorder/record.js';
import RepositoryFactory from '../../src/archivist/recorder/repositories/factory.js';
import * as services from '../../src/archivist/services/index.js';
const { JSDOM } = jsdom;
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const ROOT_OUTPUT = path.resolve(__dirname, 'output');
const SKIPPED_OUTPUT = path.join(ROOT_OUTPUT, 'skipped');
const TO_CHECK_OUTPUT = path.join(ROOT_OUTPUT, 'to-check');
program
.name('regenerate')
.description('Cleanup services declarations and regenerate versions history')
.version('0.0.1');
program
.option('-s, --serviceId [serviceId]', 'service ID of service to handle')
.option('-d, --documentType [documentType]', 'document type to handle')
.option('-i, --interactive', 'Enable interactive mode to validate each version and choose if snapshot should be skipped');
program.parse(process.argv);
const options = program.opts();
const contentSkippingRules = {
/**
* Example: For Instagram Terms of Service documents, skip all snapshots which contain "Term of Use" text in `h1 span`
* Instagram: {
* 'Terms of Service': { 'h1 span': 'Terms of Use' },
* },
*/
};
const selectorSkippingRules = {
/**
* Example: For Facebook Privacy Policy documents, skip all snapshots which have an element matching `body.touch` selector in DOM
* Facebook: {
* 'Privacy Policy': ['body.touch'],
* },
*/
};
const missingRequiredSelectorSkippingRules = {
/**
* Example: For Facebook Terms of Service documents, skip all snapshots which do not have an element matching `[href="https://www.facebook.com/legal/terms/eecc/flyout"]` selector in DOM
* Facebook: {
* 'Privacy Policy': ['body.touch'],
* },
*/
};
const snapshotsIdsWithContentToSkip = [
/**
* Example: For all documents, skip all snapshots which have a markdown converted content matching the content of the snapshot with id `2ac6866668843e95d3244ef951aec80ac2a04d81`
* '2ac6866668843e95d3244ef951aec80ac2a04d81'
*/
];
const renamingRules = {
/**
* Example: For all services, rename document type `Community Guidelines - Deceased Users` into `Deceased Users`.
* 'Community Guidelines - Deceased Users': 'Deceased Users',
*/
};
const genericPageDeclaration = {
location: 'http://service.example',
contentSelectors: 'html',
filters: [document => {
document.querySelectorAll('a').forEach(el => {
const url = new URL(el.getAttribute('href'), document.location);
url.search = '';
el.setAttribute('href', url.toString());
});
}],
};
let servicesDeclarations = await services.loadWithHistory();
await initializeFolders(servicesDeclarations);
const { versionsRepository, snapshotsRepository } = await initializeRepositories();
const contentsToSkip = await initializeSnapshotContentToSkip(snapshotsIdsWithContentToSkip, snapshotsRepository);
info('Number of snapshot in the repository', await snapshotsRepository.count());
const serviceId = options.serviceId || '*';
const documentType = options.documentType || '*';
if (serviceId != '*' || documentType != '*') {
info('Number of snapshot for the specified service', (await snapshotsRepository.findAll()).filter(s => s.serviceId == serviceId && s.documentType == documentType).length);
}
if (options.interactive) {
info('Interactive mode enabled');
}
console.log('options', options);
let index = 1;
console.time('Total time');
for await (const snapshot of snapshotsRepository.iterate([`${serviceId}/${documentType}.*`])) {
applyDocumentTypeRenaming(renamingRules, snapshot); // Modifies snapshot in place
await handleSnapshot(snapshot, options, index);
index++;
}
console.timeEnd('Total time');
await cleanupEmptyDirectories();
async function initializeFolders(servicesDeclarations) {
return Promise.all([ TO_CHECK_OUTPUT, SKIPPED_OUTPUT ].map(async folder =>
Promise.all(Object.entries(servicesDeclarations).map(([ key, value ]) =>
Promise.all(Object.keys(value.documents).map(documentName => {
const folderPath = path.join(folder, key, documentName);
if (fsApi.existsSync(folderPath)) {
return;
}
return fs.mkdir(folderPath, { recursive: true });
}))))));
}
async function initializeRepositories() {
const snapshotsRepository = RepositoryFactory.create(config.recorder.snapshots.storage);
const sourceVersionsRepository = RepositoryFactory.create(config.recorder.versions.storage);
const targetRepositoryConfig = config.util.cloneDeep(config.recorder.versions.storage);
targetRepositoryConfig.git.path = path.join(ROOT_OUTPUT, 'resulting-versions');
const targetVersionsRepository = RepositoryFactory.create(targetRepositoryConfig);
await Promise.all([
sourceVersionsRepository.initialize(),
targetVersionsRepository.initialize().then(() => targetVersionsRepository.removeAll()),
snapshotsRepository.initialize(),
]);
await copyReadme(sourceVersionsRepository, targetVersionsRepository);
return {
versionsRepository: targetVersionsRepository,
snapshotsRepository,
};
}
async function copyReadme(sourceRepository, targetRepository) {
const sourceRepositoryReadmePath = `${sourceRepository.path}/README.md`;
const targetRepositoryReadmePath = `${targetRepository.path}/README.md`;
const [firstReadmeCommit] = await sourceRepository.git.log(['README.md']);
if (!firstReadmeCommit) {
console.warn(`No commit found for README in ${sourceRepository.path}`);
return;
}
await fs.copyFile(sourceRepositoryReadmePath, targetRepositoryReadmePath);
await targetRepository.git.add(targetRepositoryReadmePath);
await targetRepository.git.commit({
filePath: targetRepositoryReadmePath,
message: firstReadmeCommit.message,
date: firstReadmeCommit.date,
});
}
async function initializeSnapshotContentToSkip(snapshotsIds, repository) {
return Promise.all(snapshotsIds.map(async snapshotsId => {
const { content, mimeType } = await repository.findById(snapshotsId);
return filter({ pageDeclaration: genericPageDeclaration, content, mimeType });
}));
}
function info(...args) {
console.log(colors.grey(...args));
}
function applyDocumentTypeRenaming(rules, snapshot) {
snapshot.documentType = rules[snapshot.documentType] || snapshot.documentType;
}
async function handleSnapshot(snapshot, options, index) {
const { serviceId, documentType } = snapshot;
const { validUntil, pages: [pageDeclaration] } = servicesDeclarations[serviceId].getDocumentDeclaration(documentType, snapshot.fetchDate);
info(`${index}`.padStart(5, ' '), serviceId, '-', documentType, ' ', 'Snapshot', snapshot.id, 'fetched at', snapshot.fetchDate.toISOString(), 'with declaration valid until', validUntil || 'now');
const { shouldSkip, reason } = checkIfSnapshotShouldBeSkipped(snapshot, pageDeclaration);
if (shouldSkip) {
console.log(` ↳ Skip: ${reason}`);
fs.writeFile(path.join(SKIPPED_OUTPUT, serviceId, documentType, generateFileName(snapshot)), snapshot.content);
return;
}
try {
const version = await filter({
pageDeclaration,
content: snapshot.content,
mimeType: snapshot.mimeType,
});
const record = new Record({
content: version,
serviceId,
documentType,
snapshotId: snapshot.id,
fetchDate: snapshot.fetchDate,
mimeType: 'text/markdown',
snapshotIds: [snapshot.id],
});
const tmpFilePath = path.join(os.tmpdir(), 'regenerated-version.md');
await fs.writeFile(tmpFilePath, version);
const diffString = await versionsRepository.git.diff([ '--word-diff=color', `${serviceId}/${documentType}.md`, tmpFilePath ]).catch(async error => {
if (!error.message.includes('Could not access')) {
throw error;
}
const { id } = await versionsRepository.save(record);
console.log(` ↳ Generated first version: ${id}`);
});
if (!diffString) {
return;
}
console.log(diffString);
fs.writeFile(path.join(TO_CHECK_OUTPUT, serviceId, documentType, generateFileName(snapshot)), snapshot.content);
if (options.interactive) {
const { validVersion } = await inquirer.prompt([{ message: 'Is this version valid?', type: 'list', choices: [ 'Yes, keep it!', 'No, I updated the declaration, let\'s retry' ], name: 'validVersion' }]);
if (validVersion == 'No, I updated the declaration, let\'s retry') {
console.log('Reloading declarations…');
servicesDeclarations = await services.loadWithHistory();
return handleSnapshot(snapshot, options, index);
}
}
const { id } = await versionsRepository.save(record);
console.log(` ↳ Generated new version: ${id}`);
} catch (error) {
if (!(error instanceof InaccessibleContentError)) {
throw error;
}
const filteredSnapshotContent = await filter({ pageDeclaration: genericPageDeclaration, content: snapshot.content, mimeType: snapshot.mimeType });
if (contentsToSkip.find(contentToSkip => contentToSkip == filteredSnapshotContent)) {
console.log(` ↳ Skip ${snapshot.id} as its content matches a content to skip`);
return;
}
console.log(' ↳ An error occured while filtering:', error.message);
const line = colors.grey(colors.underline(`${' '.repeat(process.stdout.columns)}`));
console.log(`\n${line}\n${colors.cyan(filteredSnapshotContent)}\n${line}\n`);
const { skip } = await inquirer.prompt([{ message: 'Should this snapshot be skipped?', type: 'list', name: 'skip', choices: [ 'Yes, skip it!', 'No, I updated the declaration, let\'s retry' ] }]);
if (skip == 'Yes, skip it!') {
contentsToSkip.push(filteredSnapshotContent);
console.log('Do not forget to append "{snapshot.id}" in "snapshotsIdsWithContentToSkip" array');
} else {
console.log('Reloading declarations…');
servicesDeclarations = await services.loadWithHistory();
return handleSnapshot(snapshot, options, index);
}
}
}
function generateFileName(snapshot) {
return `${snapshot.fetchDate.toISOString().replace(/\.\d{3}/, '').replace(/:|\./g, '-')}-${snapshot.id}.html`;
}
function checkIfSnapshotShouldBeSkipped(snapshot, pageDeclaration) {
const { serviceId, documentType } = snapshot;
const contentsToSkip = contentSkippingRules[serviceId] && contentSkippingRules[serviceId][documentType];
const selectorsToSkip = selectorSkippingRules[serviceId] && selectorSkippingRules[serviceId][documentType];
const missingRequiredSelectors = missingRequiredSelectorSkippingRules[serviceId] && missingRequiredSelectorSkippingRules[serviceId][documentType];
if (!(contentsToSkip || selectorsToSkip || missingRequiredSelectors)) {
return { shouldSkip: false };
}
const { window: { document: webPageDOM } } = new JSDOM(snapshot.content, { url: pageDeclaration.location, virtualConsole: new jsdom.VirtualConsole() });
const selectorToSkip = selectorsToSkip && selectorsToSkip.find(selector => webPageDOM.querySelectorAll(selector).length);
const missingRequiredSelector = missingRequiredSelectors && missingRequiredSelectors.find(selector => !webPageDOM.querySelectorAll(selector).length);
const contentToSkip = contentsToSkip && Object.entries(contentsToSkip).find(([ key, value ]) => webPageDOM.querySelector(key)?.innerHTML == value);
if (!(selectorToSkip || missingRequiredSelector || contentToSkip)) {
return { shouldSkip: false };
}
let reason;
if (selectorToSkip) {
reason = `its content matches a selector to skip: "${selectorToSkip}"`;
}
if (missingRequiredSelector) {
reason = `its content does not match a required selector: "${missingRequiredSelector}"`;
}
if (contentToSkip) {
reason = `its content matches a content to skip: ${contentToSkip}`;
}
return {
shouldSkip: true,
reason,
};
}
async function cleanupEmptyDirectories() {
/* eslint-disable no-await-in-loop */
return Promise.all([ TO_CHECK_OUTPUT, SKIPPED_OUTPUT ].map(async folder => {
const servicesDirectories = (await fs.readdir(folder, { withFileTypes: true })).filter(dirent => dirent.isDirectory()).map(dirent => dirent.name);
for (const servicesDirectory of servicesDirectories) {
const documentTypeDirectories = (await fs.readdir(path.join(folder, servicesDirectory), { withFileTypes: true })).filter(dirent => dirent.isDirectory()).map(dirent => dirent.name);
for (const documentTypeDirectory of documentTypeDirectories) {
const files = await fs.readdir(path.join(folder, servicesDirectory, documentTypeDirectory));
if (!files.length) {
await fs.rmdir(path.join(folder, servicesDirectory, documentTypeDirectory));
}
}
const cleanedDocumentTypeDirectories = (await fs.readdir(path.join(folder, servicesDirectory), { withFileTypes: true })).filter(dirent => dirent.isDirectory()).map(dirent => dirent.name);
if (!cleanedDocumentTypeDirectories.length) {
await fs.rmdir(path.join(folder, servicesDirectory));
}
}
}));
/* eslint-enable no-await-in-loop */
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment