Skip to content

Instantly share code, notes, and snippets.

@renoirb
Last active July 6, 2023 04:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save renoirb/21e31aab8d4cbcebb24afede7c49e449 to your computer and use it in GitHub Desktop.
Save renoirb/21e31aab8d4cbcebb24afede7c49e449 to your computer and use it in GitHub Desktop.
Example of a Page rendered within an application layout customElement and rendered client-side from raw markdown hosted on a GitHub Gist
locale title canonical date preamble coverImage categories tags keywords
en-CA
Converting a dynamic site into static HTML documents
2015-05-20T13:44:11-04:00
disable text
true
src alt text
~/assets/content/blog/2015/05/webat25-org-screen-capture.png
Web 25th anniversary web site screenshot
In March 2014, the W3C and the Web Foundation celebrated the World Wide Web 24th anniversary. As a W3C Team Member, I was asked to help the systems team and host the event’s web site. After the event, I was asked to make the web site to become static HTML documents so the systems team wouldn’t have to maintain the CMS it was using.
Projects
Linux
operations
procedure
favourites
webplatform
curl
wget
static site
convert from cms

Converting a dynamic site into static HTML documents


<script src="https://renoirb.com/esm-modules/value-boolean-element.mjs?registerElement=my-value-boolean" type="module"></script>

It’s been two times now that I've been asked to make a website that was running on a CMS and make it static.

This is an useful practice if you want to keep the site content for posterity without having to maintain the underlying CMS. It makes it easier to migrate sites since the sites that you know you won't add content to anymore becomes simply a bunch of HTML files in a folder.

My end goal was to make an EXACT copy of what the site is like when generated by the CMS, BUT now stored as simple HTML files. When I say EXACT, I mean it, even as to keep documents at their original location from the new static files. It means that each HTML document had to keep their same value BUT that a file will exist and the web server will find it. For example, if a link points to /foo, the link in the page remain as-is, even though its now a static file at /foo.html, but the web server will serve /foo.html anyway.

Here are a few steps I made to achieve just that. Notice that your mileage may vary, I've done those steps and they worked for me.

I've done this procedure a few times with WordPress blogs along with webat25.org that is now hosted as w3.org/webat25/ website that was running on ExpressionEngine.

Steps

1. Browse and get all pages you think could be lost in scraping

We want a simple file with one web page per line with its full address. This will help the crawler to not forget pages.

  • Use a web browser developer tool Network inspector, keep it open with "preserve log".
  • Once you browsed the site a bit, from the network inspector tool, list all documents and then export using the "Save as HAR" feature.
  • Extract urls from har file using underscore-cli

npm install underscore-cli cat site.har | underscore select '.entries .request .url' > workfile.txt

  • Remove first and last lines (its a JSON array and we want one document per line)
  • Remove the trailing remove hostname from each line (i.e. start by /path), in vim you can do %s/http:\/\/www\.example.org//g
  • Remove " and ", from each lines, in vim you can do %s/",$//g
  • At the last line, make sure the " is removed too because the last regex missed it
  • Remove duplicate lines, in vim you can do :sort u
  • Save this file as list.txt for the next step.

2. Let's scrape it all

We'll do two scrapes. First one is to get all assets it can get, then we'll go again with different options.

The following are the commands I ran on the last successful attempt to replicate the site I was working on. This is not a statement that this method is the most efficient technique. Please feel free to improve the document as you see fit.

First a quick TL;DR of wget options

  • -m is the same as --mirror
  • -k is the same as --convert-links
  • -K is the same as --backup-converted which creates .orig files
  • -p is the same as --page-requisites makes a page to get ALL requirements
  • -nc ensures we dont download the same file twice and end up with duplicates (e.g. file.html AND file.1.html)
  • --cut-dirs would prevent creating directories and mix things around, do not use.

Notice that we're sending headers as if we were a web browser. Its up to you.

export UA='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36'
export ACCEPTL='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2'
export ACCEPTT='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
wget -i list.txt -nc --random-wait \
     --mirror \
     -e robots=off \
     --no-cache \
     -k -E --page-requisites \
     --user-agent="$UA" \
     --header="$ACCEPTT" \
     http://www.example.org/

Then, another pass

wget -i list.txt --mirror \
     -e robots=off \
     -k -K -E --no-cache --no-parent \
     --user-agent="$UA" \
     --header="$ACCEPTL" \
     --header="$ACCEPTT" \
     http://www.example.org/

3. Do some cleanup on the fetched files

Here are a few commands I ran to clean the files a bit

  • Remove empty lines in every .orig files. They're the ones we'll use in the end after all

    find . -type f -regextype posix-egrep -regex '.*\.orig$' -exec sed -i 's/\r//' {} \;
  • Rename the .orig file into html

    find . -name '*orig' | sed -e "p;s/orig/html/" | xargs -n2 mv
    
    find . -type f -name '*\.html\.html' | sed -e "p;s/\.html//" | xargs -n2 mv
  • Many folders might have only an index.html file in it. Let's just make them a file without directory

    find . -type f -name 'index.html' | sed -e "p;s/\/index\.html/.html/" | xargs -n2 mv
  • Remove files that has a .1 (or any number in them), they are most likely duplicates anyway

    find . -type f -name '*\.1\.*' -exec rm -rf {} \;
locale title canonical date categories tags keywords
en-CA
Example of a Page rendered within an application layout customElement and rendered client-side from raw markdown hosted on a GitHub Gist
2023-02-19T13:44:11-04:00
Projects
web-components
open-web-platform
experiments
managing-code-release
RushJS
ESM

Experimenting with client-side and ESM

This file is from a Gist. It's loading the markdown, and inserting into an HTML customElement loaded over HTTP at https://renoirb.com/esm-modules/app-layout-element.mjs

That "app layout" will dress up a site serving raw markdown, and some of the data will be updated using ContextAPI (that's now much simpler) (see protocol spec) so to avoid daisy chaining of props.

This is messy, sorry

This Gist text is messy at the moment, I'll fix it soon.

Context API WebComponent protocol

The following was experiments ade in 2021, when ContextAPI was still in elaboration, and I experimented some things:

Context API WebComponent state communication protocol prototype

Started from definitions

// https://github.com/webcomponents/community-protocols/blob/main/proposals/context.md#definitions
import type { LitElement } from 'lit-element'
/**
* A Context object defines an optional initial value for a Context, as well as a name identifier for debugging purposes.
*/
export type Context<T> = {
readonly name: string
readonly initialValue?: T
}
/**
* An unknown context typeU
*/
export type UnknownContext = Context<unknown>
/**
* A helper type which can extract a Context value type from a Context type
*/
export type ContextType<T extends UnknownContext> = T extends Context<infer Y> ? Y : never
/**
* A function which creates a Context value object
*/
export function createContext<T>(name: string, initialValue?: T): Readonly<Context<T>> {
return {
name,
initialValue,
}
}
/**
* A callback which is provided by a context requester and is called with the value satisfying the request.
* This callback can be called multiple times by context providers as the requested value is changed.
*/
export type ContextCallback<ValueType> = (value: ValueType, dispose?: () => void) => void
// eslint-disable-next-line @typescript-eslint/naming-convention
export interface IContextEvent<T extends UnknownContext> {
/**
* The name of the context that is requested
*
* renoirb: Instead of using name context, going to use DOM CustomEvent's detail property
*/
readonly context: T
/**
* A boolean indicating if the context should be provided more than once.
*/
readonly multiple?: boolean
/**
* A callback which a provider of this named callback should invoke.
*/
readonly callback: ContextCallback<ContextType<T>>
}
/**
* An event fired by a context requester to signal it desires a named context.
*
* A provider should inspect the `context` property of the event to determine if it has a value that can
* satisfy the request, calling the `callback` with the requested value if so.
*
* If the requested context event contains a truthy `multiple` value, then a provider can call the callback
* multiple times if the value is changed, if this is the case the provider should pass a `dispose`
* method to the callback which requesters can invoke to indicate they no longer wish to receive these updates.
*/
export class ContextEvent<T extends UnknownContext> extends CustomEvent<T> implements IContextEvent<T> {
public constructor(
public readonly context: T,
public readonly callback: ContextCallback<ContextType<T>>,
public readonly multiple?: boolean,
) {
super('context-request', { bubbles: true, composed: true })
}
get detail(): T {
return this.context
}
}
// --------------------- Added by Renoir ---------------------
export interface UpdatableHonk<T extends UnknownContext> extends IContextEvent<T> {
readonly target: EventTarget
}
export class StatefulContextManager {
contexts = new Map<string, Set<UpdatableHonk<Context<unknown>>>>()
// packages/web-components/fast-foundation/src/utilities/match-media-stylesheet-behavior.ts
private listenerMap = new WeakMap<EventTarget, ContextCallback<unknown>>()
constructor() {
console.log('StatefulContextManager ctor')
}
respondFor(name: string, data: unknown) {
console.log('StatefulContextManager respondFor 1/2', { name, data })
const contexts = this.contexts
if (contexts) {
if (contexts.has(name) === false) {
throw new Error(`StatefulContextManager respondFor: There is no context on the name ${name}`)
}
const entries = contexts.get(name)
for (const { context, target, ...rest } of entries) {
const callback = this.listenerMap.has(target) ? this.listenerMap.get(target) : void 0
console.log('StatefulContextManager respondFor 2/2', { name, data, context, callback, ...rest })
const payload = data ?? context.initialValue
callback(payload)
}
}
}
protected keepTrackContextRequest<T extends UnknownContext>(event: ContextEvent<T>) {
const { context } = event
const { name } = context
const contexts = this.contexts
console.log('StatefulContextManager keepTrackContextRequest 1/2', { name, context, event, contexts })
const callback: ContextCallback<ContextType<T>> = (value, dispose) => {
console.log('StatefulContextManager keepTrackContextRequest callback')
event.callback(value, dispose)
;(event.target as LitElement).requestUpdate()
}
const honk: UpdatableHonk<T> = {
multiple: false,
...event,
callback,
context,
target: event.target,
}
this.listenerMap.set(event.target, callback)
if (contexts) {
if (contexts.has(name)) {
contexts.get(name).add(honk)
} else {
contexts.set(name, new Set([honk]))
}
}
event.stopPropagation()
console.log('StatefulContextManager keepTrackContextRequest 2/2', { name, context, event, contexts })
}
addEventListenerTo(host: HTMLElement) {
host.addEventListener('context-request', this.keepTrackContextRequest.bind(this))
}
removeEventListenerTo(host: HTMLElement) {
host.addEventListener('context-request', this.keepTrackContextRequest.bind(this))
}
}
// --------------------- Added by Renoir ---------------------
declare global {
interface HTMLElementEventMap {
/**
* A 'context-request' event can be emitted by any element which desires
* a context value to be injected by an external provider.
*/
readonly 'context-request': ContextEvent<UnknownContext>
}
}
import { LitElement, property, customElement } from 'lit-element'
import { renderAmountAndCurrency } from './parts'
import type { CurrencyAmountCallback } from './model'
import { ContextEvent, currencyAmountContext, CurrencyAmountValue } from './model'
export const localName = 'currency-amount' as const
@customElement(localName)
export class CurrencyAmountComponent extends LitElement {
static localName = localName
@property({ type: String })
readonly for: string = ''
private internal: CurrencyAmountValue = Object.create(null)
@property({ type: Object })
get state(): CurrencyAmountValue {
return { ...this.internal } as CurrencyAmountValue
}
set state(value: CurrencyAmountValue) {
this.internal = {
...this.internal,
...value,
} as CurrencyAmountValue
}
private contextCallback: CurrencyAmountCallback = (value, dispose) => {
// if we were given a disposer, this provider is likely to send us updates
if (dispose) {
// dispose immediately if we only want it once
dispose()
}
this.state = value
}
connectedCallback() {
this.dispatchEvent(new ContextEvent(currencyAmountContext, this.contextCallback))
}
render() {
return renderAmountAndCurrency({
amount: this.state.amount,
currency: this.state.currency,
locale: this.state.locale,
})
}
}
declare global {
interface HTMLElementTagNameMap {
readonly [localName]: CurrencyAmountComponent
}
}
import type { ContextCallback } from './context-api'
import { createContext } from './context-api'
export { ContextEvent } from './context-api'
export const NAMESPACE = 'currency-amount' as const
export interface CurrencyAmountValue {
readonly amount: number | 0
readonly locale?: 'fr-CA' | 'en-CA' | 'en-US'
readonly currency?: 'USD' | 'CAD' | 'EUR'
}
export interface CurrencyAmount extends CurrencyAmountValue {
readonly name: typeof NAMESPACE
}
export interface ZeroCurrencyAmountValue extends CurrencyAmountValue {
readonly amount: 0
}
export interface NonZeroCurrencyAmountValue extends CurrencyAmountValue {
readonly locale: 'fr-CA' | 'en-CA' | 'en-US'
readonly currency: 'USD' | 'CAD' | 'EUR'
}
const isCurrencyAmountPrivate = (ctx: unknown): ctx is CurrencyAmount =>
typeof ctx === 'object' && 'amount' in ctx && Number.isNaN(Number.parseInt(Reflect.get(ctx, 'amount'))) === false
export const isZeroCurrencyAmount = (ctx: unknown): ctx is ZeroCurrencyAmountValue =>
isCurrencyAmountPrivate(ctx) && Reflect.get(ctx, 'amount') === 0
export const isCurrencyAmount = (ctx: unknown): ctx is NonZeroCurrencyAmountValue =>
isCurrencyAmountPrivate(ctx) &&
Reflect.get(ctx, 'amount') !== 0 &&
Reflect.has(ctx, 'locale') === true &&
Reflect.has(ctx, 'currency') === true
// get a context from somewhere (this could be in any module)
export const currencyAmountContext = createContext<CurrencyAmountValue>(NAMESPACE, { amount: 0 })
export type CurrencyAmountContextEvent = typeof currencyAmountContext
export type CurrencyAmountCallback = ContextCallback<CurrencyAmount>
export const intlNumberFormat = (locale: NonZeroCurrencyAmountValue['locale']) =>
new Intl.NumberFormat(locale, { style: 'currency', currency: 'EUR' })
export const DEFAULT_FORMAT_LOCALE: NonZeroCurrencyAmountValue['locale'] = 'en-US' as const
import type { CurrencyAmountValue } from './model'
import { intlNumberFormat, isCurrencyAmount, DEFAULT_FORMAT_LOCALE } from './model'
import { html } from 'lit-element'
import { guard } from 'lit-html/directives/guard'
// https://schema.org/MonetaryAmount
export const renderAmountAndCurrency = ({ amount, currency, locale }: CurrencyAmountValue) =>
guard(
[amount, currency, locale],
() => html`<div>
${isCurrencyAmount({ amount, currency, locale })
? html`<span data-currency="${currency}" lang="${locale.split('-')[0]}"
>${intlNumberFormat(locale).format(amount)} 1</span
>`
: html`<span>${intlNumberFormat(DEFAULT_FORMAT_LOCALE).format(amount)} 2</span>`}
</div>`,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment