Skip to content

Instantly share code, notes, and snippets.

@jed
Last active April 30, 2024 01:33
Show Gist options
  • Save jed/3780465d20665b9b329743b732621679 to your computer and use it in GitHub Desktop.
Save jed/3780465d20665b9b329743b732621679 to your computer and use it in GitHub Desktop.
t.co resolver for twitter archives
// this script replaces all t.co links in the data/tweets.js file of an unzipped twitter archive with their resolved urls.
// it replaces all text inline, so be sure to make a backup of the file before running.
// usage: deno run -A resolve_tco.js {path to data/tweets.js}
let file = Deno.args[0]
let text = await Deno.readTextFile(file)
let matches = text.match(/"https:\/\/t\.co\/\w+"/g)
let unique = [...new Set(matches)]
console.log('%s urls found.', unique.length)
if (unique.length) for (let match of matches) {
console.log('resolving %s...', match)
let url = match.slice(1, -1)
let res = await fetch(url, {method: 'HEAD'})
if (!res.ok) throw new Error(`A ${res.code} error occured, please run again.`)
console.log('resolved: "%s".', res.url)
text = text.replace(match, `"${res.url}"`)
await Deno.writeTextFile(file, text)
await new Promise(cb => setTimeout(cb, 1000))
}
console.log('done.')
@jzaefferer
Copy link

Thanks for sharing! I quickly ran into some URLs that were down (503, tcp-connect fails) and modded the script to deal with that:

// this script replaces all t.co links in the data/tweets.js file of an unzipped twitter archive with their resolved urls.
// it replaces all text inline, so be sure to make a backup of the file before running.
// usage: deno run -A resolve_tco.js {path to data/tweets.js}

let file = Deno.args[0];
let text = await Deno.readTextFile(file);
let matches = text.match(/"https:\/\/t\.co\/\w+"/g);
let unique = [...new Set(matches)];
console.log("%s urls found.", unique.length);
if (unique.length)
  for (let match of matches) {
    console.log("resolving %s...", match);
    let url = match.slice(1, -1);
    let res;
    try {
      res = await fetch(url, {
        method: "HEAD",
        signal: AbortSignal.timeout(5000),
      });
    } catch (error) {
      console.dir(error);
      // likely tcp connect timed out, ignore
      // https://deno.land/api@v1.28.1?s=fetch claims fetch will always resolve, that seems to be wrong
      continue;
    }
    // nevermind if the url is still up, as long as it can be resolved
    if (!res.url) {
      console.dir(res);
      throw new Error(`A ${res.statusText} error occured, please run again.`);
    }
    console.log('resolved: "%s".', res.url);
    text = text.replace(match, `"${res.url}"`);
    await Deno.writeTextFile(file, text);
    await new Promise((cb) => setTimeout(cb, 1000));
  }
console.log("done.");

For the failing connect I didn't bother trying to parse the error to retrieve the resolve url. That didn't seem worth the effort.

I'm still wondering why tcp connect and dns errors seems to reject the fetch promise, when the docs claim it will always resolve...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment