Skip to content

Instantly share code, notes, and snippets.

@starbuck93
Last active February 15, 2023 17:12
Show Gist options
  • Save starbuck93/1bb325f72a9938f22c4afbd49fb54473 to your computer and use it in GitHub Desktop.
Save starbuck93/1bb325f72a9938f22c4afbd49fb54473 to your computer and use it in GitHub Desktop.
Regex sed command to remove a bunch of junk from new Google Sites takeout data

The purpose of this sed command is to get rid of the extra stuff from my Google Takeout of a "new" Google Site to eventually import it into Outline Wiki.

This will read the whole file in a loop (:a;N;$!ba), then match everything from the beginning of the file (^.*) until the substring </header> and replace it with the replacement (blank).

Then, once I import the file into Outline Wiki, I'll re-link the pictures if there are any, and reorganize the structure of the sidebar from Google Sites. But, removing all the extra HTML will make this much faster.

gsed -i ':a;N;$!ba;s:^.*</header>::' *.html
gsed -i ':a;N;$!ba;s:<script.*$::' *.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment