Skip to content

Instantly share code, notes, and snippets.

@mookieblaylock
Created March 2, 2020 19:42
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mookieblaylock/cb29e58252ebaa8f6fee939b86fd6441 to your computer and use it in GitHub Desktop.
Save mookieblaylock/cb29e58252ebaa8f6fee939b86fd6441 to your computer and use it in GitHub Desktop.
garbled_domains
2019-02-13,domains,https://uk.godaddy.com/domains,0.0,1.0,0.0,154.0,https://uk.godaddy.com/,worldwide,desktop,
@NathanHowell
Copy link

these are valid UTF-8 characters \u2229\u2557\u2510, not a BOM. it looks like a bug upstream. the easiest way to discard them is using iconv -c to target us-ascii:

$ iconv -c -f utf-8 -t us-ascii < row_bom.txt
2019-02-13,domains,https://uk.godaddy.com/domains,0.0,1.0,0.0,154.0,https://uk.godaddy.com/,worldwide,desktop,%

if other columns do have Unicode you'd probably want to strip the non-ASCII chars from that column (assuming Unicode domains are punycoded), either in the source query or in an ETL step. if there is a mix of punicoded and utf-8 domain names then I'd write a Python script to normalize them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment