Skip to content

Instantly share code, notes, and snippets.

@JeffRDay
Last active May 10, 2024 09:17
Show Gist options
  • Save JeffRDay/b38cf432c2a240f6adeec078c8d4e085 to your computer and use it in GitHub Desktop.
Save JeffRDay/b38cf432c2a240f6adeec078c8d4e085 to your computer and use it in GitHub Desktop.
Find Non-UTF8 Encoded Characters using Visual Studio Code (VS Code)

I ran into a problem with python where a file I wanted to read in and parse contained unexpected non-UTF-8 encoded characters. I am certain there are many ways to solve this problem, but capturing my quick and dirty appraoch below for posterity.

  1. Open the file and the Open Find
  2. In find, copy/paste the regex below:
^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$
  1. VS Code will highlight the lines MATCHING UTF encoded characters. So, you just have to skim the file looking for lines without highlighting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment