Skip to content

Instantly share code, notes, and snippets.

View JeffRDay's full-sized avatar

Jeff Day JeffRDay

View GitHub Profile
@JeffRDay
JeffRDay / Note.md
Last active May 10, 2024 09:17
Find Non-UTF8 Encoded Characters using Visual Studio Code (VS Code)

I ran into a problem with python where a file I wanted to read in and parse contained unexpected non-UTF-8 encoded characters. I am certain there are many ways to solve this problem, but capturing my quick and dirty appraoch below for posterity.

  1. Open the file and the Open Find
  2. In find, copy/paste the regex below:
^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$
  1. VS Code will highlight the lines MATCHING UTF encoded characters. So, you just have to skim the file looking for lines without highlighting.