poychang/Note.md

## Note.md

      
    Raw
  

              Note.md
            
          
    I ran into a problem with python where a file I wanted to read in and parse contained unexpected non-UTF-8 encoded characters. I am certain there are many ways to solve this problem, but capturing my quick and dirty appraoch below for posterity.

Open the file and the Open Find
In find, copy/paste the regex below:

^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$

VS Code will highlight the lines MATCHING UTF encoded characters. So, you just have to skim the file looking for lines without highlighting.