Skip to content

Instantly share code, notes, and snippets.

@webserveis
Created March 20, 2019 16:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save webserveis/d08f64320ab3b7c440dc52a7e60adc11 to your computer and use it in GitHub Desktop.
Save webserveis/d08f64320ab3b7c440dc52a7e60adc11 to your computer and use it in GitHub Desktop.
Jsoup snippets

JSOUP PARSERS

CODIFICACIÓN DE CARACTERES

Detección de charset**

Dependencia implementation 'com.ibm.icu:icu4j-charset:63.1'

su uso

CharsetMatch charsetMatch = new CharsetDetector().setText(bodyStream).detect();
Log.d(TAG, "crawl: charsetMatch" + charsetMatch.getName());

Dependecia juniversalchardet es de Mozilla las paginas con codificación <meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' /> la detecta como Europa Occidental (Windows-1252)

implementation 'com.github.albfernandez:juniversalchardet:2.3.0'

su uso

String encoding = UniversalDetector.detectCharset(bodyStream);
Log.d(TAG, "crawl: UniversalDetector.detectCharset" + encoding);

Conversor de charset detectado

BufferedInputStream bodyStream = connectionJsoup.response().bodyStream();
Document document = Jsoup.parse(connectionJsoup.execute().bodyStream(), charsetMatch.getName(), url);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment