Skip to content

Instantly share code, notes, and snippets.

@shuiRong
Created June 27, 2023 09:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shuiRong/f4703da89e7f07baafaaa4a8fad86668 to your computer and use it in GitHub Desktop.
Save shuiRong/f4703da89e7f07baafaaa4a8fad86668 to your computer and use it in GitHub Desktop.
Crawl the page, decode
import {
JSDOM
} from 'jsdom';
import got from 'got';
import iconv from 'iconv-lite'
const response = got({
url: '........',
headers: {
'Content-Type': 'text/html; charset=utf-8'
},
})
// 如果请求失败
const {
statusCode: status_code,
body
} = await response
if (!status_code || status_code >= 400) {
// to something
return
}
let doc = new JSDOM(body, {
url,
});
const {
document
} = doc.window const charset = (document.querySelector('meta[charset]') ? .getAttribute('charset') || document.querySelector('meta[http-equiv="Content-Type"]') ? .getAttribute('content') ? .split('charset=')[1]) ? .toLowerCase()
// 如果检测到页面编码不是 utf-8,就尝试先用 iconv 转换一下
let decodedBody = body
if (charset !== 'utf-8') {
const buffer = await response.buffer();
decodedBody = await iconv.decode(buffer, charset || 'utf-8');
doc = new JSDOM(decodedBody, {
url,
});
}
console.log('decodedBody', decodedBody)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment