Darkenetor/00_TumblrPonyList_README.md Secret

## 00_TumblrPonyList_README.md

      
    Raw
  

              00_TumblrPonyList_README.md
            
          
    New Pony Tumblrs list


Priority: https://gist.github.com/Darkenetor/10e03a4ebe42b9fec6c723ea3e8d75b5
Additional high-ish count of false positives (non-pony blogs): https://gist.github.com/Darkenetor/4620ad4a18ebac7394a48d07f30e40fe

Contains every list posted so far, ping @Darkenetor#4056 for new links and I'll update it.
Every commit only adds to the end so safe to get only the last rows if you're going without ignorelists for some reason, don't forget to randomize them otherwise.
The first three commits are derpibooru, derpibooru_domains and Rome's last zip sent here, but they're slightly better deduplicated so use the commit line count instead of the length of files you already have.
Clone the gists' repos and check commit messages for info on the sources. Currently data is from:

Derpibboru: official dump from source_urls, descriptions from Twi-Hard's archive: https://derpibooru.org/tumblr_domains.txt https://derpibooru.org/tumblrs.txt
Fimfiction: Sir Inrix's search index for Fimfarchive, Google searches for links outside stories

Google: site:https://www.fimfiction.net/user/ "tumblr"
Google: site:https://www.fimfiction.net/user/ inurl:/about "tumblr:"
Google: site:https://www.fimfiction.net/user/ inurl:/about "patreon:"
[...$$('.srg .r a[onmousedown]:not([class])')].map( a => a.href ).join('\n')
cat ../fimfic.txt | while read l; do wget -e robots=off -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' "$l"; done


Twi-Hard's dump of Tumblrpony.wikia.com and MLPFart.wikia.com

wget --mirror -e robots=off --accept-regex '(\.(html|php)|(\/|^)[^.?]*)$' --reject-regex '(\.\w+/\w+-\w+/wiki/|(Special|MediaWiki|Help|User|User_\w*|Template|Blog|File|Forum|Talk|\w+_Wiki):)' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'tumblrpony.wikia.com'
wget --mirror -e robots=off --accept-regex '(\.(html|php)|(\/|^)[^.?]*)$' --reject-regex '(\.\w+/\w+-\w+/wiki/|(Special|MediaWiki|Help|User|User_\w*|Template|Blog|File|Forum|Talk|\w+_Wiki):)' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'mlpfanart.fandom.com'


Small TVTropes excerpt

wget --mirror -e robots=off --accept-regex '(\.(html)|(\/|^)[^.]*)(\?.*)?$' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'https://tvtropes.org/pmwiki/pmwiki.php/Blog/AskAPony' -l 1 # extra folders manually removed


Redgetrek's Tumblr: http://redgetrek.tumblr.com/post/10693037093/top-original-art-pony-tumblrs

[...$$('.bodytype li')].map( e => e.innerText.trim().replace(/\s.*/, '') + '.tumblr.com' ).filter( e => /[a-z]/.test(e[0]) ).join('\n')


Up to date EquestriaDaily.com and Horse-News.org dumps

wget --mirror -e robots=off --wait 0.25 -A html -X 'search/label' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'https://www.equestriadaily.com/?m=1'
wget --mirror -e robots=off --wait 0.25 -A html -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'https://www.horse-news.org/'


13k YT channels from Twi-Hard's google searches for playlists (half composing of equitation races and elsagate but those don't have linked Tumblrs, this is what mostly composes the low accuracy list) and EqD spotlights: https://gist.github.com/Darkenetor/e058db7d16a006daf665504fc77aae29

[...new Set( Array.from($$('.pl-video-title a[href*="/channel/"], .pl-video-title a[href*="/user/"]')).map( e => e.href ) )].sort().join('\n')


Patreon: profiles gathered from above sources

Scripts below for future reference, ping me if you see issues there.
Priority shortcut lists


Custom domains: ArchiveTeam/tumblr-grab#13

https://gist.githubusercontent.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/tumblr_domains.txt
High accuracy Tumblrs outside of the Derpibooru list

https://gist.githubusercontent.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/afterderpi.txt
Low accuracy Tumblrs outside of the Derpibooru list

https://gist.githubusercontent.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/afterderpi_lowpony.txt

Misc

Probably not up to date.
twkr's Stage1 status checker: https://pastebin.com/kDa8ij6j

https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage1_domains.txt
https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage1_afterderpi.txt
https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage1_ad_lowpony.txt

twkr's Stage2 status checker: https://pastebin.com/ameY8a6m

https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage2_alive_domains.txt
https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage2_alive_afterderpi.txt
https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage2_alive_ad_lowpony.txt


## 0_prepare.js
"use strict"

const hosts = [
	['bandcamp', [/[^/\s.]+\.bandcamp.com/gi]],
	['gdocs', [/\bdocs\.google\.com\/.+/gi]],
	['gdrive', [/\bdrive\.google\.com\/.+/gi]],
	['mediafire', [/.*\bmediafire.com\b.*/gi]],
	['mega', [/\bmega(\.co)?\.nz\/.+/gi]],
	['soundcloud', [/\b(?<!api\.)soundcloud\.com\/[^/\s]+/gi]],
	['tumblr', [/(((?<=[^\w-])|^)(?!(media|static|data|www)\.)[\w-]+\.|\b)tumblr\.com(\/((?!post|blog|follow)[\w-]+|(blog|follow)\/[\w-]+))?\b/gi]],
	['wordpress', [/[\w-]+\.wordpress\.com\b/gi]],
	['blogger', [/[\w-]+\.blogger\.com\b/gi]],
	['patreon', [/\bpatreon\.com\/[\w-]+\b/gi]],
	['twitter', [/\btwitter\.com\/[\w-]+\b/gi]],
	['goo.gl', [/\bgoo\.gl\/[\w]+\b/gi]],
	['discord', [/\bdiscord\.gg\/[\w]+\b/gi]],
	['bit.ly', [/\bbit\.ly\/[\w]+\b/gi]],
]


const fs = require('fs')
const readline = require('readline')
const path_module = require('path')
const IO = {
	getDataSync : srcPath =>
			fs.readFileSync( srcPath, 'utf8', err => {
				if ( err ) throw Error( err )
			}),
	writeDataSync : (dstPath, d, cb = ()=>{}) =>
		fs.writeFileSync( dstPath, d, err => {
			if ( err ) throw Error( err )
			return cb(d)
		}),
	readdirTreeSync : function (dirPath, startDir_ = __dirname, files_ = []) {
		if (! path_module.isAbsolute(dirPath) )
			dirPath = path_module.join(__dirname, dirPath)
		return this._readdirTreeSync(dirPath, startDir_, files_)
			.map( e => path_module.relative(dirPath, e) )
	},
	// http://stackoverflow.com/a/20525865
	_readdirTreeSync : function _s (dirPath, startDir_, files_) {
		let files = fs.readdirSync(dirPath)
		for (let i in files) {
			const name = path_module.join( dirPath, path_module.relative(startDir_, files[i]) )

			if ( fs.statSync(name).isDirectory() )
				_s(name, startDir_, files_)
			else
				files_.push(name)
		}
		return files_
	}
}
Object.defineProperties( Object.prototype, {
	_pipe : { value :
		function (f) { return (
			f( this )
		)}
	},
	_fork : { value :
		function (f) { return (
			f( /*JSON.parse(JSON.stringify(*/ this /*))*/ ),	//TODO fix deep copy for arrays
			this
		)}
	},
})
Object.getOwnPropertyNames( Array.prototype ).forEach( m =>
	(! Object.prototype[m] ) ? (
		Object.defineProperty( Object.prototype, m, {
			value : Array.prototype[m],
		})
	):null
)
const flatMap = f => d =>
	[].concat( ...( d.map(f) ) )
const naturalSort = function _s (a, b) {
	if ( a === b )
		return 0
	_s.intlSort = _s.intlSort || ( new Intl.Collator() ).compare
	const [a2, b2] = [a, b].map( s => s
		.replace( /[^\wÀ-ſ\s]+/g, '' )
		.toLowerCase()
	)
	if ( a2 !== b2 )
		return [a2, b2].sort( _s.intlSort )[0] === a2 ? -1 : 1
	return [a, b].sort( _s.intlSort )[0] === a ? -1 : 1
}

const invalidURLchar = /[^\w\-.~:/#@$&*+=%',;!?]/g
const schemeURLs = [
		/(?:(?<=[/.,:;!?'%*\-\s])|^)www(?:\.\w{2,})+\.[a-zA-Z]{2,}/i,
		/https?:\/\/\w{2,}/i,
		/(?:\w{2,}\.)+[a-zA-Z]{2,}(?:[/.,:;!?'%*\-\s]|$)/i,
	]
const reSchemeURL = `(?:${ schemeURLs.map( e => e.source).join('|') })`
const reSplit = /(([\s=<]\\?"|"[>\s])|[\s\^\\<>]|[^!-~])+/
const URLClean = url => url
		.replace( new RegExp( `.*?(${reSchemeURL}.*)`, 'i'), '$1' )
		.replace( new RegExp( `(.*?)(?:${invalidURLchar.source}|#(?!!)).*`, 'i'), '$1' )
		.replace( new RegExp( `.*${invalidURLchar.source}(.*)`, 'ig'), '$1' )
		.replace( /(.+?)(?:[.,:;!?'%*\-\s]+|_{2,})$/, '$1' )
		.trim()
		.replace( /(?:^|=)https?(?::|%3a)(?:\/|%2f){2}(.*)/i, '$1' )	//FIXME fix explicit redirects and clean up rest of the line
const URLFilter = e =>
	!!(false
		|| /\S/.test(e)
		&& !~e.indexOf('..')
		&& !invalidURLchar.test(e)
	)

let iTMP = true


;( () => {
	IO.readdirTreeSync( './data/huge' )
		.reduce( (acc, f, i) => {
			// if ( i < -1 ) return acc	// DEBUG
			if ( i % 500 === 0 ) {
				console.log( i )
				IO.writeDataSync( `./data/descriptions/__huge_TMP_${+(i = !i)}.txt`, [...acc].join('\n\n') )
				// IO.writeDataSync( `./data/descriptions/__huge_TMP_${i}.txt`, acc.join('\n\n') )
			}

			IO.getDataSync( './data/huge/' + f )
				.split('\n')
				.filter( e => !!(false
					|| !e.includes( '<script' )
					&& !e.includes( 'window["ytInitialData"]' )
					// && ( new RegExp( `[^-](${hosts.map( e => e[0].replace(/\./g, '.') ).join('|')})[.:\s]`, 'i') ).test(e)
					&& /[^-](tumblr|patreon)([:\s]|\.com)/i.test(e)
				))
				.forEach( e => acc.add(e) )
			return acc
		}, new Set() )
		._pipe( e => [...e] )
		.join('\n\n')
		._fork( d => IO.writeDataSync( './data/descriptions/__huge.txt', d ) )
})()

## 1_extract.js
"use strict"

const hosts = [
	['bandcamp', [/[^/\s.]+\.bandcamp.com/gi]],
	['gdocs', [/\bdocs\.google\.com\/.+/gi]],
	['gdrive', [/\bdrive\.google\.com\/.+/gi]],
	['mediafire', [/.*\bmediafire.com\b.*/gi]],
	['mega', [/\bmega(\.co)?\.nz\/.+/gi]],
	['soundcloud', [/\b(?<!api\.)soundcloud\.com\/[^/\s]+/gi]],
	['tumblr', [/(((?<=[^\w-])|^)(?!(media|static|data|www)\.)[\w-]+\.|\b)tumblr\.com(\/((?!post|blog|follow)[\w-]+|(blog|follow)\/[\w-]+))?\b/gi]],
	['wordpress', [/[\w-]+\.wordpress\.com\b/gi]],
	['blogger', [/[\w-]+\.blogger\.com\b/gi]],
	['patreon', [/\bpatreon\.com\/[\w-]+\b/gi]],
	['twitter', [/\btwitter\.com\/[\w-]+\b/gi]],
	['goo.gl', [/\bgoo\.gl\/[\w]+\b/gi]],
	['discord', [/\bdiscord\.gg\/[\w]+\b/gi]],
	['bit.ly', [/\bbit\.ly\/[\w]+\b/gi]],
]


const fs = require('fs')
const readline = require('readline')
const path_module = require('path')
const IO = {
	getDataSync : srcPath =>
			fs.readFileSync( srcPath, 'utf8', err => {
				if ( err ) throw Error( err )
			}),
	writeDataSync : (dstPath, d, cb = ()=>{}) =>
		fs.writeFileSync( dstPath, d, err => {
			if ( err ) throw Error( err )
			return cb(d)
		}),
	readdirTreeSync : function (dirPath, startDir_ = __dirname, files_ = []) {
		if (! path_module.isAbsolute(dirPath) )
			dirPath = path_module.join(__dirname, dirPath)
		return this._readdirTreeSync(dirPath, startDir_, files_)
			.map( e => path_module.relative(dirPath, e) )
	},
	// http://stackoverflow.com/a/20525865
	_readdirTreeSync : function _s (dirPath, startDir_, files_) {
		let files = fs.readdirSync(dirPath)
		for (let i in files) {
			const name = path_module.join( dirPath, path_module.relative(startDir_, files[i]) )

			if ( fs.statSync(name).isDirectory() )
				_s(name, startDir_, files_)
			else
				files_.push(name)
		}
		return files_
	}
}
Object.defineProperties( Object.prototype, {
	_pipe : { value :
		function (f) { return (
			f( this )
		)}
	},
	_fork : { value :
		function (f) { return (
			f( /*JSON.parse(JSON.stringify(*/ this /*))*/ ),	//TODO fix deep copy for arrays
			this
		)}
	},
})
Object.getOwnPropertyNames( Array.prototype ).forEach( m =>
	(! Object.prototype[m] ) ? (
		Object.defineProperty( Object.prototype, m, {
			value : Array.prototype[m],
		})
	):null
)
const flatMap = f => d =>
	[].concat( ...( d.map(f) ) )
const naturalSort = function _s (a, b) {
	if ( a === b )
		return 0
	_s.intlSort = _s.intlSort || ( new Intl.Collator() ).compare
	const [a2, b2] = [a, b].map( s => s
		.replace( /[^\wÀ-ſ\s]+/g, '' )
		.toLowerCase()
	)
	if ( a2 !== b2 )
		return [a2, b2].sort( _s.intlSort )[0] === a2 ? -1 : 1
	return [a, b].sort( _s.intlSort )[0] === a ? -1 : 1
}

const invalidURLchar = /[^\w\-.~:/#@$&*+=%',;!?]/g
const schemeURLs = [
		/(?:(?<=[/.,:;!?'%*\-\s])|^)www(?:\.\w{2,})+\.[a-zA-Z]{2,}/i,
		/https?:\/\/\w{2,}/i,
		/(?:\w{2,}\.)+[a-zA-Z]{2,}(?:[/.,:;!?'%*\-\s]|$)/i,
	]
const reSchemeURL = `(?:${ schemeURLs.map( e => e.source).join('|') })`
const reSplit = /(([\s=<]\\?"|"[>\s])|[\s\^\\<>]|[^!-~])+/
const URLClean = url => url
		.replace( new RegExp( `.*?(${reSchemeURL}.*)`, 'i'), '$1' )
		.replace( new RegExp( `(.*?)(?:${invalidURLchar.source}|#(?!!)).*`, 'i'), '$1' )
		.replace( new RegExp( `.*${invalidURLchar.source}(.*)`, 'ig'), '$1' )
		.replace( /(.+?)(?:[.,:;!?'%*\-\s]+|_{2,})$/, '$1' )
		.trim()
		.replace( /(?:^|=)https?(?::|%3a)(?:\/|%2f){2}(.*)/i, '$1' )	//FIXME fix explicit redirects and clean up rest of the line
const URLFilter = e =>
	!!(false
		|| /\S/.test(e)
		&& !~e.indexOf('..')
		&& !invalidURLchar.test(e)
	)


;( () => {
	try {
		return IO.getDataSync( './data/_descriptions.txt' )
	} catch (e) {
		return IO.readdirTreeSync( './data/descriptions' )
			// .filter( e => /\.description$/i.test(e) )
			.map( f => IO.getDataSync( './data/descriptions/' + f ) )
			.join('\n\n')
			// ._fork( d => IO.writeDataSync( './data/_descriptions.txt', d ) )
	}
})()
	._fork( d => d
		.split('\n\n')
		.filter( s => !!(false
			|| /\btumblr\b/.test(s)
			|| ( /\bblog\b/.test(s) && !/\b(wordpress|blogger)\b/.test(s) )
		))
		._pipe( d => flatMap( e => e.split('\n') )(d) )
		.filter( e => ( e.match( new RegExp( reSchemeURL, 'gi') ) ||[]).length === 1 )
		.join('\n')
		.split( reSplit )
		._pipe( d => [...new Set(d)] )
		.filter( e => e)
		.filter( s => schemeURLs.some( r => r.test(s) ) )
		.map( URLClean )
		._pipe( d => [...new Set(d)] )
		.sort()
		.filter( URLFilter )
		.filter( e => !hosts.some( h => h[1][0].test(e) ) )	//TODO support fallback regexes; FIXME ???
		._fork( d => IO.writeDataSync( './data/hosts/tumblr2.txt', d.join('\n') ) )
	)
	// .split( /\s+/ )
	// .split( /([\s\^\\]|[^!-~])+/ )
	.split( reSplit )
	._pipe( d => [...new Set(d)] )
	.sort()
	.filter( s => schemeURLs.some( r => r.test(s) ) )
	// ._fork( d => IO.writeDataSync( './data/_filtered.txt', d.join('\n') ) )
	.map( URLClean )
	._pipe( d => [...new Set(d)] )
	.sort()
	.filter( URLFilter )
	.join('\n')
	// ._fork( d => IO.writeDataSync( './data/corrected.txt', d ) )
	._fork( d =>
		hosts.forEach( host =>
			d
				// ._fork( d => console.log( host ) )
				._pipe( s => ( s.match( host[1][0] ) ||[]) )	//TODO support fallback regexes
				._pipe( d => [...new Set(d)] )
				.sort()
				.join('\n')
				._fork( d => IO.writeDataSync( `./data/hosts/${host[0]}.txt`, d ) )
		)
	)

## 2_deduplicate_tumblrs.js
"use strict"

const hosts = [
	['bandcamp', [/[^/\s.]+\.bandcamp.com/gi]],
	['gdocs', [/\bdocs\.google\.com\/.+/gi]],
	['gdrive', [/\bdrive\.google\.com\/.+/gi]],
	['mediafire', [/.*\bmediafire.com\b.*/gi]],
	['mega', [/\bmega(\.co)?\.nz\/.+/gi]],
	['soundcloud', [/\b(?<!api\.)soundcloud\.com\/[^/\s]+/gi]],
	['tumblr', [/(((?<=[^\w-])|^)(?!(media|static|data|www)\.)[\w-]+\.|\b)tumblr\.com(\/((?!post|blog|follow)[\w-]+|(blog|follow)\/[\w-]+))?\b/gi]],
	['wordpress', [/[\w-]+\.wordpress\.com\b/gi]],
	['blogger', [/[\w-]+\.blogger\.com\b/gi]],
	['patreon', [/\bpatreon\.com\/[\w-]+\b/gi]],
	['twitter', [/\btwitter\.com\/[\w-]+\b/gi]],
	['goo.gl', [/\bgoo\.gl\/[\w]+\b/gi]],
	['discord', [/\bdiscord\.gg\/[\w]+\b/gi]],
	['bit.ly', [/\bbit\.ly\/[\w]+\b/gi]],
]


const fs = require('fs')
const readline = require('readline')
const path_module = require('path')
const IO = {
	getDataSync : srcPath =>
			fs.readFileSync( srcPath, 'utf8', err => {
				if ( err ) throw Error( err )
			}),
	writeDataSync : (dstPath, d, cb = ()=>{}) =>
		fs.writeFileSync( dstPath, d, err => {
			if ( err ) throw Error( err )
			return cb(d)
		}),
	readdirTreeSync : function (dirPath, startDir_ = __dirname, files_ = []) {
		if (! path_module.isAbsolute(dirPath) )
			dirPath = path_module.join(__dirname, dirPath)
		return this._readdirTreeSync(dirPath, startDir_, files_)
			.map( e => path_module.relative(dirPath, e) )
	},
	// http://stackoverflow.com/a/20525865
	_readdirTreeSync : function _s (dirPath, startDir_, files_) {
		let files = fs.readdirSync(dirPath)
		for (let i in files) {
			const name = path_module.join( dirPath, path_module.relative(startDir_, files[i]) )

			if ( fs.statSync(name).isDirectory() )
				_s(name, startDir_, files_)
			else
				files_.push(name)
		}
		return files_
	}
}
Object.defineProperties( Object.prototype, {
	_pipe : { value :
		function (f) { return (
			f( this )
		)}
	},
	_fork : { value :
		function (f) { return (
			f( /*JSON.parse(JSON.stringify(*/ this /*))*/ ),	//TODO fix deep copy for arrays
			this
		)}
	},
})
Object.getOwnPropertyNames( Array.prototype ).forEach( m =>
	(! Object.prototype[m] ) ? (
		Object.defineProperty( Object.prototype, m, {
			value : Array.prototype[m],
		})
	):null
)
const flatMap = f => d =>
	[].concat( ...( d.map(f) ) )
const naturalSort = function _s (a, b) {
	if ( a === b )
		return 0
	_s.intlSort = _s.intlSort || ( new Intl.Collator() ).compare
	const [a2, b2] = [a, b].map( s => s
		.replace( /[^\wÀ-ſ\s]+/g, '' )
		.toLowerCase()
	)
	if ( a2 !== b2 )
		return [a2, b2].sort( _s.intlSort )[0] === a2 ? -1 : 1
	return [a, b].sort( _s.intlSort )[0] === a ? -1 : 1
}

const invalidURLchar = /[^\w\-.~:/#@$&*+=%',;!?]/g
const schemeURLs = [
		/(?:(?<=[/.,:;!?'%*\-\s])|^)www(?:\.\w{2,})+\.[a-zA-Z]{2,}/i,
		/https?:\/\/\w{2,}/i,
		/(?:\w{2,}\.)+[a-zA-Z]{2,}(?:[/.,:;!?'%*\-\s]|$)/i,
	]
const reSchemeURL = `(?:${ schemeURLs.map( e => e.source).join('|') })`
const reSplit = /(([\s=<]\\?"|"[>\s])|[\s\^\\<>]|[^!-~])+/
const URLClean = url => url
		.replace( new RegExp( `.*?(${reSchemeURL}.*)`, 'i'), '$1' )
		.replace( new RegExp( `(.*?)(?:${invalidURLchar.source}|#(?!!)).*`, 'i'), '$1' )
		.replace( new RegExp( `.*${invalidURLchar.source}(.*)`, 'ig'), '$1' )
		.replace( /(.+?)(?:[.,:;!?'%*\-\s]+|_{2,})$/, '$1' )
		.trim()
		.replace( /(?:^|=)https?(?::|%3a)(?:\/|%2f){2}(.*)/i, '$1' )	//FIXME fix explicit redirects and clean up rest of the line
const URLFilter = e =>
	!!(false
		|| /\S/.test(e)
		&& !~e.indexOf('..')
		&& !invalidURLchar.test(e)
	)


IO.readdirTreeSync( './data/hosts_tumblr' )
	.filter( e => /tumblr[^/]*$/i.test(e) )
	.map( f => IO.getDataSync( './data/hosts_tumblr/' + f )
		.split('\n')
		.filter( e => ! /^(data|static|media)\.tumblr\.com/i.test(e) )
		.filter( e => ! /tumblr\.com\/\w{32}$/i.test(e) )
		.filter( e => ! /tumblr\.com\/tumblr_\w{19}_/i.test(e) )
		.map( e => e
			.toLowerCase()
			.replace( /^(.*\/\/)?(www\.)?/i, '' )
			.replace( /(?<=.\.tumblr\.com).+$/i, '' )
			.replace( /(?<=\btumblr\.com)\/(post|image)\/.*/i, '' )
			.replace( /^(tumblr\.com)\/(?:blog\/)?([^/]+).*?$/i, '$2.$1' )
			._pipe( e => {
				try {
					return decodeURI(e).trim()
				} catch(e) {
					return e
				}
			}).replace( /^%?2f|%2f$/g, '' )
			.replace( /(?<=.\.tumblr\.com).+$/i, '' )
		)
		.filter( e => !!(false
			|| ! /\s/.test(e)
			&& schemeURLs.some( r => r.test(e) )
			&& ! /^(.*\/\/)?(www\.)?([^/]+\.((png|jpg|exe|jpeg|gif|mp4|webp|webm)|wikia\.com|deviantart\.com|archive\.org|googleusercontent\.com|bandcamp\.com|instagram\.com|postimage\.org|amazonaws\.com|photobucket\.com|patreon\.com|blogspot\.com|deviantart\.net|dropboxusercontent\.com|dropbox\.com)|yuki\.la|ytimg\.com|youtube\.com|youtu\.be|yoursiblings\.org|yahoo\.com|whatisabrony\.com|wetheeconomy\.com|welovefine\.com|weknowmemes\.com|archive\.org|wattpad\.com|watchmojo\.com|vocaroo\.com|vine\.co|vimeo\.com|variety\.com|twitter\.com|twitch\.tv|tvtropes\.org|tvguide\.com|tv\.com|tumview\.com|tumbnation\.com|tumbex\.com|tinyurl\.com|strawpoll\.me|steamcommunity\.com|instagram\.com|pmwiki\.php|deviantart\.net|dropbox\.com|deviantart\.com|puu\.sh|pony\.fm)(\/|$)/i.test(e)	//TODO fuck if I can be bothered...
		))
		._pipe( e => [...new Set(e)] )
		.sort()
		.join('\n')
		._fork( d => IO.writeDataSync( `./data/TUMBLR/${f.match(/[^/.]+/)}_${f.match(/\d+(?=\.txt$)/)||''}.txt`, d ) )
	).join('\n')
	._fork( d => IO.writeDataSync( `./data/_TUMBLRs.txt`, d ) )
	.split('\n')
	.filter( e => ! /^[^/]+\.tumblr\.com$/.test(e) )
	._fork( d => IO.writeDataSync( `./data/_TUMBLRs_external_RAW.txt`, d.join('\n') ) )
	.map( e => e.replace( /\/.*$/gi, '' ) )
	._pipe( arr => arr.filter((key, idx) => arr.lastIndexOf(key) === idx).sort((a, b) => a < b ? -1 : 1) )	// https://stackoverflow.com/a/35642876
	.reverse()
	.join('\n')
	._fork( d => IO.writeDataSync( `./data/_TUMBLRs_external.txt`, d ) )
	"use strict"

	const hosts = [
	['bandcamp', [/[^/\s.]+\.bandcamp.com/gi]],
	['gdocs', [/\bdocs\.google\.com\/.+/gi]],
	['gdrive', [/\bdrive\.google\.com\/.+/gi]],
	['mediafire', [/.\bmediafire.com\b./gi]],
	['mega', [/\bmega(\.co)?\.nz\/.+/gi]],
	['soundcloud', [/\b(?<!api\.)soundcloud\.com\/[^/\s]+/gi]],
	['tumblr', [/(((?<=[^\w-])\|^)(?!(media\|static\|data\|www)\.)[\w-]+\.\|\b)tumblr\.com(\/((?!post\|blog\|follow)[\w-]+\|(blog\|follow)\/[\w-]+))?\b/gi]],
	['wordpress', [/[\w-]+\.wordpress\.com\b/gi]],
	['blogger', [/[\w-]+\.blogger\.com\b/gi]],
	['patreon', [/\bpatreon\.com\/[\w-]+\b/gi]],
	['twitter', [/\btwitter\.com\/[\w-]+\b/gi]],
	['goo.gl', [/\bgoo\.gl\/[\w]+\b/gi]],
	['discord', [/\bdiscord\.gg\/[\w]+\b/gi]],
	['bit.ly', [/\bbit\.ly\/[\w]+\b/gi]],
	]


	const fs = require('fs')
	const readline = require('readline')
	const path_module = require('path')
	const IO = {
	getDataSync : srcPath =>
	fs.readFileSync( srcPath, 'utf8', err => {
	if ( err ) throw Error( err )
	}),
	writeDataSync : (dstPath, d, cb = ()=>{}) =>
	fs.writeFileSync( dstPath, d, err => {
	if ( err ) throw Error( err )
	return cb(d)
	}),
	readdirTreeSync : function (dirPath, startDir_ = __dirname, files_ = []) {
	if (! path_module.isAbsolute(dirPath) )
	dirPath = path_module.join(__dirname, dirPath)
	return this._readdirTreeSync(dirPath, startDir_, files_)
	.map( e => path_module.relative(dirPath, e) )
	},
	// http://stackoverflow.com/a/20525865
	_readdirTreeSync : function _s (dirPath, startDir_, files_) {
	let files = fs.readdirSync(dirPath)
	for (let i in files) {
	const name = path_module.join( dirPath, path_module.relative(startDir_, files[i]) )

	if ( fs.statSync(name).isDirectory() )
	_s(name, startDir_, files_)
	else
	files_.push(name)
	}
	return files_
	}
	}
	Object.defineProperties( Object.prototype, {
	_pipe : { value :
	function (f) { return (
	f( this )
	)}
	},
	_fork : { value :
	function (f) { return (
	f( /JSON.parse(JSON.stringify(/ this /))/ ), //TODO fix deep copy for arrays
	this
	)}
	},
	})
	Object.getOwnPropertyNames( Array.prototype ).forEach( m =>
	(! Object.prototype[m] ) ? (
	Object.defineProperty( Object.prototype, m, {
	value : Array.prototype[m],
	})
	):null
	)
	const flatMap = f => d =>
	[].concat( ...( d.map(f) ) )
	const naturalSort = function _s (a, b) {
	if ( a === b )
	return 0
	_s.intlSort = _s.intlSort \|\| ( new Intl.Collator() ).compare
	const [a2, b2] = [a, b].map( s => s
	.replace( /[^\wÀ-ſ\s]+/g, '' )
	.toLowerCase()
	)
	if ( a2 !== b2 )
	return [a2, b2].sort( _s.intlSort )[0] === a2 ? -1 : 1
	return [a, b].sort( _s.intlSort )[0] === a ? -1 : 1
	}

	const invalidURLchar = /[^\w\-.~:/#@$&*+=%',;!?]/g
	const schemeURLs = [
	/(?:(?<=[/.,:;!?'%*\-\s])\|^)www(?:\.\w{2,})+\.[a-zA-Z]{2,}/i,
	/https?:\/\/\w{2,}/i,
	/(?:\w{2,}\.)+[a-zA-Z]{2,}(?:[/.,:;!?'%*\-\s]\|$)/i,
	]
	const reSchemeURL = `(?:${ schemeURLs.map( e => e.source).join('\|') })`
	const reSplit = /(([\s=<]\\?"\|"[>\s])\|[\s\^\\<>]\|[^!-~])+/
	const URLClean = url => url
	.replace( new RegExp( `.?(${reSchemeURL}.)`, 'i'), '$1' )
	.replace( new RegExp( `(.?)(?:${invalidURLchar.source}\|#(?!!)).`, 'i'), '$1' )
	.replace( new RegExp( `.${invalidURLchar.source}(.)`, 'ig'), '$1' )
	.replace( /(.+?)(?:[.,:;!?'%*\-\s]+\|_{2,})$/, '$1' )
	.trim()
	.replace( /(?:^\|=)https?(?::\|%3a)(?:\/\|%2f){2}(.*)/i, '$1' ) //FIXME fix explicit redirects and clean up rest of the line
	const URLFilter = e =>
	!!(false
	\|\| /\S/.test(e)
	&& !~e.indexOf('..')
	&& !invalidURLchar.test(e)
	)

	let iTMP = true


	;( () => {
	IO.readdirTreeSync( './data/huge' )
	.reduce( (acc, f, i) => {
	// if ( i < -1 ) return acc // DEBUG
	if ( i % 500 === 0 ) {
	console.log( i )
	IO.writeDataSync( `./data/descriptions/__huge_TMP_${+(i = !i)}.txt`, [...acc].join('\n\n') )
	// IO.writeDataSync( `./data/descriptions/__huge_TMP_${i}.txt`, acc.join('\n\n') )
	}

	IO.getDataSync( './data/huge/' + f )
	.split('\n')
	.filter( e => !!(false
	\|\| !e.includes( '<script' )
	&& !e.includes( 'window["ytInitialData"]' )
	// && ( new RegExp( `[^-](${hosts.map( e => e[0].replace(/\./g, '.') ).join('\|')})[.:\s]`, 'i') ).test(e)
	&& /[^-](tumblr\|patreon)([:\s]\|\.com)/i.test(e)
	))
	.forEach( e => acc.add(e) )
	return acc
	}, new Set() )
	._pipe( e => [...e] )
	.join('\n\n')
	._fork( d => IO.writeDataSync( './data/descriptions/__huge.txt', d ) )
	})()