- Priority: https://gist.github.com/Darkenetor/10e03a4ebe42b9fec6c723ea3e8d75b5
- Additional high-ish count of false positives (non-pony blogs): https://gist.github.com/Darkenetor/4620ad4a18ebac7394a48d07f30e40fe
Contains every list posted so far, ping @Darkenetor#4056
for new links and I'll update it.
Every commit only adds to the end so safe to get only the last rows if you're going without ignorelists for some reason, don't forget to randomize them otherwise.
The first three commits are derpibooru, derpibooru_domains and Rome's last zip sent here, but they're slightly better deduplicated so use the commit line count instead of the length of files you already have.
Clone the gists' repos and check commit messages for info on the sources. Currently data is from:
- Derpibboru: official dump from source_urls, descriptions from
Twi-Hard
's archive: https://derpibooru.org/tumblr_domains.txt https://derpibooru.org/tumblrs.txt - Fimfiction:
Sir Inrix
's search index for Fimfarchive, Google searches for links outside stories
Google: site:https://www.fimfiction.net/user/ "tumblr"
Google: site:https://www.fimfiction.net/user/ inurl:/about "tumblr:"
Google: site:https://www.fimfiction.net/user/ inurl:/about "patreon:"
[...$$('.srg .r a[onmousedown]:not([class])')].map( a => a.href ).join('\n')
cat ../fimfic.txt | while read l; do wget -e robots=off -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' "$l"; done
Twi-Hard
's dump of Tumblrpony.wikia.com and MLPFart.wikia.com
wget --mirror -e robots=off --accept-regex '(\.(html|php)|(\/|^)[^.?]*)$' --reject-regex '(\.\w+/\w+-\w+/wiki/|(Special|MediaWiki|Help|User|User_\w*|Template|Blog|File|Forum|Talk|\w+_Wiki):)' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'tumblrpony.wikia.com'
wget --mirror -e robots=off --accept-regex '(\.(html|php)|(\/|^)[^.?]*)$' --reject-regex '(\.\w+/\w+-\w+/wiki/|(Special|MediaWiki|Help|User|User_\w*|Template|Blog|File|Forum|Talk|\w+_Wiki):)' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'mlpfanart.fandom.com'
- Small TVTropes excerpt
wget --mirror -e robots=off --accept-regex '(\.(html)|(\/|^)[^.]*)(\?.*)?$' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'https://tvtropes.org/pmwiki/pmwiki.php/Blog/AskAPony' -l 1 # extra folders manually removed
Redgetrek
's Tumblr: http://redgetrek.tumblr.com/post/10693037093/top-original-art-pony-tumblrs
[...$$('.bodytype li')].map( e => e.innerText.trim().replace(/\s.*/, '') + '.tumblr.com' ).filter( e => /[a-z]/.test(e[0]) ).join('\n')
- Up to date EquestriaDaily.com and Horse-News.org dumps
wget --mirror -e robots=off --wait 0.25 -A html -X 'search/label' -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'https://www.equestriadaily.com/?m=1'
wget --mirror -e robots=off --wait 0.25 -A html -U 'Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' 'https://www.horse-news.org/'
- 13k YT channels from
Twi-Hard
's google searches for playlists (half composing of equitation races and elsagate but those don't have linked Tumblrs, this is what mostly composes the low accuracy list) and EqD spotlights: https://gist.github.com/Darkenetor/e058db7d16a006daf665504fc77aae29
[...new Set( Array.from($$('.pl-video-title a[href*="/channel/"], .pl-video-title a[href*="/user/"]')).map( e => e.href ) )].sort().join('\n')
- Patreon: profiles gathered from above sources
Scripts below for future reference, ping me if you see issues there.
- Custom domains: ArchiveTeam/tumblr-grab#13
https://gist.githubusercontent.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/tumblr_domains.txt - High accuracy Tumblrs outside of the Derpibooru list
https://gist.githubusercontent.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/afterderpi.txt - Low accuracy Tumblrs outside of the Derpibooru list
https://gist.githubusercontent.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/afterderpi_lowpony.txt
Probably not up to date.
twkr
's Stage1 status checker: https://pastebin.com/kDa8ij6j
- https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage1_domains.txt
- https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage1_afterderpi.txt
- https://gist.github.com/Darkenetor/23b25ab13dea1f6606577933b96d1802/raw/stage1_ad_lowpony.txt
twkr
's Stage2 status checker: https://pastebin.com/ameY8a6m