Skip to content

Instantly share code, notes, and snippets.

@alexanderfefelov
Created March 5, 2015 06:30
Show Gist options
  • Save alexanderfefelov/04e1869904d868291c09 to your computer and use it in GitHub Desktop.
Save alexanderfefelov/04e1869904d868291c09 to your computer and use it in GitHub Desktop.
Сбор web-страниц с IP-адресов, упомянутых в файле. Используется Web-Harvest (http://web-harvest.sourceforge.net)
#!/bin/bash
mkdir -p ips
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' file.txt > ips/all
sort ips/all | uniq > ips/uniq
awk '$0="<![CDATA[<ip>]]>"$0"<![CDATA[</ip>]]>"' ips/uniq > ips/cdata.inc
INC=$(cat ips/cdata.inc)
cat <<EOF > web-harvest.xml
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="ips">
<![CDATA[<ips>]]>
$INC
<![CDATA[</ips>]]>
</var-def>
<loop item="i">
<list>
<xpath expression="//ip/text()">
<var name="ips"/>
</xpath>
</list>
<body>
<file action="write" path="pages/\${i}.txt">
<empty>
<var-def name="content">
<try>
<body>
<http url="http://\${i}/"/>
</body>
<catch>
*** Exception ***
</catch>
</try>
</var-def>
</empty>
<template>
\${content}
</template>
</file>
</body>
</loop>
</config>
EOF
mkdir -p pages
rm -f pages/*
java -jar webharvest_all_2.jar config=web-harvest.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment