Skip to content

Instantly share code, notes, and snippets.

@gvlx
Forked from sagrawal31/DownloadURLs.groovy
Last active May 10, 2022 16:26
Show Gist options
  • Save gvlx/2c48fb0136aa17f421ba56175cc81f6b to your computer and use it in GitHub Desktop.
Save gvlx/2c48fb0136aa17f421ba56175cc81f6b to your computer and use it in GitHub Desktop.
A simple Groovy script to scrape all URLs from a given string and download the content from those URLs
import java.util.regex.Matcher
import java.util.regex.Pattern
Pattern urlPattern = Pattern.compile("(https?|ftps?|file)://([-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])",Pattern.CASE_INSENSITIVE);
String urlString = """This is a big string with lots of Image URL like: http://i.istockimg.com/file_thumbview_approve/69656987/3/stock-illustration-69656987-vector-of-flat-icon-life-buoy.jpg and
http://i.istockimg.com/file_thumbview_approve/69943823/3/stock-illustration-69943823-beach-ball.jpg few others below
http://i.istockimg.com/file_thumbview_approve/40877104/3/stock-photo-40877104-pollen-floating-on-water.jpg
http://i.istockimg.com/file_thumbview_approve/68944343/3/stock-illustration-68944343-ship-boat-flat-icon-with-long-shadow.jpg
abcdef
www.whatever.com
https://github.com/geongeorge/i-hate-regex
https://www.facebook.com/
https://www.google.com/
https://xkcd.com/2293/
https://this-shouldn't.match@example.com
http://www.example.com/
ftp://this.new.server.com/
"""
Matcher matcher = urlPattern.matcher(urlString);
while (matcher.find()) {
String address = matcher.group()
println("Got URL: " + address);
new File("./" + address.tokenize("/").last()).withOutputStream { out ->
out << new URL(address).openStream()
}
}
// References:
// 1. http://stackoverflow.com/questions/5713558/detect-and-extract-url-from-a-string
// 2. http://stackoverflow.com/questions/4674995/groovy-download-image-from-url
@gibello
Copy link

gibello commented May 10, 2022

And concerning the URL extraction itself, I did it in 2 steps (the 1st is just a find/grep, the 2nd is a pipe of several "sed" with a sort/uniq), as follows - with final list in "/tmp/url2.txt":

rm -f /tmp/url1.txt /tmp/url2.txt
for file in `find content/ -name "*.md" -print`; do grep "\((http\S*)\)" $file >> /tmp/url1.txt ; done

sed -r 's/.*(\((.*)\)).*/\2/' /tmp/url1.txt | grep http |sed -r 's/\)//g' |sed -r 's/\"//g' |sed -r 's/\/$//g' | sort | uniq > /tmp/url2.txt

Don't know how you can add this to a Groovy script, but if regexps are properly supported, it is certainly possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment