Skip to content

Instantly share code, notes, and snippets.

@gvlx
Forked from sagrawal31/DownloadURLs.groovy
Last active May 10, 2022 16:26
Show Gist options
  • Save gvlx/2c48fb0136aa17f421ba56175cc81f6b to your computer and use it in GitHub Desktop.
Save gvlx/2c48fb0136aa17f421ba56175cc81f6b to your computer and use it in GitHub Desktop.
A simple Groovy script to scrape all URLs from a given string and download the content from those URLs
import java.util.regex.Matcher
import java.util.regex.Pattern
Pattern urlPattern = Pattern.compile("(https?|ftps?|file)://([-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])",Pattern.CASE_INSENSITIVE);
String urlString = """This is a big string with lots of Image URL like: http://i.istockimg.com/file_thumbview_approve/69656987/3/stock-illustration-69656987-vector-of-flat-icon-life-buoy.jpg and
http://i.istockimg.com/file_thumbview_approve/69943823/3/stock-illustration-69943823-beach-ball.jpg few others below
http://i.istockimg.com/file_thumbview_approve/40877104/3/stock-photo-40877104-pollen-floating-on-water.jpg
http://i.istockimg.com/file_thumbview_approve/68944343/3/stock-illustration-68944343-ship-boat-flat-icon-with-long-shadow.jpg
abcdef
www.whatever.com
https://github.com/geongeorge/i-hate-regex
https://www.facebook.com/
https://www.google.com/
https://xkcd.com/2293/
https://this-shouldn't.match@example.com
http://www.example.com/
ftp://this.new.server.com/
"""
Matcher matcher = urlPattern.matcher(urlString);
while (matcher.find()) {
String address = matcher.group()
println("Got URL: " + address);
new File("./" + address.tokenize("/").last()).withOutputStream { out ->
out << new URL(address).openStream()
}
}
// References:
// 1. http://stackoverflow.com/questions/5713558/detect-and-extract-url-from-a-string
// 2. http://stackoverflow.com/questions/4674995/groovy-download-image-from-url
@gibello
Copy link

gibello commented May 10, 2022

Here's the list of URLs, you can just test your script on it: note that I have issues with some of them due to 403 (forbidden) - was partly solved by specifying a user-agent to look like a real navigator, but it's not enough for a few of them (?)

http://oss-watch.ac.uk/files/procurement.odp
http://oss-watch.ac.uk/resources/ssmm
https://a16z.com/2019/10/04/commercializing-open-source
https://alambic.io
https://anchore.com/blog/5-open-source-procurement-best-practices
https://blog.kenjo.io/what-is-a-competency-matrix
https://blogs.vmware.com/opensource/2020/12/01/why-companies-contribute-to-open-source
https://certification.openchainproject.org
https://chaoss.community
https://chaoss.github.io/grimoirelab
https://clearcode.cc/blog/why-developers-contribute-open-source-software
https://clearlydefined.io
https://dev.to/datreeio/top-10-github-best-practices-3kl2
https://digital.com/creating-an-llc/open-source-business
https://dirkriehle.com/publications/2019-selected/the-innovations-of-open-source
https://docs.github.com/en/github/administering-a-repository/about-securing-your-repository
https://docs.github.com/en/github/managing-security-vulnerabilities/about-alerts-for-vulnerable-dependencies
https://ec.europa.eu/info/departments/informatics/open-source-software-strategy_en#opensourcesoftwarestrategy
https://ec.europa.eu/info/sites/default/files/en_ec_open_source_strategy_2020-2023.pdf
https://eclipse.github.io/steady
https://en.wikipedia.org/wiki/Heartbleed
https://fsfe.org
https://github.com/borisbaldassari/alambic/community
https://github.com/fossas/fossa-cli
https://github.com/oss-review-toolkit/ort
https://gitlab.ow2.org/ggi/ggi
https://gitlab.ow2.org/ggi/ggi-castalia/-/boards/449
https://gitlab.ow2.org/ggi/ggi-castalia/-/issues/28
https://gitlab.ow2.org/ggi/ggi/-/tree/main/resources
https://inform.tmforum.org/features-and-analysis/2017/05/upstream-first-building-products-open-source-software
https://joinup.ec.europa.eu/collection/open-source-observatory-osor
https://mail.ow2.org/wws/info/ossgovernance
https://managementisajourney.com/management-toolbox-better-decision-making-with-a-skills-inventory
https://maximilianmichels.com/2021/upstream-first
https://nythesis.com/open-courses/free-and-open-source-software
https://opencollective.com
https://opengovernance.dev
https://openpracticelibrary.com/practice/code-review
https://opensource.guide
https://ospo.zone
https://oss-compliance-tooling.org/Tooling-Landscape/OSS-Based-licence-Compliance-Tools
https://oss-review-toolkit.org
https://osv.dev
https://outreach.eclipse.foundation/hubfs/EuropeanOpenSourceWhitePaper-June2021.pdf
https://owasp.org/www-community/Vulnerability_Scanning_Tools
https://owasp.org/www-project-dependency-check
https://projects.eclipse.org/projects/technology.sw360
https://reflectoring.io/upstream-downstream
https://resources.whitesourcesoftware.com/blog-whitesource/3-reasons-why-open-source-is-safer-than-commercial-software
https://reuse.software
https://sap.github.io/project-kb
https://scancode-toolkit.readthedocs.io
https://scancode-toolkit.readthedocs.io/en/latest/cli-reference/scan-options-pre.html#classify
https://sfconservancy.org
https://sourceforge.net/blog/5-open-source-skills-game-resume
https://sourceforge.net/blog/support-open-source-projects-now
https://superuser.openstack.org/articles/cern-openstack-update
https://superuser.openstack.org/articles/the-role-of-open-source-in-digital-sovereignty-openinfra-live-recap
https://sustainoss.org
https://tidelift.com
https://timreview.ca/article/512
https://todogroup.org/guides
https://www.chromium.org/chromium-os/chromiumos-design-docs/upstream-first
https://www.cloudbees.com/blog/why-your-employees-should-be-contributing-to-open-source
https://www.computer.org/csdl/magazine/co/2020/10/09206429/1npxG2VFQSk
https://www.cvedetails.com
https://www.eclipse.org/sw360
https://www.fossology.org
https://www.ibrahimatlinux.com/wp-content/uploads/2022/01/recommended-oss-compliance-practices.pdf
https://www.infoworld.com/article/2612259/7-ways-your-company-can-support-open-source.html
https://www.itjungle.com/2021/02/15/weighing-the-hidden-costs-of-open-source
https://www.lfenergy.org/wp-content/uploads/sites/67/2019/07/Open-Source-Strategy-V1.0.pdf
https://www.linuxfoundation.org/tools/participating-in-open-source-communities
https://www.linuxfoundation.org/wp-content/uploads/lfcorp/files/lf_foss_compliance_fossology.pdf
https://www.openchainproject.org
https://www.openlogic.com/blog/top-5-benefits-open-source-software
https://www.opensourcerers.org/2021/08/16/a-primer-on-digital-sovereignty-open-source
https://www.ow2.org/view/MRL/Full_List_of_Best_Practices
https://www.ow2.org/view/MRL/Overview
https://www.ow2.org/view/OSS_Governance
https://www.perforce.com/blog/qac/9-best-practices-for-code-review
https://www.raconteur.net/technology/cloud/open-source-technology
https://www.redhat.com/en/blog/events-life-force-open-source
https://www.redhat.com/en/blog/what-enterprise-open-source
https://www.redhat.com/en/blog/what-open-source-upstream
https://www.techrepublic.com/article/4-innovations-we-owe-to-open-source
https://www.threefivetwo.com/blog/can-open-source-innovation-work-in-the-enterprise
https://www.unicef.org/innovation/stories/open-source-digital-sovereignty
https://www.un.org/ruleoflaw/files/Governance%20Indicators_A%20Users%20Guide.pdf
https://www.webiny.com/blog/what-is-commercial-open-source
http://www.catb.org/~esr/writings/cathedral-bazaar
http://www.managersresourcehandbook.com/download/Skills-Matrix-Template.pdf

@gibello
Copy link

gibello commented May 10, 2022

And concerning the URL extraction itself, I did it in 2 steps (the 1st is just a find/grep, the 2nd is a pipe of several "sed" with a sort/uniq), as follows - with final list in "/tmp/url2.txt":

rm -f /tmp/url1.txt /tmp/url2.txt
for file in `find content/ -name "*.md" -print`; do grep "\((http\S*)\)" $file >> /tmp/url1.txt ; done

sed -r 's/.*(\((.*)\)).*/\2/' /tmp/url1.txt | grep http |sed -r 's/\)//g' |sed -r 's/\"//g' |sed -r 's/\/$//g' | sort | uniq > /tmp/url2.txt

Don't know how you can add this to a Groovy script, but if regexps are properly supported, it is certainly possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment