Skip to content

Instantly share code, notes, and snippets.

@maphysics
Last active May 20, 2017 18:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maphysics/31f6d4ae5b7b35d9e594d51d23239509 to your computer and use it in GitHub Desktop.
Save maphysics/31f6d4ae5b7b35d9e594d51d23239509 to your computer and use it in GitHub Desktop.
WIP - Nutch 2 on Windows 10

Ok so I've been trying to get nutch running on my Windows 10 machine for a while now.
Online resources have been lacking in recent tutorials. So this is my notes on the matter.

So the offical tutorials are at https://wiki.apache.org/nutch/

And I downloaded the binaries from http://www-eu.apache.org/dist/nutch/2.3.1/

I grabbed the zip file as I'm on Windows.
Aside: I would prefer to be on a linux system. At the moment, I just have my gaming machine which is windows.
Yes I could put a duel boot on it or run a VM, but I'm trying to do this the "easiest" way possible.

Ok, so I normally like to dev in Intellij. A lot of people are Eclipse folks, I get that.
I started using an IDE with Ruby for automated UI test, and RubyMines is really the top for Ruby.

The first Nutch tutorial I'm working through is https://wiki.apache.org/nutch/Nutch2Tutorial

My thoughts and issues as I read are below:

  • Who makes a nested tutorial? Like in order to use version 2, I need to know how to use version 1?
    Come on, let's actually do a tutorial right.

  • To be honest, the first time I heard of Gora was the first time I tried this tutorial. http://gora.apache.org/ "The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop™ MapReduce support. Gora uses the Apache Software License v2.0. Gora graduated from the Apache Incubator in January 2012 to become a top-level Apache project. You can find the Gora DOAP here. - See more at: http://gora.apache.org/#sthash.yuKKClVl.dpuf"

  • Also Gora is tried to specific versions of the Hadoop tools it works with as pointed out in the Nutch tutorial and easy to miss.

  • ok the version of hbase they link to is gzip tar, luckily I have 7-Zip installed on my machine.
    If you need it you can get it here http://www.7-zip.org/download.html

  • Windows has a package manager Chocolatey, I plan to see if any of this is in choco, I haven't yet.
    Side note: choco is what my neice and nephew call poo, so that's entertaining.

  • Oh right, I haven't set my $NUTCH_HOME yet. On windows, that's a system variable and you need to be an admin to set it, at least through the UI. If like me, you've worked at places where you didn't have it, too bad.
    JAVA_HOME is normally what one ends up doing this for, so it's no surprise they have a good looking set of directions for it. Now where did I unpack nutch... https://www.java.com/en/download/help/path.xml

  • Ahh windows is being "helpful" and downloaded it to my OneDrive and I unpacked it there without thinking. So I'm going to move it somewhere 'nicer.' Let's just put it on the C:\ drive level and by it I mean the file instead what was extracted. The one called 'apache-nutch-2.3.1' or similarly if you have a different version of 2. Then, it looks like from the tutorial that $NUTCH_HOME=C:\apache-nutch-2.3.1

  • "Download and configure HBase 0.98.8-hadoop2" that's useful! Did I mention tar.gz files are annoying on windows even with 7-Zip? No, well... 7-zip unpacks the gzip and puts it in a folder. The tarball is inside that folder and you have to extract it again. Then you get to clean up your working directory that is now polluted with all these extra files and folders. It's also slow.

  • Ok the version of hbase they want you to use is OLD. However, the internet is a mine of outdated information. JavaTPoint has a configuration guide I'm going to use. I'm going with standalone mode. https://www.javatpoint.com/hbase-installation

  • Ahh, Java my old friend, the time has come to meet again. Hbase wants Java and I'm going to give it to them. The configuration guide wants Java 7. Well, ok, I like 8 it's got some great things in it, but a lot of the world is written in older versions and I can work with that. If you want to use, choco: choco install jdk7 Otherwise, you got to hit the archives http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html

  • I already had 8 so I need to update my JAVA_HOME to point to 7.

  • Time to fix hbasa-site.xml to mimick the javaTpoint guide. I like to edit files that are outside a project with Notepad++. Use what you like as long as it's plain text, duh. https://notepad-plus-plus.org/

  • Right javaTpoint assumes I'm on linux and all the file locations are the wrong slash for windows. Why can't we all just get along? Here's my adjustment which require making a few directories:

//Here you have to set the path where you want HBase to store its files. hbase.rootdir file:C:\Users\[username]\HBase\HFiles

//Here you have to set the path where you want HBase to store its built in zookeeper files.

hbase.zookeeper.property.dataDir
C:\Users[username]\zookeeper

  • Next is running Hbase. The script is a shell script. I stay away from bat scripts as a general rule. So I'm going to use gitbash to run the script. You can get as part of git. choco install git

  • That went well: $ ./start-hbase.sh Error: Could not find or load main class org.apache.hadoop.hbase.util.HBaseConfTool Error: Could not find or load main class org.apache.hadoop.hbase.zookeeper.ZKServerTool starting master, logging to /c/hbase-0.98.8-hadoop2/bin/../logs/hbase--master-sj.out Error: Could not find or load main class org.apache.hadoop.hbase.master.HMaster localhost: ssh: connect to host localhost port 22: Connection refused

  • Right I forgot to update the hbase-env.sh to have the correct JAVA_HOME dir. Obviously, it can't use the system variable I made, that would be too easy. And a new error: $ ./start-hbase.sh /c/hbase-0.98.8-hadoop2/bin/../conf/hbase-env.sh: line 29: export: `FilesJavajdk1.7.0_79': not a valid identifier /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:Program/bin/java: No such file or directory /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:Program/bin/java: No such file or directory starting master, logging to /c/hbase-0.98.8-hadoop2/bin/../logs/hbase--master-slimjim.out /c/hbase-0.98.8-hadoop2/bin/../bin/hbase: line 389: C:Program/bin/java: No such file or directory localhost: ssh: connect to host localhost port 22: Connection refused

  • Right bash HATES spaces in directory names. I'm moving my Java install from Program Files to a directory up, and updating JAVA_HOME locally and updating it in hbase-env.sh.
    Don't forget to close and reopen the console/gitbash/powershell you are using in order to pick up the new JAVA_HOME. Sweet! I got the same error! Cause I forgot to save the hbase-env.xml file.

  • Saving and I get a new shiney error... ooooo... $ ./start-hbase.sh /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:javajdk1.7.0_79/bin/java: No such file or directory /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:javajdk1.7.0_79/bin/java: No such file or directory starting master, logging to /c/hbase-0.98.8-hadoop2/bin/../logs/hbase--master-slimjim.out /c/hbase-0.98.8-hadoop2/bin/../bin/hbase: line 389: C:javajdk1.7.0_79/bin/java: No such file or directory localhost: ssh: connect to host localhost port 22: Connection refused I didn't switch the slashs around and change C:\ to /c/, opps.

  • Boom! I've got hbase running stand alone on Windows 10. That was the whole point right? We weren't like trying to do something else with that last HOUR+ of time? Oh right Nutch, duck it! Where was I on that tutorial? That was step ducking, two!

  • Ok step 3, configure GORA. That should be easy than hbase based on the number of lines it takes up. In $NUTCH_HOME/conf/nutch-site.xml make the following change plus all the others from the tutorial for the thing we are replacing or haven't used before, the old version of nutch?? Gah, ducking A rabit, nested tutorials again!

storage.data.store.class org.apache.gora.hbase.store.HBaseStore Default class for storing data http.agent.name My Nutch Spider

That's line 116 in case you wanted some helpful information.

  • Next we get to add a "missing jar", da duck? Oh right, because gora is tied to specific version of hbase. So we have to put that version in. In theory to, it works with a lot of other backends. Having use provide the jar for the backend we are using means a smaller dependency foot print. Of course, that's just my guess as it's not actually explained in the tutorial.

  • Wait! The GORA you want me to use because it is linked to hbase or whatever, is broken? The new version is fixed but you stil want my to use the janky version? Thanks Nutch!

  • That looks like a configuration line coming next.

but where does it go? In case you were wondering the link of above it explains nothing. That config line looks like the same syntax as the ivy one. I'll put it there.
  • Ok another config file. Would it be too much to wrap those up into say one config file? $NUTCH_HOME/conf/gora.properties gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

  • When I was looking for where to put the gore.datastore.default property, I found this: #########################

HBaseStore properties

#########################

HBase requires that the Configuration has a valid "hbase.zookeeper.quorum"

property. It should be included within hbase-site.xml on the classpath. When

this property is omitted, it expects Zookeeper to run on localhost:2181.

To greatly improve scan performance, increase the hbase-site Configuration

property "hbase.client.scanner.caching". This sets the number of rows to grab

per request.

HBase autoflushing. Enabling autoflush decreases write performance.

Available since Gora 0.2. Defaults to disabled.

hbase.client.autoflush.default=false

HBase client cache that improves the scan in HBase (default 0)

gora.datastore.scanner.caching=1000

Might be helpful in the future, since I don't have a zookeeper running.

  • Oh well, making note of it and moving on. The property in gora.properties, I want to change is on line 19. BTW.

  • "N.B. It's probably worth checking and setting all your usual configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before progressing." Well, I don't have any, so I'm progressing.

  • Now I am ready to compile? I'm using a binary distro so that seems odd, but oh well.

  • I need to install ant. Which BTW wants java 8. I'm sure that will work out fine. choco install ant

  • OK ant runtime, ERROR! $ ant runtime java.lang.UnsupportedClassVersionError: org/apache/tools/ant/launch/Launcher : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482) Exception in thread "main"

  • Yeah that's totally the java version biting me in the toushy. My $JAVA_HOME got bumped to java 8. I rolled it back, let's try again. Nope! No good. "The class file version for Java SE 8 is 52.0 as per the JVM Specification. Version 52.0 class files produced by a Java SE 8 compiler cannot be used in earlier releases of Java SE." http://www.oracle.com/technetwork/java/javase/8-compatibility-guide-2156366.html So I'm going to roll again to Java 8 and update the JAVA_HOME in my configs. Yep totally got me to the next error.

  • That error is .... $ ant runtime Buildfile: C:\apache-nutch-2.3.1\build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init: [mkdir] Created dir: C:\apache-nutch-2.3.1\build [mkdir] Created dir: C:\apache-nutch-2.3.1\build\classes [mkdir] Created dir: C:\apache-nutch-2.3.1\build\release [mkdir] Created dir: C:\apache-nutch-2.3.1\build\test [mkdir] Created dir: C:\apache-nutch-2.3.1\build\test\classes

clean-lib:

resolve-default: [ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = C:\apache-nutch-2.3.1\ivy\ivysettings.xml


Ok I had to take a couple weeks off, since then I've had to update windows and my laptops firmware. Let's see if I can sort this shit out now

So what is the ant equalivalent of .m2?

Ok so ant runs over maven? http://maven.apache.org/ant-tasks/usage.html So let's put in the .m2 repo.

So I haven't installed that jar yet... I started hbase again and tried ant runtime and this:

$ ant runtime Buildfile: C:\apache-nutch-2.3.1\build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:

clean-lib: [delete] Deleting directory C:\apache-nutch-2.3.1\build\lib

resolve-default: [ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = C:\apache-nutch-2.3.1\ivy\ivysettings.xml

A different error, I may need to install ant... choco install ant choco upgrade ant hrmm ant is already installed... so while I force windows to install it again, let's look at this magical build.xml file. "... ..." Looks like the final version will be in the folder build.xml is in in the subfolder of release Actually, we have a folder called runtime and in there is a local folder with a bunch of jars in it. Including a bunch of gora jars.... And huzah hbase-common-0.98.8-hadoop2.jar is in there. Honestly, I have no clue if this is working or not, but it seems to have what it wants for now...

  1. Make a list of url's to crawl. WOOOOOTTTTT!!!!! We're finally talking about crawling the web. That's why I'm here. You're obviously here for the entertainment value. So this is tutorial inception again and we are told to go back to the tutorial for nutch1. Oh no I'm wrong we need to follow the hbase tutorial to make sure it's step up correctly, ballz. "Start HBase. Use the bin/start-hbase.sh command to start HBase. If your system is configured correctly, the jps command should show the HMaster and HRegionServer processes running." My jps only returns, jps. FYI JPS is http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html
  • Update conf/hbase-site.xml it needs to point at where the data is going to be written. So the JaveTPoint tutorial has the directories for a nice, friendly linux or mac environment. Let's fix that. file:/c/Users/{username}/HBase/HFiles for hbase.rootdir and /c/Users/{username}/zookeeper for hbase.zookeeper.property.dataDir Remember to save the file ;) Now jps shows HMaster. Awesome that means it's running in local mode. Yeah we could and in production want a distributed HBase. Be real though it would already be running. Or just yell at us DevOps like people to get one up for you!
  1. Finally writing the ducking URL list. Back up the tutorial hole we go. https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list "Create a URL seed list A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download"
  • so that file is this

The default url filter.

Better for whole-internet crawling.

Each non-comment, non-blank line contains a regular expression

prefixed by '+' or '-'. The first matching pattern in the file

determines whether a URL is included or ignored. If no pattern

matches, the URL is ignored.

skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

skip image and other suffixes we can't yet parse

for a more extensive coverage use the urlfilter-suffix plugin

-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

skip URLs with slash-delimited segment that repeats 3+ times, to break loops

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

accept anything else

+.

I'm leaving this the same. Like all sane, modern devs I prefer to let others handle my regex. There are people who enjoy this https://xkcd.com/1313/ https://www.oreilly.com/learning/regex-golf-with-peter-norvig dude that guy does videos for coursera! Also this explains why I don't work at Google. I didn't even finish the video. Or the course...

  • Anyways, the nutch 1 tutorial says how to make a list beyond regex. (The regex file basically is how nutch determines if it's going to follow a URL or not) "mkdir -p urls cd urls touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl)." REALLY!!! Was '-p' really necessary? That's like some wank being like "I'm going to show off. I do this everytime to just be safe." It's not safe, that's how you make dirs where you don't want them dip shat.

  • Ok I bite on use "http://nutch.apache.org/" as my seed url. Let's try this. Now go back to the nutch2 tutorial https://wiki.apache.org/nutch/Nutch2Tutorial Let's try injecting. That sounds fun and painless... from your runtime/local/bin folder do " nutch inject /someseedDir nutch readdb" someseedDir? Damn that is some seed Dir you got there. Must come from a hell of a mkdir -p.

So I'm guessing this isn't the right way to call this... "$ ./nutch inject seed.txt [Fatal Error] nutch-site.xml:23:7: The string "--" is not permitted within comments. Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; systemId: file:/C:/apache-nutch-2.3.1/runtime/local/conf/nutch-site.xml; lineNumber: 23; columnNumber: 7; The string "--" is not permitted within comments. at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2348) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2205) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2112) at org.apache.hadoop.conf.Configuration.set(Configuration.java:989) at org.apache.hadoop.conf.Configuration.set(Configuration.java:961) at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1299) at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:319) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:479) at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:170) at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284) Caused by: org.xml.sax.SAXParseException; systemId: file:/C:/apache-nutch-2.3.1/runtime/local/conf/nutch-site.xml; lineNumber: 23; columnNumber: 7; The string "--" is not permitted within comments. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2183) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2171) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2242) ... 11 more"

Looks like nutch dislikes my sass in my nutch-site.xml file. I'm pulling back my dashes. Honestly who doesn't use yaml this days? XML is so old school. I'm going to be lazy and change it in the runtime file and not in the original conf file. I'm unconvienced ant will work again.

  • Let's run "./nutch inject urls" instead of the file since nutch is looking for a directory. Also incase you didn't catch it. The commands we are running go through python scripts.

  • It's just sitting there. I went through a whole diply article and it still hasn't run. This isn't the crawling. This is just reparing to crawl. This can't be right. Well that could be because my HMaster shut down. Would have been nice to be notfied. There are a few errors in the hbase logs. The first is: "2017-04-25 15:08:45,449 ERROR [main] util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries." The first comment on this has links to download them. I'm going with the older hadoop because this hbase is older. http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path I'm moving all the files in the zip to the bin folder for my hbase Now I'm going to tail -f the log file and wait for the puppy to crash. Time to find another diply to read. Oh instead I looked up Katie+8. Go take your hate away. Those kids are big since I last watched. It looks like HMaster is stable now. Let's inject again. An error is PROGRESS!!!! "$ ./nutch inject urls InjectorJob: starting at 2017-05-09 21:47:49 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.util.Shell.execCommand(Shell.java:774) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:646) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:434) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:115) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)" Looks like I need the soure code to debug this one. To the Github!!!! https://github.com/apache/nutch remember to use tag: release-2.3.1 git fetch --all --tags --prune git checkout tags/release-2.3.1 -b master-release-2.3.1 Now I'm using Intellij, shush you Eclipse loving, wannabe players. Ok I need to update my intellij because apparently, I haven't done that in a couple years. Then I'm going to bed. Probably gonna be a few days before I get back to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment