Ok so I've been trying to get nutch running on my Windows 10 machine for a while now.
Online resources have been lacking in recent tutorials. So this is my notes on the matter.
So the offical tutorials are at https://wiki.apache.org/nutch/
And I downloaded the binaries from http://www-eu.apache.org/dist/nutch/2.3.1/
I grabbed the zip file as I'm on Windows.
Aside: I would prefer to be on a linux system. At the moment, I just have my gaming machine which is windows.
Yes I could put a duel boot on it or run a VM, but I'm trying to do this the "easiest" way possible.
Ok, so I normally like to dev in Intellij. A lot of people are Eclipse folks, I get that.
I started using an IDE with Ruby for automated UI test, and RubyMines is really the top for Ruby.
The first Nutch tutorial I'm working through is https://wiki.apache.org/nutch/Nutch2Tutorial
My thoughts and issues as I read are below:
-
Who makes a nested tutorial? Like in order to use version 2, I need to know how to use version 1?
Come on, let's actually do a tutorial right. -
To be honest, the first time I heard of Gora was the first time I tried this tutorial. http://gora.apache.org/ "The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop™ MapReduce support. Gora uses the Apache Software License v2.0. Gora graduated from the Apache Incubator in January 2012 to become a top-level Apache project. You can find the Gora DOAP here. - See more at: http://gora.apache.org/#sthash.yuKKClVl.dpuf"
-
Also Gora is tried to specific versions of the Hadoop tools it works with as pointed out in the Nutch tutorial and easy to miss.
-
ok the version of hbase they link to is gzip tar, luckily I have 7-Zip installed on my machine.
If you need it you can get it here http://www.7-zip.org/download.html -
Windows has a package manager Chocolatey, I plan to see if any of this is in choco, I haven't yet.
Side note: choco is what my neice and nephew call poo, so that's entertaining. -
Oh right, I haven't set my $NUTCH_HOME yet. On windows, that's a system variable and you need to be an admin to set it, at least through the UI. If like me, you've worked at places where you didn't have it, too bad.
JAVA_HOME is normally what one ends up doing this for, so it's no surprise they have a good looking set of directions for it. Now where did I unpack nutch... https://www.java.com/en/download/help/path.xml -
Ahh windows is being "helpful" and downloaded it to my OneDrive and I unpacked it there without thinking. So I'm going to move it somewhere 'nicer.' Let's just put it on the C:\ drive level and by it I mean the file instead what was extracted. The one called 'apache-nutch-2.3.1' or similarly if you have a different version of 2. Then, it looks like from the tutorial that $NUTCH_HOME=C:\apache-nutch-2.3.1
-
"Download and configure HBase 0.98.8-hadoop2" that's useful! Did I mention tar.gz files are annoying on windows even with 7-Zip? No, well... 7-zip unpacks the gzip and puts it in a folder. The tarball is inside that folder and you have to extract it again. Then you get to clean up your working directory that is now polluted with all these extra files and folders. It's also slow.
-
Ok the version of hbase they want you to use is OLD. However, the internet is a mine of outdated information. JavaTPoint has a configuration guide I'm going to use. I'm going with standalone mode. https://www.javatpoint.com/hbase-installation
-
Ahh, Java my old friend, the time has come to meet again. Hbase wants Java and I'm going to give it to them. The configuration guide wants Java 7. Well, ok, I like 8 it's got some great things in it, but a lot of the world is written in older versions and I can work with that. If you want to use, choco: choco install jdk7 Otherwise, you got to hit the archives http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html
-
I already had 8 so I need to update my JAVA_HOME to point to 7.
-
Time to fix hbasa-site.xml to mimick the javaTpoint guide. I like to edit files that are outside a project with Notepad++. Use what you like as long as it's plain text, duh. https://notepad-plus-plus.org/
-
Right javaTpoint assumes I'm on linux and all the file locations are the wrong slash for windows. Why can't we all just get along? Here's my adjustment which require making a few directories:
//Here you have to set the path where you want HBase to store its built in zookeeper files.
hbase.zookeeper.property.dataDir
C:\Users[username]\zookeeper
-
Next is running Hbase. The script is a shell script. I stay away from bat scripts as a general rule. So I'm going to use gitbash to run the script. You can get as part of git. choco install git
-
That went well: $ ./start-hbase.sh Error: Could not find or load main class org.apache.hadoop.hbase.util.HBaseConfTool Error: Could not find or load main class org.apache.hadoop.hbase.zookeeper.ZKServerTool starting master, logging to /c/hbase-0.98.8-hadoop2/bin/../logs/hbase--master-sj.out Error: Could not find or load main class org.apache.hadoop.hbase.master.HMaster localhost: ssh: connect to host localhost port 22: Connection refused
-
Right I forgot to update the hbase-env.sh to have the correct JAVA_HOME dir. Obviously, it can't use the system variable I made, that would be too easy. And a new error: $ ./start-hbase.sh /c/hbase-0.98.8-hadoop2/bin/../conf/hbase-env.sh: line 29: export: `FilesJavajdk1.7.0_79': not a valid identifier /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:Program/bin/java: No such file or directory /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:Program/bin/java: No such file or directory starting master, logging to /c/hbase-0.98.8-hadoop2/bin/../logs/hbase--master-slimjim.out /c/hbase-0.98.8-hadoop2/bin/../bin/hbase: line 389: C:Program/bin/java: No such file or directory localhost: ssh: connect to host localhost port 22: Connection refused
-
Right bash HATES spaces in directory names. I'm moving my Java install from Program Files to a directory up, and updating JAVA_HOME locally and updating it in hbase-env.sh.
Don't forget to close and reopen the console/gitbash/powershell you are using in order to pick up the new JAVA_HOME. Sweet! I got the same error! Cause I forgot to save the hbase-env.xml file. -
Saving and I get a new shiney error... ooooo... $ ./start-hbase.sh /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:javajdk1.7.0_79/bin/java: No such file or directory /c/hbase-0.98.8-hadoop2/bin/hbase: line 389: C:javajdk1.7.0_79/bin/java: No such file or directory starting master, logging to /c/hbase-0.98.8-hadoop2/bin/../logs/hbase--master-slimjim.out /c/hbase-0.98.8-hadoop2/bin/../bin/hbase: line 389: C:javajdk1.7.0_79/bin/java: No such file or directory localhost: ssh: connect to host localhost port 22: Connection refused I didn't switch the slashs around and change C:\ to /c/, opps.
-
Boom! I've got hbase running stand alone on Windows 10. That was the whole point right? We weren't like trying to do something else with that last HOUR+ of time? Oh right Nutch, duck it! Where was I on that tutorial? That was step ducking, two!
-
Ok step 3, configure GORA. That should be easy than hbase based on the number of lines it takes up. In $NUTCH_HOME/conf/nutch-site.xml make the following change plus all the others from the tutorial for the thing we are replacing or haven't used before, the old version of nutch?? Gah, ducking A rabit, nested tutorials again!
-
Ok Ivy config. I'm new to Ivy and if you are too, you probably are, here's a link I haven't read yet. I'm used to maven over ant. https://en.wikipedia.org/wiki/Apache_Ivy $NUTCH_HOME/ivy/ivy.xml
That's line 116 in case you wanted some helpful information.
-
Next we get to add a "missing jar", da duck? Oh right, because gora is tied to specific version of hbase. So we have to put that version in. In theory to, it works with a lot of other backends. Having use provide the jar for the backend we are using means a smaller dependency foot print. Of course, that's just my guess as it's not actually explained in the tutorial.
-
Wait! The GORA you want me to use because it is linked to hbase or whatever, is broken? The new version is fixed but you stil want my to use the janky version? Thanks Nutch!
-
That looks like a configuration line coming next.
-
Ok another config file. Would it be too much to wrap those up into say one config file? $NUTCH_HOME/conf/gora.properties gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
-
When I was looking for where to put the gore.datastore.default property, I found this: #########################
#########################
Might be helpful in the future, since I don't have a zookeeper running.
-
Oh well, making note of it and moving on. The property in gora.properties, I want to change is on line 19. BTW.
-
"N.B. It's probably worth checking and setting all your usual configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before progressing." Well, I don't have any, so I'm progressing.
-
Now I am ready to compile? I'm using a binary distro so that seems odd, but oh well.
-
I need to install ant. Which BTW wants java 8. I'm sure that will work out fine. choco install ant
-
OK ant runtime, ERROR! $ ant runtime java.lang.UnsupportedClassVersionError: org/apache/tools/ant/launch/Launcher : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482) Exception in thread "main"
-
Yeah that's totally the java version biting me in the toushy. My $JAVA_HOME got bumped to java 8. I rolled it back, let's try again. Nope! No good. "The class file version for Java SE 8 is 52.0 as per the JVM Specification. Version 52.0 class files produced by a Java SE 8 compiler cannot be used in earlier releases of Java SE." http://www.oracle.com/technetwork/java/javase/8-compatibility-guide-2156366.html So I'm going to roll again to Java 8 and update the JAVA_HOME in my configs. Yep totally got me to the next error.
-
That error is .... $ ant runtime Buildfile: C:\apache-nutch-2.3.1\build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-download-unchecked:
ivy-init-antlib:
ivy-init:
init: [mkdir] Created dir: C:\apache-nutch-2.3.1\build [mkdir] Created dir: C:\apache-nutch-2.3.1\build\classes [mkdir] Created dir: C:\apache-nutch-2.3.1\build\release [mkdir] Created dir: C:\apache-nutch-2.3.1\build\test [mkdir] Created dir: C:\apache-nutch-2.3.1\build\test\classes
clean-lib:
resolve-default: [ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = C:\apache-nutch-2.3.1\ivy\ivysettings.xml
- Remember that missing jar? That's why we are re-building. We can just grab from maven. http://www.mail-archive.com/user@nutch.apache.org/msg14220.html That took me about an hour to find mind you. Let's see if it works. I'm going to grab that jar and drop it in my ... I don't see where to save it...
Ok I had to take a couple weeks off, since then I've had to update windows and my laptops firmware. Let's see if I can sort this shit out now
So what is the ant equalivalent of .m2?
Ok so ant runs over maven? http://maven.apache.org/ant-tasks/usage.html So let's put in the .m2 repo.
So I haven't installed that jar yet... I started hbase again and tried ant runtime and this:
$ ant runtime Buildfile: C:\apache-nutch-2.3.1\build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-download-unchecked:
ivy-init-antlib:
ivy-init:
init:
clean-lib: [delete] Deleting directory C:\apache-nutch-2.3.1\build\lib
resolve-default: [ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = C:\apache-nutch-2.3.1\ivy\ivysettings.xml
A different error, I may need to install ant... choco install ant choco upgrade ant hrmm ant is already installed... so while I force windows to install it again, let's look at this magical build.xml file. "... ..." Looks like the final version will be in the folder build.xml is in in the subfolder of release Actually, we have a folder called runtime and in there is a local folder with a bunch of jars in it. Including a bunch of gora jars.... And huzah hbase-common-0.98.8-hadoop2.jar is in there. Honestly, I have no clue if this is working or not, but it seems to have what it wants for now...
- Make a list of url's to crawl. WOOOOOTTTTT!!!!! We're finally talking about crawling the web. That's why I'm here. You're obviously here for the entertainment value. So this is tutorial inception again and we are told to go back to the tutorial for nutch1. Oh no I'm wrong we need to follow the hbase tutorial to make sure it's step up correctly, ballz. "Start HBase. Use the bin/start-hbase.sh command to start HBase. If your system is configured correctly, the jps command should show the HMaster and HRegionServer processes running." My jps only returns, jps. FYI JPS is http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html
- Update conf/hbase-site.xml it needs to point at where the data is going to be written. So the JaveTPoint tutorial has the directories for a nice, friendly linux or mac environment. Let's fix that. file:/c/Users/{username}/HBase/HFiles for hbase.rootdir and /c/Users/{username}/zookeeper for hbase.zookeeper.property.dataDir Remember to save the file ;) Now jps shows HMaster. Awesome that means it's running in local mode. Yeah we could and in production want a distributed HBase. Be real though it would already be running. Or just yell at us DevOps like people to get one up for you!
- Finally writing the ducking URL list. Back up the tutorial hole we go. https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list "Create a URL seed list A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download"
- so that file is this
-^(file|ftp|mailto):
-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
-[?*!@=]
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.
I'm leaving this the same. Like all sane, modern devs I prefer to let others handle my regex. There are people who enjoy this https://xkcd.com/1313/ https://www.oreilly.com/learning/regex-golf-with-peter-norvig dude that guy does videos for coursera! Also this explains why I don't work at Google. I didn't even finish the video. Or the course...
-
Anyways, the nutch 1 tutorial says how to make a list beyond regex. (The regex file basically is how nutch determines if it's going to follow a URL or not) "mkdir -p urls cd urls touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl)." REALLY!!! Was '-p' really necessary? That's like some wank being like "I'm going to show off. I do this everytime to just be safe." It's not safe, that's how you make dirs where you don't want them dip shat.
-
Ok I bite on use "http://nutch.apache.org/" as my seed url. Let's try this. Now go back to the nutch2 tutorial https://wiki.apache.org/nutch/Nutch2Tutorial Let's try injecting. That sounds fun and painless... from your runtime/local/bin folder do " nutch inject /someseedDir nutch readdb" someseedDir? Damn that is some seed Dir you got there. Must come from a hell of a mkdir -p.
So I'm guessing this isn't the right way to call this... "$ ./nutch inject seed.txt [Fatal Error] nutch-site.xml:23:7: The string "--" is not permitted within comments. Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; systemId: file:/C:/apache-nutch-2.3.1/runtime/local/conf/nutch-site.xml; lineNumber: 23; columnNumber: 7; The string "--" is not permitted within comments. at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2348) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2205) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2112) at org.apache.hadoop.conf.Configuration.set(Configuration.java:989) at org.apache.hadoop.conf.Configuration.set(Configuration.java:961) at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1299) at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:319) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:479) at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:170) at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284) Caused by: org.xml.sax.SAXParseException; systemId: file:/C:/apache-nutch-2.3.1/runtime/local/conf/nutch-site.xml; lineNumber: 23; columnNumber: 7; The string "--" is not permitted within comments. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2183) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2171) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2242) ... 11 more"
Looks like nutch dislikes my sass in my nutch-site.xml file. I'm pulling back my dashes. Honestly who doesn't use yaml this days? XML is so old school. I'm going to be lazy and change it in the runtime file and not in the original conf file. I'm unconvienced ant will work again.
-
Let's run "./nutch inject urls" instead of the file since nutch is looking for a directory. Also incase you didn't catch it. The commands we are running go through python scripts.
-
It's just sitting there. I went through a whole diply article and it still hasn't run. This isn't the crawling. This is just reparing to crawl. This can't be right. Well that could be because my HMaster shut down. Would have been nice to be notfied. There are a few errors in the hbase logs. The first is: "2017-04-25 15:08:45,449 ERROR [main] util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries." The first comment on this has links to download them. I'm going with the older hadoop because this hbase is older. http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path I'm moving all the files in the zip to the bin folder for my hbase Now I'm going to tail -f the log file and wait for the puppy to crash. Time to find another diply to read. Oh instead I looked up Katie+8. Go take your hate away. Those kids are big since I last watched. It looks like HMaster is stable now. Let's inject again. An error is PROGRESS!!!! "$ ./nutch inject urls InjectorJob: starting at 2017-05-09 21:47:49 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.util.Shell.execCommand(Shell.java:774) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:646) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:434) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:115) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)" Looks like I need the soure code to debug this one. To the Github!!!! https://github.com/apache/nutch remember to use tag: release-2.3.1 git fetch --all --tags --prune git checkout tags/release-2.3.1 -b master-release-2.3.1 Now I'm using Intellij, shush you Eclipse loving, wannabe players. Ok I need to update my intellij because apparently, I haven't done that in a couple years. Then I'm going to bed. Probably gonna be a few days before I get back to this.