Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Umair444/cbc9d8790aaf5024378adc551c40c92a to your computer and use it in GitHub Desktop.
Save Umair444/cbc9d8790aaf5024378adc551c40c92a to your computer and use it in GitHub Desktop.

Setup Hadoop on Windows 10

  1. Download Hadoop 3.2.2 from Apache Archives and extract it to C:/hadoop/
  2. Download any version of JDK 8. Older the better.
  3. Any other version of hadoop can be downloaded if someone already build winutils of that version. You can also build it yourself using this tutorial, which would be easy if you have idea of using UNIX commands in Windows. (for which cygwin can be used). So, download the winutils of Hadoop 3.2.2 by pasting the link of github folder to this website. Paste the downloaded folder to C:\hadoop\hadoop-3.2.2. You can also download the winutils of Hadoop 3.2.2 from my drive.
  4. Finally download the attached 'ps1' file and run it. It will do everything for you. If there is an error running the script, you can do it manually:
    1. Make Required folders for datanode: C:\hadoop\hadoop-3.2.2\data\dfs\data and C:\hadoop\hadoop-3.2.2\data\dfs\namespace_logs
    2. Add Required configuration to C:\hadoop\etc\hadoop\[core,hdfs,mapred,yarn]-site.xml. Download configuration files from here and paste it in C:\hadoop\hadoop-3.2.2\ or copy text from files below.
    3. Create local user env variables Windows Key+R: C:\Windows\system32\rundll32.exe" sysdm.cpl,EditEnvironmentVariables
      1. HADOOP_HOME: C:\hadoop\hadoop-3.2.2 -> SAVE
      2. HADOOP_CP: <Open Terminal. Execute hadoop classpath. Paste the Output>
      3. HDFS_LOC: hdfs://localhost:19000
    4. Add %HADOOP_HOME%/bin and %HADOOP_HOME%/sbin to local user path
    5. Change JAVA_HOME variable in file C:\hadoop\hadoop-3.2.2\etc\hadoop\hadoop-env.cmd to C:\Progra~1\Java\jdk1.8.0_<your-version>. Check windows 8.3 naming rules if you're interested.

Run Hadoop on Python

Open admin terminal in new desktop (Windows Key+Ctrl+D) and execute start-dfs and start-yarn. These files are located in sbin directory. To know if your HDFS is running, you may check:

(NOTE: If you didn't run config.ps1 and followed configuration steps manually, download this folder and paste it in C:/hadoop/)

  1. Open admin terminal in C:/hadoop
  2. Add data to HDFS by executing
    1. !hdfs dfs -mkdir /test
    2. !hdfs dfs -copyFromLocal "C:\hadoop\test.txt" \test
  3. Execute !chmod 777 mapper.py reducer.py
  4. Run the streaming service by executing (I am assuming your terminal is opened in C:/hadoop):
   hadoop jar C:\hadoop\hadoop-3.2.2\share\hadoop\tools\lib\hadoop-streaming-3.2.2.jar \
  -input /test/test.txt \
  -output /test/output \
  -mapper "<path to python> <path to mapper.py>" \
  -reducer "<path to python> <path to reducer.py>" \
  -file "<path to mapper.py>" \
  -file "<path to mapper.py>"

Here, <path to python> can be obtained by executing which python. If there are spaces inside your path, remember to use short path compling Windows 8.3 naming rules. Streaming job can't have spaces inside, or else it would give you an error (failed streaming). To obtain a short path check this link. If python path is incorrect or streaming job is unable to read it correctly, the most probable error would be Java_IO_Exception with HADOOP_HOME failed to read.

Extras

  • To run the streaming job again (if it failes delete the last output folder using): hadoop fs -rm -R /test/output
  • To check the content of a HDFS block execute: hdfs dfs -cat /g4g/output/part-00000
# Create Directories
$baseDir = "C:/hadoop/hadoop-3.2.2/data"
New-Item -ItemType Directory -Path "$baseDir"
New-Item -ItemType Directory -Path "$baseDir/dfs/data" -Force
New-Item -ItemType Directory -Path "$baseDir/dfs/namespace_logs" -Force
# Fix JAVA_HOME
Set-Location "C:\Progra~1\Java\"
$javaFolder = Get-ChildItem | Where-Object { $_.Name -match "jdk1.8.0_*" }
$javaPath = $javaFolder.FullName
$a = New-Object -ComObject Scripting.FileSystemObject
$javaPath = $a.GetFolder($javaPath)
$javaPath = $javaPath.ShortPath
$hadoopEnvPath = "C:\hadoop\hadoop-3.2.2\etc\hadoop\hadoop-env.cmd"
(Get-Content $hadoopEnvPath) | ForEach-Object {
if ($_ -like 'set JAVA_HOME=*') {
$_ -replace 'set JAVA_HOME=.*', "set JAVA_HOME=$javaPath"
}
else {
$_
}
} | Set-Content $hadoopEnvPath
# Create environment variables
[Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\hadoop\hadoop-3.2.2", "User")
$hadoopClasspath = & "$env:HADOOP_HOME\bin\hadoop.cmd" classpath
[Environment]::SetEnvironmentVariable("HADOOP_CP", $hadoopClasspath, "User")
[Environment]::SetEnvironmentVariable("HDFS_LOC", "hdfs://localhost:19000", "User")
$path = [Environment]::GetEnvironmentVariable("Path", "User")
$newPaths = "$env:HADOOP_HOME\bin;$env:HADOOP_HOME\sbin"
$newPath = "$newPaths;$path"
[Environment]::SetEnvironmentVariable("Path", $newPath, "User")
# 1. core-site.xml
$path = "C:\hadoop\hadoop-3.2.2\etc\hadoop\core-site.xml"
$xml = [xml](Get-Content $path)
$property = $xml.CreateElement("property")
$propertyname = $xml.CreateElement("name")
$propertyvalue = $xml.CreateElement("value")
$propertyname.InnerText = "fs.default.name"
$propertyvalue.InnerText = "hdfs://0.0.0.0:19000"
$property.AppendChild($propertyname)
$property.AppendChild($propertyvalue)
$xml.configuration.AppendChild($property)
$xml.Save($path)
# 2. hdfs-site.xml
$path = "C:\hadoop\hadoop-3.2.2\etc\hadoop\hdfs-site.xml"
$xml = [xml](Get-Content $path)
$replicationProperty = $xml.CreateElement("property")
$replicationPropertyName = $xml.CreateElement("name")
$replicationPropertyValue = $xml.CreateElement("value")
$replicationPropertyName.InnerText = "dfs.replication"
$replicationPropertyValue.InnerText = "1"
$replicationProperty.AppendChild($replicationPropertyName)
$replicationProperty.AppendChild($replicationPropertyValue)
$xml.configuration.AppendChild($replicationProperty)
$namenodeDirProperty = $xml.CreateElement("property")
$namenodeDirPropertyName = $xml.CreateElement("name")
$namenodeDirPropertyValue = $xml.CreateElement("value")
$namenodeDirPropertyName.InnerText = "dfs.namenode.name.dir"
$namenodeDirPropertyValue.InnerText = "$baseDir/dfs/namespace_logs"
$namenodeDirProperty.AppendChild($namenodeDirPropertyName)
$namenodeDirProperty.AppendChild($namenodeDirPropertyValue)
$xml.configuration.AppendChild($namenodeDirProperty)
$datanodeDirProperty = $xml.CreateElement("property")
$datanodeDirPropertyName = $xml.CreateElement("name")
$datanodeDirPropertyValue = $xml.CreateElement("value")
$datanodeDirPropertyName.InnerText = "dfs.datanode.data.dir"
$datanodeDirPropertyValue.InnerText = "$baseDir/dfs/data"
$datanodeDirProperty.AppendChild($datanodeDirPropertyName)
$datanodeDirProperty.AppendChild($datanodeDirPropertyValue)
$xml.configuration.AppendChild($datanodeDirProperty)
$xml.Save($path)
# 3. mapred-site.xml
$path = "C:\hadoop\hadoop-3.2.2\etc\hadoop\mapred-site.xml"
$xml = [xml](Get-Content $path)
$frameworkNameProperty = $xml.CreateElement("property")
$frameworkNamePropertyName = $xml.CreateElement("name")
$frameworkNamePropertyValue = $xml.CreateElement("value")
$frameworkNamePropertyName.InnerText = "mapreduce.framework.name"
$frameworkNamePropertyValue.InnerText = "yarn"
$frameworkNameProperty.AppendChild($frameworkNamePropertyName)
$frameworkNameProperty.AppendChild($frameworkNamePropertyValue)
$xml.configuration.AppendChild($frameworkNameProperty)
$classpathProperty = $xml.CreateElement("property")
$classpathPropertyName = $xml.CreateElement("name")
$classpathPropertyValue = $xml.CreateElement("value")
$classpathPropertyName.InnerText = "mapreduce.application.classpath"
$classpathPropertyValue.InnerText = "%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*"
$classpathProperty.AppendChild($classpathPropertyName)
$classpathProperty.AppendChild($classpathPropertyValue)
$xml.configuration.AppendChild($classpathProperty)
$xml.Save($path)
# 4. yarn-site.xml
$path = "C:\hadoop\hadoop-3.2.2\etc\hadoop\yarn-site.xml"
$xml = [xml](Get-Content $path)
$resourceManagerProperty = $xml.CreateElement("property")
$resourceManagerPropertyName = $xml.CreateElement("name")
$resourceManagerPropertyValue = $xml.CreateElement("value")
$resourceManagerPropertyName.InnerText = "yarn.resourcemanager.hostname"
$resourceManagerPropertyValue.InnerText = "localhost"
$resourceManagerProperty.AppendChild($resourceManagerPropertyName)
$resourceManagerProperty.AppendChild($resourceManagerPropertyValue)
$xml.configuration.AppendChild($resourceManagerProperty)
$auxServicesProperty = $xml.CreateElement("property")
$auxServicesPropertyName = $xml.CreateElement("name")
$auxServicesPropertyValue = $xml.CreateElement("value")
$auxServicesPropertyName.InnerText = "yarn.nodemanager.aux-services"
$auxServicesPropertyValue.InnerText = "mapreduce_shuffle"
$auxServicesProperty.AppendChild($auxServicesPropertyName)
$auxServicesProperty.AppendChild($auxServicesPropertyValue)
$xml.configuration.AppendChild($auxServicesProperty)
$envWhitelistProperty = $xml.CreateElement("property")
$envWhitelistPropertyName = $xml.CreateElement("name")
$envWhitelistPropertyValue = $xml.CreateElement("value")
$envWhitelistPropertyName.InnerText = "yarn.nodemanager.env-whitelist"
$envWhitelistPropertyValue.InnerText = "JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME"
$envWhitelistProperty.AppendChild($envWhitelistPropertyName)
$envWhitelistProperty.AppendChild($envWhitelistPropertyValue)
$xml.configuration.AppendChild($envWhitelistProperty)
$xml.Save($path)
# Create Files
$testFilePath = "C:\hadoop\test.txt"
$mapperFilePath = "C:\hadoop\mapper.py"
$reducerFilePath = "C:\hadoop\reducer.py"
# 1. Create test.txt file
Set-Content -Path $testFilePath -Value @"
this is the test file if you are seeing it in HDFS test folder it means the file uploaded successfully. If you are seeing it in output folder it means the streaming job ran successfully
"@ -Force
# 2. Create mapper.py file
Set-Content -Path $mapperFilePath -Value @"
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print('%s\t%s' % (word, 1))
"@ -Force
# 3. Create reducer.py file
Set-Content -Path $reducerFilePath -Value @"
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print('%s\t%s' % (current_word, current_count))
"@ -Force
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<!-- <value>file:///DIRECTORY 1 HERE</value> -->
<value>file:///C:/Users/Anthony/Documents/cp-master/hadoop-3.2.1/data/dfs/namespace_logs</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<!-- <value>file:///DIRECTORY 2 HERE</value> -->
<value>file:///C:/Users/Anthony/Documents/cp-master/hadoop-3.2.1/data/dfs/data</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment