Skip to content

Instantly share code, notes, and snippets.

@lassebenni
Created April 29, 2020 19:46
Show Gist options
  • Save lassebenni/d415638102d13c9390e37aa2a808dd05 to your computer and use it in GitHub Desktop.
Save lassebenni/d415638102d13c9390e37aa2a808dd05 to your computer and use it in GitHub Desktop.
About hadoop

TLDR; Hadoop is a framework of distributed storage and distributed processing of very large data sets on a cluster.

The distributed storage part means that the data is stored in pieces (blocks) on multiple computers (nodes) in a resilient way so in case of hardware failure of one of the nodes, the data stays available. This storage system is called HDFS (Apache Hadoop Distributed File System) and acts like one single storage device (i.e. hard disk) even though it is comprised of many, many different nodes that store a bit of the entire data. This is very cost-efficient as one can keep adding hardware to keep up with increasing amounts of data (horizontal scaling).

The next part is distributed processing, which means that a task (varying from simple to complex) can be split into equal parts and given to nodes (which use their own memory/cpu power) to process in parallel. That way, just like in distributed storage, many computers use their resources to act as a single "super computer" when handling a task for the system. The results are then returned as if it had been run on one single machine. Again, more cost-efficient than one single expensive machine.

Very large datasets is a bit of a subjective term. Some say that "very large" is petabytes or more, others define it as data that cannot be handled well by traditional (relational) database management systems. It's hard to tell when exactly "regular data" grows into "big data". Usually Big Data is used as a catch-all term for data that is not handled well in a traditional relational database and should be stored and operated on in a NoSQL database. Streaming (continuous) data is often seen as a good example. Whether the term big-data applies, varies from case to case, and should be assessed by weighing the traditional options to "Big Data solutions(tm)".

Finally, a cluster means a group of nodes (machines) working together as a unit.

All of this combined means we can use a relatively cheap way to handle large amounts of data while efficiently using the resources provided by our machines as one large machine (cluster), tasks are divided and run fast (parallel), all the while we assure data is not lost in case of failure (resilient distributed).

http://localhost:8090/link/1#bkmrk-for-additional-infor

For additional information about how Hadoop came to be: here is a an article on Medium: The History of Hadoop by Marko Bonaci.

The term Hadoop 2.0 refers to the uncoupling of MapReduce from the 'Hadoop Ecosystem'. Before Apache Hadoop version 2.0, MapReduce was the all-encompassing manager of the cluster. This included assigning resources, delegating jobs, processing data and communicating with client APIs. On top of that, MapReduce was designed to handle batches of data and not for data streams. In an attempt to solve these problems and seperate MapReduce from HDFS, the code that handled the distribution of tasks was gathered en separated from MapReduce and codenamed YARN (Yet Another Resource Negotiator) in 2012.

In order to generalize processing capability, the resource management, workflow management and fault-tolerance components were removed from MapReduce, a user-facing framework and transferred into YARN, effectively decoupling cluster operations from the data pipeline.

History of Hadoop by Marko Bonaci

http://localhost:8090/link/3#bkmrk-this-meant-that-othe

This meant that other developers could build on the now decoupled Hadoop, creating their own APIs that did not have to fit in the tight MapReduce mold. On this stack, many, many APIs have been build.

In order to play around with Hadoop, there are different options available to us. The easiest way to get it up running quickly is to use a distribution. This is is basically a pre-packaged version of the "Hadoop Ecosystem" with batteries included (configurations pre-set, applications pre-installed). Otherwise you would have to install Java, install Hadoop, set various environment variables. All things that you probably are not very concerned with when first getting to learn Hadoop and the basic desire to play around in it.

Three main distributions vendors exists: Cloudera, HortonWorks and MapR. Of these, Cloudera is the oldest, HortonWorks distributes a 100% open source distribution of Apache Hadoop, and MapR has created their own version of the MapReduce-component to Apache's Hadoop.

HortonWorks has a 'sandbox' to learn Hadoop and its derivatives. We will be using that to learn the environment. To keep this consistent we will use Docker and a dockerimage of the HortonWorks Standbox Standalone HDP distribution. This way, we can change from distributions quickly in the future and do not have to install the Hadoop Ecosystem on our own machine but will use a virtual image.

If you do not have Docker installed, you can download and install it from here Docker CE download (Windows/Mac/Linux versions). For more information see the documentation .

We will mainly be following the installation guide provided by HortonWorks. The steps mentioned below are mentioned there, and more.

Step 1 ~ Download the distro: Download the HortonWorks Hadoop Sandbox called HDP, (HortonWorks Data Platform) not to be confused with HDF ( HortonWorks Dataflow, this is a different distrubtion for Streaming data)! I am using windows, so I downloaded the Docker Windows image/configuration file. This results in a ZIP-file (in my situation called start-sandbox-hdp-standalone_2-6-4.ps1.zip) containing a ps1 powershell-script. Unzip this ZIP-file. For linux/mac, the resulting ZIP-file contains a SH script which is the equivalent we downloaded for Windows.

Step 2 ~ Install Docker and run the distro: For step 2, Docker CE needs to be installed and running. If you do not have Docker installed, you can download and install it from here Docker CE download (Windows/Mac/Linux versions). For more information see the documentation. For Docker Windows and Mac users of the Docker GUI (only Mac and Windows), HortonWorks recommends increasing the amount of RAM assigned to Docker to a minimum of 8gb. Assuming you have done that, you can now run the script that you downloaded in step 1.

Change directory to the location of the script (sh or ps1 file).

For mac/linux just run:

1 sh start-sandbox-hdp-standalone_{version}.sh Replace {version} with the version number of the specific distro, contained in the filename.

For Windows users:

You have to run the powershell script with some preceding commands that handle some authorization.

1 powershell -ExecutionPolicy ByPass -File start-sandbox-hdp-standalone_{version}.ps1 This process starts the Docker containers, loads config files and gets all the applications within the sandbox environment (HDFS, YARN, AMBARI) up and running. It could take a while depending on your hardware specifications and the amount of resources assigned to Docker. When finished, the output should display

1 Started Hortonworks HDP container Step 3 ~ Connect to Hadoop: Now the Hadoop Sandbox is up and running, we should be able to connect to it. This and the following steps are described in the following Hortonworks guide as well. The Docker container that has been spun up, is forwarding ports to localhost. We can connect to the Sandbox with SSH on localhost port 2222 (alternatively you can map the container's IP address to a desired hostname, as the guide specifies). On Windows you can use PuTTY or a different SSH client (I am using CMDer which has SSH capabilities built-in).

1 SSH root@localhost -p 2222 This prompts the password for root, which is hadoop as default. It will prompt you to change this password, remember/write down the new one!

Another way to connect to the HDP Sandbox is using the web-terminal: browsing to localhost:4200 should display a terminal that can be used to login just like the SSH way with the built-in terminal.

1 [root@sandbox-hdp ~]# We should now be inside the Sandbox CLI environment, ready to run commands! We could list the folders in the '/' dir of the hadoop filesystem, for instance.

1 [root@sandbox-hdp /]# hadoop fs -ls / 2 Found 12 items 3 drwxrwxrwx - yarn hadoop 0 2018-04-10 12:53 /app-logs 4 drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:32 /apps 5 drwxr-xr-x - yarn hadoop 0 2018-02-01 10:24 /ats 6 drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:39 /demo 7 drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:24 /hdp 8 drwx------ - livy hdfs 0 2018-02-01 10:27 /livy2-recovery 9 drwxr-xr-x - mapred hdfs 0 2018-02-01 10:24 /mapred 10 drwxrwxrwx - mapred hadoop 0 2018-02-01 10:25 /mr-history 11 drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:24 /ranger 12 drwxrwxrwx - spark hadoop 0 2018-04-12 11:16 /spark2-history 13 drwxrwxrwx - hdfs hdfs 0 2018-02-01 10:47 /tmp 14 drwxr-xr-x - hdfs hdfs 0 2018-04-10 08:58 /user Congratulations, you successfully spun up Hadoop as a single node cluster and executed your first command!

There are also other users/roles we could switch to, predefined by Hortonworks. Browse the different users to get a feel for common roles and rights for different Hadoop users.

Step 4 ~ Login to Ambari: HortonWorks bundles it's HDP distribution with Ambari to manage the cluster. Ambari is a tool that manages the different nodes and supplies the users with a web-interface to run commands, handle files and monitor usage of the Hadoop cluster.

A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of operating Hadoop. Ambari makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations.

~ https://hortonworks.com/apache/ambari/

We can use Ambari for a lot of operations that are harder to visualize on the CLI. On the HDP distro, it is hosted on port 8080 and a tutorial-section of the website is hosted on port 8888. Connecting to http://localhost:8888 in your browser takes you to this tutorial/friendly-setup page. Here you can either follow a tutorial (enable pop-ups) or view the other applications that are hosted by the HDP container.

Ambari welcome page

Click on the left button "Launch Dashboard" to take you to the Ambari dashboard.

This should take you to the login screen of Ambari. We should now create the password for the admin user. To do that, we need to switch back to the terminal that connects us to our Hadoop Sandbox, if you closed it reconnect to it by typing ssh root@localhost -p 2222 and provide the password you created previously. Now, logged in as the root user, execute the following command.

1 ambari-admin-password-reset Which prompts you to create a new password and to confirm it. The webserver will then restart, after which it will start listening on port 8080 again. When the prompt confirms this, you can go back to the login-screen and log-in with user admin and the password you just created. This should take you to the Ambari Dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment