Skip to content

Instantly share code, notes, and snippets.

@explicite
Last active September 4, 2017 11:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save explicite/b5b90f665d39837d5ed0db68efecbe27 to your computer and use it in GitHub Desktop.
Save explicite/b5b90f665d39837d5ed0db68efecbe27 to your computer and use it in GitHub Desktop.
Hadoop legacy

Hadoop legacy

In the previous blog post I explained the basic concepts of data lakes. Some core problems which can occur in data lakes were defined and I gave some hints to avoid them. Most of these pitfalls are caused by the traits of data lakes. Unfortunately, current Hadoop distributions can't resolve them entirely. Additionally, the architecture itself is not the best for such solutions and Hadoop has a bad reputation for being extremely complex.

This time I would like to focus on the Hadoop environment problems in data lake implementations. Maybe they can be repaired?

Architecture

Hadoop has an irreparably fractured ecosystem: all the responsibilities are only partially fulfilled by dozens of services. For example, for security there are Sentry and Ranger;f for processing management there are NiFi and Falcon. The solution may be to select an existing distribution such as HDP, CDH or MEP. Each one is different because the companies around Hadoop each have own unique vision. Selecting the right distribution can be a real challenge, especially as each ships different Hadoop components, for example Cloudera’s Navigator in CDH, and configuration managers like Ambari. Everyone wants to do it their own way and there is little cooperation between them. Even Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop seven years ago and have since kept it closed source. 1.

Hadoop has been on the market for a long time. There are still some problems which have not yet been solved due to the fact that the Hadoop architecture and HDFS in particular do not allow good implementation of data lakes. For example, HDFS is a good fit for batch jobs without compromising on data locality. Right now, streaming is equally important and in the future it will be more important than batch processing.

There are three major categories of architectural problems in the Hadoop environment: security, versioning and lineage (metadata). First, I will point out the most important problems, then there will be an overview of existing alternatives.

Security

Hadoop clusters are used by many people with different backgrounds. Without a tool like Ranger, you can’t create access policies and you don't have any audit logs in the Hadoop cluster. Let’s look at the following example: you run a Spark job inside of which you are using HDFS. The job will work with the user who triggers the job and there is no transfer of user identity by default. This scenario is not possible because I'm data scientist X and I want to run a job with my identity X and use resources with that identity. As you can see, to provide consistent security over the cluster you have to integrate all services with a tool that provides fine-grained access control across cluster components.

Versioning

In a data lake you run a lot data ingestion and wrangling in a data lake you run a lot of pipelines that are developed by more than one developer. If you do not trace the changes in the pipeline, then bug hunting will be extremely hard because it is not possible to disable a particular pipeline and read logs. In Hadoop data versioning, this is not possible without an additional metadata layer (tools like Hive). If we can't connect a pipeline to a version without metadata, then guessing how the current pipeline works is extremely hard. Of course, you can give a unique name for pipeline deployment and then checkout particular run, but this is not most efficient way to deal with errors during production. Even if you can find bugs and fix them, you still need to revert corrupted data; this isn’t possible without metadata.

Lineage

There is no lineage in Hadoop by default. Even with Atlas we need to integrate with other applications like Spark. Right now, Atlas provides limited integration with Hadoop applications. Very fast growing - Spark is not supported. Atlas could solve the problem, but it needs to provide integration for most Hadoop cluster components. Atlas provides REST and Kafka APIs for integration. Another post [Navigation in the data lake using Atlas](link to Bartek Tomala blogpost) @btomal describes how to integrate Spark via its REST API.

Operations

Creating a Hadoop cluster is not an easy thing and maintenance is not a simple task at all. Implementing security measures is also a lot of work. Kerberos is not the latest technology and its concepts are certainly not easy to understand. Additionally, it must be well integrated with other technologies like LDAP and Active Directory. Kerberos deployment can be highly available and durable, but you need specialists.

Ambari and Impala come with help and both manage services in the cluster.

Management

If you can’t provide a huge amount of unstructured data, then a data lake will only be a buzzword for your company. Finding a usage case is one of the hardest implementation elements. A data lake is a long-term integration project that requires lot of resources. In some cases this may be an unprofitable investment. You need to invest in specialists as it is not possible to retrain your staff. The learning curve for Hadoop clusters is high, but the learning curve for security in Hadoop clusters is even harder. If you create a data lake, then you need to scale staff and hardware with your data very quickly, otherwise the data lake will prove to be a failure and you will lose money.

There is huge difference between a DIY cluster and a production environment. You need to create availability architecture for the influx of new data. The security around the cluster should be bullet proof. All roles and access controls should be defined, but you should be aware that this takes a lot of time and resources to build.

Make sure you understand hardware and networking requirements as these are the most important components in a Data Lake.

Amendments?

There are many organizations that are making money from and improving the Hadoop ecosystem. For example, Hadoop has a single NameNode where the metadata about the Hadoop cluster is stored . Unfortunately, as there is only one of them, the NameNode is a single point of failure for the entire environment. Right now Hadoop can have two separated NameModes when one of them is active and another. Unfortunately the changes are fragmentary and there is no collaboration between Hadoop stack providers. They focus on the products around HDFS, rather than the core.

Alternatives

There are probably lots of data lake implementations; however, rather than focusing on them I focus on the ones that I have tested. HDP is the Hadoop environment distribution with all the problems described in this blog post. I have extensive experience with Kylo, which is described below.

Kylo

Kylo is data lake management software produced by Teradata. It is open source under ThinkBig and built on top of Apache Hadoop, Spark and NiFi. Its security module can be integrated with Ranger and Sentry. Navigator or Ambari are available for cluster management. Kylo is responsible for:

  • Data ingestion - mix of streaming and batch processing via NiFi,
  • Data preparation - helps to cleanse data in order to improve its quality and to accredit data governance,
  • Discovery - analysts and data scientists can search and find what data is available to them.

Kylo provides a consistent data access layer for data infrastructure staff and data scientists. You can provide metadata for entities and define policies for components. It can automatically build lineage for data from NiFi metadata; additionally, there is monitoring for feeds and jobs running in a system. Kylo provides statistics for data transformations which can help evaluate the quality of data. Data scientists can review data with the Visual Query tool; there is also tool to monitor services in a cluster.

Problems

Kylo UI as an NiFi template booster looks like a fine solution for data lakes as it supplies additional information about data on pipelines. The downside is that you are limited to Kylo, for which there is no easy way to integrate with other tools. Such a lack of opportunities is never good, especially in systems like data lakes, as someday you might find that a component is missing. There are some other problems in Kylo:

  • Integration with NiFi is inconsistent: if someone with access to NiFi changes a feed then Kylo might not notice.
  • Defining templates in NiFi and re-importing is pain in the butt as sometimes old feeds hang up after import.
  • Little information about errors/warning for flows in the Kylo UI. NiFi errors are sometimes not propagated to Kylo. You need to switch to NiFi and check all the processors in the flow.
  • There is service monitoring, but it is not possible to manage services from the UI. You need to use Ambari.
  • It is not possible define fine-grained access policies for data directly from Kylo. You need to use Ranger, but even that won’t allow you to set policy rules for entities from Kylo UI.
  • Custom flow does not integrate with default templates without additional work as it can't provide metadata for Kylo (no data lineage); you need to integrate the custom NiFi flow with existing core templates.
  • If we are using a Spark job in an NiFi processor, then you don't have lineage from inside the job.
  • There is no inheritance of identity. If you log in to Kylo UI and run the feed, then the feed will be run with NiFi’s identity, not yours.

Pachyderm

It's very hard to manage cooperation between data infrastructure and data scientist teams in a Hadoop cluster. In the Hadoop ecosystem, the main players are trying to solve this problem by adding metadata everywhere. This looks reasonable and is not a problem when we are using metadata to keep business information for unstructured data; however, when we are using metadata for data governance and lineage, it starts to be suspicious. Pachyderm was created to give large organizations a better way to manage data infrastructure. Pachyderm uses Docker and Kubernetes, both of which are advantageous to resolving BigData world problems. I do not want to talk too much about Pachyderm because the last post will be dedicated to it.

Conclusion

There are a few problems in the current data lake implementation in the Hadoop ecosystem. Some are caused by large system fragmentation, while some are due to the lack of a common vision between the main distributors. These problems, however, can be fixed in some way. Unfortunately, some of the problems are caused by the HDFS architecture, which cannot be changed easily.

In the next blog post, I will try to explain the core concepts of Pachyderm and we will test it on a Raspberry Pi cluster. There will be some Ansible playbook to run the cluster quickly and we will see the main difference between Hadoop and Pachyderm in a real example.

1 “Let’s build a modern Hadoop” Feb 10, 2015, https://medium.com/pachyderm-data/lets-build-a-modern-hadoop-4fc160f8d74f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment