Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
haproxy frontend loadbalancer for hadoop httpfs


The httpfs gateway is a proxy that allows users to read & write data using webhdfs. The benefit is httpfs behaves as a choke point by preventing access to the namenode and datanode ports. While looking into load issues for a single httpfs gateway, I started thinking why can't we bring up multiple httpfs gateways and put a layer7 load balancer in front?


To experiment, I downloaded an apache hadoop 2.7.1 binary package and followed the single node pseudo distributed node instructions to get a cluster running on my Mac. My next step was to add hadoop proxy settings for the httpfs gateway to core-site.xml. For example:


which states that the user 'root' is allowed to proxy as other users from any host. This is a horrible idea in practice, but I was okay running it locally on my laptop for the time I spent looking into this. If this was actually being deployed, create an account that will run the httpfs service and swap 'root' with that account name. Why I used root to run the httpfs process instead of my own user account is hdfs is smart enough to prevent proxy users from running commands. When httpfs was running as my local account, I would receive error messages stating "userx is not allowed to impersonate userx" (where userx is my local account).

Installing haproxy was very easy as I was able to run 'brew install haproxy'. Writing a config for haproxy was a differnt matter and is where I spent most of my time. This is the haproxy config I used for testing. Once again I'd verify haproxy settings before deploying this to production.

localhost:haproxy_webhdfs $ cat haproxy.conf
   user opsmekanix
   group omx

   mode    http
   option  forwardfor
   http-reuse always
   timeout connect 5000
   timeout client  50000
   timeout server  50000

frontend yourservername  
   bind *:1337
   default_backend httpfs

backend httpfs 
   server httpfs localhost:14000

listen statistics  
   bind *:8778
   mode http
   stats enable
   stats show-desc @opsmekanix HAProxy Status
   stats uri /

This allows the following logical connections to occur. Commands like hadoop distcp ./etc/hadoop webhdfs://localhost:1337/user/foo/distcp_test will be succussful and testing will show TCP traffic moves through the haproxy process. Notice the local host port number is not 14000 (httpfs) or 50070 (webhdfs), but is the port that haproxy (1337) is configured to listen on.

--------        ---------       --------     ---------
|client|  ->    |haproxy|   ->  |httpfs|  -> |namenode| 
--------  <-    ---------   <-  --------  <- ---------
                                   | ^
                                   v |


I'm actually a little excited that this works as it's possible the layer 7 proxy could be scaled up to support multiple httpfs gateways if needed. This was tested with hdfs in "simple" authentication mode. I believe it should also work with a cluster secured by Kerberos as the authentication bits are passed back and forth in the HTTP headers webhdfs uses during the request. More work would be needed to confirm this is the case.

Also one might ask, why bother with using the httpfs service if a layer 7 proxy can do something similar? Remember that during a webhdfs connection, the namenode returns a 307 response to the webhdfs client. The location of the redirect in the http header would need to be dynamically modified in the response so the redirect uses the layer7 proxy, and the proxy would need to be aware of all datanodes in the cluster. Essentially one would need to configure a transparent forward proxy for the webhdfs clients. I think replacing httpfs with a layer7 proxy is possible, but the implementation would be simplier using features provided by the linux kerenl, which are missing on OS X.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.