The httpfs gateway is a proxy that allows users to read & write data using webhdfs. The benefit is httpfs behaves as a choke point by preventing access to the namenode and datanode ports. While looking into load issues for a single httpfs gateway, I started thinking why can't we bring up multiple httpfs gateways and put a layer7 load balancer in front?
To experiment, I downloaded an apache hadoop 2.7.1 binary package and followed the single node pseudo distributed node instructions to get a cluster running on my Mac. My next step was to add hadoop proxy settings for the httpfs gateway to core-site.xml. For example:
<property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value>
which states that the user 'root' is allowed to proxy as other users from any host. This is a horrible idea in practice, but I was okay running it locally on my laptop for the time I spent looking into this. If this was actually being deployed, create an account that will run the httpfs service and swap 'root' with that account name. Why I used root to run the httpfs process instead of my own user account is hdfs is smart enough to prevent proxy users from running commands. When httpfs was running as my local account, I would receive error messages stating "userx is not allowed to impersonate userx" (where userx is my local account).
Installing haproxy was very easy as I was able to run 'brew install haproxy'. Writing a config for haproxy was a differnt matter and is where I spent most of my time. This is the haproxy config I used for testing. Once again I'd verify haproxy settings before deploying this to production.
localhost:haproxy_webhdfs $ cat haproxy.conf global user opsmekanix group omx defaults mode http option forwardfor http-reuse always timeout connect 5000 timeout client 50000 timeout server 50000 frontend yourservername bind *:1337 default_backend httpfs backend httpfs server httpfs localhost:14000 listen statistics bind *:8778 mode http stats enable stats show-desc @opsmekanix HAProxy Status stats uri /
This allows the following logical connections to occur. Commands like hadoop distcp ./etc/hadoop webhdfs://localhost:1337/user/foo/distcp_test will be succussful and testing will show TCP traffic moves through the haproxy process. Notice the local host port number is not 14000 (httpfs) or 50070 (webhdfs), but is the port that haproxy (1337) is configured to listen on.
-------- --------- -------- --------- |client| -> |haproxy| -> |httpfs| -> |namenode| -------- <- --------- <- -------- <- --------- | ^ v | --------- |datanode| ---------
I'm actually a little excited that this works as it's possible the layer 7 proxy could be scaled up to support multiple httpfs gateways if needed. This was tested with hdfs in "simple" authentication mode. I believe it should also work with a cluster secured by Kerberos as the authentication bits are passed back and forth in the HTTP headers webhdfs uses during the request. More work would be needed to confirm this is the case.
Also one might ask, why bother with using the httpfs service if a layer 7 proxy can do something similar? Remember that during a webhdfs connection, the namenode returns a 307 response to the webhdfs client. The location of the redirect in the http header would need to be dynamically modified in the response so the redirect uses the layer7 proxy, and the proxy would need to be aware of all datanodes in the cluster. Essentially one would need to configure a transparent forward proxy for the webhdfs clients. I think replacing httpfs with a layer7 proxy is possible, but the implementation would be simplier using features provided by the linux kerenl, which are missing on OS X.