ytjohn/rackhd use case

## rackhd use case
# Virtustream + RackHD use cases

Gabi, Rod, and I had a chance to sit down and go over a broad swath of feedback
related to their use of RackHD in deploying sites. To provide some background,
they've deployed 8 preliminary sites with RackHD, about 50 nodes each, with a
target of orchestrating a typical deployment of 200-250 physical compute servers.
Their configuration has a single RackHD instance per site, and her team (of which
you'll see Jamie, John & David on the Slack channel) is focused entirely on bare metal
verification and provisioning and standing up these sites as efficiently as possible.

Their configuration is more static, starting with a cutsheet describing all the
hardware, IP addresses, and network configuration values that the machines
should have as an output from their networking group. Today, their version of
this detail is a large JSON file. They don't/can't leverage the ARP poller
mechanisms, instead leveraging a lookup that they manage with code from this
static cut sheet.

For the success detail, with their prior (internal) tools, a site deployment would
take up to a week.  The first round with rackhd cut this to 3 to 3 1/2 days, and
the most recent deployment they did – entirely remotely - started at a little after
Noon and completed roughly 4 hours later.

The code they're using is from the master branch around the end of April.
When they deploy today, they do a discovery that runs some tests and vets
various details of the machinery, with primarily (but not entirely) homogenous
machinery, and then leverages SKU definitions to trigger an OS install, which
loads up Ubuntu (14.04) on the relevant nodes with static configurations
applied. They leverage SaltStack as a software configuration mechanism for all
software installs and provisioning post bare-metal install.

# Upcoming Goals

Coming up, their goals include leveraging this same system to "pave and level,
and reprovision" to switch between OS - going back and forth from Ubuntu to
ESXi. Additionally, they have a more "hetergenous vendor" environment in some
sites, and want to pull those into this same  setup. Cisco UCS servers, Quantas
with Voyager DAEs (EMC ECS), Supermicro, and some Dell FX2 gear are all on that list.

They're also likely to be expanding the OS installations they're doing, they're
actively working on SUSE based installs, will be leveraging ESXi installs, and
likely a variation of Ubuntu to install not only 14.04, but also the newly
released Ubuntu LTS version (16.04).  RHEL is also likely a little further down the
road.

# Challenges

The code to date has been very successful for them, but not without its issues,
which I'll outline:

## static IP address environment & 'missing' hardware

The default setup with RackHD assumes that nodes are all coming up with DHCP and
could easily reside on DHCP continuously from there (the compute nodes all being
connected to RackHD via a "build" network). The Virtustream team doesn't have a
separate network, but controls their DHCP on their direct network, and uses
RackHD directly. They also leverage DHCP relays in their configuration to expand
out this control space to their nodes. The cut-sheet they have lists serial
numbers and mac-addresses for all the nodes, but with the nodes all starting
with DHCP from the very beginning, it makes it quite a challenge to identify
nodes that didn't discover and provision.

The Virtustream team created their own front-end to RackHD APIs that let them
compare what has been discovered and come online vs their cut-sheet with all the
hardware details. Today when they power on the entire environment to start this
process, they see a roughly 75% immediate success rate, but the remaining 25% can
hang or have problems - maybe incorrectly configured, etc - and just identifying
that they're missing and not online is one of the more immediate issues. With
all of these site deployments, "remote hands" (i.e. datacenter contract help) do
the rack, stack and cabling of the hardware, and Gabi's team is typically
entirely remote from this environment.

## Getting errors to announce themselves

Related to this issue of some nodes just not showing up as expected, Gabi
outlined three issues they'd seen with some frequency that were hard to track,
but in general highlighted that our logging output wasn't super amenable (or
that perhaps they didn't know a good way to solve this problem) to finding
problems

1. Some nodes would hang or time out when just doing the initial attempt to PXE
boot, the console showing "PXE......." without end, never triggering the boot
and installation process.

2. In their workflows, an API query would indicate that a workflow as indeed
assigned to node, but relevant images (profiles, etc) wouldn't be offered up.
This was apparently intermittent, and we don't have a specific reproduction case
to show it off.

3. Sometimes a task would just hang, including the OS install, and there was no
obvious way to see that this had happened. I don't know if they're leveraging
the workflow timeout capabilities today, but Gabi explicitly asked if there was
a way to add a timeout to a specific task so that when it "took too long", they
could error out the task and/or workflow and see that as an immediate failure.

## Logging difficult to parse

Because logging is coming in just standard out, and upstart it dealing with the
output, Gabi's team has found it difficult to sometimes find - and more
specifically difficult to tail and parse to make it understandable on a per-node
basis, where they want to see what's happened and what's been done with a node.
They've developed some scripts to drive out some of this detail, but in general
highlighted that taking the STDOUT logging and understanding the log format to
get useful information related to nodes and failures was something they've been
struggling with.

## LDAP Authentication

The code that Gabi's team is using doesn't have authorization or authentication
enabled at all. Their need is to bind it to LDAP authentication. That's not
something we have in RackHD today, but we did build our authentication on a
plugin based mechanism, so making it more explicitly pluggable to something
backed by LDAP (or Active Directory) should be reasonable achievable. I briefly
highlighted that
https://github.com/RackHD/RackHD/wiki/proposal-authentication-on-by-default is
in our roadmap and coming shortly, with some additional discussion already
happening in the google groups mailing list at
https://groups.google.com/forum/#!msg/rackhd/pW_CDrQlA0U/1Joen0OHBQAJ and
https://groups.google.com/forum/#!topic/rackhd/pbTgDlMEH1Q respectively, and
that more would be coming.

One of the specific asks that Gabi's team had was some means of leveraging LDAP
groups to provide permissions to groups of nodes, so that they could expose the
ability to power off/on or OS install to specific individuals that they enable,
while maintaining a "can do anything" mode for their internal staff.

Gabi's team also investigated CLI usage, briefly looked at the Ruby CLI that the
RackHD-BOSH-CPI team developed to RackHD, took a swipe at an internally
developed python based cli, and wondered if RackHD would be
providing a supported CLI (in whatever language) for interacting via bash
scripts and/or CLI efforts, leveraging authentication and authorization.

## Pollers/on-http hang issue at ~30 to 50 nodes of scale

Gabi's team has disabled the automatic creation of pollers, noting that when it
was enabled, they would get to somewhere between 30 and 50 nodes doing this
"discovery + provisioning" process they have set up, and sometime in that
process, the on-http process would simply stop responding. Disabling the pollers
makes this better, but they still intermittently see this issue, but never with
any success at diagnosing it or understanding what's causing the hang to happen.
They haven't opened a bug formally for this, but it's been plaguing them for
scale support - their current workaround has been to monitor the on-http process
and simply restart it when it ceases to respond.

## Build quality and regressions on master

Gabi's team has been using RackHD from master from around May 1st for a while,
and opened up a couple of bugs that break their usage of the system with the
transition from the DHCP lease poller to the ARP based poller. The bugs
https://github.com/rackHD/rackhd/issues/225 and (just today)
https://github.com/rackHD/rackhd/issues/278 reflect these issues.

More generally, we had some issues where the debian packages that we built
weren't getting updated on commits to master (they're using those packages
directly), and there's nothing vetting their specific configuration and needs
within the environment. Gabi said they were working towards setting up an
end-to-end integration test to help identify these issues, and I mentioned our
own recent efforts making slow but steady progress, to drive better quality
gates for PR prior to merge with the continuous integration testing.

I also mentioned the efforts that Peter's team has been investigating, but
hasn't been shared on the mailing list yet, related to improving this test
effort, and specifically running longer-term regressions leveraging the InfraSIM
project with an ESXi/VMWare based virtual environment for our testing.

Related to this overall quality kind of meme, Gabi knew that our testing didn't
use packages, and wanted to understand what a guaranteed path was to build and
install from source so they could leverage an "install and test" setup for
themselves. I mentioned the https://github.com/RackHD/on-build-config
repository, but we don't currently have any clear documentation on how to set up
and run these test and this environment for external teams wanting to do the
same.
	# Virtustream + RackHD use cases

	Gabi, Rod, and I had a chance to sit down and go over a broad swath of feedback
	related to their use of RackHD in deploying sites. To provide some background,
	they've deployed 8 preliminary sites with RackHD, about 50 nodes each, with a
	target of orchestrating a typical deployment of 200-250 physical compute servers.
	Their configuration has a single RackHD instance per site, and her team (of which
	you'll see Jamie, John & David on the Slack channel) is focused entirely on bare metal
	verification and provisioning and standing up these sites as efficiently as possible.

	Their configuration is more static, starting with a cutsheet describing all the
	hardware, IP addresses, and network configuration values that the machines
	should have as an output from their networking group. Today, their version of
	this detail is a large JSON file. They don't/can't leverage the ARP poller
	mechanisms, instead leveraging a lookup that they manage with code from this
	static cut sheet.

	For the success detail, with their prior (internal) tools, a site deployment would
	take up to a week. The first round with rackhd cut this to 3 to 3 1/2 days, and
	the most recent deployment they did – entirely remotely - started at a little after
	Noon and completed roughly 4 hours later.

	The code they're using is from the master branch around the end of April.
	When they deploy today, they do a discovery that runs some tests and vets
	various details of the machinery, with primarily (but not entirely) homogenous
	machinery, and then leverages SKU definitions to trigger an OS install, which
	loads up Ubuntu (14.04) on the relevant nodes with static configurations
	applied. They leverage SaltStack as a software configuration mechanism for all
	software installs and provisioning post bare-metal install.

	# Upcoming Goals

	Coming up, their goals include leveraging this same system to "pave and level,
	and reprovision" to switch between OS - going back and forth from Ubuntu to
	ESXi. Additionally, they have a more "hetergenous vendor" environment in some
	sites, and want to pull those into this same setup. Cisco UCS servers, Quantas
	with Voyager DAEs (EMC ECS), Supermicro, and some Dell FX2 gear are all on that list.

	They're also likely to be expanding the OS installations they're doing, they're
	actively working on SUSE based installs, will be leveraging ESXi installs, and
	likely a variation of Ubuntu to install not only 14.04, but also the newly
	released Ubuntu LTS version (16.04). RHEL is also likely a little further down the
	road.

	# Challenges

	The code to date has been very successful for them, but not without its issues,
	which I'll outline:

	## static IP address environment & 'missing' hardware

	The default setup with RackHD assumes that nodes are all coming up with DHCP and
	could easily reside on DHCP continuously from there (the compute nodes all being
	connected to RackHD via a "build" network). The Virtustream team doesn't have a
	separate network, but controls their DHCP on their direct network, and uses
	RackHD directly. They also leverage DHCP relays in their configuration to expand
	out this control space to their nodes. The cut-sheet they have lists serial
	numbers and mac-addresses for all the nodes, but with the nodes all starting
	with DHCP from the very beginning, it makes it quite a challenge to identify
	nodes that didn't discover and provision.

	The Virtustream team created their own front-end to RackHD APIs that let them
	compare what has been discovered and come online vs their cut-sheet with all the
	hardware details. Today when they power on the entire environment to start this
	process, they see a roughly 75% immediate success rate, but the remaining 25% can
	hang or have problems - maybe incorrectly configured, etc - and just identifying
	that they're missing and not online is one of the more immediate issues. With
	all of these site deployments, "remote hands" (i.e. datacenter contract help) do
	the rack, stack and cabling of the hardware, and Gabi's team is typically
	entirely remote from this environment.

	## Getting errors to announce themselves

	Related to this issue of some nodes just not showing up as expected, Gabi
	outlined three issues they'd seen with some frequency that were hard to track,
	but in general highlighted that our logging output wasn't super amenable (or
	that perhaps they didn't know a good way to solve this problem) to finding
	problems

	1. Some nodes would hang or time out when just doing the initial attempt to PXE
	boot, the console showing "PXE......." without end, never triggering the boot
	and installation process.

	2. In their workflows, an API query would indicate that a workflow as indeed
	assigned to node, but relevant images (profiles, etc) wouldn't be offered up.
	This was apparently intermittent, and we don't have a specific reproduction case
	to show it off.

	3. Sometimes a task would just hang, including the OS install, and there was no
	obvious way to see that this had happened. I don't know if they're leveraging
	the workflow timeout capabilities today, but Gabi explicitly asked if there was
	a way to add a timeout to a specific task so that when it "took too long", they
	could error out the task and/or workflow and see that as an immediate failure.

	## Logging difficult to parse

	Because logging is coming in just standard out, and upstart it dealing with the
	output, Gabi's team has found it difficult to sometimes find - and more
	specifically difficult to tail and parse to make it understandable on a per-node
	basis, where they want to see what's happened and what's been done with a node.
	They've developed some scripts to drive out some of this detail, but in general
	highlighted that taking the STDOUT logging and understanding the log format to
	get useful information related to nodes and failures was something they've been
	struggling with.

	## LDAP Authentication

	The code that Gabi's team is using doesn't have authorization or authentication
	enabled at all. Their need is to bind it to LDAP authentication. That's not
	something we have in RackHD today, but we did build our authentication on a
	plugin based mechanism, so making it more explicitly pluggable to something
	backed by LDAP (or Active Directory) should be reasonable achievable. I briefly
	highlighted that
	https://github.com/RackHD/RackHD/wiki/proposal-authentication-on-by-default is
	in our roadmap and coming shortly, with some additional discussion already
	happening in the google groups mailing list at
	https://groups.google.com/forum/#!msg/rackhd/pW_CDrQlA0U/1Joen0OHBQAJ and
	https://groups.google.com/forum/#!topic/rackhd/pbTgDlMEH1Q respectively, and
	that more would be coming.

	One of the specific asks that Gabi's team had was some means of leveraging LDAP
	groups to provide permissions to groups of nodes, so that they could expose the
	ability to power off/on or OS install to specific individuals that they enable,
	while maintaining a "can do anything" mode for their internal staff.

	Gabi's team also investigated CLI usage, briefly looked at the Ruby CLI that the
	RackHD-BOSH-CPI team developed to RackHD, took a swipe at an internally
	developed python based cli, and wondered if RackHD would be
	providing a supported CLI (in whatever language) for interacting via bash
	scripts and/or CLI efforts, leveraging authentication and authorization.

	## Pollers/on-http hang issue at ~30 to 50 nodes of scale

	Gabi's team has disabled the automatic creation of pollers, noting that when it
	was enabled, they would get to somewhere between 30 and 50 nodes doing this
	"discovery + provisioning" process they have set up, and sometime in that
	process, the on-http process would simply stop responding. Disabling the pollers
	makes this better, but they still intermittently see this issue, but never with
	any success at diagnosing it or understanding what's causing the hang to happen.
	They haven't opened a bug formally for this, but it's been plaguing them for
	scale support - their current workaround has been to monitor the on-http process
	and simply restart it when it ceases to respond.

	## Build quality and regressions on master

	Gabi's team has been using RackHD from master from around May 1st for a while,
	and opened up a couple of bugs that break their usage of the system with the
	transition from the DHCP lease poller to the ARP based poller. The bugs
	https://github.com/rackHD/rackhd/issues/225 and (just today)
	https://github.com/rackHD/rackhd/issues/278 reflect these issues.

	More generally, we had some issues where the debian packages that we built
	weren't getting updated on commits to master (they're using those packages
	directly), and there's nothing vetting their specific configuration and needs
	within the environment. Gabi said they were working towards setting up an
	end-to-end integration test to help identify these issues, and I mentioned our
	own recent efforts making slow but steady progress, to drive better quality
	gates for PR prior to merge with the continuous integration testing.

	I also mentioned the efforts that Peter's team has been investigating, but
	hasn't been shared on the mailing list yet, related to improving this test
	effort, and specifically running longer-term regressions leveraging the InfraSIM
	project with an ESXi/VMWare based virtual environment for our testing.

	Related to this overall quality kind of meme, Gabi knew that our testing didn't
	use packages, and wanted to understand what a guaranteed path was to build and
	install from source so they could leverage an "install and test" setup for
	themselves. I mentioned the https://github.com/RackHD/on-build-config
	repository, but we don't currently have any clear documentation on how to set up
	and run these test and this environment for external teams wanting to do the
	same.