Skip to content

Instantly share code, notes, and snippets.

@ytjohn
Created June 28, 2016 21:58
Show Gist options
  • Save ytjohn/219f06edad3fe0673b00725f3e297db6 to your computer and use it in GitHub Desktop.
Save ytjohn/219f06edad3fe0673b00725f3e297db6 to your computer and use it in GitHub Desktop.
# Virtustream + RackHD use cases
Gabi, Rod, and I had a chance to sit down and go over a broad swath of feedback
related to their use of RackHD in deploying sites. To provide some background,
they've deployed 8 preliminary sites with RackHD, about 50 nodes each, with a
target of orchestrating a typical deployment of 200-250 physical compute servers.
Their configuration has a single RackHD instance per site, and her team (of which
you'll see Jamie, John & David on the Slack channel) is focused entirely on bare metal
verification and provisioning and standing up these sites as efficiently as possible.
Their configuration is more static, starting with a cutsheet describing all the
hardware, IP addresses, and network configuration values that the machines
should have as an output from their networking group. Today, their version of
this detail is a large JSON file. They don't/can't leverage the ARP poller
mechanisms, instead leveraging a lookup that they manage with code from this
static cut sheet.
For the success detail, with their prior (internal) tools, a site deployment would
take up to a week. The first round with rackhd cut this to 3 to 3 1/2 days, and
the most recent deployment they did – entirely remotely - started at a little after
Noon and completed roughly 4 hours later.
The code they're using is from the master branch around the end of April.
When they deploy today, they do a discovery that runs some tests and vets
various details of the machinery, with primarily (but not entirely) homogenous
machinery, and then leverages SKU definitions to trigger an OS install, which
loads up Ubuntu (14.04) on the relevant nodes with static configurations
applied. They leverage SaltStack as a software configuration mechanism for all
software installs and provisioning post bare-metal install.
# Upcoming Goals
Coming up, their goals include leveraging this same system to "pave and level,
and reprovision" to switch between OS - going back and forth from Ubuntu to
ESXi. Additionally, they have a more "hetergenous vendor" environment in some
sites, and want to pull those into this same setup. Cisco UCS servers, Quantas
with Voyager DAEs (EMC ECS), Supermicro, and some Dell FX2 gear are all on that list.
They're also likely to be expanding the OS installations they're doing, they're
actively working on SUSE based installs, will be leveraging ESXi installs, and
likely a variation of Ubuntu to install not only 14.04, but also the newly
released Ubuntu LTS version (16.04). RHEL is also likely a little further down the
road.
# Challenges
The code to date has been very successful for them, but not without its issues,
which I'll outline:
## static IP address environment & 'missing' hardware
The default setup with RackHD assumes that nodes are all coming up with DHCP and
could easily reside on DHCP continuously from there (the compute nodes all being
connected to RackHD via a "build" network). The Virtustream team doesn't have a
separate network, but controls their DHCP on their direct network, and uses
RackHD directly. They also leverage DHCP relays in their configuration to expand
out this control space to their nodes. The cut-sheet they have lists serial
numbers and mac-addresses for all the nodes, but with the nodes all starting
with DHCP from the very beginning, it makes it quite a challenge to identify
nodes that didn't discover and provision.
The Virtustream team created their own front-end to RackHD APIs that let them
compare what has been discovered and come online vs their cut-sheet with all the
hardware details. Today when they power on the entire environment to start this
process, they see a roughly 75% immediate success rate, but the remaining 25% can
hang or have problems - maybe incorrectly configured, etc - and just identifying
that they're missing and not online is one of the more immediate issues. With
all of these site deployments, "remote hands" (i.e. datacenter contract help) do
the rack, stack and cabling of the hardware, and Gabi's team is typically
entirely remote from this environment.
## Getting errors to announce themselves
Related to this issue of some nodes just not showing up as expected, Gabi
outlined three issues they'd seen with some frequency that were hard to track,
but in general highlighted that our logging output wasn't super amenable (or
that perhaps they didn't know a good way to solve this problem) to finding
problems
1. Some nodes would hang or time out when just doing the initial attempt to PXE
boot, the console showing "PXE......." without end, never triggering the boot
and installation process.
2. In their workflows, an API query would indicate that a workflow as indeed
assigned to node, but relevant images (profiles, etc) wouldn't be offered up.
This was apparently intermittent, and we don't have a specific reproduction case
to show it off.
3. Sometimes a task would just hang, including the OS install, and there was no
obvious way to see that this had happened. I don't know if they're leveraging
the workflow timeout capabilities today, but Gabi explicitly asked if there was
a way to add a timeout to a specific task so that when it "took too long", they
could error out the task and/or workflow and see that as an immediate failure.
## Logging difficult to parse
Because logging is coming in just standard out, and upstart it dealing with the
output, Gabi's team has found it difficult to sometimes find - and more
specifically difficult to tail and parse to make it understandable on a per-node
basis, where they want to see what's happened and what's been done with a node.
They've developed some scripts to drive out some of this detail, but in general
highlighted that taking the STDOUT logging and understanding the log format to
get useful information related to nodes and failures was something they've been
struggling with.
## LDAP Authentication
The code that Gabi's team is using doesn't have authorization or authentication
enabled at all. Their need is to bind it to LDAP authentication. That's not
something we have in RackHD today, but we did build our authentication on a
plugin based mechanism, so making it more explicitly pluggable to something
backed by LDAP (or Active Directory) should be reasonable achievable. I briefly
highlighted that
https://github.com/RackHD/RackHD/wiki/proposal-authentication-on-by-default is
in our roadmap and coming shortly, with some additional discussion already
happening in the google groups mailing list at
https://groups.google.com/forum/#!msg/rackhd/pW_CDrQlA0U/1Joen0OHBQAJ and
https://groups.google.com/forum/#!topic/rackhd/pbTgDlMEH1Q respectively, and
that more would be coming.
One of the specific asks that Gabi's team had was some means of leveraging LDAP
groups to provide permissions to groups of nodes, so that they could expose the
ability to power off/on or OS install to specific individuals that they enable,
while maintaining a "can do anything" mode for their internal staff.
Gabi's team also investigated CLI usage, briefly looked at the Ruby CLI that the
RackHD-BOSH-CPI team developed to RackHD, took a swipe at an internally
developed python based cli, and wondered if RackHD would be
providing a supported CLI (in whatever language) for interacting via bash
scripts and/or CLI efforts, leveraging authentication and authorization.
## Pollers/on-http hang issue at ~30 to 50 nodes of scale
Gabi's team has disabled the automatic creation of pollers, noting that when it
was enabled, they would get to somewhere between 30 and 50 nodes doing this
"discovery + provisioning" process they have set up, and sometime in that
process, the on-http process would simply stop responding. Disabling the pollers
makes this better, but they still intermittently see this issue, but never with
any success at diagnosing it or understanding what's causing the hang to happen.
They haven't opened a bug formally for this, but it's been plaguing them for
scale support - their current workaround has been to monitor the on-http process
and simply restart it when it ceases to respond.
## Build quality and regressions on master
Gabi's team has been using RackHD from master from around May 1st for a while,
and opened up a couple of bugs that break their usage of the system with the
transition from the DHCP lease poller to the ARP based poller. The bugs
https://github.com/rackHD/rackhd/issues/225 and (just today)
https://github.com/rackHD/rackhd/issues/278 reflect these issues.
More generally, we had some issues where the debian packages that we built
weren't getting updated on commits to master (they're using those packages
directly), and there's nothing vetting their specific configuration and needs
within the environment. Gabi said they were working towards setting up an
end-to-end integration test to help identify these issues, and I mentioned our
own recent efforts making slow but steady progress, to drive better quality
gates for PR prior to merge with the continuous integration testing.
I also mentioned the efforts that Peter's team has been investigating, but
hasn't been shared on the mailing list yet, related to improving this test
effort, and specifically running longer-term regressions leveraging the InfraSIM
project with an ESXi/VMWare based virtual environment for our testing.
Related to this overall quality kind of meme, Gabi knew that our testing didn't
use packages, and wanted to understand what a guaranteed path was to build and
install from source so they could leverage an "install and test" setup for
themselves. I mentioned the https://github.com/RackHD/on-build-config
repository, but we don't currently have any clear documentation on how to set up
and run these test and this environment for external teams wanting to do the
same.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment