Created
June 28, 2016 21:58
-
-
Save ytjohn/219f06edad3fe0673b00725f3e297db6 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Virtustream + RackHD use cases | |
Gabi, Rod, and I had a chance to sit down and go over a broad swath of feedback | |
related to their use of RackHD in deploying sites. To provide some background, | |
they've deployed 8 preliminary sites with RackHD, about 50 nodes each, with a | |
target of orchestrating a typical deployment of 200-250 physical compute servers. | |
Their configuration has a single RackHD instance per site, and her team (of which | |
you'll see Jamie, John & David on the Slack channel) is focused entirely on bare metal | |
verification and provisioning and standing up these sites as efficiently as possible. | |
Their configuration is more static, starting with a cutsheet describing all the | |
hardware, IP addresses, and network configuration values that the machines | |
should have as an output from their networking group. Today, their version of | |
this detail is a large JSON file. They don't/can't leverage the ARP poller | |
mechanisms, instead leveraging a lookup that they manage with code from this | |
static cut sheet. | |
For the success detail, with their prior (internal) tools, a site deployment would | |
take up to a week. The first round with rackhd cut this to 3 to 3 1/2 days, and | |
the most recent deployment they did – entirely remotely - started at a little after | |
Noon and completed roughly 4 hours later. | |
The code they're using is from the master branch around the end of April. | |
When they deploy today, they do a discovery that runs some tests and vets | |
various details of the machinery, with primarily (but not entirely) homogenous | |
machinery, and then leverages SKU definitions to trigger an OS install, which | |
loads up Ubuntu (14.04) on the relevant nodes with static configurations | |
applied. They leverage SaltStack as a software configuration mechanism for all | |
software installs and provisioning post bare-metal install. | |
# Upcoming Goals | |
Coming up, their goals include leveraging this same system to "pave and level, | |
and reprovision" to switch between OS - going back and forth from Ubuntu to | |
ESXi. Additionally, they have a more "hetergenous vendor" environment in some | |
sites, and want to pull those into this same setup. Cisco UCS servers, Quantas | |
with Voyager DAEs (EMC ECS), Supermicro, and some Dell FX2 gear are all on that list. | |
They're also likely to be expanding the OS installations they're doing, they're | |
actively working on SUSE based installs, will be leveraging ESXi installs, and | |
likely a variation of Ubuntu to install not only 14.04, but also the newly | |
released Ubuntu LTS version (16.04). RHEL is also likely a little further down the | |
road. | |
# Challenges | |
The code to date has been very successful for them, but not without its issues, | |
which I'll outline: | |
## static IP address environment & 'missing' hardware | |
The default setup with RackHD assumes that nodes are all coming up with DHCP and | |
could easily reside on DHCP continuously from there (the compute nodes all being | |
connected to RackHD via a "build" network). The Virtustream team doesn't have a | |
separate network, but controls their DHCP on their direct network, and uses | |
RackHD directly. They also leverage DHCP relays in their configuration to expand | |
out this control space to their nodes. The cut-sheet they have lists serial | |
numbers and mac-addresses for all the nodes, but with the nodes all starting | |
with DHCP from the very beginning, it makes it quite a challenge to identify | |
nodes that didn't discover and provision. | |
The Virtustream team created their own front-end to RackHD APIs that let them | |
compare what has been discovered and come online vs their cut-sheet with all the | |
hardware details. Today when they power on the entire environment to start this | |
process, they see a roughly 75% immediate success rate, but the remaining 25% can | |
hang or have problems - maybe incorrectly configured, etc - and just identifying | |
that they're missing and not online is one of the more immediate issues. With | |
all of these site deployments, "remote hands" (i.e. datacenter contract help) do | |
the rack, stack and cabling of the hardware, and Gabi's team is typically | |
entirely remote from this environment. | |
## Getting errors to announce themselves | |
Related to this issue of some nodes just not showing up as expected, Gabi | |
outlined three issues they'd seen with some frequency that were hard to track, | |
but in general highlighted that our logging output wasn't super amenable (or | |
that perhaps they didn't know a good way to solve this problem) to finding | |
problems | |
1. Some nodes would hang or time out when just doing the initial attempt to PXE | |
boot, the console showing "PXE......." without end, never triggering the boot | |
and installation process. | |
2. In their workflows, an API query would indicate that a workflow as indeed | |
assigned to node, but relevant images (profiles, etc) wouldn't be offered up. | |
This was apparently intermittent, and we don't have a specific reproduction case | |
to show it off. | |
3. Sometimes a task would just hang, including the OS install, and there was no | |
obvious way to see that this had happened. I don't know if they're leveraging | |
the workflow timeout capabilities today, but Gabi explicitly asked if there was | |
a way to add a timeout to a specific task so that when it "took too long", they | |
could error out the task and/or workflow and see that as an immediate failure. | |
## Logging difficult to parse | |
Because logging is coming in just standard out, and upstart it dealing with the | |
output, Gabi's team has found it difficult to sometimes find - and more | |
specifically difficult to tail and parse to make it understandable on a per-node | |
basis, where they want to see what's happened and what's been done with a node. | |
They've developed some scripts to drive out some of this detail, but in general | |
highlighted that taking the STDOUT logging and understanding the log format to | |
get useful information related to nodes and failures was something they've been | |
struggling with. | |
## LDAP Authentication | |
The code that Gabi's team is using doesn't have authorization or authentication | |
enabled at all. Their need is to bind it to LDAP authentication. That's not | |
something we have in RackHD today, but we did build our authentication on a | |
plugin based mechanism, so making it more explicitly pluggable to something | |
backed by LDAP (or Active Directory) should be reasonable achievable. I briefly | |
highlighted that | |
https://github.com/RackHD/RackHD/wiki/proposal-authentication-on-by-default is | |
in our roadmap and coming shortly, with some additional discussion already | |
happening in the google groups mailing list at | |
https://groups.google.com/forum/#!msg/rackhd/pW_CDrQlA0U/1Joen0OHBQAJ and | |
https://groups.google.com/forum/#!topic/rackhd/pbTgDlMEH1Q respectively, and | |
that more would be coming. | |
One of the specific asks that Gabi's team had was some means of leveraging LDAP | |
groups to provide permissions to groups of nodes, so that they could expose the | |
ability to power off/on or OS install to specific individuals that they enable, | |
while maintaining a "can do anything" mode for their internal staff. | |
Gabi's team also investigated CLI usage, briefly looked at the Ruby CLI that the | |
RackHD-BOSH-CPI team developed to RackHD, took a swipe at an internally | |
developed python based cli, and wondered if RackHD would be | |
providing a supported CLI (in whatever language) for interacting via bash | |
scripts and/or CLI efforts, leveraging authentication and authorization. | |
## Pollers/on-http hang issue at ~30 to 50 nodes of scale | |
Gabi's team has disabled the automatic creation of pollers, noting that when it | |
was enabled, they would get to somewhere between 30 and 50 nodes doing this | |
"discovery + provisioning" process they have set up, and sometime in that | |
process, the on-http process would simply stop responding. Disabling the pollers | |
makes this better, but they still intermittently see this issue, but never with | |
any success at diagnosing it or understanding what's causing the hang to happen. | |
They haven't opened a bug formally for this, but it's been plaguing them for | |
scale support - their current workaround has been to monitor the on-http process | |
and simply restart it when it ceases to respond. | |
## Build quality and regressions on master | |
Gabi's team has been using RackHD from master from around May 1st for a while, | |
and opened up a couple of bugs that break their usage of the system with the | |
transition from the DHCP lease poller to the ARP based poller. The bugs | |
https://github.com/rackHD/rackhd/issues/225 and (just today) | |
https://github.com/rackHD/rackhd/issues/278 reflect these issues. | |
More generally, we had some issues where the debian packages that we built | |
weren't getting updated on commits to master (they're using those packages | |
directly), and there's nothing vetting their specific configuration and needs | |
within the environment. Gabi said they were working towards setting up an | |
end-to-end integration test to help identify these issues, and I mentioned our | |
own recent efforts making slow but steady progress, to drive better quality | |
gates for PR prior to merge with the continuous integration testing. | |
I also mentioned the efforts that Peter's team has been investigating, but | |
hasn't been shared on the mailing list yet, related to improving this test | |
effort, and specifically running longer-term regressions leveraging the InfraSIM | |
project with an ESXi/VMWare based virtual environment for our testing. | |
Related to this overall quality kind of meme, Gabi knew that our testing didn't | |
use packages, and wanted to understand what a guaranteed path was to build and | |
install from source so they could leverage an "install and test" setup for | |
themselves. I mentioned the https://github.com/RackHD/on-build-config | |
repository, but we don't currently have any clear documentation on how to set up | |
and run these test and this environment for external teams wanting to do the | |
same. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment