Skip to content

Instantly share code, notes, and snippets.

@sam-github
Last active August 29, 2015 14:01
Show Gist options
  • Save sam-github/2ed1cb29a90e5b1c3a93 to your computer and use it in GitHub Desktop.
Save sam-github/2ed1cb29a90e5b1c3a93 to your computer and use it in GitHub Desktop.
build/deploy tooling spike

Build/Deploy

TL;DR

  1. slc build: We need a tool that will build an app and it's deps into a tarball or git branch. High priority.
  2. slc publish: We need a tool that will publish an app tarball or git branch to a deploy server (use git protocol first, other protocols later). Medium priority (you can do this with a few git commands if you use git).
  3. slc deploy, prepare, export: We need three small tools, one for each stage of what it takes to make an app run on a server, they can be run individually, or all together for a complete "heroku-like" deploy solution: (High priority)
    1. A network receiver of an app publish (using git protocol first, others later)
    2. A CLI to prep the received app for running (npm rebuild/install)
    3. A CLI to export the app's config to the system process manager (upstart, systemd, etc.)
  4. slc run: Needs signal handling and log aggregation, and perhaps a plugin system for features like a /health endpoint.
  5. node foreman: Contains useful code which should be refactored into modules.

References:

Terms:

  • URL: description of a node package's source or destination. Could be . or some other local fs path, a tarball, a checked out git repo, a git URL, an HTTP reference to a tarball, etc. Which specific URLs we support first for source and destination depend on use-case priorities.
  • node package: package.json and the direct source, deps present only as specs in the package.json
  • node archive: node package with dependencies built in preparation for deploy
  • preparer: a tool that takes a node archive, and prepares it to be runnable (mostly npm rebuild and post-install hooks)
  • exporter: a tool that generates configuration for a runnable application, and registers it with a process manager (such as upstart)
  • deployer: a tool that receives a node archive, and prepares and exports it

Building a node application archive

  • Input: node package URL
  • Output: node archive URL
  • Tool: slc build (and/or grunt rule and/or yeoman generator?)
  • REQ: 1-4,12
  • ESTIMATE: 5 days

Build steps are described in (a). Note error in (a) claiming Heroku does this wrong, in fact, they intelligently and correctly deal with archived applications, see (b).

(a) advocates comitting archived deps to development branches. This is wrong, the archive may be committed to a git "deployment only" branch or packaged as tarball, but it should not be comitted to a development branch.

What this tool should do is mostly discussed in the second section of (c), but missing is a description of:

  • automation of npm interactions: npm install, strip of binaries, shrinkwrap, what npm scripts should be run (if any), etc. TBD
  • automation of custom build steps: front-end build output (minified src, compiled sass, etc.) needs to be archived along with the npm/package.json dependencies. This is often done with grunt and/or bower, or npm scripts. We should drive this in the build, or ensure that an 'slc build' run after the front-end build will archive its output, or perhaps implement the 'slc build' as a grunt plugin. TBD

Imaging a node application archive

  • Input: node archive URL
  • Output: virtual image

I'm not sure if this is a tool we should prioritize, but its a tool I think we could write. We could take a node archive, and bind it into a docker image, or a vmware image, or ... Basically, generate output that can be fairly directly imported into a virtualizer.

Publishing a node application archive

  • Input: node archive or node package (+) URL
  • Output: deployer URL to publish to
  • Tool: slc publish, or ... Ops Dashboard (++)
  • REQ: 4,8
  • ESTIMATE: 2 days

PaaS deployers usually use git to accept applications. I'm not sure if there are any other generic protocols to a remote deployer. Custom infrastructure in a company might use scp or sftp, wrapped in some kind of scripts or tools to trigger running the app after its published. We can publish tarballs to an HTTP URL, assuming there are people out there that use HTTP to publish to a deployer, I'm not sure there are.

Common cases to prioritize are just these:

  • git (local/remote) to git remote: ex. github to openshift
  • git (local/remote) to tarball (local): ex. git remote to custom tooling
  • tarball (local/remote) to git remote: artifactory to git PaaS

This may seem trivial, but its unnecessarily painful to take an artifactory HTTP URL to a tarball, and publish it to an openshift git repo. Even pulling a remote git branch branch (archived) from github and pushing to openshift would require a tedious series of git commands.

These processes are worth automating, IMO.

(+) Note that heroku cleverly supports both archived apps, and unarchived app packages. Archives are superior for production and reproduceability, but raw packages are useful for PoC, dev, quick tests, etc. Its easy to support both in the deployer.

(++) I don't suggest we prioritize this, but its worth noting that if we built this into a dashboard, you could theoretically provide the slc publish input and output URLs as part of an application's configuration, and run the slc publish in the Ops backend on user-request. If we did lots more work, we could even provide Dashboard tools to provision destinations (spin up EC2 instances, etc.). I don't think we should do this... Ryan points out that there are lots of companies out there trying to provide high-level dashboards to PaaSes, we could partner with one or more of them, helping them make their dashboards work well with node appplication archvies, and possibly helping them implement heroku-like deployers for node.

Deploying a node application archive

  • Input: (pushed) git, ?
  • Output: running application
  • Tool: slc deploy, slc prepare, slc export
  • REQ: 5-7,10
  • ESTIMATE:
    • receive: 2 days to repurpose substack/cicada
    • prepare: 1 day
    • export: 2-4 days, depending on how much configuration we go for, and system support (should be mostly pulling code out of node-foreman)

Involves 3 steps:

  1. receive: probably with git protocol
  2. prepare: see (b), basically checkout/untar, then npm rebuild/npm install
  3. export: configuring the process manager to run the app

For PaaS, they already run deployers, and its unlikely we need to do anything there, certainly not for Heroku. If we find OpenShift does require over-riding of how it prepares apps to be inject, we may want a stand-alone CLI implementing the prepare, particularly the sequence and choices described in (b).

For existing corporate dev-ops infrastructure, they probably have receive and run tooling already setup, but as for PaaS providers, if they don't have node specific experience, they may benefit from the same stand-alone "prepare" CLI.

For companies without existing infrastructure, they have a few options:

  • docker: use dokku: "docker-based heroku in 100 lines of bash", implements a git receive, prepare, and run inside docker. Ryan speaks well of the code, but we haven't tried it, and says the reason it is so Heroku-like is it reuses a lot of their opensource build-pack code.

  • vanilla linux: we should build a tool that combines the git receive implementation of substack/cicada, the prepare CLI (to be written), and the process manager configuration compiler from NodeFly/node-foreman to achieve a light-weight "heroku equivalent" for linux machine that we can recommend to our customers who are looking for a deployment solution.

As a matter of good architecture, the perparation would be implemented as a node module, with an API and a CLI.

Running a node application in deployment

System process managers are sysV-init, upstart, systemd, launchd, windows service.

Process management starts with the system, and requires system configuration.

The system configuration can be hand-written, but is mostly tedious boiler-plate. Unfortunately, it is not entirely boiler-plate... there are choices to be made in regards to logging, daemonization, runners (node version, slc run, foreman?, etc.), pid-file handling/generation, ulimits (nfiles, memory, core, etc.), etc.

There are various tools that will export a configuration. They don't generally have equivalent support across the managers, or equivalent or useful configurability.

Possibilities are:

  • ddollar/foreman, commonly used, partically abandoned, often cloned. The original exports to initd or upstart
  • NodeFly/node-foreman, we own copyright on, a foreman clone in node that compiles a Procfile and some configuration into upstart or systemd configurations.
  • unitech/pm2, has a few varieties of startup scripts it can install

I suggest we not leave this up to third-party utilities, and have a more definitive solution. In particular, we need to ensure our solution integrates well with a git-protocol deployer (see previous), and is sufficiently customizable it allows the use of strong-supervisor, supports configuration of log aggregation, etc.

Since node-foreman currently contains some code to do this, we should enhance and refactor it to support our use-cases.

Ryan suggests node-foreman (npm package name is actually just foreman) should be re-written by extracting/re-writing the most useful features as modules and then re-composing them into a couple different tools:

  • strong-foreman (Procfile + .env + log tagger + nodemon)
  • slc export (Procfile + .env + service exporter)
  • ... and the log tagger should be added to strong-supervisor

** ESTIMATE: ... node-foreman refactor **

** TODO: agent reporting system limit "metrics" **

Note on logging

In-app logging has many options for a logger, we should recommend one of:

  • winston
  • bunyan
  • console
  • visionmedia/debug
  • ... any other candidates?

Logging (depending on logger) can be configured to multiple destinations:

  • stdout/err
  • syslog
  • file
  • fifo (allowing smart file redirection)

and some allow run-time reconfiguration of log levels.

Configuring the logger to write to stdout/err is most flexible, it allows the logging destination to be configured at deploy time, but introduces a protocol question: who adds the timestamps, log levels, pids, cluster worker IDs? And is the output allowed to be multi-line? Stack traces on error usually are.

We should extend supervisor to redirect the output of its workers (including error stack dumps) to a configurable destination, and make sure that it is flexible enough to accomodate reasonable variations in log output format.

** ESTIMATE: backlogged, supervisor logging ** ** ESTIMATE: 1 day, exporter implementing configurable logging strategy **

Note on supervisors and process managers

system process managers often contain features to do daemonization, logging, restart on failure, start/stop, deliver signals or run scripts on request, process listing, log viewing, cpu/memory usage, etc.

A number of node supervisors have veered into system process manager territory in terms of features.

Why node process managers are not as useful as they appear

One thing I don't like about pm2/etc. is that they seem to be reimplementing system features. If they are doing it, as node-foreman does, to build a dev time tool, then that makes some sense. But if they are targeting the tool for deployment time... it doesn't make so much sense.

If we did decide to implement process-management features, it would make sense only as a manager that attached to the Ops Dashboard, and that we used to get a toe-hold on machines, so that we could then control, deploy, and configure them, pushing apps dynamically, etc. A kind of super-clustering, super-supervising, super-deployer.

In the absence of those features, I don't see process managers as useful node deployment tools.

Why node supervisors are useful

Deployment-time decisions should be made at deploy time, not in development. This is best done with some kind of supervisor, that can optionally run an application with monitoring (like strong-agent), with clustering (like strong-cluster-control), with log aggregation (strong-supervisor implementation in progress), pidfile generation, etc.

Monitoring is not a system feature.

Clustering is a feature that could be done by system, but... node has ability to use multiple cores with cluster, forcing that into system is not very dynamic, and also forces use of a reverse proxy/load balancer. Clustering and restart is a feature that belongs in a node supervisor.

Logging could be done by the system, but upstart doesn't, and the log aggregation works better when its cluster-worker/node aware.

FWIW, I regret adding daemonization and logging to strong-supervisor.

We need a blog that more clearly articulates why strong-supervisor is a reasonable choice of tool, despite that it is apparently feature poor when compared to forever, pms, etc.

We also need to add a few features that it really is missing:

  1. signal handling: TBD, requirements to be driven by systemd/upstart script
  2. health: TBD
  3. log aggregation: TBD (but in progress)

Exposing a /health HTTP route is necessary for integration with EC2, Heroku, and many in-house load-balancer setups. It can't be done well in a worker, because workers don't know state of a cluster. Its a reasonable thing to add to strong-supervisor, though there are limits to how "application specific" the health check can be. At least basic cluster status (number of workers, restarts, whether its shutting down or starting up, etc.) can be published at a known route (optionally, of course), and possibly some kind of application health can be determined from whether the workers are listening, or even whether they respond to a message sent to them on the cluster bus, or are sending a heartbeat status on the cluster bus.

I'd like to see some kind of "plugin" system for strong-supervisor... so each supervisor can call an arbitrary piece of code before forking the workers. Then we could provide plugins for health, strong-mq, strong-cluster-store, signal handling, etc. I don't want to build too many optional and opinionated features into it!

** ESTIMATE: signal handling ** ** ESTIMATE: health ** ** ESTIMATE: strong-supervisor plugin system **

Note on restart-in-place

strong-supervisor and other supervisors allow code to be updated underneath an app, and for the workers to be gradually restarted. I've mixed feelings about this. For an app composed purely of resources loaded into memory, this could work, maybe its even a good idea. If your app has resources on disk that are loaded at run-time such as a web app and/or its assets, it would be awful if the new version of pieces of the web app started to be served before the running node app had restarted!

What's our stance on this?

restart-in-place may mean that configuration of the process manager gets more complex, and might make zero-downtime harder. Perhaps its a choice made by the app. Perhaps on-disk resources are loaded once on app startup when NODE_ENV is production, but not after?

My feeling on zero-downtime is that if you actually care, you should have multiple machines behind a load-balancer. Otherwise, any kind of operational activity (scheduled or unfortunate) other than a simple app update is still going to cause downtime.

Note on graceful close

REQ: 9

Its common for people to suggest that node apps should graceful close by:

  1. Not accepting new connections
  2. Allowing current connections to "finish gracefully"

(1) is easy: see server.close().

2... not so easy, though spanishdict blog post below seems to have the best suggestion: express middleware. Its hard to believe this isn't out there already. I've been asked several times (Dream11, etc.) by people how to do this, and have had to wave my hands. I'd like to be more of a subject-matter expert on this!

In particular, node http connections are usually in keep-alive, so they will stay open as long as a client is using them. A client doing an API polling loop could keep the connection open until the app is exited... some of the approaches discussed below need wrapping up into a module, I highly expect, and we need to blog about them.

Random resources on this issue:

[1:07:59 PM] Sam Roberts: ben, how to gracefully close an http server? The best I can see is to keep an array of all open connections (manually, ugh), then when I want to close, call .setTimeout(0) on all the connections.
[1:08:25 PM] Sam Roberts: I take that back, probably need .setTimeout(1), because 0 disables a timeout
[1:09:20 PM] Ben Noordhuis: how graceful is graceful? you can call res.end() on all responses, then server.close()
[1:10:04 PM] Ben Noordhuis: but that will wait until all pending data has been flushed to the client. if there are stragglers (or malicious slolaris clients) that can take a while
[1:10:38 PM] Sam Roberts: graceful, as in, avoid client seeing the connection reset, if possible.
[1:11:41 PM] Sam Roberts: I don't see any way to set keep-alive to false, which would be the nicest way for clients that are actively using that http connection. and for clients that are not using them actively, they run the risk of a RST if we close our side. unavoidable.
[1:11:41 PM] Ben Noordhuis: right. res.end() or setTimeout(1) would do that
[1:12:20 PM] Ben Noordhuis: yes. just server.close() will make the server stop accepting new connections, then you can wait for the existing ones to die off naturally
[1:13:16 PM] Ben Noordhuis: you probably want to put a mechanism in place that responds with 503 to new requests on existing connections
[1:14:16 PM] Sam Roberts: this is a cluster scenario, so I want them to keep doing http with the server, but to make a new connection first.
[1:14:41 PM] Ben Noordhuis: is the scenario a rolling upgrade?
[1:16:03 PM] Sam Roberts: rolling restart, single restart, or a decrease in number of concurrent workers (but server hasn't errored, its still good, we just want it to go away, and not stay FOREVER because there is an active http client keeping a connection open)
[1:16:42 PM] Sam Roberts: would also happen with a loadbalancer in front of a number of node instances, where we want to take one instance down
[1:18:23 PM] Ben Noordhuis: right. then you want a mix of the above: server.close(), a reasonable timeout, etc.
[1:20:13 PM] Sam Roberts: http://blog.argteam.com/coding/hardening-node-js-for-production-part-3-zero-downtime-deployments-with-nginx/, these folks are using middleware to start sending 502. hm, not so different from your 503 suggestion.
[1:21:03 PM] Ben Noordhuis: 503 > 502 in more than one way
[1:23:25 PM] Ben Noordhuis: i think if you unset both Content-Length and Transfer-Encoding in the final response, node will auto-close the socket (because there is no way for the client to otherwise know when the end of the response is)

Note on when and how configuration is bound to an application

Configuration should not be present in a node application archive. The more configuration is bound in, the less choice there will be over deployment environment. Even deploying the same app to staging and then production could become impossible.

The tool flow is:

  1. slc build: creates an archive
  2. slc publish: publishes an archive
  3. slc deploy: prepares an archive, and exports to process manager

At step 2 or 3, configuration can be bound. 2 would involve re-writing the archive to contain configuration. 3 is the almost-last moment, and has the advantage that the runner could know what the configuration should be on the machine it is running on. But, that may require the deployer to know things about the application, which introduces the problem of the deployer configuration needing to be changed as the application changes, which introduces the question of where it gets its configuration... StrongOps has run into this, its nasty.

We should support configuration being bound by the deployer (probably in form of a .env file, possibly with the a URL to a .env file), and while that can include application specific configuration, we should recommend a better way.

Applications should use tools like nconf-redis to pull their configuration. In this model, the only application specific configuration that the exporter needs to bind to application is the URL of the configuration server, and some identifying token. In particular, note how rolling restart from strongops (or ssh/CLI, through supervisor or through process manager) allows natural restart and pickup of new configuration on change.

Other than that, all configuration should be application non-specific... where logs go, where pid file goes, which supervisor or node version to use, etc.

Use Cases

Jenkins, to Artifactory, to staging, then production

Build steps/jobs would be:

  • ci-build: checkout, slc build, upload to artifactory
  • ci-test: npm test, code metrics, etc..
  • ci-promote-to-staging: slc publish $ART_URL git@staging
  • ci-promote-to-production: slc publish ... git@production

This requires routine Jenkins job configuration to trigger build steps on appropriate criteria, such as git commits, passing of automated tests, manual promotion by QA or OPS, etc.

It also requires a staging and production environment to exist and be provisioned with a deployer ready to accept a publish (we will provide deployers for linux with redhat/systemd and ubuntu/upstart, git deployers for heroku, openshift, docker, etc. are already standard in PaaS world).

StrongOps

Discussed at length by Ryan and Sam, we think we could deploy strongops if we moved it to nconf-redis for config, had output log aggregation, and something that we could git push the code to.

Interesting wrinkle is when you have multiple apps in a single repo, that don't all run on the same machine. As we do. I can't recall what we decided about that... though I think its a bit unusual.

Related projects (unreviewed)

@bajtos
Copy link

bajtos commented May 20, 2014

Note on when and how configuration is bound to an application

Before we decide on what approach is the best, can we clarify the requirements first? Here is few items to start the discussion:

  1. Change tracking & version control. All configurations should be kept in an SCM solution.It should be possible to revert back to an older config revision, review differences between different configuration revision.
  2. Whenever a deployment is performed, it must be clear what config version (staging/production) and revision (r20 or r30 in svn terms) was applied.
  3. The relationship between app code revisions and config revisions must be documented and tracked. When a new revision of the app introduces a new service as a dependency, all config versions (staging/production) must be updated. There should be a mechanism preventing deployment of new app revision with old config revision.
  4. Access control. Sensitive information like credentials must be kept private, accessible by few designated people only. (E.g. credentials to production database servers should not live in the main dev repository.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment