Skip to content

Instantly share code, notes, and snippets.

@derpston
Last active August 10, 2018 17:14
Show Gist options
  • Save derpston/64d25d70f7b882f38b243cb2bbfef4e9 to your computer and use it in GitHub Desktop.
Save derpston/64d25d70f7b882f38b243cb2bbfef4e9 to your computer and use it in GitHub Desktop.
Hosted Graphite are hiring an ops/automation engineer!
Us
--
Two co-founders, 11 engineers, and traffic to our service has more than doubled in the last year.
We run a hosted version of the popular Graphite open source metric and monitoring software, and we
have customers all over the world.
We need another back end engineer to help work on scaling, reliability and automation. We have over
500 systems to manage now, mostly physical hardware, and automation is more important than ever.
Instead of hiring a pure ops person, we want to hire someone capable of automating as much ops work
as possible. More automation = more sleep.
We're looking for a sysadmin/engineer who wants to be part of an early stage startup with all the
ups, downs, risks and benefits that go with it. This is not a comfortable corporate job, but then
there aren't any TPS reports or middle managers either...
You
--
Significant Linux system administration experience. You need to know how to use package managers
correctly and tools like tcpdump, lsof, mtr, rsync, iptables, ntp, strace, etc when diagnosing
common application, system and network problems.
Some puppet experience would be good - we've been using puppet since server #1 and we're pretty
pleased with it so far.
An eye for performance is important - your contributions will be exercised by more than
fifty billion events per day. We always have to think about how something will scale and fail.
We don't really care about your level of formal education, mathematical skill and so on. We want
to see that you have relevant experience, that you like automating away repetitive work, that
you have good attention to detail, an aptitude for learning new skills and that you have empathy
for your team-mates and our customers.
The job and the challenges
--
While the frontend is three Django apps, we have more than ten different backend and internal
services, and many of them talk to each other. We'll need your help to scale them individually,
and to decide when to throw away and rebuild others. This is not your typical website and
database scaling problem, though we have those too!
While this role involves a lot of ops work, the biggest challenges come from how our traffic has
doubled every year for the last few years, (and we expect this to continue) and how we need to
continue automating deployments of our services across our clusters of servers. We need someone
to help us identify weak points and to build auto-remediation tools for when things fail.
We have eight riak clusters, which you'll need to learn to maintain. We use a lot of big redis
instances. We're using serf for distributed service discovery/cluster management and we're trying
to make our backend tolerate a failure without waking anybody up.
Being in the on-call rotation is part of the job. That usually means not being more than a few
minutes from an internet connection. Sometimes it means getting woken up by a phone at 4am. We
have weeks go by with zero incidents, and other weeks with several. On-call always sucks, so
we're interested in making it suck as little as possible.
Every on call shift ends on a Friday morning with the rest of the day off, giving you a three day
weekend that's not counted against your holiday allowance. We want relaxed, well rested ops people.
Being on call does not mean watching graphs - nobody has time for that. We try to rely on our
alerting and we try to only alert for actionable things that are already broken, or will be
broken soon. As a monitoring company it's important that we constantly try to make sure our own
monitoring is up to scratch too.
Most of the team works out of the Dublin office, but we're flexible about working from home and one
of our co-founders is living in the US, so we're partially remote and we have to be good at
communicating. We use Slack, Google Docs, Trello, Workflowy and video chat tools like appear.in to
keep in touch.
Location and hours
--
While we're a partially remote team, our office is in Dublin, Ireland and we'd like you to be
there with most of the team. We have a bright, spacious office on Drury St in the city centre
with many good lunch and transport options nearby.
Our working hours are typically 1000-1800, but it varies by person.
Once you've settled in you'll have the opportunity to work from home regularly.
Compensation
--
A competitive salary. 25 days of paid holiday, one day off after every on-call shift plus
the usual 9 public holidays.
Health insurance for you and your family.
Since you'll be on-call, we'll pay your phone bill. We also provide a company laptop,
typically a Macbook Air, but the brand/model is up for discussion.
How to apply
--
Tell jobs at hostedgraphite dot com about why your skills, experience and personality make
you a good fit. If you want to submit a CV, make sure it's txt or pdf. We'd like to see
some of your code, but it's not essential.
No ninjas, rockstars or brogrammers, please; just nice, caring humans.
We don't work with recruitment agencies.
@mjhea0
Copy link

mjhea0 commented Mar 31, 2016

Typo. This:

While we're a partially remote team, our office is in Dublin, Ireland and we'd like you to be
there with most of the team.

Should be:

While we're a partially remote team, our office is in Dublin, Ireland and we'd like you to be
there with most of the time.

@derpston
Copy link
Author

derpston commented Apr 1, 2016

I've re-read it several times and I believe it is as intended, but thanks anyway. :)

@gjstoychev
Copy link

still best job description in 2018 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment