derpston/ops_automation_engineer_job_spec.txt Secret

## ops_automation_engineer_job_spec.txt
Us
--
Two co-founders, 11 engineers, and traffic to our service has more than doubled in the last year.

We run a hosted version of the popular Graphite open source metric and monitoring software, and we
have customers all over the world.

We need another back end engineer to help work on scaling, reliability and automation. We have over
500 systems to manage now, mostly physical hardware, and automation is more important than ever.
Instead of hiring a pure ops person, we want to hire someone capable of automating as much ops work
as possible. More automation = more sleep.

We're looking for a sysadmin/engineer who wants to be part of an early stage startup with all the
ups, downs, risks and benefits that go with it. This is not a comfortable corporate job, but then
there aren't any TPS reports or middle managers either...


You
--
Significant Linux system administration experience. You need to know how to use package managers
correctly and tools like tcpdump, lsof, mtr, rsync, iptables, ntp, strace, etc when diagnosing
common application, system and network problems.

Some puppet experience would be good - we've been using puppet since server #1 and we're pretty
pleased with it so far.

An eye for performance is important - your contributions will be exercised by more than
fifty billion events per day. We always have to think about how something will scale and fail.

We don't really care about your level of formal education, mathematical skill and so on. We want
to see that you have relevant experience, that you like automating away repetitive work, that
you have good attention to detail, an aptitude for learning new skills and that you have empathy
for your team-mates and our customers.


The job and the challenges
--
While the frontend is three Django apps, we have more than ten different backend and internal
services, and many of them talk to each other. We'll need your help to scale them individually,
and to decide when to throw away and rebuild others. This is not your typical website and
database scaling problem, though we have those too!

While this role involves a lot of ops work, the biggest challenges come from how our traffic has
doubled every year for the last few years, (and we expect this to continue) and how we need to
continue automating deployments of our services across our clusters of servers. We need someone
to help us identify weak points and to build auto-remediation tools for when things fail.

We have eight riak clusters, which you'll need to learn to maintain. We use a lot of big redis
instances. We're using serf for distributed service discovery/cluster management and we're trying
to make our backend tolerate a failure without waking anybody up.

Being in the on-call rotation is part of the job. That usually means not being more than a few
minutes from an internet connection. Sometimes it means getting woken up by a phone at 4am. We
have weeks go by with zero incidents, and other weeks with several. On-call always sucks, so
we're interested in making it suck as little as possible.

Every on call shift ends on a Friday morning with the rest of the day off, giving you a three day
weekend that's not counted against your holiday allowance. We want relaxed, well rested ops people.

Being on call does not mean watching graphs - nobody has time for that. We try to rely on our
alerting and we try to only alert for actionable things that are already broken, or will be
broken soon. As a monitoring company it's important that we constantly try to make sure our own
monitoring is up to scratch too.

Most of the team works out of the Dublin office, but we're flexible about working from home and one
of our co-founders is living in the US, so we're partially remote and we have to be good at
communicating. We use Slack, Google Docs, Trello, Workflowy and video chat tools like appear.in to
keep in touch.


Location and hours
--
While we're a partially remote team, our office is in Dublin, Ireland and we'd like you to be
there with most of the team. We have a bright, spacious office on Drury St in the city centre
with many good lunch and transport options nearby.

Our working hours are typically 1000-1800, but it varies by person.

Once you've settled in you'll have the opportunity to work from home regularly.


Compensation
--
A competitive salary. 25 days of paid holiday, one day off after every on-call shift plus
the usual 9 public holidays.

Health insurance for you and your family.

Since you'll be on-call, we'll pay your phone bill. We also provide a company laptop,
typically a Macbook Air, but the brand/model is up for discussion.


How to apply
--
Tell jobs at hostedgraphite dot com about why your skills, experience and personality make
you a good fit. If you want to submit a CV, make sure it's txt or pdf. We'd like to see
some of your code, but it's not essential.

No ninjas, rockstars or brogrammers, please; just nice, caring humans.

We don't work with recruitment agencies.
	Us
	--
	Two co-founders, 11 engineers, and traffic to our service has more than doubled in the last year.

	We run a hosted version of the popular Graphite open source metric and monitoring software, and we
	have customers all over the world.

	We need another back end engineer to help work on scaling, reliability and automation. We have over
	500 systems to manage now, mostly physical hardware, and automation is more important than ever.
	Instead of hiring a pure ops person, we want to hire someone capable of automating as much ops work
	as possible. More automation = more sleep.

	We're looking for a sysadmin/engineer who wants to be part of an early stage startup with all the
	ups, downs, risks and benefits that go with it. This is not a comfortable corporate job, but then
	there aren't any TPS reports or middle managers either...


	You
	--
	Significant Linux system administration experience. You need to know how to use package managers
	correctly and tools like tcpdump, lsof, mtr, rsync, iptables, ntp, strace, etc when diagnosing
	common application, system and network problems.

	Some puppet experience would be good - we've been using puppet since server #1 and we're pretty
	pleased with it so far.

	An eye for performance is important - your contributions will be exercised by more than
	fifty billion events per day. We always have to think about how something will scale and fail.

	We don't really care about your level of formal education, mathematical skill and so on. We want
	to see that you have relevant experience, that you like automating away repetitive work, that
	you have good attention to detail, an aptitude for learning new skills and that you have empathy
	for your team-mates and our customers.


	The job and the challenges
	--
	While the frontend is three Django apps, we have more than ten different backend and internal
	services, and many of them talk to each other. We'll need your help to scale them individually,
	and to decide when to throw away and rebuild others. This is not your typical website and
	database scaling problem, though we have those too!

	While this role involves a lot of ops work, the biggest challenges come from how our traffic has
	doubled every year for the last few years, (and we expect this to continue) and how we need to
	continue automating deployments of our services across our clusters of servers. We need someone
	to help us identify weak points and to build auto-remediation tools for when things fail.

	We have eight riak clusters, which you'll need to learn to maintain. We use a lot of big redis
	instances. We're using serf for distributed service discovery/cluster management and we're trying
	to make our backend tolerate a failure without waking anybody up.

	Being in the on-call rotation is part of the job. That usually means not being more than a few
	minutes from an internet connection. Sometimes it means getting woken up by a phone at 4am. We
	have weeks go by with zero incidents, and other weeks with several. On-call always sucks, so
	we're interested in making it suck as little as possible.

	Every on call shift ends on a Friday morning with the rest of the day off, giving you a three day
	weekend that's not counted against your holiday allowance. We want relaxed, well rested ops people.

	Being on call does not mean watching graphs - nobody has time for that. We try to rely on our
	alerting and we try to only alert for actionable things that are already broken, or will be
	broken soon. As a monitoring company it's important that we constantly try to make sure our own
	monitoring is up to scratch too.

	Most of the team works out of the Dublin office, but we're flexible about working from home and one
	of our co-founders is living in the US, so we're partially remote and we have to be good at
	communicating. We use Slack, Google Docs, Trello, Workflowy and video chat tools like appear.in to
	keep in touch.


	Location and hours
	--
	While we're a partially remote team, our office is in Dublin, Ireland and we'd like you to be
	there with most of the team. We have a bright, spacious office on Drury St in the city centre
	with many good lunch and transport options nearby.

	Our working hours are typically 1000-1800, but it varies by person.

	Once you've settled in you'll have the opportunity to work from home regularly.


	Compensation
	--
	A competitive salary. 25 days of paid holiday, one day off after every on-call shift plus
	the usual 9 public holidays.

	Health insurance for you and your family.

	Since you'll be on-call, we'll pay your phone bill. We also provide a company laptop,
	typically a Macbook Air, but the brand/model is up for discussion.


	How to apply
	--
	Tell jobs at hostedgraphite dot com about why your skills, experience and personality make
	you a good fit. If you want to submit a CV, make sure it's txt or pdf. We'd like to see
	some of your code, but it's not essential.

	No ninjas, rockstars or brogrammers, please; just nice, caring humans.

	We don't work with recruitment agencies.