Skip to content

Instantly share code, notes, and snippets.

@rampage644
Last active August 18, 2016 08:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rampage644/ef46422f435fd0ea3ef52aa1fa6f383c to your computer and use it in GitHub Desktop.
Save rampage644/ef46422f435fd0ea3ef52aa1fa6f383c to your computer and use it in GitHub Desktop.
Shub talk

My Shub Talks

29/07 - Introduce workflow manager

Brief intro

First, I'd like to say hello to everyone and thank for coming.

I joined SH not that long ago (~2months) as a PS team member. I don't have a chance to gather enough knowledge about how stuff works here, but let me begin.

During working on some PS projects I met with many things that should be done only once and they were not. I want to talk about one of them in this talk.

Pain

There are lots of duplicate code across different projects

Even worse there are duplicate ideas, methods, approaches and techniques.

Pain #2

In many projects there is custom logic for running spiders. Usually it's a entry script that is scheduled with Dash that takes some sort of input (config, args) and runs spiders.

It's possible thanks to SC2.0 arbitrary python scripts execution facility and simple cron-style scheduler. We can improve this.

What we have

Some custom code that implements scheduling

DS team did a decent job with it. It provide snippets (some code + config: see BaseManager in ds-toolbox) to create producer-consumer spider chain. Custom scripts (Managers) run within single dash project that control spider execution within other dash project.

Question why not "open source" it to whole company and even take one step forward?

My suggestion

WE NEED STANDALONE DEDICATED WORKFLOW MANAGER.

In my opinion, the whole company would greatly benefit from having it. Let's create company wide workflow manager installation so that anyone in any team can access and use it!

My suggestion #2

Do not write it from scratch (at least, at the moment)

It really doesn't matter what to install (google for luigi, oozie, azkaban). Many of them.

My suggestion is Airflow. Doing it to make proposal sound more concrete and not just chit-chatting about non-tangible stuff.

Benefits

  • Code/infrustructure reusability -> don't reinvent the wheel
  • One place for all workflows -> central registry
  • Tons of nice features coming with it -> GUI, flexibility, etc. (in case of airflow)
  • Another nice feature we can sell to customers (service to execute large complex workflows to support data retrieval process)

Use cases

  • Right now in PS projects I'm seeing a couple of projects with really complex workflow logic: chaining/branching/dependencies. Such projects would benefit a lot.
  • More flexibility for monitoring: check not only spider state but look at results.
  • DS team could enhance their managers. Or simplify. Or get rid of. Given the team has created a lot for unifying codebase it would not that difficult to switch to new platform. Though they usually need not so complex logic, just producer/consumer pattern (most of time, and I could mistake here)
  • All ETL's within a company would be done right. Workflow manager is a heart of companies ETL processes, it's designed for them.
  • Most simple spiders would also benefit: easy to implement retry/monitor logic. Or any other extra functionality for free.

More about airflow

Open source python scalable workflow manager, now is under Apache incubator. Originally developed by airbnb. Gaining momentum pretty fast. Pretty new, yet early adoption phase has already been passed. Workflow (==DAG, directed acyclic graph) is a python code: flexibility!

Consists of:

  1. Webserver (GUI + manual input)
  2. Scheduler (decides what to run when)
  3. Workers (inprocess/local/celery/mesos executors)

My own Proof-of-Concept

Local installation with docker-airflow with DAGs (directed acyclic graph) folder mounted. Just open your browser and look how it runs. Actually, my previous experience is already PoC.

Samples

Let's get a little dirty with a code

  1. Producer-consumer chain
  2. LinkedIn Spark ETL LinkedIn Spark ETL
  3. hdfs tweets from airflow example dags

Wrap up

Main point of my talk is not about airflow but about common workflow process. I used airflow to be as concrete as possible (not just idea and/or talk but proposal, demonstration, PoC).

In the end i want to ask for a feedback. Do we really need it or am I mistaken here? Maybe there is already some job done within some team/project/anything.

Another thing i want to ask is to spread the word. Shubtalks is a nice place to start but we have more than 35-40 people. It nice to have them in a loop :)

19/09 - My long-term productivity tips

Basement

I wont' go into very fundamental topics that are worth researching by themselves such as:

  • Sleep
  • Physical activity/wellness
  • Personal burnout

I just want to share my regular routine that helps me to stay on track and get job done.

Pomodoro

Pomodoro technique is a core of my work process. For those who is not familiar with it it just a smart way to split you time into intervals of work and rest.

Rest is first-class citizen of your productive work. It one of things we're all know about but rare pay due attention to.

How do i use it: I use 25 min work interval and 5 min rest. My pomodoro is 30 minutes long. Every four pomodoro you have a long break (i use 15 minutes).

Key points here are:

  • Really take breaks! I use short breaks for stretching, having a glass of water, do some movements. These small things contribute much to my physical condition. Idea here to have some activity to accelerate blood flow that help brain cells to get extra fuel: nutrients, oxygen, whatever. That boosts your mind performing.
  • Ultra focus on one task per interval. No distractions (phone, social media, chats, nothing - use them either during breaks or in a dedicated pomodoro). For better focusing I employ ambient sounds in a headphones (more on this later).
  • Small trick is to leave to rest without not wanting to. So you brain will still work on what you've worked in background generating insights while you conscientiously do other things (test that! this is really cool!)

Find more about it at this awesome lifehacker article. It's great as entry point.

Standing desk

This is very controversial point. For pomodoro technique you can vary work/rest time to fit you personally. Once i used 45-15 split because i feel not-so-productive working only 25 min. But (!) everything changed once i switch to height-regulated desk.

Right now 25-5 split fits much naturally because, well, it's harder to stand than to sit. Now you tend to focus much harder for smaller periods of time (or it's only me).

There are no research that really show standing is more or less beneficial than sitting so i won't advocate for it. I switched because it felt like a cool experiment and it stuck with me.

My main point using standing desk is to force myself being more active throughout the day. Again, it's harder to stand so you will accept small movements more easily. You're already standing - just go grab some water. Compare it to sitting - man, i'm too lazy to stand up and go do something, let's do it in an hour.


Music/sound/white-noise

This is great for spinning your brain. There actually ARE researches that show how right amount of background noise affect your concentration ability.

So I employ it. There are services that generate such noises (my choice is nosili.com). There is a intriguing service called brain.fm (AI generated sounds/music for relaxing/productivity/etc. Let me know if we can have a paid account there). Music actually distracts me and is good only for routine tasks.

Parkinson law

work expands so as to fill the time available for its completion

Limit yourself. Life is a marathon and not a sprint. Your goal is to perfrom steadily for a long time. Unless you know what you're doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment