Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Interview Engineers

Senior Data Engineer

Job Description

REQUIREMENTS FOR THE GIG:

  • Strong development experience using Python
  • Experience developing machine learning and natural language processing models and related applications
  • Experience with MPP databases such as Greenplum
  • Experience with AWS ecosystem of big data and analytics services – Redshift, Kinesis, AWS Lambda, Redshift Spectrum, Quicksight, Amazon EMR
  • Experience with architecting and designing solutions to replace on-premise solutions with cloud native technologies
  • Experience with Business Intelligence tools and dashboards using Tableau, QuickSight, Jaspersoft or similar
  • Managing AWS resources including EC2, RDS, Redshift, et cetera
  • Explore and learn the latest AWS technologies to provide new capabilities and increase efficiency
  • Help continually improve ongoing reporting and analysis processes, automating or simplifying self-service support for customers
  • Ability to Mentor and work alongside junior team members and grow them into a leader
  • Familiarity with Linux
  • Experience with Hadoop or other map/reduce "big data" systems and services
  • Knowledge of Advanced SQL and scripting for automation

BASIC QUALIFICATIONS:

  • Bachelor’s degree in Computer Science, MIS, related technical field, or equivalent work experience.
  • More than 5 or more years of overall work experience in a related field, including 3 or more years analytics, data engineering or related field
  • Proven experience in data modeling, ETL development, and data warehousing, or similar skills
  • Demonstrable skills and experience using SQL (e.g. Postgres, Oracle, SQL Server, Redshift)
  • Proven track record of successful communication of data infrastructure, data models, and data engineering solutions through written communication, including an ability to effectively communicate with both business and technical teams

Skill 1: Python Skill

Ask Q1 and one of Q2/Q3

Q1. Width of experience in Python

  • What you usually do with Python in your work? Data engineering, DevOps, ML + 1 ~ 3
  • Do you have experience writing your own frameworks or library? + 1

Q2. Depth of experience in Python: Cache

Having a class called Order (e-commerce order), of course it has an attribute called order_id, I want you to implement a property method called create_at, it is the timestamp when the order been created. You have to query the database to get this information, how do you implement that?

  • mentioned create_at won't change or cache +1
  • mentioned cache property with initial value None + 1 (or similar solution)
  • mentioned getter, setter + 1

Q3. Depth of experience in Python: Singleton Pattern

How do you implement singleton class in Python?

  • what is singleton? + 1
  • mentioned class attribute cache + 1
  • mentioned customize __new__ method or factory method + 1
  • mentioned raise Exception in __init__ + 1

Skill 2: General Machine Learning, Natural Language processing

Ask Q1, and one of Q2/Q3

Q1. Data Preprocessing for Text Dataset

Given a Dataset that, let's say StackOverflow/Quora, each document is a big paragraph of text, like user post and comments. We will use this data set for analytics purpose or machine learning, but we don't know how we exactly gonna do it. How do you process and store the data.

  • Mentioned one or any of the following use case + 1
  1. important information like numbers, people names
  2. extract general NLP features ready for use
  3. build inverse index for searching
  • Understand the data engineering process for any of above use case + 1
  • Good understanding with text database, and query optimization, like elasticsearch, + 1

Q2. TF-IDF

Explain what is TF-IDF in NLP and why it is a good factor for term importance in a document.

If he don't know TF-IDF, ask Q3

  • explained TF + 1
  • explained IDF + 1
  • understand why IDF can reflect importance + 1
  • understand the disadvantage of IDF, (no positioning info) + 1

Q3. Document Similarity

How do you calculate a similarity between two document?

  • mentioned that similarity has different use case a) In legal matters b) Search engines c) In customer services, etc ... + 1
  • mentioned text document feature engineering, term frequency, etc ... + 1
  • mentioned K-Mean, cos distance, LDA, other document distance algorithm etc ... + 1

Skill 3: Databases and Data Warehouses

Q1. Data Management Essential

How do you Track change of the data record?

  • use hash for uuid, choose partition key properly + 1
  • use hash or body of content or version_id to track changes + 1
  • use last update timestamp + 1
  • store data history with the data, use last_update timestamp as version id, display latest copy in application level + 1
  • other ideas + 1 ~ 2

Q2. OLAP

Tell me about an experience using OLAP database, and your achievement in the project.

  • understand OLAP, column-oriented storage + 1
  • experience in using OLAP as developer + 1
  • experience in integrating with other application or external data processing system + 1
  • understand underlying implementation + 1
  • other ideas + 1 ~ 2

Skill 4: AWS Skill

Q1. General Level of AWS Experience

Tell me about your experience using AWS for Data Engineering

  • AWS S3 + 1
  • AWS Lambda + 2
  • AWS Kinesis + 2
  • AWS SQS + 1
  • AWS EMR + 2
  • AWS Redshift + 2

Q2. Validate problem solving skills with AWS

How do you manage who can access what data. You can use any data storage services like S3, RDS, Redshift + 1 ~ 3

  • IAM Policy
  • S3 Policy
  • Redshift custom view

How do you track who access which data from where and when? + 1 ~ 3

  • Cloudtrail log
  • database log
  • s3 object level log

Communication, Presentation, Team Players Skill

No question, this is evaluated based on how the candidates answering questions.

  • Communication + 1 ~ 3
  • Presentation + 1 ~ 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment