Skip to content

Instantly share code, notes, and snippets.

@MacHu-GWU
Created December 6, 2019 16:36
Show Gist options
  • Save MacHu-GWU/3de8971dc1237ab5739fcca51abc889e to your computer and use it in GitHub Desktop.
Save MacHu-GWU/3de8971dc1237ab5739fcca51abc889e to your computer and use it in GitHub Desktop.
Interview Engineers

Senior Data Engineer

REQUIREMENTS FOR THE GIG:

  • Strong development experience using Python
  • Experience developing machine learning and natural language processing models and related applications
  • Experience with MPP databases such as Greenplum
  • Experience with AWS ecosystem of big data and analytics services – Redshift, Kinesis, AWS Lambda, Redshift Spectrum, Quicksight, Amazon EMR
  • Experience with architecting and designing solutions to replace on-premise solutions with cloud native technologies
  • Experience with Business Intelligence tools and dashboards using Tableau, QuickSight, Jaspersoft or similar
  • Managing AWS resources including EC2, RDS, Redshift, et cetera
  • Explore and learn the latest AWS technologies to provide new capabilities and increase efficiency
  • Help continually improve ongoing reporting and analysis processes, automating or simplifying self-service support for customers
  • Ability to Mentor and work alongside junior team members and grow them into a leader
  • Familiarity with Linux
  • Experience with Hadoop or other map/reduce "big data" systems and services
  • Knowledge of Advanced SQL and scripting for automation

BASIC QUALIFICATIONS:

  • Bachelor’s degree in Computer Science, MIS, related technical field, or equivalent work experience.
  • More than 5 or more years of overall work experience in a related field, including 3 or more years analytics, data engineering or related field
  • Proven experience in data modeling, ETL development, and data warehousing, or similar skills
  • Demonstrable skills and experience using SQL (e.g. Postgres, Oracle, SQL Server, Redshift)
  • Proven track record of successful communication of data infrastructure, data models, and data engineering solutions through written communication, including an ability to effectively communicate with both business and technical teams

Ask Q1 and one of Q2/Q3

  • What you usually do with Python in your work? Data engineering, DevOps, ML + 1 ~ 3
  • Do you have experience writing your own frameworks or library? + 1

Having a class called Order (e-commerce order), of course it has an attribute called order_id, I want you to implement a property method called create_at, it is the timestamp when the order been created. You have to query the database to get this information, how do you implement that?

  • mentioned create_at won't change or cache +1
  • mentioned cache property with initial value None + 1 (or similar solution)
  • mentioned getter, setter + 1

How do you implement singleton class in Python?

  • what is singleton? + 1
  • mentioned class attribute cache + 1
  • mentioned customize __new__ method or factory method + 1
  • mentioned raise Exception in __init__ + 1

Ask Q1, and one of Q2/Q3

Given a Dataset that, let's say StackOverflow/Quora, each document is a big paragraph of text, like user post and comments. We will use this data set for analytics purpose or machine learning, but we don't know how we exactly gonna do it. How do you process and store the data.

  • Mentioned one or any of the following use case + 1
  1. important information like numbers, people names
  2. extract general NLP features ready for use
  3. build inverse index for searching
  • Understand the data engineering process for any of above use case + 1
  • Good understanding with text database, and query optimization, like elasticsearch, + 1

Explain what is TF-IDF in NLP and why it is a good factor for term importance in a document.

If he don't know TF-IDF, ask Q3

  • explained TF + 1
  • explained IDF + 1
  • understand why IDF can reflect importance + 1
  • understand the disadvantage of IDF, (no positioning info) + 1

How do you calculate a similarity between two document?

  • mentioned that similarity has different use case a) In legal matters b) Search engines c) In customer services, etc ... + 1
  • mentioned text document feature engineering, term frequency, etc ... + 1
  • mentioned K-Mean, cos distance, LDA, other document distance algorithm etc ... + 1

How do you Track change of the data record?

  • use hash for uuid, choose partition key properly + 1
  • use hash or body of content or version_id to track changes + 1
  • use last update timestamp + 1
  • store data history with the data, use last_update timestamp as version id, display latest copy in application level + 1
  • other ideas + 1 ~ 2

Tell me about an experience using OLAP database, and your achievement in the project.

  • understand OLAP, column-oriented storage + 1
  • experience in using OLAP as developer + 1
  • experience in integrating with other application or external data processing system + 1
  • understand underlying implementation + 1
  • other ideas + 1 ~ 2

Tell me about your experience using AWS for Data Engineering

  • AWS S3 + 1
  • AWS Lambda + 2
  • AWS Kinesis + 2
  • AWS SQS + 1
  • AWS EMR + 2
  • AWS Redshift + 2

How do you manage who can access what data. You can use any data storage services like S3, RDS, Redshift + 1 ~ 3

  • IAM Policy
  • S3 Policy
  • Redshift custom view

How do you track who access which data from where and when? + 1 ~ 3

  • Cloudtrail log
  • database log
  • s3 object level log

No question, this is evaluated based on how the candidates answering questions.

  • Communication + 1 ~ 3
  • Presentation + 1 ~ 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment