MacHu-GWU/Senior Data Engineer.rst

## Senior Data Engineer.rst

      
    Raw
  

              Senior Data Engineer.rst
            
          
    Senior Data Engineer


Job Description
Skill 1: Python Skill
Q1. Width of experience in Python
Q2. Depth of experience in Python: Cache
Q3. Depth of experience in Python: Singleton Pattern


Skill 2: General Machine Learning, Natural Language processing
Q1. Data Preprocessing for Text Dataset
Q2. TF-IDF
Q3. Document Similarity


Skill 3: Databases and Data Warehouses
Q1. Data Management Essential
Q2. OLAP


Skill 4: AWS Skill
Q1. General Level of AWS Experience
Q2. Validate problem solving skills with AWS


Communication, Presentation, Team Players Skill


Job Description

REQUIREMENTS FOR THE GIG:

Strong development experience using Python
Experience developing machine learning and natural language processing models and related applications
Experience with MPP databases such as Greenplum
Experience with AWS ecosystem of big data and analytics services – Redshift, Kinesis, AWS Lambda, Redshift Spectrum, Quicksight, Amazon EMR
Experience with architecting and designing solutions to replace on-premise solutions with cloud native technologies
Experience with Business Intelligence tools and dashboards using Tableau, QuickSight, Jaspersoft or similar
Managing AWS resources including EC2, RDS, Redshift, et cetera
Explore and learn the latest AWS technologies to provide new capabilities and increase efficiency
Help continually improve ongoing reporting and analysis processes, automating or simplifying self-service support for customers
Ability to Mentor and work alongside junior team members and grow them into a leader
Familiarity with Linux
Experience with Hadoop or other map/reduce "big data" systems and services
Knowledge of Advanced SQL and scripting for automation

BASIC QUALIFICATIONS:

Bachelor’s degree in Computer Science, MIS, related technical field, or equivalent work experience.
More than 5 or more years of overall work experience in a related field, including 3 or more years analytics, data engineering or related field
Proven experience in data modeling, ETL development, and data warehousing, or similar skills
Demonstrable skills and experience using SQL (e.g. Postgres, Oracle, SQL Server, Redshift)
Proven track record of successful communication of data infrastructure, data models, and data engineering solutions through written communication, including an ability to effectively communicate with both business and technical teams


Skill 1: Python Skill


Q1. Width of experience in Python
Q2. Depth of experience in Python: Cache
Q3. Depth of experience in Python: Singleton Pattern


Ask Q1 and one of Q2/Q3

Q1. Width of experience in Python


What you usually do with Python in your work? Data engineering, DevOps, ML + 1 ~ 3
Do you have experience writing your own frameworks or library? + 1


Q2. Depth of experience in Python: Cache

Having a class called Order (e-commerce order), of course it has an attribute called order_id, I want you to implement a property method called create_at, it is the timestamp when the order been created. You have to query the database to get this information, how do you implement that?

mentioned create_at won't change or cache +1
mentioned cache property with initial value None + 1 (or similar solution)
mentioned getter, setter + 1


Q3. Depth of experience in Python: Singleton Pattern

How do you implement singleton class in Python?

what is singleton? + 1
mentioned class attribute cache + 1
mentioned customize __new__ method or factory method + 1
mentioned raise Exception in __init__ + 1


Skill 2: General Machine Learning, Natural Language processing


Q1. Data Preprocessing for Text Dataset
Q2. TF-IDF
Q3. Document Similarity


Ask Q1, and one of Q2/Q3

Q1. Data Preprocessing for Text Dataset

Given a Dataset that, let's say StackOverflow/Quora, each document is a big paragraph of text, like user post and comments. We will use this data set for analytics purpose or machine learning, but we don't know how we exactly gonna do it. How do you process and store the data.

Mentioned one or any of the following use case + 1


important information like numbers, people names
extract general NLP features ready for use
build inverse index for searching


Understand the data engineering process for any of above use case + 1
Good understanding with text database, and query optimization, like elasticsearch, + 1


Q2. TF-IDF

Explain what is TF-IDF in NLP and why it is a good factor for term importance in a document.
If he don't know TF-IDF, ask Q3

explained TF + 1
explained IDF + 1
understand why IDF can reflect importance + 1
understand the disadvantage of IDF, (no positioning info) + 1


Q3. Document Similarity

How do you calculate a similarity between two document?

mentioned that similarity has different use case a) In legal matters b) Search engines c) In customer services, etc ... + 1
mentioned text document feature engineering, term frequency, etc ... + 1
mentioned K-Mean, cos distance, LDA, other document distance algorithm etc ...  + 1


Skill 3: Databases and Data Warehouses


Q1. Data Management Essential
Q2. OLAP


Q1. Data Management Essential

How do you Track change of the data record?

use hash for uuid, choose partition key properly + 1
use hash or body of content or version_id to track changes + 1
use last update timestamp + 1
store data history with the data, use last_update timestamp as version id, display latest copy in application level + 1
other ideas + 1 ~ 2


Q2. OLAP

Tell me about an experience using OLAP database, and your achievement in the project.

understand OLAP, column-oriented storage + 1
experience in using OLAP as developer + 1
experience in integrating with other application or external data processing system + 1
understand underlying implementation + 1
other ideas + 1 ~ 2


Skill 4: AWS Skill


Q1. General Level of AWS Experience
Q2. Validate problem solving skills with AWS


Q1. General Level of AWS Experience

Tell me about your experience using AWS for Data Engineering

AWS S3 + 1
AWS Lambda + 2
AWS Kinesis + 2
AWS SQS + 1
AWS EMR + 2
AWS Redshift + 2


Q2. Validate problem solving skills with AWS

How do you manage who can access what data. You can use any data storage services like S3, RDS, Redshift + 1 ~ 3

IAM Policy
S3 Policy
Redshift custom view

How do you track who access which data from where and when? + 1 ~ 3

Cloudtrail log
database log
s3 object level log


Communication, Presentation, Team Players Skill

No question, this is evaluated based on how the candidates answering questions.

Communication + 1 ~ 3
Presentation + 1 ~ 3