Skip to content

Instantly share code, notes, and snippets.

@nickmarden
Last active December 12, 2019 17:31
Show Gist options
  • Save nickmarden/dc769d148e0362f61ab14607a9906258 to your computer and use it in GitHub Desktop.
Save nickmarden/dc769d148e0362f61ab14607a9906258 to your computer and use it in GitHub Desktop.

Enterprise Data Warehouse Engineer Skills Test

The following is a series of questions related to skills that we commonly use in our enterprise data warehouse (EDW).

This is not a timed or monitored test, and you may use any print or online resources you would like to assist you in your work.

Please submit your answers in the form of text, markdown, or other lightweight data formats. Please indicate clearly which parts of your response correspond to which question. If your answers span multiple files, please attach them as a zip file.

1. HBase Denormalization with Consistency

Emails received from different users for an particular account are stored in an email table.

The attributes of the email table are:

userId
messageId
timestamp
email-message
attachmentId

Design an HBase schema for the below requirements such that it provides excellent random read performance when:

  • Retrieving all emails for a given user ID and date
  • Retrieving all attachments for a given email for a given user ID and date

Please include sample queries showing both of these types of access, and the row key design employed for each requirement.

2. Slowly-Changing Dimensions

Users post questions on the a legal forum, and the forum's attorneys answer those questions based on their practice area. All the questions and answers posted will be reviewed and assigned a status code.

Here are different status codes for reference:

0: Decision Pending

1: Question or Answer Accepted

2: Question or Answer Declined

A question or answer will initially have status code "0" and change to either "1" or "2". It is also possible that the status for a question or answer could shift between "1" and "2" at any point of time.

We want a report that would measure each attorney's efficiency in the following ways:

  • Number of Decision Pending/Approved questions answered by day.
  • Count of Decision Pending/Approved answers and count of Declined answers by day.
  • Count of approved questions with approved answers answered in a month.

Please outline the database objects you would create to measure this. You can draw a diagram, create DDL, or any other method you would like to communicate what you build. Please include SQL statements that we could use to answer the above questions.

3. Tableau

Tableau is our reporting layer and we have automated a lot of our standard reporting on Tableau Server. Please answer the following questions and focus your answers on Tableau.

  1. You are checking on an automated report on Tableau server set to run each business day. When you open the report, the filters are still present, but there are no results (no data, no viz). There are also no errors. What are some of the possible explanations for why there would not be data present in the report? Describe how you would check for each possible explanation you provide.

  2. We produce Tableau reports that connect to multiple servers. Assume that one of the servers went down, it’s unclear when it will be back online, and that this server is the most widely used data source. What would you do immediately to minimize the impact on business customers?

4. Scala: A Simple Calculator

In Scala, write a method called calculator that accepts three string parameters:

def calculator(operand1: String, operator: String, operand2: String): Unit

and demonstrate that it behaves as follows

  • Converts the operands to Ints;
  • Performs the desired mathematical operator (+, -, *, or /) on the two operands
  • Prints the result, or a generic error message

5. Query Optimization

Consider the following SQL query:

Select a. * , b. * , c. * , d. *
    from table_a a left join (select *, case when column_1b_sub > 0 then 1 else 0 end as column_1b_sub_value from table_b) b on a.column_1a = b.column_1b and a.column_2a = b.column_2b
    left join (select * from table_d) d    on a.column_1a = b.column_1b and a.column_2a = b.column_2b

Identify any syntactic problems with this query. With those problems fixed, how could you modify this query to make it more efficient?

6. Something's Slowing Down Production

Some new code hit production and it's performing horribly! The site is all broken.

The engineers have tracked the problem down to this SQL query:

-- Most popular user first names
SELECT first_name, COUNT(*) FROM v_user_locale_info GROUP BY first_name ORDER BY COUNT(*) DESC LIMIT 10

The definition of the v_user_locale_info view is:

SELECT u.first_name
       , u.last_name
       , a.street_address1
       , a.street_address2
       , a.city
       , a.postalcode
       , c.primary_language
       , l.ltr
FROM   users u
       , address a
       , country c
       , language l
WHERE  u.address_id = a.id
AND    a.country_code = c.code
AND    l.language_code = c.primary_language_code

What's wrong?

This query worked fine on the staging server. Why is it failing now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment