Skip to content

Instantly share code, notes, and snippets.

@wakproductions
Last active August 10, 2016 03:59
Show Gist options
  • Save wakproductions/d6456c198341583b984f28141b36b9d1 to your computer and use it in GitHub Desktop.
Save wakproductions/d6456c198341583b984f28141b36b9d1 to your computer and use it in GitHub Desktop.
KSM Consulting SQL Exercises

SQL Exercise #1

At this link <omitted> is a tab-delimited text file containing a list of members of the 114th Congress. Use these names for the following exercise. Import the list of names into a database - Oracle, SQL Server, or HANA. (PostgreSQL was used instead due to its similarity to the selections given. The other options were not readily available.)

Create a list of all possible three-letter combinations (“trigrams”), from ‘AAA’ to ‘ZZZ’, in order, with an ID column. ‘AAA’ gets ID = 1.

Parse each last name into a set of trigrams; e.g., “Brandt” contains BRA, RAN, AND, and NDT. Determine the five most common trigrams in the Congressional roster. For each last name, sum the ID values of its trigrams and determine the last names with the five highest scores.

Deliverables:

  1. The list of most common trigrams, (trigram and frequency).
  2. The list of high-scoring names, (name and score).
  3. The type of database used.
  4. Describe your technique for parsing names into trigrams, as well as the code to achieve the above.

Data Load Process

  • Looking over the format of this file. There's a header row with the field names. It was hard to see the delimiters in my text editor (Sublime), so just to make sure I opened a Ruby console to read a line to check for delimiters. Sure enough, I was able to confirm that the data is tab-delimited with each row terminated by a newline.
[1] pry(main)> file = File.join(Dir.pwd, 'ContactingCongress.db.txt')
=> "/Users/wkotzan/Development/sites/ksm_consulting_sql/ContactingCongress.db.txt"
[2] pry(main)> f = File.open(file, 'r')
=> #<File:/Users/wkotzan/Development/sites/ksm_consulting_sql/ContactingCongress.db.txt>
[3] pry(main)> f.readline
=> "Formal Name\tDistrict\tChamber\tFirst Name\tSurname\tParty\tDC Office\tDC Voice\tDistrict Voice\tElectronic Correspondence\tWeb\n"
  • I'm going to use the PostgreSQL 9.3 database because I develop on a Mac and I currently have that platform configured on my system. PostgreSQL is an enterprise scale relational database used by many organizations, including my previous employer Avant who used it to manage a $3 billion loan portfolio. For the most part it works just like SQL Server or Oracle. Using pgAdmin, I created a new database named "ksm_consulting_sql" and made a table using the following schema:
CREATE TABLE congressmen (
  id serial NOT NULL,
  formal_name character varying(255),
  district character varying(255),
  chamber character varying(255),
  first_name character varying(255),
  surname character varying(255),
  party character varying(255),
  dc_office character varying(255),
  dc_voice character varying(255),
  district_voice character varying(255),
  electronic_correspondence character varying(255),
  web character varying(255)
);
ALTER TABLE ONLY congressmen ADD CONSTRAINT congressmen_pk PRIMARY KEY (id);
  • I chose varchar string types for the data fields because all of the data to me seem best represented by strings. Depending on the project requirements, we could add further constraints, indexes, or search capability. For example, we could use the citext data type if we want to make formal_name a case insensitive search. We could include a validation against a phone number regex for dc_office or dc_voice if we want to make sure that phone numbers are entered in a certain format. However, those requirements weren't specified in the problem so I kept it simple.

  • I used the following import command to load the data. Sanity checking the row count of 540 rows indicates that the correct number of rows were loaded.

COPY congressmen (formal_name, district, chamber, first_name, surname, party, dc_office, dc_voice, district_voice, electronic_correspondence, web)
FROM '/Users/wkotzan/Development/sites/ksm_consulting_sql/ContactingCongress.db.txt'
WITH (FORMAT 'csv', DELIMITER E'\t', HEADER true);

--> Query returned successfully: 540 rows affected, 149 ms execution time.
  • For a second sanity check I did a select * from congressmen query to make sure all of the rows were in the right place. Looked okay at cursory glance.

Problem-Solving Approach

  • I considered several methods for building and parsing the trigrams. Most people coming from a database background would probably try using a procedural scripting language provided by the database platform. Postgres has a procedural language called PL/pgSQL which is similar to Oracle's PL/SQL. Since I come from an software application engineering background, I have a preference for external scripting languages. I believe there are some advantages to this approach:

    • Flexibility: the syntax for external scripting languages are better designed for ad hoc data manipulation and deliver a lot of power.
    • Interoperability: by avoiding database system extensions, you can isolate a dependency and make the ETL process more platform agnostic.
    • Maintainability: open source languages like Elixir and Ruby are widely used and have huge support communities.
    • Testability: ETL scripts based on such languages can be easily be expanded and unit tested. We can run the script in a mock environment to see how it performs against different data sets before it gets anywhere near actual data.
  • My initial inclination was to use Ruby since I am very strong in that language, but instead I chose Elixir. My reasoning was:

    • Elixir is increasing in popularity as an alternative to Ruby because it has much better speed and concurrency handling.
    • I also considered Clojure and Golang, but Exlir is a popular alternative among Rubyists because it was developed by some of the same contributors to the Ruby core team. Many of the design ideas are carried over, making it feel complementary to Ruby.
    • I never wrote any Elixir code before this project, so I'm going to use this as an opportunity. I decided it would be fun to challenge myself and see if I could learn enough in about 24 hours to pull of this data manipulation task.

Building the Universe of Trigrams

  • To generate the universe of Trigrams, I created a module named TrigramGenerator with several utility functions for building the letter pairings. (I chose lower case letters vs upper for aesthetics and convention with Elixir/Ruby systems.) You can generate the universe of trigrams as an in-memory list using the function TrigramGenerator#build from iex console:
iex(1)> TrigramGenerator.build                    
["aaa", "aab", "aac", "aad", "aae", "aaf", "aag", "aah", "aai", "aaj", "aak",
 "aal", "aam", "aan", "aao", "aap", "aaq", "aar", "aas", "aat", "aau", "aav",
 "aaw", "aax", "aay", "aaz", "aba", "abb", "abc", "abd", "abe", "abf", "abg",
 "abh", "abi", "abj", "abk", "abl", "abm", "abn", "abo", "abp", "abq", "abr",
 "abs", "abt", "abu", "abv", "abw", "abx", ...]
  • To insert the list of trigrams into the database, I first created a table using the following SQL instruction:
-- I didn't use the serial type for id because I want my Elixir program to have control over the id numbers
CREATE TABLE trigrams (
  id integer NOT NULL,
  trigram character varying(3)
);
ALTER TABLE ONLY trigrams ADD CONSTRAINT trigrams_pk PRIMARY KEY (id);
  • I then created an Elixir model module named Trigram to handle the database interaction and a module to handle the population routine named PopulateTrigrams. This makes use of Elixir's database interface, Ecto. Ecto works similarly to LINQ of the .NET world. Here's the console commands used to perform the population. Note that for the operation to be a success the trigrams table must be truncated.
iex(1)> KsmConsultingSql.Repo.start_link
iex(2)> PopulateTrigrams.populate_trigram_universe

17:13:24.526 [debug] INSERT INTO "trigrams" ("id", "trigram") VALUES ($1, $2) [1, "aaa"] OK query=496.1ms queue=55.8ms

17:13:24.530 [debug] INSERT INTO "trigrams" ("id", "trigram") VALUES ($1, $2) [2, "aab"] OK query=1.0ms

...
  • It took about 20 seconds for it to insert all the rows into the database. I did a sanity check using the following SQL command in pgAdmin. There were 17576 rows, which equals 26 * 26 * 26, so it had the correct number of trigrams and they all seemed to be in order by id.
select * from trigrams order by trigram;

Parsing and Associating the Trigrams within Each Last Name

  • Next I built functionality to parse the trigrams by building a TrigramParser module, which accepts a string as input and returns a list of all the trigrams. The parser extracts only letters so names containing spaces, apostrophes, or other non-alphabetic characters are ignored. The parser is invoked via the #call function.

  • To associate the trigram of each politician's surname with an id number in our trigram universe requires a many-to-many association table:

CREATE TABLE congressmen_trigrams (
  id serial NOT NULL,
  congressman_id integer,
  trigram_id integer
);
ALTER TABLE ONLY congressmen_trigrams ADD CONSTRAINT congressmen_trigrams_pk PRIMARY KEY (id);
  • To perform the population of this association table, I created another Elixir model to interface that table and a respective module named PopulateCongressmenTrigrams. The population was performed using the following command in the iex console. It took about 15 seconds to run and populate the congressmen_trigrams table.
iex(1)> PopulateCongressmenTrigrams.call

10:14:40.621 [debug] SELECT c0."id", c0."surname" FROM "congressmen" AS c0 [] OK query=58.4ms

10:14:40.628 [debug] SELECT t0."id", t0."trigram" FROM "trigrams" AS t0 WHERE (t0."trigram" = $1) ["mur"] OK query=6.3ms

10:14:40.628 [debug] BEGIN [] OK query=0.3ms

10:14:40.629 [debug] INSERT INTO "congressmen_trigrams" ("congressman_id", "trigram_id") VALUES ($1, $2) RETURNING "id" [1, 8650] OK query=1.0ms

10:14:40.630 [debug] COMMIT [] OK query=0.5ms

10:14:40.635 [debug] SELECT t0."id", t0."trigram" FROM "trigrams" AS t0 WHERE (t0."trigram" = $1) ["urk"] OK query=5.0ms

...
  • To perform a sanity check on the data in congressmen_trigrams, first I performed a select * from congressmen_trigrams query and examined the row count, 2468, which given 540 names averages out to ~4.5 trigrams per surname. Given that the average length of all the surnames is ~6.5 (select avg(length(surname)) from congressmen) that number looks about right. I did a few individual checks against a few names using SQL statements like this:
select congressman_id, surname, trigram_id, trigram
from congressmen c 
inner join congressmen_trigrams ct on c.id=ct.congressman_id
inner join trigrams t on t.id=ct.trigram_id
where congressman_id = 15;

produces

congressman_id surname trigram_id trigram
15 "Crawford" 1795 "cra"
15 "Crawford" 11515 "raw"
15 "Crawford" 578 "awf"
15 "Crawford" 15017 "wfo"
15 "Crawford" 3762 "for"
15 "Crawford" 9910 "ord"

Looks ok! Also, note that the sum of the above trigram_id numbers adds up to 42577, which I will use as a sanity check on another query later.

Five Most Common Trigrams

select count(trigram_id), trigram
from congressmen_trigrams ct
inner join trigrams t on t.id=ct.trigram_id
group by trigram
order by count(trigram_id) desc
limit 5;
count trigram
26 "son"
18 "ter"
17 "ell"
14 "ers"
13 "man"
  • To sanity check, I wanted to see many results pop up by searching for a particular set of records:
select formal_name, surname from congressmen where surname like '%ter%';

Sure enough, it returned 18 rows!

Highest Scoring Names

select congressman_id, surname, sum(trigram_id) as trigram_id_sum
from congressmen c 
inner join congressmen_trigrams ct on c.id=ct.congressman_id
inner join trigrams t on t.id=ct.trigram_id
group by congressman_id, surname
order by trigram_id_sum desc
limit 5;
congressman_id surname trigram_id_sum
130 "Wasserman Schultz" 113844
517 "McMorris Rodgers" 101396
233 "Ruppersberger" 93859
328 "Frelinghuysen" 93479
329 "Watson Coleman" 90117

Sanity check on the query math:

select congressman_id, surname, sum(trigram_id)
from congressmen c 
inner join congressmen_trigrams ct on c.id=ct.congressman_id
inner join trigrams t on t.id=ct.trigram_id
where congressman_id = 15
group by congressman_id, surname;
congressman_id surname sum
15 "Crawford" 42577

Technique for Parsing Names Into Trigrams

The process I used works like this:

  1. Convert each name into lower case.
  2. Remove all non-alphabetic characters and whitespace.
  3. Break up each name into a list of individual letters.
  4. Use Elixir's Enum.chunk function to iterate through the list of letters in sets of three and concatenating them back into a string, thereby creating the list of trigrams for each name.
  5. Once I have the list of trigrams associated with each name, use Elixir's DSL to write the trigram associations to the database's join table, congressmen_trigrams.

Considerations

For the DB interactions, I used Elixir's PostgreSQL database adapter. Elixir currently has adapters for PostgreSQL, MySQL, SQL Server, SQLite3, and MongoDB. Oracle and SAP can use the same approach, but will require a different scripting language since they do not have adapters available in Elixir. Clojure is a good alternative candidate because it runs on the Java Virtual Machine and therefore can use the JDBC to perform the database queries.

It took me just above 24 work hours to complete this task. I could have done it much quicker in Ruby, but I wanted to challenge my comfort zone by trying Elixir, which I've never used before. It was a very educational exercise as I got pretty familiar with Elixir's rudimentary data types and the Elixir way of thinking. I was surprised to learn that Elixir feels a bit of a lower level language in comparison to Ruby. I'm not sure if it's my lack of familiarity with Elixir's design patterns for handling enumerations, but I feel like Ruby gives you more tools for parsing through lists and strings. However, that convenience of Ruby costs in performance and memory usage.

Maybe this wasn't the approach you were looking for, but I think it at least demonstrates a progressive, entreprenurial mindset around deployment models and emerging technologies.

Elixir Project Code (selected files)

# /lib/congressman.ex

defmodule Congressman do
  use Ecto.Schema

  schema "congressmen" do
    field :formal_name, :string
    field :surname, :string

    has_many :congressmen_trigrams, CongressmanTrigram, [foreign_key: :congressman_id]
    has_many :trigrams, through: [:congressmen_trigrams, :trigrams]
  end
end
# /lib/congressman_trigram.ex

defmodule CongressmanTrigram do
  use Ecto.Schema
  import Ecto.Changeset

  schema "congressmen_trigrams" do
    belongs_to :congressman, Congressman
    belongs_to :trigram, Trigrams
  end

end
# /lib/populate_congressmen_trigrams.ex

defmodule PopulateCongressmenTrigrams do
  import Ecto.Query

  def call do
    list_surnames |> populate_trigrams
  end  

  defp associate_congressman_to_trigram(congressman_id, trigram) do
    Ecto.build_assoc(retrieve_trigram_by_text(trigram), :congressmen_trigrams, congressman_id: congressman_id)
    |> KsmConsultingSql.Repo.insert!
  end

  defp create_associations_for_surname({congressman_id, surname}) do
    TrigramParser.call(surname) |> Enum.each(fn(trigram) -> associate_congressman_to_trigram(congressman_id, trigram) end)    
  end

  defp list_surnames do
    from(c in Congressman, select: {c.id, c.surname}) |> KsmConsultingSql.Repo.all
  end

  defp populate_trigrams(surname_list) do
    surname_list |> Enum.each(&(&1 |> create_associations_for_surname))
  end

  # TODO consider raising an error if trigram cannot be found
  defp retrieve_trigram_by_text(trigram) do
    record = Trigram |> Trigram.find_by_text(trigram) |> List.first
  end

end
# /lib/populate_trigrams.ex

defmodule PopulateTrigrams do

  def populate_trigram_entry(params) do
    Trigram.changeset(%Trigram{}, params)
    |> KsmConsultingSql.Repo.insert!
    # consider displaying changeset errors here using IO.puts, if any; currently no means to display errors
  end

  def populate_trigram_universe do
    TrigramGenerator.build |> parameterize_trigram_list |> commit_trigrams
  end

  defp commit_trigrams(trigrams) when is_list(trigrams) do
    Enum.each(trigrams, &(populate_trigram_entry(&1)))
  end

  defp parameterize_trigram_list(trigrams) when is_list(trigrams) do
    List.foldl(trigrams, [], fn(trigram, acc) -> acc ++ [%{id: length(acc) + 1, trigram: trigram}] end)
  end

end
# /lib/trigram.ex

defmodule Trigram do
  use Ecto.Schema
  import Ecto.Query
  import Ecto.Changeset

  schema "trigrams" do
    field :trigram, :string

    has_many :congressmen_trigrams, CongressmanTrigram
    has_many :congressmen, through: [:congressmen_trigrams, :congressmen]
  end

  def changeset(trigram, params \\ %{}) do
    cast(trigram, params, [:id, :trigram])
  end

  def find_by_text(query, text) do
    from(t in query, where: t.trigram == ^text) |> KsmConsultingSql.Repo.all
  end
end
# /lib/trigram_generator.ex

defmodule TrigramGenerator do
  
  def alphabet, do: Enum.map(?a..?z, & &1) |> to_string |> String.graphemes

  def build do
    alphabet |> join_alphabet |> join_alphabet
  end

  def join_alphabet(base \\ "")
  def join_alphabet(base) when is_binary(base), do: for letter <- alphabet, do: base <> letter
  def join_alphabet(base) when is_list(base) do
    base 
    |> Enum.map(fn(list_item) -> join_alphabet(list_item) end) 
    |> List.flatten
  end
end
# /lib/trigram_parser.ex

defmodule TrigramParser do
  
  def call(input) when is_binary(input) do
    input |> clean_up |> extract_trigrams
  end

  defp extract_trigrams(input) do
    input |> to_graphemes |> breakout_trigrams
  end

  defp breakout_trigrams(input) do
    input |> Enum.chunk(3, 1) |> concat_grapheme_sets
  end

  defp concat_grapheme_sets(input) do
    input |>  List.foldl([], fn(set, acc) -> acc ++ [concat_grapheme_set(set)] end)
  end

  defp concat_grapheme_set(input) do
    input |> List.foldl("", fn(grapheme, acc) -> acc <> grapheme end)
  end

  defp clean_up(input), do: input |> lowercase |> strip_whitespace
  defp lowercase(input), do: String.downcase(input)
  defp strip_whitespace(input), do: String.replace(input, ~r/[^a-z]/, "", global: true)
  defp to_graphemes(input), do: String.graphemes(input)
  
end
# /lib/trigram_generator_test.exs

defmodule TrigramGeneratorTest do
  use ExUnit.Case

  describe "TrigramGenerator.alphabet/0" do
    test "list letters" do
      assert TrigramGenerator.alphabet == [
          "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p",
          "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"
        ]
    end
  end

  describe "TrigramGenerator.iterate_letters/1" do
    test "no base given" do
      assert TrigramGenerator.join_alphabet() == TrigramGenerator.alphabet
    end

    test "concatenates with the given base" do
      assert TrigramGenerator.join_alphabet("a") == [
        "aa", "ab", "ac", "ad", "ae", "af", "ag", "ah", "ai", "aj", "ak", "al", "am",
        "an", "ao", "ap", "aq", "ar", "as", "at", "au", "av", "aw", "ax", "ay", "az"
      ]
    end

    test "takes a List" do
      assert TrigramGenerator.join_alphabet(["aa", "ab"]) == [
        "aaa", "aab", "aac", "aad", "aae", "aaf", "aag", "aah", "aai", "aaj", "aak",
        "aal", "aam", "aan", "aao", "aap", "aaq", "aar", "aas", "aat", "aau", "aav",
        "aaw", "aax", "aay", "aaz", "aba", "abb", "abc", "abd", "abe", "abf", "abg",
        "abh", "abi", "abj", "abk", "abl", "abm", "abn", "abo", "abp", "abq", "abr",
        "abs", "abt", "abu", "abv", "abw", "abx", "aby", "abz"
      ]
    end
  end

  describe "TrigramGenerator.build/0" do
    test "has the proper number of sorted items" do
      trigram = TrigramGenerator.build
      assert trigram |> length == 26 * 26 * 26
      assert trigram |> Enum.at(26) == "aba"
      assert trigram |> Enum.at(27) == "abb"
      assert trigram |> Enum.at(52) == "aca"
      assert trigram |> Enum.at(-2) == "zzy"
    end
  end

end
# /lib/trigram_parser_test.exs

defmodule TrigramParserTest do
  use ExUnit.Case

  describe "TrigramParser.call/1" do
    test "parses various inputs correctly" do
      assert TrigramParser.call("batman-vs sup'rman") == [
          "bat", "atm", "tma", "man", "anv", "nvs", "vss", "ssu", "sup", "upr", "prm", "rma", "man"
        ]
    end
  end

end

SQL Exercise #2

Client complains that ETL has been slowing down way more than expected for modest growth of the files.

What are your first questions for him about the ETL, and when you get on-site, what are the first three (or more) steps in your solution?

Most cases of poor performance on a data operation occur because the computer is performing a lot of work on a repetitive task. Often is the case that efficiency could be improved by restructuring that process so that the computer has to do less thinking do to the same thing. Here are some ideas I would investigate to troubleshoot the matter:

  • First, I need to know more about the ETL process and data involved.

    • Is it a script?
    • Is the data inserted by a database query? What does that query look like?
    • What kind of data is being loaded? i.e.) is it coming from a flat file? Is it being piped in from an external source?
    • What related procedures or jobs are invoked in association with this ETL process?
    • How many rows does the table currently have?
    • Is the size of the input data growing? If so, let's try to benchmark the ETL process against different sizes of input data sets.
  • I won't come up with a solution until I have more information about the process. But here are some possible outcomes:

    • The load process contains an inefficient query. I've seen a lot of cases that sound like this where a repetitive process becomes slow due to a poorly written joining of related tables or calculations calling in many rows such as an average. The way to fix it is to rewrite the query process to be more efficient. When the database is small performance is acceptable, but when the overall size of the data grows such queries become exponentially burdensome on processing power. This is especially true for complex joins of multiple tables. The way to fix this often involves breaking up the inefficient query into several processing stages. Look for values that are hard to calculate and isolate them into an intermediate table. Sometimes the caching of related records for a join into a temporary, denormalized table significantly improves performance. That way you can avoid repeating calculations for every single row added.
    • A particular set of records are causing trouble. There could be certain triggers within the data causing a background process or query to hang. By scrutinizing the type of data being loaded, we might be able to identify such cases.
    • The table itself might have poor design. There may be fields being hit in lookups that could be optimized with a BTree index. The user might also be trying to inefficiently store certain types of information, such as large blobs of data that might be best left on separate tables or outside of the DB system as independent files.
  • My general approach would be to break apart the entire ETL process. I would try to replicate the issue in a development environment, and benchmark each routine within the process to identify the slowest components.

Would my answer vary depending on the database type?

From my experience most relational databases work similarly enough that most solutions for data and query quality fall along the same general ideas. Yes, there could be a big difference if it's relational vs a NoSQL cluster running Hadoop. I'd also have to determine what advanced features the client might be utilizing of the DBMS, which could lead to a system-specific problem.

SQL Exercise #3

This is a very interesting problem and I have some ideas that could help. However, I think they would be best explained in a face-to-face conversation. Feel free to contact me if you would like to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment