atombender/challenges.md

## challenges.md

      
    Raw
  

              challenges.md
            
          
    Challenges

Geocoding

Consider a distributed (ie., deployed on multiple machines) application designed to geocode data.
The app listens to a queue (eg., Amazon SQS). Other apps put addresses to be geocoded into this queue. The app geocodes using the Google Maps API.
Because of Google's licensing rules, the application can only process 2500 items per day before being rate limited. Indeed, Google may at any time fail a request by returning an "over the limit" error.

How would you structure such an application?
What are some possible technical challenges and gotchas that would come up?
How might client apps (the ones that ask for addresses to be geocoded) get the results?

Movie matching

We have two datasets A and B, each containing movies.
Each movie has a bunch of attributes: One or more titles (there may be several for different markets/languages), a release year, a list of directors, a list of countries, a list of genres, and so on.
However, the two datasets contain small differences:

Some movies have titles spelled or formatted differently. For example, some may use correct accent characters, or put definite articles at the end ("sting, the").
Person names may be spelled or formatted differently. Typically the order of Asian names may be inconsistent, a name may have/lack middle initials, etc.
Country names may be written differently.
Some entries simply lack data. A movie could have a title and year, but no directors or countries; or the wrong country; or a year that is off by 1-2 years (since movies are released to different markets at different times), etc.

How would you match dataset A against B so that you could output a list of A/B pairs that corresponded to the same movie?
Also:

What solutions would be most simple and precise to implement?
What are potential challenges and gotchas?
Given lots of time and resources, what other venues (technical, non-technical) could be pursued that could improve the matching further?
Are there any clever ways to make this a bit more scalable?

Word program

Consider a program that does, given a corpus of English text, can find any sentence beginning of one or more words, sorted by frequency. For example, if the user types "it", it would print the top sentences starting with the word "it", eg. "it was a dark and stormy night".

What data structures would be ideal to use?
Beyond a naive solution, how could we design a system that would work with arbitrarily large text corpuses with potentially millions or billions of sentences, and still return results reasonably fast?

Expensive calculation

Imagine a web site which displays to the visiting user some personalized (ie., each visitor must get their own unique results) information which takes 5-10 seconds to calculate. For example, it could be a dating site which showed "possible dating partners", using a special matching algorithm which is inherently slow. Also, the underlying information changes fairly frequently. Let's say user happiness is more important than anything else; a slow web page is bad.

How could one design a solution?
The problem is exacerbated by the fact that the web site also has a bazillion users. Exactly why is this a problem, and how does it affect the solution?
Regardless of what the "special matching algorithm" actually is, could we optimize how it's used?