philangist/gist:be7bb18a9bf74ee01ee7c358713f3238

## gistfile1.txt
Questions
Once finished writing the code you must answer the following questions and include them in your documentation:

Was the question/problem clear? Did you feel like something was missing or not explained correctly?
- The problem was well defined, but a few details were left out. The request is to "create a piece of software that collects data from two services and indexes that data through a third service as quickly and efficiently as possible", but a little more context would've been helpful. What does the typical load for our system look like (in other words, what value should i pass in for the -c flag to the `./challenge-linux input -c <LOAD>` command?). I'd also argue that the wording "as quickly and efficiently as possible" is ill-defined, and specific metrics around what success would look like might have been better. For example: We expect the service to index 95% of all incoming requests in under 5 seconds, to have a memory profile that fits this certain use case, etc.

There were also some implementation details that were not defined ahead of time, for example the /users and /index endpoints return a gzip compressed response and that's not indicated in the problem definition.

• How much time did you spend on each part: understanding, designing, coding, testing?
Understanding & design:
- After reading the assessment I wrote a small python script to parse 1000000 input values and see if there were any interesting patterns in the data that could be useful, but the output of the script seemed to be completely random (~10 minutes).
- I then spent the next two days working on other projects and thinking over the possible ways to approach the problem. I knew I'd take a concurrent approach for solving it and since I hadn't used go's concurrency primitives before I spent some time watching videos on go concurrency patterns and came up with a single threaded version of the core solution.

Coding:
- I took about 3 days of working on and off to get to an implementation I thought was respectable. The core solution itself was written fairly early on and took 2-3 hours - https://github.com/philangist/vimeo-indexer/commit/c7e6b9b34063f5fadc16ff61c961aee11d417787 - everything after that has just been iteration and refinement. I did the difficulty of implementing the concurrent approach in go however , and spent much of sunday fighting deadlocks and race conditions and trying to better understand the language. I had to write out a dead simple version of the core workflow (https://gist.github.com/philangist/d14bc3319e1eb4a9c9f983f6dd461fc0) and then aggresively simplify my core logic

Testing:
- I was manually testing for the first few days but when I started to approach a stable solution I took about 5-6 hours over 2 nights to write tests

https://github.com/philangist/vimeo-indexer/commit/1fca7206888ec6b51cb8170569533f2ba49d7447
https://github.com/philangist/vimeo-indexer/commit/c9a82c90035074d4bd1eca115fd79ec98b2311f9

It was a larger time investment than I initially expected, but I wanted to write a high-quality solution.


• Why did you choose the language to write your solution in?
- Because go is fun to write in! It is like a language that was thrives for projects of this type, building small, performant networking services with out of the box concurrency and a great standard library.


• What would you have done differently if you had more time or resources?
- I would've really fleshed out what the concurrent implementation actually looks like before I started working on it, and I would've also made a better effort to REALLY understand go's approach to concurrency,

• Are there any bottlenecks with your solution? If so, what are they and
what can you do to fix them/minimize their impact?
- Yep. With an application like this you'd expect performance hotspots to primarily be i/o and the overhead costs of managing concurrency and from performance profiling that's definitely the case.

the rate at which we can read the input stream from stdin is the first obvious bottleneck

sending data over the wire to/from /users and /videos services is the next one
improvements:
- reduce the amount of information we send back. the /users payload contains a lot of information that's seemingly not relevant to search queries, so do we really need to index their last ipaddress or which language they speak.

- consider a different encoding type. decompressing from gzip is the largest memory hog in our application. a cursory google search shows that deflate is about 40% faster than gzip

- use a rpc protocol like thrift or protobuf for communicating among internal services. the payload is typically much smaller and faster than a raw json request.

last is sending data to /index service
- there are also potential optimizations based on typical user behavior on the platform. do they often upload multiple videos at once? maybe it makes sense for the call to /index to be structures as
{
    user: {user info}
    videos: []{collection of videos recently uploaded by user}
}

- horizontal scaling of the number of servers running the /users, /videos, and /index services also. I was starting to see dropped messages on my local machine when I jacked the number of threads up to several hundred, because the other backend services could not keep up with the workload.

- optimizing memory usage. Escape analysis (we want as much data on the stack as possible. benefits of stack = improved cache hits due to locality of reference, we're not chasing pointers around the heap, reduces memory fragmentation), pre-allocate and reuse fixed size buffers/byte arrays. take advantage of byte alignment when defining structs. use more performant libraries to replace json marshaling/unmarshaling, decompressing logic

- Send less data by questioning our assumptions, are there any queries on the index that could be done in a more conventional datastore. a query like (all videos in the last month from region seattle that are at least 5 minutes with term "foo" in their title or body) can have the range(video.date), exact(video.region), gte(video.length, 5 minutes) filtering happen in a db and our index only needs to be aware of the values of title and body.

- cache the values of partially successful user,video pairs. so if we attempt to get a video id that fails, but it's user id passes, when we retry the indexing again later we'll only have to fetch once.

- improve the error rate of /users, /videos (biggest win) and /index as well

- from a ux perspective, since bottlenecks will always exists we can be smart about prioritizing so users are likely to find the results they want in a timely manner. encoding type information for the behavior that an event in the input stream matches and prioritizing based off that. t. ex: events that are of type create(video) should have a higher weight than update(video.frame_rate)

• How would the system scale for a larger data set (1 billion+ or a never
ending stream) or to handle more complex queries or higher volume of
queries?

- scaling for a larger data set:
    the system starts to degrade in performance for larger data sets. memory usage will likely be the biggest issue, gc pauses will be larger and more frequent. partially because of the nature of the problem (forwarding large amounts of data from one group of services to another), partially because of implementation tradeoffs made at a smaller case. naively retrying every failed request infinitely might work fine for a data set of several hundred thousand or millions, but it quickly becomes too expensive  and because we infinitely retry all failed operations there can be network congestion and increased average latency for "legitimate" requests.

- scaling for more complex/higher volume of queries:
    we're likely sending too much unneeded data to the data store anyway, the size of documents might

• Anything else you want to share about your solution or the problem :)
- this was a really fun project and i put a lot of effort into it. i hope that's reflected in the quality of the work
	Questions
	Once finished writing the code you must answer the following questions and include them in your documentation:

	Was the question/problem clear? Did you feel like something was missing or not explained correctly?
	- The problem was well defined, but a few details were left out. The request is to "create a piece of software that collects data from two services and indexes that data through a third service as quickly and efficiently as possible", but a little more context would've been helpful. What does the typical load for our system look like (in other words, what value should i pass in for the -c flag to the `./challenge-linux input -c <LOAD>` command?). I'd also argue that the wording "as quickly and efficiently as possible" is ill-defined, and specific metrics around what success would look like might have been better. For example: We expect the service to index 95% of all incoming requests in under 5 seconds, to have a memory profile that fits this certain use case, etc.

	There were also some implementation details that were not defined ahead of time, for example the /users and /index endpoints return a gzip compressed response and that's not indicated in the problem definition.

	• How much time did you spend on each part: understanding, designing, coding, testing?
	Understanding & design:
	- After reading the assessment I wrote a small python script to parse 1000000 input values and see if there were any interesting patterns in the data that could be useful, but the output of the script seemed to be completely random (~10 minutes).
	- I then spent the next two days working on other projects and thinking over the possible ways to approach the problem. I knew I'd take a concurrent approach for solving it and since I hadn't used go's concurrency primitives before I spent some time watching videos on go concurrency patterns and came up with a single threaded version of the core solution.

	Coding:
	- I took about 3 days of working on and off to get to an implementation I thought was respectable. The core solution itself was written fairly early on and took 2-3 hours - https://github.com/philangist/vimeo-indexer/commit/c7e6b9b34063f5fadc16ff61c961aee11d417787 - everything after that has just been iteration and refinement. I did the difficulty of implementing the concurrent approach in go however , and spent much of sunday fighting deadlocks and race conditions and trying to better understand the language. I had to write out a dead simple version of the core workflow (https://gist.github.com/philangist/d14bc3319e1eb4a9c9f983f6dd461fc0) and then aggresively simplify my core logic

	Testing:
	- I was manually testing for the first few days but when I started to approach a stable solution I took about 5-6 hours over 2 nights to write tests

	https://github.com/philangist/vimeo-indexer/commit/1fca7206888ec6b51cb8170569533f2ba49d7447
	https://github.com/philangist/vimeo-indexer/commit/c9a82c90035074d4bd1eca115fd79ec98b2311f9

	It was a larger time investment than I initially expected, but I wanted to write a high-quality solution.


	• Why did you choose the language to write your solution in?
	- Because go is fun to write in! It is like a language that was thrives for projects of this type, building small, performant networking services with out of the box concurrency and a great standard library.


	• What would you have done differently if you had more time or resources?
	- I would've really fleshed out what the concurrent implementation actually looks like before I started working on it, and I would've also made a better effort to REALLY understand go's approach to concurrency,

	• Are there any bottlenecks with your solution? If so, what are they and
	what can you do to fix them/minimize their impact?
	- Yep. With an application like this you'd expect performance hotspots to primarily be i/o and the overhead costs of managing concurrency and from performance profiling that's definitely the case.

	the rate at which we can read the input stream from stdin is the first obvious bottleneck

	sending data over the wire to/from /users and /videos services is the next one
	improvements:
	- reduce the amount of information we send back. the /users payload contains a lot of information that's seemingly not relevant to search queries, so do we really need to index their last ipaddress or which language they speak.

	- consider a different encoding type. decompressing from gzip is the largest memory hog in our application. a cursory google search shows that deflate is about 40% faster than gzip

	- use a rpc protocol like thrift or protobuf for communicating among internal services. the payload is typically much smaller and faster than a raw json request.

	last is sending data to /index service
	- there are also potential optimizations based on typical user behavior on the platform. do they often upload multiple videos at once? maybe it makes sense for the call to /index to be structures as
	{
	user: {user info}
	videos: []{collection of videos recently uploaded by user}
	}

	- horizontal scaling of the number of servers running the /users, /videos, and /index services also. I was starting to see dropped messages on my local machine when I jacked the number of threads up to several hundred, because the other backend services could not keep up with the workload.

	- optimizing memory usage. Escape analysis (we want as much data on the stack as possible. benefits of stack = improved cache hits due to locality of reference, we're not chasing pointers around the heap, reduces memory fragmentation), pre-allocate and reuse fixed size buffers/byte arrays. take advantage of byte alignment when defining structs. use more performant libraries to replace json marshaling/unmarshaling, decompressing logic

	- Send less data by questioning our assumptions, are there any queries on the index that could be done in a more conventional datastore. a query like (all videos in the last month from region seattle that are at least 5 minutes with term "foo" in their title or body) can have the range(video.date), exact(video.region), gte(video.length, 5 minutes) filtering happen in a db and our index only needs to be aware of the values of title and body.

	- cache the values of partially successful user,video pairs. so if we attempt to get a video id that fails, but it's user id passes, when we retry the indexing again later we'll only have to fetch once.

	- improve the error rate of /users, /videos (biggest win) and /index as well

	- from a ux perspective, since bottlenecks will always exists we can be smart about prioritizing so users are likely to find the results they want in a timely manner. encoding type information for the behavior that an event in the input stream matches and prioritizing based off that. t. ex: events that are of type create(video) should have a higher weight than update(video.frame_rate)

	• How would the system scale for a larger data set (1 billion+ or a never
	ending stream) or to handle more complex queries or higher volume of
	queries?

	- scaling for a larger data set:
	the system starts to degrade in performance for larger data sets. memory usage will likely be the biggest issue, gc pauses will be larger and more frequent. partially because of the nature of the problem (forwarding large amounts of data from one group of services to another), partially because of implementation tradeoffs made at a smaller case. naively retrying every failed request infinitely might work fine for a data set of several hundred thousand or millions, but it quickly becomes too expensive and because we infinitely retry all failed operations there can be network congestion and increased average latency for "legitimate" requests.

	- scaling for more complex/higher volume of queries:
	we're likely sending too much unneeded data to the data store anyway, the size of documents might

	• Anything else you want to share about your solution or the problem :)
	- this was a really fun project and i put a lot of effort into it. i hope that's reflected in the quality of the work