Skip to content

Instantly share code, notes, and snippets.

View peijiehu's full-sized avatar

Peijie Hu peijiehu

View GitHub Profile

Try the following approaches:

  1. Simplify the problem, try solving the simpler/sub problem first, see if the same technique can be applied or if we can build it up to the bigger problem, this usually implies Dp.
  2. Brute force approach, have a solid but probably slow solution first, then start optimizing from there.
  3. When the solution seems complex, don't have to come up with algo for every piece, instead, mock some algo with simply a empty function, and implement it later.

Test binary size and RAM have strong correlations with whether a test is flaky.(Assuming the test is already made hermetic - no dependency on external systems).

Compression is a good option when you have a lot of CPU to throw around and limited IO.

Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Google File System has master and chunk servers

  • Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk.

  • Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.

One problem in distributed computing (eg. Map Reduce) is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.

Bad System Design Patterns

  1. You are afraid to make changes to the system since you know the system is fragile. Root cause is there are dependencies and manual steps everywhere - to add a new data source or a new user, you have to remember to change some configurations in different services and update/insert some records to databases.

  2. Having too many long running jobs.

  3. Big chunks of data are being moved around, and maybe even repeatedly for same data.

TODO - Add More

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

ETL

In my view, ETL is really two things. First, it is an extraction and data cleanup process—essentially liberating data locked up in a variety of systems in the organization and removing an system-specific non-sense. Secondly, that data is restructured for data warehousing queries (i.e. made to fit the type system of a relational DB, forced into a star or snowflake schema, perhaps broken up into a high performance column format, etc). Conflating these two things is a problem. The clean, integrated repository of data should be available in real-time as well for low-latency processing as well as indexing in other real-time storage systems.

I think this has the added benefit of making data warehousing ETL much more organizationally scalable. The classic problem of the data warehouse team is that they are responsible for collecting and cleaning all the data generated by every o

Some general rules of unit testing:

  • A testing unit should focus on one tiny bit of functionality and prove it correct.
  • Each test unit must be fully independent. Each test must be able to run alone, and also within the test suite, regardless of the order that they are called. The implication of this rule is that each test must be loaded with a fresh dataset and may have to do some cleanup afterwards. This is usually handled by setUp() and tearDown() methods.
  • Try hard to make tests that run fast. If one single test needs more than a few milliseconds to run, development will be slowed down or the tests will not be run as often as is desirable. In some cases, tests can’t be fast because they need a complex data structure to work on, and this data structure must be loaded every time the test runs. Keep these heavier tests in a separate test suite that is run by some scheduled task, and run all other tests as often as needed.
  • Learn your tools and learn how to run a single test or a test case. Then,