Peijie Hu peijiehu

## get_unstuck_algo_problems.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / get_unstuck_algo_problems.md
            
            
              Created
              June 24, 2017 07:27
            
          
    Try the following approaches:

Simplify the problem, try solving the simpler/sub problem first, see if the same technique can be applied or if we can build it up to the bigger problem, this usually implies Dp.
Brute force approach, have a solid but probably slow solution first, then start optimizing from there.
When the solution seems complex, don't have to come up with algo for every piece, instead, mock some algo with simply a empty function, and implement it later.


## flaky_tests.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / flaky_tests.md
            
            
              Last active
              June 24, 2017 07:29
            
          
    Test binary size and RAM have strong correlations with whether a test is flaky.(Assuming the test is already made hermetic - no dependency on external systems).


## data_compression.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / data_compression.md
            
            
              Created
              June 23, 2017 05:08
            
          
    Compression is a good option when you have a lot of CPU to throw around and limited IO.
Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

  
## GFS.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / GFS.md
            
            
              Created
              June 23, 2017 05:04
            
          
    Google File System has master and chunk servers


Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk.


Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.


## map_reduce_challenges.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / map_reduce_challenges.md
            
            
              Created
              June 23, 2017 05:01
            
          
    One problem in distributed computing (eg. Map Reduce) is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.

  
## twitter_scale.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / twitter_scale.md
            
            
              Created
              June 23, 2017 04:53
            
          
    http://highscalability.com/blog/2016/4/20/how-twitter-handles-3000-images-per-second.html
The New Way - Twitter In 2016
The Write Path
Decoupling media upload from tweeting.

Uploading was made a first class citizen. An upload endpoint was created, it’s only responsibility is to put the original media in BlobStore
This gives a lot of flexibility in how upload is handled.

  
## bad_system_design_patterns.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / bad_system_design_patterns.md
            
            
              Created
              June 23, 2017 04:46
            
          
    Bad System Design Patterns


You are afraid to make changes to the system since you know the system is fragile. Root cause is there are dependencies and manual steps everywhere - to add a new data source or a new user, you have to remember to change some configurations in different services and update/insert some records to databases.


Having too many long running jobs.


Big chunks of data are being moved around, and maybe even repeatedly for same data.


TODO - Add More


## best_read_on_real_time_data.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / best_read_on_real_time_data.md
            
            
              Created
              June 13, 2017 00:58
            
          
    https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
ETL

In my view, ETL is really two things. First, it is an extraction and data cleanup process—essentially liberating data locked up in a variety of systems in the organization and removing an system-specific non-sense. Secondly, that data is restructured for data warehousing queries (i.e. made to fit the type system of a relational DB, forced into a star or snowflake schema, perhaps broken up into a high performance column format, etc). Conflating these two things is a problem. The clean, integrated repository of data should be available in real-time as well for low-latency processing as well as indexing in other real-time storage systems.
I think this has the added benefit of making data warehousing ETL much more organizationally scalable. The classic problem of the data warehouse team is that they are responsible for collecting and cleaning all the data generated by every o

  
## when_to_use_spark.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / when_to_use_spark.md
            
            
              Last active
              June 11, 2017 16:41
            
          
    When to (and not to) use Spark

Notes from a very good talk/presentation - https://spark-summit.org/east-2016/events/not-your-fathers-database-how-to-use-apache-spark-properly-in-your-big-data-architecture/
Problems that are perfectly solved with Apache Spark:

Analyzing a large set of data files.
Doing ETL of a large amount of data.
Applying Machine Learning & Data Science to a large dataset.
Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.


## unit_testing_guidelines.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / unit_testing_guidelines.md
            
            
              Last active
              June 8, 2017 00:59
            
          
    Some general rules of unit testing:


A testing unit should focus on one tiny bit of functionality and prove it correct.
Each test unit must be fully independent. Each test must be able to run alone, and also within the test suite, regardless of the order that they are called. The implication of this rule is that each test must be loaded with a fresh dataset and may have to do some cleanup afterwards. This is usually handled by setUp() and tearDown() methods.
Try hard to make tests that run fast. If one single test needs more than a few milliseconds to run, development will be slowed down or the tests will not be run as often as is desirable. In some cases, tests can’t be fast because they need a complex data structure to work on, and this data structure must be loaded every time the test runs. Keep these heavier tests in a separate test suite that is run by some scheduled task, and run all other tests as often as needed.
Learn your tools and learn how to run a single test or a test case. Then,