Peijie Hu peijiehu

## cassandra_index.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / cassandra_index.md
            
            
              Created
              August 26, 2018 22:15
            
              
                cassandra_index
              
          
    When to Use An Index?

Cassandra's built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have for querying and maintaining the index. For example, suppose you had a playlists table with a billion songs and wanted to look up songs by the artist. Many songs will share the same column value for artist. The artist column is a good candidate for an index.
When Not to Use An Index?

Do not use an index in these situations:

On high-cardinality columns because you then query a huge volume of records for a small number of results
In tables that use a counter column
On a frequently updated or deleted column
To look for a row in a large partition unless narrowly queried


## max_min_int.py
sign = -1 if ls[0] == '-' else 1
max(-2**31, min(2**31-1, sign * res))

## distributed-computing-concepts.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / distributed-computing-concepts.md
            
            
              Last active
              October 21, 2017 16:40
            
          
    Curated Distributed Computing Concepts

Introduction


Distributed Computing and Parallel Computing
http://wla.berkeley.edu/~cs61a/fa11/lectures/communication.html
Comprehensive foundamentals, characteristics and "8 Fallacies" of distributed system are interesting, and finally, design principles
http://www.hpcs.cs.tsukuba.ac.jp/~tatebe/lecture/h23/dsys/dsd-tutorial.html

RPC

https://en.wikipedia.org/wiki/Remote_procedure_call

  
## api-gateway.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / api-gateway.md
            
            
              Created
              September 22, 2017 19:50
            
          
    With the Microservices pattern, a client may need data from multiple different microservices. If the client called each microservice directly, that could contribute to longer load times, since the client would have to make a network request for each microservice called. Moreover, having the client call each microservice directly ties the client to that microservice - if the internal implementations of the microservices change (for example, if two microservices are combined sometime in the future) or if the location (host and port) of a microservice changes, then every client that makes use of those microservices must be updated.
The intent of the API Gateway pattern is to alleviate some of these issues. In the API Gateway pattern, an additional entity (the API Gateway) is placed between the client and the microservices. The job of the API Gateway is to aggregate the calls to the microservices. Rather than the client calling each microservice individually, the client calls the API Gateway a single time. The A

  
## Jetty-vs-Netty.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / Jetty-vs-Netty.md
            
            
              Last active
              September 21, 2017 00:31
            
          
    Jetty is a lightweight servlet container serves as both http server and application server(similar to Tomcat, but lighter), easy to embed within a java application, there is an easy to use jetty client also.
Netty is an asynchronous event-driven network application framework. You can write your own servlet container or http client app with help of the Netty framework for example. It's a transport layer, not limited to http like Jetty is.
HTTP client non-blocking, asynchronous APIs are perfectly suited for large content downloads, for parallel processing of requests/responses and in cases where performance and efficient thread and resource utilization is a key factor.
Threads are relatively expensive resources in an operating system. Each thread needs memory for the stack (which can be for example 2 MB in size). When you create thousands of threads, this is going to cost a lot of memory; also, operating systems have limits on the number of threads that can be created. So you don't want to start a new thread

  
## tomcat_initscript.sh
tomcat_pid() {
        echo `ps -fe | grep $CATALINA_BASE | grep -v grep | tr -s " "|cut -d" " -f2`
}

start() {
  pid=$(tomcat_pid)
  if [ -n "$pid" ]
  then
    echo -e "\e[00;31mTomcat is already running (pid: $pid)\e[00m"
  else

## building_data_product.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / building_data_product.md
            
            
              Created
              August 20, 2017 05:43
            
          
    Takeaways from https://engineering.linkedin.com/blog/2017/08/scaling-contextual-conversation-suggestions-over-linkedins-graph

Data jiujitsu: When building a data product, start small and verify that the users like the idea before investing too much to get a perfect product. This is reiterated in “Rules of Machine Learning: Best Practices for ML Engineering” by Martin Zinkevich.
Hadoop joins: If your offline flow is taking a long time to converge, it might be that you are doing massive joins.
Use hybrid: When building an online recommendation service, consider using a hybrid solution to precompute some parts of the computation in order to speed up your service. We have other successful systems at LinkedIn that follow a similar approach, including our Ads ML system.

  
## columnar_db.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / columnar_db.md
            
            
              Created
              July 16, 2017 23:39
            
              
                columnar db - definition and use cases
              
          
    How do columnar databases work? The defining concept of a column-store is that the values of a table are stored contiguously by column. Thus the classic supplier table from CJ Date's supplier and parts database:
SNO  STATUS CITY    SNAME

S1       20 London  Smith
S2       10 Paris   Jones
S3       30 Paris   Blake
S4       20 London  Clark
S5       30 Athens  Adams
would be stored on disk or in memory something like:

  
## nosql.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / nosql.md
            
            
              Last active
              July 16, 2017 23:09
            
              
                When to use NoSQL(eg. MongoDB)
              
          
    start by asking yourself the following questions:

is your data format not well defined and likely to change - add fields that you don't know yet? Adding field in rdbms may lock the db or affect performance.
is the data going to be big? 5-10 gb table in MySQL will not work efficiently.
high insert load?
nature of the data, eg. geo data that you want to query by locations, mongo has support

MORE ON:
dzone post - when to use mongo rather than mysql

  
## data_profiling.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                peijiehu
                / data_profiling.md
            
            
              Last active
              July 6, 2017 05:35
            
          
    Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.[1] The purpose of these statistics may be to:


Find out whether existing data can be easily used for other purposes
Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category
Assess data quality, including whether the data conforms to particular standards or patterns[2]
Assess the risk involved in integrating data in new applications, including the challenges of joins
Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies
Assess whether known metadata accurately describes the actual values in the source database
Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late i
	sign = -1 if ls[0] == '-' else 1
	max(-231, min(231-1, sign * res))
	tomcat_pid() {
	echo `ps -fe \| grep $CATALINA_BASE \| grep -v grep \| tr -s " "\|cut -d" " -f2`
	}

	start() {
	pid=$(tomcat_pid)
	if [ -n "$pid" ]
	then
	echo -e "\e[00;31mTomcat is already running (pid: $pid)\e[00m"
	else