Skip to content

Instantly share code, notes, and snippets.

View peijiehu's full-sized avatar

Peijie Hu peijiehu

View GitHub Profile
@peijiehu
peijiehu / cassandra_index.md
Created August 26, 2018 22:15
cassandra_index

When to Use An Index?

Cassandra's built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have for querying and maintaining the index. For example, suppose you had a playlists table with a billion songs and wanted to look up songs by the artist. Many songs will share the same column value for artist. The artist column is a good candidate for an index.

When Not to Use An Index?

Do not use an index in these situations:

  • On high-cardinality columns because you then query a huge volume of records for a small number of results
  • In tables that use a counter column
  • On a frequently updated or deleted column
  • To look for a row in a large partition unless narrowly queried
sign = -1 if ls[0] == '-' else 1
max(-2**31, min(2**31-1, sign * res))

With the Microservices pattern, a client may need data from multiple different microservices. If the client called each microservice directly, that could contribute to longer load times, since the client would have to make a network request for each microservice called. Moreover, having the client call each microservice directly ties the client to that microservice - if the internal implementations of the microservices change (for example, if two microservices are combined sometime in the future) or if the location (host and port) of a microservice changes, then every client that makes use of those microservices must be updated.

The intent of the API Gateway pattern is to alleviate some of these issues. In the API Gateway pattern, an additional entity (the API Gateway) is placed between the client and the microservices. The job of the API Gateway is to aggregate the calls to the microservices. Rather than the client calling each microservice individually, the client calls the API Gateway a single time. The A

Jetty is a lightweight servlet container serves as both http server and application server(similar to Tomcat, but lighter), easy to embed within a java application, there is an easy to use jetty client also.

Netty is an asynchronous event-driven network application framework. You can write your own servlet container or http client app with help of the Netty framework for example. It's a transport layer, not limited to http like Jetty is.

HTTP client non-blocking, asynchronous APIs are perfectly suited for large content downloads, for parallel processing of requests/responses and in cases where performance and efficient thread and resource utilization is a key factor.

Threads are relatively expensive resources in an operating system. Each thread needs memory for the stack (which can be for example 2 MB in size). When you create thousands of threads, this is going to cost a lot of memory; also, operating systems have limits on the number of threads that can be created. So you don't want to start a new thread

tomcat_pid() {
echo `ps -fe | grep $CATALINA_BASE | grep -v grep | tr -s " "|cut -d" " -f2`
}
start() {
pid=$(tomcat_pid)
if [ -n "$pid" ]
then
echo -e "\e[00;31mTomcat is already running (pid: $pid)\e[00m"
else

Data jiujitsu: When building a data product, start small and verify that the users like the idea before investing too much to get a perfect product. This is reiterated in “Rules of Machine Learning: Best Practices for ML Engineering” by Martin Zinkevich.

Hadoop joins: If your offline flow is taking a long time to converge, it might be that you are doing massive joins.

Use hybrid: When building an online recommendation service, consider using a hybrid solution to precompute some parts of the computation in order to speed up your service. We have other successful systems at LinkedIn that follow a similar approach, including our Ads ML system.

@peijiehu
peijiehu / columnar_db.md
Created July 16, 2017 23:39
columnar db - definition and use cases

How do columnar databases work? The defining concept of a column-store is that the values of a table are stored contiguously by column. Thus the classic supplier table from CJ Date's supplier and parts database:

SNO STATUS CITY SNAME


S1 20 London Smith S2 10 Paris Jones S3 30 Paris Blake S4 20 London Clark S5 30 Athens Adams would be stored on disk or in memory something like:

@peijiehu
peijiehu / nosql.md
Last active July 16, 2017 23:09
When to use NoSQL(eg. MongoDB)

start by asking yourself the following questions:

  1. is your data format not well defined and likely to change - add fields that you don't know yet? Adding field in rdbms may lock the db or affect performance.
  2. is the data going to be big? 5-10 gb table in MySQL will not work efficiently.
  3. high insert load?
  4. nature of the data, eg. geo data that you want to query by locations, mongo has support

MORE ON: dzone post - when to use mongo rather than mysql

Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.[1] The purpose of these statistics may be to:

  1. Find out whether existing data can be easily used for other purposes
  2. Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category
  3. Assess data quality, including whether the data conforms to particular standards or patterns[2]
  4. Assess the risk involved in integrating data in new applications, including the challenges of joins
  5. Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies
  6. Assess whether known metadata accurately describes the actual values in the source database
  7. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late i