jianminchen/Design twitter

## Design twitter
design a twitter, let us start, less than 1000 users, RSTful API, post twitter, follow, follower can see time line, hashtag, search using hashtag, -> scability, bottom neck

API
authentication jwt token

encrpty user_id with information this token only remaining valid around 1 week
define salt key random string
decrypt token invalid

/we store token after user authenticate redis

http protocol
register(username,password)
login(username, password)
post_tweet(user_id, content)
follow(user_id, followed_user_id)
get_recent_timeline(user_id, start, end) return array of tweets
search(hashtag) return array of tweets

header 200 ok
server internal error 500
request parameter invalid 400

body: {
  message: "ok"
  error_code: 23,
  error_message: ""
}

Table
User table
id , username , password, created_at

Tweet table
id, user_id , content, created_at

Hashtag table
id, tweet_id, hashtag
1 .  1 .  hashtag1
2    1 .  hashtag2


Follower table
user_id, follow_id bot id foreign key to user table, created_at


Timeline:
agregate based on recent tweets from our follower
so it will be agregate background and it will store in cache in redis
redis
user_id: [tweet_id, tweet_id, tweet_id ]


Diagram:
 less than 1000 user

Client -> LB -> Reverse Proxy -> UserService   -> Redis ->MYSQL
                              -> TimelineService -> Redis -> MYSQL

                              please describe and write down why we need reverse proxy:

Reverse Proxy
we can add rate limiter
SSL termination
compression response for outbound
blaclist certain ip

select * from hashtag_table like '#hashtag' - for searching hashtag

ranking based on like, retweet on tweet

Activity:
type (like, retweet) | actor_id ( user_id) | tweet_id | created_at


1st query if grow big store in cache
tweet_id_total_likes: 123
tweet_id_total_retweet: 123

every time we do agregation we can also update in tweet table
denomarlize

Tweet table
id, user_id , content, created_at, total_likes, total_retweet


Ranking:
we can read based on total_likes or total_retweet basedo n algorithm


100 Million

something #hastag something
if tweet include hashtag it will do tokenizer based on hashtag as the main key it will send to queue and worker will index to elastic search it will use inverted index

got tweet send asycn to queue
worker instance will call index api to elastic search

inverted index is data structure for faster lookup

let say

doc1 tweet #something #abc
doc2 tweet #something #abc2

tokenizer
#s
#so
#som
#some

// keyword index criteria -
1 "something like this #abc"

fuzzy search some
1
#ab
-
#abc
1
tokenizer phrase that not has hashtag
we define min length 4

some
somet
someth
somethi
somethin
something

like

#abc


#something doc1, doc2
#abc doc1
#abc2 doc2

Client -> LB -> Reverse Proxy -> Fanout   ->  SearchService -> Queue -> Worker ->  ElasticSearch
                                          ->  PostTweetService -> MYSQL
                                                               -> Redis
    Fanout
  Please write down some keywords to help understand FAUout

  Fanout is for sending /forwarding request to multiple service
  Fanin call from server service and merge

                               -> TimelineService -> Redis -> MYSQL

we will have a lot of followers

we can 2 mechanism timeline
push /pull
hybrid

server will do push information new tweet if user follower if less than 1 million
if user follower if greater than 1 million it means to get the new tweets from that user they must using pull

push mechanism can be divide in two
SSE / Websocket
SSE

Pull mechanism
short polling every 5 minutes calling to see new tweet from celebrity follower

Survey -
Cassadra -
distributed storage -
AWS - storage -
zookeeper - Redis - idea


Cassandra High throupout and high availability  but the downside eventual consistent
Timeline
TimelineFeed
User_id -> list(tweet_id)

redis
tweet_id->"content"

mysql it use master slave


quorum ring based consistent hashing
consistent hashing
5 server
size of the ring 0 - 2^32-1

server1 1
server2 3
server3 4

hash the data and doing % with size of the ring
it will place the data based on the clock wise
let say i hash data i got number 2 it will store in server2


large data intensive application (this i read the book)

- do you learn by reading or you also work on large distribute system
real experience - large distributed system -
what is technology different from twitter system

what is weakness/ strength system design?
if familiar with questoin is easy
let say design something im not familiar like live streaming like
	design a twitter, let us start, less than 1000 users, RSTful API, post twitter, follow, follower can see time line, hashtag, search using hashtag, -> scability, bottom neck

	API
	authentication jwt token

	encrpty user_id with information this token only remaining valid around 1 week
	define salt key random string
	decrypt token invalid

	/we store token after user authenticate redis

	http protocol
	register(username,password)
	login(username, password)
	post_tweet(user_id, content)
	follow(user_id, followed_user_id)
	get_recent_timeline(user_id, start, end) return array of tweets
	search(hashtag) return array of tweets

	header 200 ok
	server internal error 500
	request parameter invalid 400

	body: {
	message: "ok"
	error_code: 23,
	error_message: ""
	}

	Table
	User table
	id , username , password, created_at

	Tweet table
	id, user_id , content, created_at

	Hashtag table
	id, tweet_id, hashtag
	1 . 1 . hashtag1
	2 1 . hashtag2


	Follower table
	user_id, follow_id bot id foreign key to user table, created_at


	Timeline:
	agregate based on recent tweets from our follower
	so it will be agregate background and it will store in cache in redis
	redis
	user_id: [tweet_id, tweet_id, tweet_id ]




	Diagram:
	less than 1000 user

	Client -> LB -> Reverse Proxy -> UserService -> Redis ->MYSQL
	-> TimelineService -> Redis -> MYSQL

	please describe and write down why we need reverse proxy:

	Reverse Proxy
	we can add rate limiter
	SSL termination
	compression response for outbound
	blaclist certain ip

	select * from hashtag_table like '#hashtag' - for searching hashtag

	ranking based on like, retweet on tweet

	Activity:
	type (like, retweet) \| actor_id ( user_id) \| tweet_id \| created_at



	1st query if grow big store in cache
	tweet_id_total_likes: 123
	tweet_id_total_retweet: 123

	every time we do agregation we can also update in tweet table
	denomarlize

	Tweet table
	id, user_id , content, created_at, total_likes, total_retweet



	Ranking:
	we can read based on total_likes or total_retweet basedo n algorithm


	100 Million

	something #hastag something
	if tweet include hashtag it will do tokenizer based on hashtag as the main key it will send to queue and worker will index to elastic search it will use inverted index

	got tweet send asycn to queue
	worker instance will call index api to elastic search

	inverted index is data structure for faster lookup

	let say

	doc1 tweet #something #abc
	doc2 tweet #something #abc2

	tokenizer
	#s
	#so
	#som
	#some

	// keyword index criteria -
	1 "something like this #abc"

	fuzzy search some
	1
	#ab
	-
	#abc
	1
	tokenizer phrase that not has hashtag
	we define min length 4

	some
	somet
	someth
	somethi
	somethin
	something

	like

	#abc



	#something doc1, doc2
	#abc doc1
	#abc2 doc2

	Client -> LB -> Reverse Proxy -> Fanout -> SearchService -> Queue -> Worker -> ElasticSearch
	-> PostTweetService -> MYSQL
	-> Redis
	Fanout
	Please write down some keywords to help understand FAUout

	Fanout is for sending /forwarding request to multiple service
	Fanin call from server service and merge

	-> TimelineService -> Redis -> MYSQL

	we will have a lot of followers

	we can 2 mechanism timeline
	push /pull
	hybrid

	server will do push information new tweet if user follower if less than 1 million
	if user follower if greater than 1 million it means to get the new tweets from that user they must using pull

	push mechanism can be divide in two
	SSE / Websocket
	SSE

	Pull mechanism
	short polling every 5 minutes calling to see new tweet from celebrity follower

	Survey -
	Cassadra -
	distributed storage -
	AWS - storage -
	zookeeper - Redis - idea


	Cassandra High throupout and high availability but the downside eventual consistent
	Timeline
	TimelineFeed
	User_id -> list(tweet_id)

	redis
	tweet_id->"content"

	mysql it use master slave


	quorum ring based consistent hashing
	consistent hashing
	5 server
	size of the ring 0 - 2^32-1

	server1 1
	server2 3
	server3 4

	hash the data and doing % with size of the ring
	it will place the data based on the clock wise
	let say i hash data i got number 2 it will store in server2


	large data intensive application (this i read the book)

	- do you learn by reading or you also work on large distribute system
	real experience - large distributed system -
	what is technology different from twitter system

	what is weakness/ strength system design?
	if familiar with questoin is easy
	let say design something im not familiar like live streaming like