Skip to content

Instantly share code, notes, and snippets.

View charalak's full-sized avatar

Charalambos Kanella charalak

View GitHub Profile
@charalak
charalak / query_engines.md
Last active February 14, 2018 08:15
Query Engines

QUERY ENGINES

Query engines sits on top of data storing technologies and let you run SQL queries regardles the (external) database you use.

Examples are:

  • Apache DRILL
  • Apache PHOENIX
  • Apache ZEPPELIN
  • PRESTO
@charalak
charalak / external_database_choosing_a.md
Last active February 1, 2018 07:39
Tips for choosing an exteranl database

TIPS FOR CHOODSING A SUITABLE EXTERNAL DATABASE

Examples of external databases are: MySQL, cassandra, MongoDB, (Hbase this is in the handoop ecosystem)

What systems do you have to integrate?

See if the systems can talk to each other

Consider about scaling. Specifically if you are planning to increase your database orders of magnitudes.

Consider the support provided, or the security.Maybe paid support is the best case (e.g., MongoDB)

@charalak
charalak / MongoDB_cheatsheet.md
Last active January 30, 2018 10:08
MongoDB cheatsheet

MongoDB

Characteristics

  • It is for Managing HuMONGOus data.
  • It Fvours Consistency and Partitin-Tolerance instead of availability.
  • It is document data model, and it is flexible.
  • It looks like JSON

There is no real schema to enforce

@charalak
charalak / cassandra_cheatsheet.md
Last active May 2, 2020 05:56
CASSANDRA cheatsheet

CASSANDRA

It is a distributed database with no single point of failure because there is no master note. It is engineer for availability.

It is non-relational database. No joins etc are needed. It is for massive transactions, hihg avalability and scalability.

Even though it is NoSQL it has its own query language C-QL (C for Cassandra).

How is Cassandra built?

@charalak
charalak / HBASE_cheatsheet.md
Last active January 29, 2018 08:46
HBASE cheatsheet

HBASE

It is a non-relational, scalable database built on top of HDFS, and it is used for exposing stored massive datasets on HDFS file system to the user.

It is an open-source equivalent of Google's BIG TABLE.

It is partitioned into region servers, servers of keys like sharding when scaling SQL.

HBASE takes all small clumbs of data in HDFS and collects them in larger partions.

@charalak
charalak / NoSQL_cheatsheet.md
Last active January 29, 2018 08:46
NoSQL

NoSQL

  • It is a system built to scale horizontally infinitely. It is very fast, and resilient.
  • It is useful for very large data and queries to solve specific problems and not abstract queries.

However if you needed to do queries of huuuuuge data (of the order that Google or Amazon does) with traditional SQL then you have to use osme other tricks such as: Denormalization (create specific tables with only what you need so no to go through all ur database), Caching layers (layers sitting here and there on ur DB and caching data if your moves fit a pattern), master/slave setups (?), Sharding (partitioning ur data in ranges of indices, so as specific databases handle specific ranges), materialized views (?), removing stored procedures.

@charalak
charalak / Sqoop_cheatsheet.md
Last active January 29, 2018 08:47
SQOOP cheatsheet

SCOOP

Integrates SQL and Hadoop. It handles big data. It takes MAPREDUCE out of the equation and handles the importing and expoting of the data.

Import data from (My)SQL to HDFS

sqoop import --connect jdbc://localhost/movielns --driver com.mysql.jdbc.Driver --table table_name

Import data from (My)SQL to Hive

@charalak
charalak / hive_cheatsheet.md
Last active January 29, 2018 08:47
HIVE cheatsheet

HIVE

It let you manage hdfs in an SQL manner. It sits on top of MapReduce, and TEZ. It tranlates SQL queries to MR or Tez on the cluster.

From ur point of view you just write SQL queries and the rest are let to HIVE to figure out the rest.

Characteristics: It is interactive, scalable, much easier than MR in Java, optimized, extensible. Drawbacks: Slow with OLTP (online data). It has its limits in complexity of queries, so Pig or Spark are more appropriate. No transactions, no records, it is like database but it just Mapping and Reducing in a more efficient way.

@charalak
charalak / spark_cheatsheet.md
Last active October 5, 2020 01:41
Spark cheatsheet

SPARK

General information

With SPARK you can write scripts in for example python and manipulate data.

Spark can run on HADOOP but it doesn't has to do it. It can use its own cluster manager or it can use MESOS.

It is scalable

@charalak
charalak / impalance_classification.md
Last active January 24, 2018 12:34
Imbalance classification

IMBALANCED CLASSIFICATION

General information about the methods can be found in this short tutorial

In python you can install the imbalanced-learn. A large number of methods for over-sampling the minority class, under-sampling the majority class, and combination of those methods can be found there.

Another interesting mehtod of over-sampling the minority class is kmeans-smote.