Charalambos Kanella charalak

## query_engines.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / query_engines.md
            
            
              Last active
              February 14, 2018 08:15
            
              
                Query Engines
              
          
    QUERY ENGINES

Query engines sits on top of data storing technologies  and let you run SQL queries regardles the (external) database you use.
Examples are:

Apache DRILL
Apache PHOENIX
Apache ZEPPELIN
PRESTO


## external_database_choosing_a.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / external_database_choosing_a.md
            
            
              Last active
              February 1, 2018 07:39
            
              
                Tips for choosing an exteranl database
              
          
    TIPS FOR CHOODSING A SUITABLE EXTERNAL DATABASE

Examples of external databases are: MySQL, cassandra, MongoDB, (Hbase this is in the handoop ecosystem)
What systems do you have to integrate?
See if the systems can talk to each other
Consider about scaling. Specifically if you are planning to increase your database orders of magnitudes.
Consider the support provided, or the security.Maybe paid support is the best case (e.g., MongoDB)

  
## MongoDB_cheatsheet.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / MongoDB_cheatsheet.md
            
            
              Last active
              January 30, 2018 10:08
            
              
                MongoDB cheatsheet
              
          
    MongoDB

Characteristics


It is for Managing HuMONGOus data.
It Fvours  Consistency and Partitin-Tolerance instead of availability.
It is document data model, and it is flexible.
It looks like JSON

There is no real schema to enforce

  
## cassandra_cheatsheet.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              1 star
            
          
                charalak
                / cassandra_cheatsheet.md
            
            
              Last active
              May 2, 2020 05:56
            
              
                CASSANDRA cheatsheet
              
          
    CASSANDRA

It is a distributed database with no single point of failure because there is no master note. It is engineer for availability.
It is non-relational database. No joins etc are needed. It is for massive transactions, hihg avalability and scalability.
Even though it is NoSQL it has its own query language C-QL (C for Cassandra).
How is Cassandra built?


## HBASE_cheatsheet.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / HBASE_cheatsheet.md
            
            
              Last active
              January 29, 2018 08:46
            
              
                HBASE cheatsheet
              
          
    HBASE

It is a non-relational, scalable database built on top of HDFS, and it is used for exposing  stored massive datasets on HDFS file system to the user.
It is an open-source equivalent of Google's BIG TABLE.
It is partitioned into region servers, servers of keys like sharding when scaling SQL.
HBASE takes all small clumbs of data in HDFS and collects them in larger partions.

  
## NoSQL_cheatsheet.md

      
              1 file
            
          
              1 fork
            
          
                0 comments
              
            
              1 star
            
          
                charalak
                / NoSQL_cheatsheet.md
            
            
              Last active
              January 29, 2018 08:46
            
              
                NoSQL
              
          
    NoSQL


It is a system built to scale horizontally infinitely. It is very fast, and resilient.
It is useful for very large data and queries to solve specific problems and not abstract queries.

However if you needed to do queries of huuuuuge data (of the order that Google or Amazon does) with traditional SQL then you have to use osme other tricks such as: Denormalization (create specific tables with only what you need so no to go through all ur database),  Caching layers (layers sitting here and there on ur DB and caching data if your moves fit a pattern), master/slave setups (?), Sharding (partitioning ur data in ranges of indices, so as specific databases handle specific ranges), materialized views (?), removing stored procedures.

  
## Sqoop_cheatsheet.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / Sqoop_cheatsheet.md
            
            
              Last active
              January 29, 2018 08:47
            
              
                SQOOP cheatsheet
              
          
    SCOOP

Integrates SQL and Hadoop. It handles big data. It takes MAPREDUCE out of the equation and handles the importing and expoting of the data.
Import data from (My)SQL to HDFS

sqoop import --connect jdbc://localhost/movielns --driver com.mysql.jdbc.Driver --table table_name

Import data from (My)SQL to Hive


## hive_cheatsheet.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / hive_cheatsheet.md
            
            
              Last active
              January 29, 2018 08:47
            
              
                HIVE cheatsheet
              
          
    HIVE

It let you manage hdfs in an SQL manner. It sits on top of MapReduce, and TEZ.
It tranlates SQL queries to MR or Tez on the cluster.
From ur point of view you just write SQL queries and the rest are let to HIVE to figure out the rest.
Characteristics: It is interactive, scalable, much easier than MR in Java, optimized, extensible.
Drawbacks: Slow with OLTP (online data). It has its limits in complexity of queries, so Pig or Spark are more appropriate.
No transactions, no records, it is like database but it just Mapping and Reducing in a more efficient way.

  
## spark_cheatsheet.md

      
              1 file
            
          
              1 fork
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / spark_cheatsheet.md
            
            
              Last active
              October 5, 2020 01:41
            
              
                Spark cheatsheet
              
          
    SPARK

General information

With SPARK you can write scripts in for example python and manipulate data.
Spark can run on HADOOP but it doesn't has to do it. It can use its own cluster manager or it can use MESOS.
It is scalable

  
## impalance_classification.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                charalak
                / impalance_classification.md
            
            
              Last active
              January 24, 2018 12:34
            
              
                Imbalance classification
              
          
    IMBALANCED CLASSIFICATION

General information about the methods can be found in this short tutorial
In python you can install the imbalanced-learn. A large number of methods for over-sampling the minority class, under-sampling the majority class, and combination of those methods can be found there.
Another interesting mehtod of over-sampling the minority class is kmeans-smote.