Query engines sits on top of data storing technologies and let you run SQL queries regardles the (external) database you use.
Examples are:
- Apache DRILL
- Apache PHOENIX
- Apache ZEPPELIN
- PRESTO
Examples of external databases are: MySQL, cassandra, MongoDB, (Hbase this is in the handoop ecosystem)
What systems do you have to integrate?
See if the systems can talk to each other
Consider about scaling. Specifically if you are planning to increase your database orders of magnitudes.
Consider the support provided, or the security.Maybe paid support is the best case (e.g., MongoDB)
It is a distributed database with no single point of failure because there is no master note. It is engineer for availability.
It is non-relational database. No joins etc are needed. It is for massive transactions, hihg avalability and scalability.
Even though it is NoSQL it has its own query language C-QL (C for Cassandra).
It is a non-relational, scalable database built on top of HDFS, and it is used for exposing stored massive datasets on HDFS file system to the user.
It is an open-source equivalent of Google's BIG TABLE.
It is partitioned into region servers, servers of keys like sharding when scaling SQL.
HBASE takes all small clumbs of data in HDFS and collects them in larger partions.
However if you needed to do queries of huuuuuge data (of the order that Google or Amazon does) with traditional SQL then you have to use osme other tricks such as: Denormalization (create specific tables with only what you need so no to go through all ur database), Caching layers (layers sitting here and there on ur DB and caching data if your moves fit a pattern), master/slave setups (?), Sharding (partitioning ur data in ranges of indices, so as specific databases handle specific ranges), materialized views (?), removing stored procedures.
Integrates SQL and Hadoop. It handles big data. It takes MAPREDUCE out of the equation and handles the importing and expoting of the data.
sqoop import --connect jdbc://localhost/movielns --driver com.mysql.jdbc.Driver --table table_name
It let you manage hdfs in an SQL manner. It sits on top of MapReduce, and TEZ. It tranlates SQL queries to MR or Tez on the cluster.
From ur point of view you just write SQL queries and the rest are let to HIVE to figure out the rest.
Characteristics: It is interactive, scalable, much easier than MR in Java, optimized, extensible. Drawbacks: Slow with OLTP (online data). It has its limits in complexity of queries, so Pig or Spark are more appropriate. No transactions, no records, it is like database but it just Mapping and Reducing in a more efficient way.
General information about the methods can be found in this short tutorial
In python you can install the imbalanced-learn. A large number of methods for over-sampling the minority class, under-sampling the majority class, and combination of those methods can be found there.
Another interesting mehtod of over-sampling the minority class is kmeans-smote.