Skip to content

Instantly share code, notes, and snippets.

@ianmcook
Last active April 26, 2021 05:51
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ianmcook/0754938648d2030b58bdfe7ce5e9a31f to your computer and use it in GitHub Desktop.
Save ianmcook/0754938648d2030b58bdfe7ce5e9a31f to your computer and use it in GitHub Desktop.
Query Apache Hive from R using JDBC
# Copyright 2018 Cloudera, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# first install the Hive JDBC driver
# from https://www.cloudera.com/downloads/connectors/hive/jdbc.html
# set jdbc_path to the path to the folder containing the JAR files
# extracted from the JDBC41 zip file in the JDBC driver package
jdbc_path <- "PATH TO FOLDER CONTAINING JAR FILES HERE"
# set the Hive JDBC connection string
jdbc_conn_str <- "jdbc:hive2://HOSTNAME:10000"
# install the RJDBC package if it's not already installed
if(!"RJDBC" %in% rownames(installed.packages())) {
install.packages("RJDBC")
}
# initialize rJava
hive_classpath <- list.files(path = jdbc_path, pattern = "\\.jar$", full.names = TRUE)
rJava::.jinit(classpath = hive_classpath)
# connect to HiveServer2
library(DBI)
drv <- RJDBC::JDBC("com.cloudera.hive.jdbc41.Driver", hive_classpath, "`")
hive <- dbConnect(drv, jdbc_conn_str)
# now you can run SELECT queries and return the results to R
# using the dbGetQuery() function
# for example:
dbGetQuery(
hive,
"SELECT origin,
COUNT(*) AS num_departures,
AVG(dep_delay) AS avg_dep_delay
FROM flights
WHERE dest = 'LAS'
GROUP BY origin
ORDER BY avg_dep_delay")
# disconnect from Hive
dbDisconnect(hive)
@maerory
Copy link

maerory commented Apr 23, 2021

Hi, I am following the steps from your code and ran into an error on line 36.
I keep getting the error when trying to create the drv object:

Error in .jfindClass(as.character(driverClass)[1]) : 
  java.lang.ClassNotFoundException

Is there any update I have to make on the code?

@ianmcook
Copy link
Author

@maerory Depending on what JDBC version your Hive driver supports, you might need to change that line to use one of these:

  • "com.cloudera.hive.jdbc40.Driver"
  • "com.cloudera.hive.jdbc4.HS2Driver"
  • "com.cloudera.hive.jdbc.HS2Driver"

Or if your Hive JDBC driver is not from Cloudera, it might be something different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment