top of page

Running Hive Queries in Spark

Hive quries can be executed from Spark Shell. Hive comes with Spark Library as HiveContext, which is inherited from SQLContext. HiveContext can be used to create and find tables in HiveMetaStore as well as to write and execute HiveSQL(HQL) queries.

Start the Spark Shell


Import HiveContext

import org.apache.spark.sql.hive.HiveContext

Create a new HiveContext Object

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

Run HQL to show list of all tables from default database from Spark Shell

hiveContext.sql("show tables").show(false)


hiveContext.sql("show tables").collect().foreach(println)

Now, create a csv file that holds sample employee data - employee.csv and place it within your hdfs location. For me it is: /user/ad/emp/employee.csv


We will create an external table from Spark using HQL.

Create External Table

hiveContext.sql("CREATE EXTERNAL TABLE if not exists employee( emp_id int, emp_name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/ad/employee'")

Since you created an external table and the location already had a file employee.csv present, your data is loaded into your table as soon as you created it.

Verify the creation of table by logging onto hive cli.

describe employee;

Select data from tables

hiveContext.sql("select count(*) as num_employees from employee").show

Verify the results by running the same query on hive cli.

select count(*) as num_employees from employee;

We can analyze the dataset in Spark by getting the complete table as dataframe and running spark onto it.

val employeeDF= hiveContext.sql("select * from default.employee")


scala> employeeDF.count
res14: Long = 5  

|   101|    amit| 35|
|   102|   nitin| 28|
|   103|  navnit| 39|
|   104|   akash| 20|
|   105|  naveen| 30|

Note: Since we have not specified any database name thus table will be created in 'default' database within hive.


bottom of page