Hive quries can be executed from Spark Shell. Hive comes with Spark Library as HiveContext, which is inherited from SQLContext. HiveContext can be used to create and find tables in HiveMetaStore as well as to write and execute HiveSQL(HQL) queries.
Start the Spark Shell
Create a new HiveContext Object
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Run HQL to show list of all tables from default database from Spark Shell
Now, create a csv file that holds sample employee data - employee.csv and place it within your hdfs location. For me it is: /user/ad/emp/employee.csv
101,amit,35 102,nitin,28 103,navnit,39 104,akash,20 105,naveen,30
We will create an external table from Spark using HQL.
Create External Table
hiveContext.sql("CREATE EXTERNAL TABLE if not exists employee( emp_id int, emp_name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/ad/employee'")
Since you created an external table and the location already had a file employee.csv present, your data is loaded into your table as soon as you created it.
Verify the creation of table by logging onto hive cli.
hive describe employee;
Select data from tables
hiveContext.sql("select count(*) as num_employees from employee").show
Verify the results by running the same query on hive cli.
select count(*) as num_employees from employee;
We can analyze the dataset in Spark by getting the complete table as dataframe and running spark onto it.
val employeeDF= hiveContext.sql("select * from default.employee") employeeDF.count employeeDF.show
scala> employeeDF.count res14: Long = 5 scala> employeeDF.show +------+--------+---+ |emp_id|emp_name|age| +------+--------+---+ | 101| amit| 35| | 102| nitin| 28| | 103| navnit| 39| | 104| akash| 20| | 105| naveen| 30| +------+--------+---+
Note: Since we have not specified any database name thus table will be created in 'default' database within hive.