Reading Hbase table from Spark ~ Data Engineering

For reading Hbase table from Spark, here we will show with the help of Hortonworks Spark-Hbase connector.

We need to define a Catalog for each Hbase table. It will be in Json Format.It will define the mapping between Hbase columns and table schema.

E.g. We have a Hbase table called 'video-creator'.

To access this table through Spark.

In Spark shell, we need to run the below hortonworks spark package:

spark-shell --packages com.hortonworks:shc:1.0.0-2.0-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/

And import the below:

import org.apache.spark.sql.{SQLContext, _}

import org.apache.spark.sql.execution.datasources.hbase._

Now we can define a catalog which is hbase table details in jason format:

def catalog =

s"""{

|"table":{"namespace":"default", "name":"video-creator"},

|"rowkey":"key",

|"columns":{

|"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},

|"vidcrt":{"cf":"vidcrt", "col":"creator_id", "type":"string"}

|}""".stripMargin

Here table name is 'video-creator' and column family is 'vidcrt' and column 'creator_id'.

Then Dataframe on top of the Hbase table:

def withCatalog(cat: String): DataFrame = {

spark.sqlContext

.read

.options(Map(HBaseTableCatalog.tableCatalog->cat))

.format("org.apache.spark.sql.execution.datasources.hbase")

.load()

}

We will define the Dataframe by passing the catalog object:

val df_video_creator = withCatalog(catalog)

In Spark shell:

Here i am using Spark version 2.1.1, Hbase 1.1.2 and Scala 2.11

DataFrame