For reading Hbase table from Spark, here we will show with the help of Hortonworks Spark-Hbase connector.
We need to define a Catalog for each Hbase table. It will be in Json Format.It will define the mapping between Hbase columns and table schema.
E.g. We have a Hbase table called 'video-creator'.
To access this table through Spark.
In Spark shell, we need to run the below hortonworks spark package:
spark-shell --packages com.hortonworks:shc:1.0.0-2.0-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/
And import the below:
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
Now we can define a catalog which is hbase table details in jason format:
def catalog =
s"""{
|"table":{"namespace":"default", "name":"video-creator"},
|"rowkey":"key",
|"columns":{
|"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
|"vidcrt":{"cf":"vidcrt", "col":"creator_id", "type":"string"}
|}
|}""".stripMargin
Here table name is 'video-creator' and column family is 'vidcrt' and column 'creator_id'.
Then Dataframe on top of the Hbase table:
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
We will define the Dataframe by passing the catalog object:
val df_video_creator = withCatalog(catalog)
In Spark shell:
Here i am using Spark version 2.1.1, Hbase 1.1.2 and Scala 2.11
DataFrame