Parsing massive amounts of semi structure data is a pain using traditional parser. And more over you want to make that data queryable is an additional task.
Problem with traditional approach:
If you have massive data, or expecting your data to grow huge, you would be restricted by your hardware(storage/processing).
Overview of traditional approach:
Limitations:
The above approach has worked well with limited amount of data. But in this age of data flood where data is generated by almost every device, the above approach would fall short on
All the above drawback/Limitations are addressed if you are using big data platform like Hadoop.
Here is how you can query massive amount of Data on hadoop/Hive.
Hadoop based Approach overview:
Example implementation:
Step 1: Store Json file on HDFS
hdfs dfs -put
hdfs dfs -put /user/mik/jsondata/*.json /user/mik/data/
Step 2: Create external tables using hive and use jsonSerde
> Download the serde file:
http://www.congiu.net/hive-json-serde/1.3.7/cdh5/json-udf-1.3.7-jar-with-dependencies.jar
> Store jar on HDFS home directory.
$ hdfs dfs -put json-udf-1.3.7-jar-with-dependencies.jar /user/mik/
> Create Hive external tables and map column names with json attribute name using Json serde
$ hive
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p1357.1177/jars/hive-common-1.1.0-cdh5.4.5.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> add jar json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;
Added [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar]
hive> CREATE EXTERNAL TABLE tb_countrycode_json
> (
> countryName STRING,
> countryCode STRING
>
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH SERDEPROPERTIES ( "mapping.time_stamp" = "timestamp" )
> LOCATION '/user/mik/data'
> ;
Step 3: Query your data using HiveQL
hive> Select * from tb_countrycode_json;
That's it.
Problem with traditional approach:
If you have massive data, or expecting your data to grow huge, you would be restricted by your hardware(storage/processing).
Overview of traditional approach:
- Store the data on a server.(beware servers have limited space and CPUs)
- Write a parser program to parse Json.Rewrite the code if structure of JSON changes.
- Store it in RDBMS(Again, RDBMS is also constrained by storage and processing capasity)
- Query RDBMs.
Limitations:
The above approach has worked well with limited amount of data. But in this age of data flood where data is generated by almost every device, the above approach would fall short on
- Scalable storage: Scale your storage as the need arise
- Scalable processing: Increase CPU's as the data grows.
- Fault Tolerance: If the server fail after few hours of processing you need to start from the beginning. Moreover it needs manual intervention.
All the above drawback/Limitations are addressed if you are using big data platform like Hadoop.
Here is how you can query massive amount of Data on hadoop/Hive.
Hadoop based Approach overview:
- Store your JSON data on HDFS.
- Create external tables using hive and use jsonSerde to map json data to coloumns of your table.
- Query your data using hiveQL.
Example implementation:
Step 1: Store Json file on HDFS
hdfs dfs -put
hdfs dfs -put /user/mik/jsondata/*.json /user/mik/data/
Step 2: Create external tables using hive and use jsonSerde
> Download the serde file:
http://www.congiu.net/hive-json-serde/1.3.7/cdh5/json-udf-1.3.7-jar-with-dependencies.jar
> Store jar on HDFS home directory.
$ hdfs dfs -put json-udf-1.3.7-jar-with-dependencies.jar /user/mik/
> Create Hive external tables and map column names with json attribute name using Json serde
$ hive
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p1357.1177/jars/hive-common-1.1.0-cdh5.4.5.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> add jar json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;
Added [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar]
hive> CREATE EXTERNAL TABLE tb_countrycode_json
> (
> countryName STRING,
> countryCode STRING
>
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH SERDEPROPERTIES ( "mapping.time_stamp" = "timestamp" )
> LOCATION '/user/mik/data'
> ;
Step 3: Query your data using HiveQL
hive> Select * from tb_countrycode_json;
That's it.