Parsing massive amounts of semi structure data is a pain using traditional parser. And more over you want to make that data queryable is an additional task.
Problem with traditional approach:
If you have massive data, or expecting your data to grow huge, you would be restricted by your hardware(storage/processing).
Overview of traditional approach:
Limitations:
The above approach has worked well with limited amount of data. But in this age of data flood where data is generated by almost every device, the above approach would fall short on
All the above drawback/Limitations are addressed if you are using big data platform like Hadoop.
Here is how you can query massive amount of Data on hadoop/Hive.
Hadoop based Approach overview:
Example implementation:
Step 1: Store Json file on HDFS
hdfs dfs -put
hdfs dfs -put /user/mik/jsondata/*.json /user/mik/data/
Step 2: Create external tables using hive and use jsonSerde
> Download the serde file:
http://www.congiu.net/hive-json-serde/1.3.7/cdh5/json-udf-1.3.7-jar-with-dependencies.jar
> Store jar on HDFS home directory.
$ hdfs dfs -put json-udf-1.3.7-jar-with-dependencies.jar /user/mik/
> Create Hive external tables and map column names with json attribute name using Json serde
$ hive
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p1357.1177/jars/hive-common-1.1.0-cdh5.4.5.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> add jar json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;
Added [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar]
hive> CREATE EXTERNAL TABLE tb_countrycode_json
> (
> countryName STRING,
> countryCode STRING
>
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH SERDEPROPERTIES ( "mapping.time_stamp" = "timestamp" )
> LOCATION '/user/mik/data'
> ;
Step 3: Query your data using HiveQL
hive> Select * from tb_countrycode_json;
That's it.
Problem with traditional approach:
If you have massive data, or expecting your data to grow huge, you would be restricted by your hardware(storage/processing).
Overview of traditional approach:
- Store the data on a server.(beware servers have limited space and CPUs)
- Write a parser program to parse Json.Rewrite the code if structure of JSON changes.
- Store it in RDBMS(Again, RDBMS is also constrained by storage and processing capasity)
- Query RDBMs.
Limitations:
The above approach has worked well with limited amount of data. But in this age of data flood where data is generated by almost every device, the above approach would fall short on
- Scalable storage: Scale your storage as the need arise
- Scalable processing: Increase CPU's as the data grows.
- Fault Tolerance: If the server fail after few hours of processing you need to start from the beginning. Moreover it needs manual intervention.
All the above drawback/Limitations are addressed if you are using big data platform like Hadoop.
Here is how you can query massive amount of Data on hadoop/Hive.
Hadoop based Approach overview:
- Store your JSON data on HDFS.
- Create external tables using hive and use jsonSerde to map json data to coloumns of your table.
- Query your data using hiveQL.
Example implementation:
Step 1: Store Json file on HDFS
hdfs dfs -put
hdfs dfs -put /user/mik/jsondata/*.json /user/mik/data/
Step 2: Create external tables using hive and use jsonSerde
> Download the serde file:
http://www.congiu.net/hive-json-serde/1.3.7/cdh5/json-udf-1.3.7-jar-with-dependencies.jar
> Store jar on HDFS home directory.
$ hdfs dfs -put json-udf-1.3.7-jar-with-dependencies.jar /user/mik/
> Create Hive external tables and map column names with json attribute name using Json serde
$ hive
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p1357.1177/jars/hive-common-1.1.0-cdh5.4.5.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> add jar json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;
Added [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar]
hive> CREATE EXTERNAL TABLE tb_countrycode_json
> (
> countryName STRING,
> countryCode STRING
>
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH SERDEPROPERTIES ( "mapping.time_stamp" = "timestamp" )
> LOCATION '/user/mik/data'
> ;
Step 3: Query your data using HiveQL
hive> Select * from tb_countrycode_json;
That's it.
It is nice blog Thank you provide important information and i am searching for same information to save my time Big data hadoop online training
ReplyDeleteperde modelleri
ReplyDeletenumara onay
Vodafone Mobil Ödeme Bozdurma
nft nasıl alınır
ankara evden eve nakliyat
trafik sigortasi
dedektör
web sitesi kurma
aşk kitapları
kartal bosch klima servisi
ReplyDeletependik alarko carrier klima servisi
kadıköy toshiba klima servisi
maltepe beko klima servisi
ataşehir samsung klima servisi
çekmeköy mitsubishi klima servisi
üsküdar bosch klima servisi
beykoz arçelik klima servisi
tuzla daikin klima servisi
Önemli giriş adreslerine buradan ulaşabilirsiniz.
ReplyDeletebetturkey giriş
betpark giriş
OD3X
kocaeli
ReplyDeletekonya
kuşadası
kütahya
malatya
CDZV65
manisa
ReplyDeletemaraş
mardin
marmaris
mersin
7AF3
erzurum
ReplyDeletekuşadası
ümraniye
beşiktaş
ataşehir
1ZL
beşiktaş
ReplyDeleteeyüpsultan
gebze
kaş
tekirdağ
S5JG
kağıthane
ReplyDeletemuğla
tuzla
etiler
maltepe
SF2EM7
https://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
G4AAQ
xxlargeseo
ReplyDeletexxlargeseodigi
digi.seo
RBESS6