Apache HCatalog Tutorial For Beginners ~ Tech Blog

Tuesday, 24 July 2018

Apache HCatalog Tutorial For Beginners

1. Objective

Today, we are introducing a new journey towards Apache HCatalog. In this HCatalog tutorial, we are providing a guide of the ever-useful HCatalog storage management layer for Hadoop. Also, we will explain what it does as well as how it works. Moreover, in this HCatalog Tutorial, we will also discuss HCatalog architecture along with its benefits to get it well. So, get ready to dive into HCatalog Tutorial.

So, let’s start Apache HCatalog.

Apache HCatalog Tutorial For Beginners 2018

2. What is HCatalog?

Basically, a table as well as a storage management layer for Hadoop is what we call HCatalog. Its main function is that it enables users with different data processing tools, for example, Pig, MapReduce to make read and write data easily on the grid.

In addition, its abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS). Also, it makes sure that where or in what format their data is stored like the RCFile format, text files, SequenceFiles, or ORC files, users need not worry about.

Hence we can say in any format for which a SerDe (serializer-deserializer) can be written, HCatalog supports reading and writing files. Moreover, it supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats, by default. Although, make sure to provide the InputFormat, OutputFormat, and SerDe, to use a custom format.

3. HCatalog Tutorial – Intended Audience

For the professionals who want to make a career in Big Data Analytics using Hadoop Framework, HCatalog tutorial is specially designed for them.
Also, ETL developers, as well as analytics professionals, may go through this tutorial for good effect.

4. Prerequisites for HCatalog

It is highly recommended that one must have a basic knowledge of Core Java, Database concepts of SQL, Hadoop File system, and any of Linux operating system flavors, to start this tutorial. It will help us to understand the whole topic in depth.

5. Why HCatalog?

Enabling right tool for right Job

As we know for data processing such as Hive, Pig, and MapReduce, Hadoop ecosystem contains different tools. However, they do not need metadata, so, they can benefit from it when it is present only. Hence, no loading or transfer steps are required.

Capture processing states to enable sharing

We can publish our analytics results by HCatalog. Hence via “REST” the other programmer can access our analytics platform also.

Integrate Hadoop with everything

In the form of processing as well as storage environment, Hadoop opens up a lot of opportunities for the enterprise. So, with a familiar API and SQL-like language, REST services open up the platform to the enterprise. As a result, to more deeply integrate with the Hadoop platform, Enterprise data management systems use HCatalog.

6. HCatalog Architecture

Basically, on top of the Hive metastore, HCatalog is built and it incorporates Hive’s DDL. Also, it offers read and writes interfaces for Pig as well as MapReduce and also for issuing data definition and metadata exploration commands, it uses Hive’s command line interface.

HCatalog Architecture

7. Data Flow Example

Here, is a simple data flow example which explains how HCatalog can help grid users to share as well as to access data:

First: Copy Data to the Grid

At very first, to get data onto the grid, John uses distcp in data acquisition.

hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"

Second: Prepare the Data

Then to cleanse and prepare the data, Samuel uses Pig, in data processing.

However, Samuel must be manually informed by John when data is available, or poll on HDFS, without HCatalog.

A = load ‘/data/rawevents/20100819/data’ as (alpha:int, beta:chararray, …);

B = filter A by bot_finder(zeta) = 0;

…

store Z into ‘data/processedevents/20100819/data’;

Further, HCatalog will send a JMS message that data is available, with HCatalog. Afterward, Pig job starts:

A = load ‘rawevents’ using org.apache.hive.hcatalog.pig.HCatLoader();

B = filter A by date = ‘20100819’ and by bot_finder(zeta) = 0;

…

store Z into ‘processedevents’ using org.apache.hive.hcatalog.pig.HCatStorer(“date=20100819”);

Third: Analyze the Data

Further, to analyze his clients’ results, Ross uses Hive in client management.

So, Ross must alter the table to add the required partition, without HCatalog.

alter table processedevents add partition 20100819 hdfs://data/processedevents/20100819/data

select advertiser_id, count(clicks)

from processedevents

where date = ‘20100819’

group by advertiser_id;

Although, Ross does not need to modify the table structure, with HCatalog.

select advertiser_id, count(clicks)

from processedevents

where date = ‘20100819’

group by advertiser_id;

8. How HCatalog Works?

On top of the Hive metastore, HCatalog is built. Basically, it incorporates components from the Hive DDL. So, for Pig and MapReduce, HCatalog provides read and write interfaces. Also, for issuing data definition and metadata exploration commands, it uses Hive’s command line interface. In addition, to permit external tools access to Hive DDL operations, it also presents a REST interface, such as “create table” and “describe table.”

Further, it presents a relational view of data. Here, data save in table format and further these tables go into databases. However, we can partition table on one or more keys. So, there will be one partition that contains all rows with that value (or set of values), for a given value of a key (or set of keys).

9. HCatalog Web API

Basically, for HCatalog, WebHCat is a REST API. Where REST refers to “representational state transfer”. It is a style of API, which relies on HTTP verbs. However, Templeton was the name of WebHCat, originally.

10. HCatalog Benefits

There are several benefits that Apache HCatalog offers:

With the table abstraction, it frees the user from having to know the location of stored data.
Moreover, it enables notifications of data availability.
Also, it offers visibility for data cleaning and archiving tools.

So, this was all about HCatalog Tutorial. Hope you like our explanation

11. Conclusion

Hence, in this HCatalog tutorial, we have learned the whole about HCatalog in detail. Moreover, we discussed the meaning and need of HCatalog. Also, we discussed HCatalog Architecture and example. Along with this, we discussed the working of HCatalog, HCatalog Web API, and benefits ofHCatalog. However, if any doubt in HCatalog tutorial, feel free to ask in the comment tab.

Tuesday, 24 July 2018

1. Objective

2. What is HCatalog?

3. HCatalog Tutorial – Intended Audience

4. Prerequisites for HCatalog

5. Why HCatalog?

6. HCatalog Architecture

7. Data Flow Example

8. How HCatalog Works?

9. HCatalog Web API

10. HCatalog Benefits

11. Conclusion

0 comments:

Post a Comment

Total Pageviews

Achievement

Live Traffic

Subscribe To

Followers

About Me

I V RAMANA

Recent Comments

Categories

Popular Posts

Hot Topics

Video

News

Comments

Recent

Bottom Ad [Post Page]

Recent Posts

Mysql - How to reset the administrator password in ISPConfig 3

Socialize

Blog Archive

Search This Blog

Post Top Ad

Archive

Post Bottom Ad

Author Details

About Me

Tags

Full width home advertisement

Pages

Post Page Advertisement [Top]

Climb the mountains