Thursday, March 14, 2013

Why HADOOP...

BIG DATA is very much Discussed  in terms of storage performance and is used by several industries like Social Media(Facebook, twitter), space programs etc where the data is HUGE, ie TB, PB etc.
For Such huge Data there are several challenges in terms of handling, storing, mining through the data and use it.
Some of the challenges for BIG DATA are 
  1. Type of Data that is stored may not be simple.It may be complex.data ie, the it may not be only traditional data type but may have anything like Image, file, binary etc.
  2. Storage: Storing  huge amount of data on to one machine may be any issue because after a certain amount of data the hardware may not support further extension.
  3. Processing&Performance: Performance of the Application will be a problem when dealing with big Data. To process such huge data require lot of time and CPU. Moreover CPU may be a bottelneck as it has its own limitations.
  4. Network: Network carrying data required for application itself may eat majority of the time. 
  5. I/O: The hardware starts detoriating on frequent I/O operations and may increase the read write time as it grows old. This also adds to performance hit of the application.
  6. Data: The amout of data that has to be stored is not fixed. There is no control over the data that inflows.
 HADOOP is an open source Distributed framework by Apache foundation that provides scaling in terms of Performance, storage, and IO bandwidth.

Most of the challenges for BIG DATA can be addressed through HADOOP.

In HADOOP, data is stored across several machines and data is processed parellely across each machines.
So, the main idea is to break a big task in several small processes and distribute these small tasks to the machines where the data required for is resides. By this we can achieve Data localization.

Data localaization will reduce the Network latency(Fetching the data required for performing a task and trasfering it on the network).

The IO bandwidth, i.e the read write operations will also reduce due to data localization as it has to read limited amount of data.

As the hadoop is distributed framwork, it has adopted failure as part of the system, i.e it is not considered as an exception and is handled. So this handles the problem of partial failure in the distributed system.

Besides the above there are other advantages of HADOOP.

  1. HADOOP cluster can be implemented using simple machines. No high end system are required. CPU and the storage of each machine is used. So this also helps in cost reduction. 
  2. It is paltform independent as it is implemented in java
  3. It is fault tolerent, i.e, if one of the node in the cluster fails the processing is not stopped but instead it is taken up by other machines.