January 07, 2016

What does the term "Data Locality" mean in Hadoop?

What does the term "Data Locality" mean in Hadoop?

Hadoop believes in "Moving computation is cheaper than moving data", i.e. it works on local data as far as possible which means that when we invoke any map reduce job, the logic( map and reduce code) is sent to each data node in the cluster.

Hadoop puts mapReduce job's jar to the HDFS. The task trackers which needed it will take it from there. The node is going to process local data.

When a dataset is stored in HDFS, it is divided in to blocks and stored across the DataNodes in the Hadoop cluster. When a MapReduce job is executed against the dataset the individual Mappers will process the blocks (Input Splits). When the data is not available for the Mapper in the same node where it is being executed, the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task.

So suppose if you have a text file of size 1 GB and you have written a map reduce code to convert all text in that file to upper case then first the file will be broken into chunks and the logic to cover the text to Upper case would be available to each data node. Now the tasktracker on each node would only run the map reduce code the data block/s present on that local node. This is known as data locality.

Why is Data Locality important?

Imagine a MapReduce job with over 100 Mappers and each Mapper is trying to copy the data from another DataNode in the cluster at the same time, this would result in serious network congestion as all the Mappers would try to copy the data at the same time and it is not ideal. So it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation.

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...