February 20, 2013

Hadoop Architecture and its Usage at Facebook

Lots of data is generated on Facebook
– 300+ million active users 
– 30 million users update their statuses at least once each day
– More than 1 billion photos uploaded each month 
– More than 10 million videos uploaded each month 
– More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week

Data Usage
Statistics per day:
– 4 TB of compressed new data added per day
– 135TB of compressed data scanned per day
– 7500+ Hive jobs on production cluster per day
– 80K compute hours per day
Barrier to entry is significantly reduced:
– New engineers go though a Hive training session
– ~200 people/month run jobs on Hadoop/Hive
– Analysts (non-engineers) use Hadoop through Hive

Where is this data stored?
Hadoop/Hive Warehouse
– 4800 cores, 5.5 PetaBytes
– 12 TB per node
– Two level network topology
1 Gbit/sec from node to rack switch
4 Gbit/sec to top level rack switch

Data Flow into Hadoop Cloud













Move old data to cheap storage

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...