February 19, 2013

What's the big deal about Hadoop?

Hadoop has advantages over traditional database management systems, especially the ability to handle both structured data like that found in relational databases, say, as well as unstructured information such as video -- and lots of it. The system can also scale up with a minimum of fuss and bother.
 A growing number of firms are using Hadoop and related technologies such as Hive, Pig and Hbase to analysis analyze data in ways that cannot easily or affordably be done using traditional relational database technologies.
JPMorgan Chase, for instance, is using Hadoop to improve fraud detection, IT risk management, and self service applications. The financial services firm is also using the technology to enable a far more comprehensive view of its customers than was possible previously, executives said.
Meanwhile, Ebay is using Hadoop technology and the Hbase open source database to build a new search engine for its auction site. The auction site is revamping its core search engine technology using Hadoop and Hbase, a technology that enables real-time analysis of data in Hadoop environments.
The new eBay search engine, code-named Cassini, will replace the Voyager technology that's been used since the early 2000s. The update is needed in part due to surging volumes of data that needs to be managed.  Cassini will deliver more accurate and more context-based results to user search queries.

What is Hadoop used for? 
 Search 
– Yahoo, Amazon, Zvents 

 Log processing 
– Facebook, Yahoo, ContextWeb. Joost, Last.fm 
Recommendation Systems 
– Facebook   
Data Warehouse 
– Facebook, AOL 
Video and Image Analysis
– New York Times, Eyealike 


Goals of HDFS Very Large Distributed File System
– 10K nodes, 100 million files, 10 - 100 PB 

Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them 

Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth 

User Space, runs on heterogeneous OS  

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...