February 28, 2013

NoSQL Key-Value Store

Basic terminology:
  • Key-Value Store – data is stored in unstructured records consisting of a key + the values associated with that record
  • NoSQL –Doesn’t use SQL commands
Let’s say you’ve got millions of data records — as you might have for example, if you’ve got millions of users who visit your website.
looks like a “row” in a database table).Note that not every user has the same information — some users will have a username, some will only have an email address, some users will have provided their name and others will not.  Each record has a different length and different values.
To store this kind of data, you create a key for each record and then store whatever fields are available as bins (what would be columns in a structured database) — where each bin consists of a name and a value.  Then you create a bin for each piece of data you have.  If you don’t have a particular piece of data, you don’t have a blank field (like in a relational table), you simply don’t store a bin for that data.
This type of database is called a Key-Value Store because each record has a primary key and a collection of values (bins).  It’s also called a Row Store because all of the data for a single record is stored together, in something that we can think of conceptually as a row.

Example of unstructured data for user records:

Key: 1 ID:av First Name: Avishkar



Key: 2 Email: avishkarm@gmail.com Location: Mumbai Age: 37



Key: 3  Facebook ID: avishkarmeshram  Password: xxx  Name: Avishkar




Data is organized into policy containers called ‘namespaces’, semantically similar to ‘databases’ in an RDBMS system. Namespaces are configured when the cluster is started, and are used to control retention and reliability requirements for a given set of data. 
Within a namespace, data is subdivided into ‘sets’ (similar to ‘tables’) and ‘records’ (similar to ‘rows’). Each record has an indexed ‘key’ that is unique in the set, and one or more named ‘bins’ (similar to columns) that hold values associated with the record.




Indexes (primary keys) are stored in DRAM for ultra-fast access and values can be stored either in DRAM or more cost-effectively on SSDs. Each namespace can be configured separately, so small namespaces can take advantage of DRAM and larger ones gain the cost

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...