Data Engineering with Avishkar: 2013

December 16, 2013

Database supporting different languages

As we all know, many global industries wants to increase their business worldwide and grow at the same time, they would want to widen their business by providing services to the customers worldwide by supporting different languages like Chinese, Japanese, Korean and Arabic. Many websites these days are supporting international languages to do their business and to attract more and more customers and that makes life easier for both the parties.
To store the customer data into the database the database must support a mechanism to store the international characters, storing these characters is not easy, and many database vendors have to revised their strategies and come up with new mechanisms to support or to store these international characters in the database. Some of the big vendors like Oracle, Microsoft, IBM and other database vendors started providing the international character support so that the data can be stored and retrieved accordingly to avoid any hiccups while doing business with the international customers.
The difference in storing character data between Unicode and non-Unicode depends on whether non-Unicode data is stored by using double-byte character sets. All non-East Asian languages and the Thai language store non-Unicode characters in single bytes. Therefore, storing these languages as Unicode uses two times the space that is used specifying a non-Unicode code page. On the other hand, the non-Unicode code pages of many other Asian languages specify character storage in double-byte character sets (DBCS). Therefore, for these languages, there is almost no difference in storage between non-Unicode and Unicode.

Collation itself specifies the rules for how strings of character data are sorted and compared. The rules for sorting data vary depending on the language and locale.
For example if you was to use a Lithuanian collation, the letter "Y" would appear between "I" and "J" if sorted. And if using the traditional Spanish collation "ch" would be sorted at the end of a list of words beginning with "c".

You can specify collations at following:
1. Creating or altering a database.
2. Creating or altering a table column.
You can specify collations for each character string column using the COLLATE clause of the CREATE TABLE or ALTER TABLE statement. You can also specify a collation when you create a table using SQL Server Management Studio. If you do not specify a collation, the column is assigned the default collation of the database.
3.Casting the collation of an expression.
You can use the COLLATE clause to apply a character expression to a certain
collation. Character literals and variables are assigned the default collation of
the current database. Column references are assigned the definition collation of
the column.
4.When restoring or attaching a database, the default collation of the database and
the collation of any char, varchar, and text columns or parameters

The COLLATE clause can be applied only for the char, varchar, text, nchar,
nvarchar, and ntext data types.

You can execute the system function fn_helpcollations to retrieve a list of all the

valid collation names for Windows collations and SQL Server collations:
SELECT name, description
FROM fn_helpcollations();

Use of nchar, nvarchar, nvarchar(max), and ntext is the same as char, varchar, varchar(max), and text, respectively, except:
Unicode supports a wider range of characters.
More space is needed to store Unicode characters.
The maximum size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar.
Unicode constants are specified with a leading N, for example, N'A Unicode string'

The collation on the database to accept the Russian characters using the CYRILLIC_GENERAL_CI_AS collation set. The CI stands for CASE INSENSTIVE and the AS
stands for ACCENT SENSTIVE.

April 01, 2013

Amazon RDS for Microsoft SQL Server

Amazon RDS frees you up to focus on application development by managing time-consuming database administration tasks including provisioning, backups, software patching, monitoring, and hardware scaling.

You can run Amazon RDS for SQL Server under two different licensing models – “License Included” and “License Mobility through Software Assurance (or Bring Your Own License – BYOL)”.
"License Included" pricing starts at $0.035 per hour and is inclusive of software, underlying hardware resources, and Amazon RDS management capabilities.

Microsoft’s License Mobility program allows customers who already own SQL Server licenses to run SQL Server deployments on Amazon RDS.

Amazon RDS for SQL Server DB Instances can be provisioned with either standard storage or Provisioned IOPS storage. Amazon RDS Provisioned IOPS is a storage option designed to deliver fast, predictable, and consistent I/O performance, and is optimized for I/O-intensive, transactional (OLTP) database workloads.

Amazon Web Services

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. Amazon Elastic Block Store (EBS) provides persistent storage to Amazon EC2 instances.

Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers.

Amazon RDS is a web service that makes it easier to set up, operate, and scale a relational database in the cloud.

Amazon DynamoDB is a high performance non-relational database service that is easy to set up, operate, and scale. It is designed to address the core problems of database management, performance, scalability, and reliability. It also provides predictable high performance and low latency at scale.

Amazon SimpleDB is a web service providing the core database functions of data indexing and querying in the cloud.

mazon Simple Queue Service (Amazon SQS) offers a reliable, highly scalable, hosted queue in the cloud.

Amazon Simple Email Service (Amazon SES) is a highly scalable and cost-effective bulk and transactional email-sending service for businesses and developers.

Amazon Glacier is an extremely low-cost storage service that provides secure and durable storage for data archiving and backup. It is optimized for data that is infrequently accessed and for which retrieval times of several hours are suitable.

Amazon CloudFront is a web service for content delivery. It delivers your content using a global network of edge locations and works seamlessly with Amazon S3 which durably stores the original, definitive versions of your files.

Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud. It is protocol-compliant with Memcached, so code, applications, and tools that you use today with your existing Memcached environments work seamlessly with the service.

Amazon CloudWatch is a web service that enables you to monitor your Amazon EC2 instances, Amazon EBS volumes, Elastic Load Balancers, and Amazon RDS database instances in real-time. You can also supply your own custom application metrics. With Amazon CloudWatch you can access up-to-the-minute statistics, view graphs, and set alarms for your metric data.

Amazon Virtual Private Cloud (Amazon VPC) is a secure and seamless bridge between a company's existing IT infrastructure and the AWS cloud.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.

AWS Import/Export transfers large amounts of data directly onto and off of storage devices using Amazon's high-speed internal network and bypassing the Internet.

March 02, 2013

Storing Log Data using MongoDB

This blog outlines the basic patterns and principles for using MongoDB as a persistent storage engine for log data from servers and other machine data.Servers generate a large number of events (i.e. logging,) that contain useful information about their operation including errors, warnings, and users behavior. By default, most servers, store these data in plain text log files on their local file systems.While plain-text logs are accessible and human-readable, they are difficult to use, reference, and analyze without holistic systems for aggregating and storing these data.

1. Schema Design
The schema for storing log data in MongoDB depends on the format of the event data that you’re storing.The preferred approach is to extract the relevant information from the log data into individual fields in a MongoDB document.When you extract data from the log into fields, pay attention to the data types you use to render the log data into MongoDB. Using proper types for your data also increases query flexibility: if you store date as a timestamp you can make date range queries, whereas it’s very difficult to compare two strings that represent dates. The same issue holds for numeric fields; storing numbers as strings requires more space and is difficult to query.When extracting data from logs and designing a schema, also consider what information you can omit from your log tracking system. In most cases there’s no need to track all data from an event log, and you can omit other fields.

2.System Architecture
Insertion speed is the primary performance concern for an event logging system. At the same time, the system must be able to support flexible queries so that you can return data from the system efficiently.
MongoDB has a configurable write concern. This capability allows you to balance the importance
of guaranteeing that all writes are fully recorded in the database with the speed of the insert.
For example, if you issue writes to MongoDB and do not require that the database issue any response, the writeoperations will return very fast (i.e. asynchronously,) but you cannot be certain that all writes succeeded.
The following command will insert the event object into the events collection.
>>> db.events.insert(event, w=0)
By setting w=0, you do not require that MongoDB acknowledges receipt of the insert. Although very fast, this is risky
because the application cannot detect network and server failures. See write-concern for more information.

Conversely,if you require that MongoDB acknowledge every write operation, the database will not return as quickly but you can be certain that every item will be present in the database.
In this case use pass w=1 argument as follows:
>>> db.events.insert(event, w=1)

Finally, if you have extremely low tolerance for event data loss, you can require that MongoDB replicate the data to multiple secondary replica set members before returning:
>>> db.events.insert(event, w=majority)

Sharding
Eventually your system’s events will exceed the capacity of a single event logging database instance. In these situations you will want to use a sharded cluster, which takes advantage of MongoDB’s sharding functionality.
In a sharded environment the limitations on the maximum insertion rate are:
• the number of shards in the cluster.
• the shard key you chose.
Because MongoDB distributed data in using “ranges” (i.e. chunks) of keys, the choice of shard key can control how MongoDB distributes data and the resulting systems’ capacity for writes and queries.
Shard key choices:

Shard by Time
Shard by a Semi-Random Key
Shard by an Evenly-Distributed Key in the Data Set
Shard by Combine a Natural and Synthetic Key

Choosing a Mobile BI Solution

Mobile BI Solution are helping remote employees/users manage supply chains more efficiently or keeping traveling executives informed of the latest financial developments, today’s mobile ad hoc reporting solutions provide the dynamic capabilities organizations need to stay competitive and drive innovation in the field.
While working in the field used to mean relying on static data, today’s mobile BI solutions offer the ability to generate interactive reports with in-depth analytic functionality.

Solutions which provide unified user experience across all devices are most suitable for Mobile BI solutions.
Rather than relying on static data, users should be able to use real-time updates to inform their decisions.
Mobile BI solution should facilitates sharing reports, both over wireless networks and in person.
Users may need to access mobile BI solutions from remote locations where internet connectivity is low or absent or on a plane. While a lack of connectivity prohibits real-time updates, a good mobile BI offering should have some form of reliable offline access to recent and saved reports so that employees can tap into data-driven insights.

February 28, 2013

NoSQL Key-Value Store

Basic terminology:

Key-Value Store – data is stored in unstructured records consisting of a key + the values associated with that record
NoSQL –Doesn’t use SQL commands

Let’s say you’ve got millions of data records — as you might have for example, if you’ve got millions of users who visit your website.
looks like a “row” in a database table).Note that not every user has the same information — some users will have a username, some will only have an email address, some users will have provided their name and others will not. Each record has a different length and different values.
To store this kind of data, you create a key for each record and then store whatever fields are available as bins (what would be columns in a structured database) — where each bin consists of a name and a value. Then you create a bin for each piece of data you have. If you don’t have a particular piece of data, you don’t have a blank field (like in a relational table), you simply don’t store a bin for that data.
This type of database is called a Key-Value Store because each record has a primary key and a collection of values (bins). It’s also called a Row Store because all of the data for a single record is stored together, in something that we can think of conceptually as a row.

Example of unstructured data for user records:

Key: 1

ID:av

First Name: Avishkar

Key: 2

Email: avishkarm@gmail.com

Location: Mumbai

Age: 37

Key: 3

Facebook ID: avishkarmeshram

Password: xxx

Name: Avishkar

Data is organized into policy containers called ‘namespaces’, semantically similar to ‘databases’ in an RDBMS system. Namespaces are configured when the cluster is started, and are used to control retention and reliability requirements for a given set of data.
Within a namespace, data is subdivided into ‘sets’ (similar to ‘tables’) and ‘records’ (similar to ‘rows’). Each record has an indexed ‘key’ that is unique in the set, and one or more named ‘bins’ (similar to columns) that hold values associated with the record.

Indexes (primary keys) are stored in DRAM for ultra-fast access and values can be stored either in DRAM or more cost-effectively on SSDs. Each namespace can be configured separately, so small namespaces can take advantage of DRAM and larger ones gain the cost

February 20, 2013

Hadoop Architecture and its Usage at Facebook

Lots of data is generated on Facebook
– 300+ million active users
– 30 million users update their statuses at least once each day
– More than 1 billion photos uploaded each month
– More than 10 million videos uploaded each month
– More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week

Data Usage
Statistics per day:
– 4 TB of compressed new data added per day
– 135TB of compressed data scanned per day
– 7500+ Hive jobs on production cluster per day
– 80K compute hours per day
Barrier to entry is significantly reduced:
– New engineers go though a Hive training session
– ~200 people/month run jobs on Hadoop/Hive
– Analysts (non-engineers) use Hadoop through Hive

Where is this data stored?
Hadoop/Hive Warehouse
– 4800 cores, 5.5 PetaBytes
– 12 TB per node
– Two level network topology
1 Gbit/sec from node to rack switch
4 Gbit/sec to top level rack switch

Data Flow into Hadoop Cloud

Move old data to cheap storage

Implementing a Left Outer Join in Map Reduce

The Problem:

I have two datasets:

User information (id, mobile, location)
Transaction information (transaction-id, car-id, user-id, CarBookingDate)

Given these data sets, I want to find the number of unique locations in which each car has been sold.

One Solution

For each transaction, look up the user record for the transaction’s user-Id
Join the user records to each transaction
Create a mapping from car-id to a list of locations
Count the number of distinct locations per car-id.

The Map Reduce Solution

First off, the problem requires that we write a two stage map-reduce:

Join users onto transactions and emit a set of car-location pairs (one for each transaction)
For each car sold, count the # of distinct locations it was sold in

STAGE 1

We’re basically building a left outer join with map reduce.

transaction map task outputs (K,V) with K = userId, and V = carId
user map tasks outputs (K,V) with K = userId, and V = location
reducer gets both user location and carid thus outputs (K,V) with K = carId, and V = location
STAGE 2
map task is an identity mapper, outputs (K,V) with K = carId and V = location
reducer counts the number of unique locations that it sees per carId, outputs (K,V), K = carId, andV = # distinct locations

February 19, 2013

What's the big deal about Hadoop?

Hadoop has advantages over traditional database management systems, especially the ability to handle both structured data like that found in relational databases, say, as well as unstructured information such as video -- and lots of it. The system can also scale up with a minimum of fuss and bother.

A growing number of firms are using Hadoop and related technologies such as Hive, Pig and Hbase to analysis analyze data in ways that cannot easily or affordably be done using traditional relational database technologies.

JPMorgan Chase, for instance, is using Hadoop to improve fraud detection, IT risk management, and self service applications. The financial services firm is also using the technology to enable a far more comprehensive view of its customers than was possible previously, executives said.

Meanwhile, Ebay is using Hadoop technology and the Hbase open source database to build a new search engine for its auction site. The auction site is revamping its core search engine technology using Hadoop and Hbase, a technology that enables real-time analysis of data in Hadoop environments.

The new eBay search engine, code-named Cassini, will replace the Voyager technology that's been used since the early 2000s. The update is needed in part due to surging volumes of data that needs to be managed. Cassini will deliver more accurate and more context-based results to user search queries.

What is Hadoop used for?
Search
– Yahoo, Amazon, Zvents
Log processing
– Facebook, Yahoo, ContextWeb. Joost, Last.fm
Recommendation Systems
– Facebook
Data Warehouse
– Facebook, AOL
Video and Image Analysis
– New York Times, Eyealike

Goals of HDFS Very Large Distributed File System
– 10K nodes, 100 million files, 10 - 100 PB
Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS

February 15, 2013

Scaling Out SQL Server

Scalability is the ability of an application to efficiently use more resources in order to do more useful work. For example, an application that can service four users on a single-processor system may be able to service 15 users on a four-processor system. In this case, the application is scalable. If adding more processors doesn't increase the number of users serviced (if the application is single threaded, for example), the application isn't scalable.

There are two kinds of scalability: scaleup and scaleout.
Scaleup means scaling to a bigger, more powerful server—going from a four-processor server to a 64-processor or 128-processor server, for example. This is the most common way for databases to scale. When your database runs out of resources on your current hardware, you go out and buy a bigger box with more processors and more memory. Scaleup has the advantage of not requiring significant changes to the database. In general, you just install your database on a bigger box and keep running the way you always have, with more database power to handle a heavier load.

Scaleout means expanding to multiple servers rather than a single, bigger server. Scaleout usually has some initial hardware cost advantages—eight four-processor servers generally cost less than one 32-processor server. Scaleout is separating or partitioning the database system in a manner so you can take those parts and place them on separate database servers. This allows you to spread processing power across as many servers as necessary to accommodate expanding growth. However, additional features and functionality require additional complexity. A scale out database scenario is not a particularly easy one to design or administer. You must answer many difficult business and technology-driven questions before you can successfully implement a scale out of a database system.

There is no thumb rule for Scaleup and Scaleout.. i.e. if hardware cost is less than licensing and maintenance costs then Scaleup is better than Scaleout.If one machine out of your N machine fails, it's less important. The system will still be up and running. And, it's not only failures but hardware/OS/software updates/upgrades, then Scaleout is better than Scaleup

February 13, 2013

Hurray! I have one of the top 10% most viewed @LinkedIn profiles for 2012

February 07, 2013

What is Big Data

What is Big Data?

Big Data is a massive collection of data produced by multiple traffic sources which is constantly being updated – the very nature of Big Data means it’s complex and almost impossible to even get a handle on in the first place, let alone break down, assess and produce tangible results and recommendations that companies can learn from.
With Big Data, traditional web analytics is just the tip of the iceberg. we still need to know what traffic we’re getting, where it’s coming from and which journeys customers are taking when they arrive on the site, but in order to run a successful eCommerce store, we also need to take into account and learn from other data which is out of our control and not necessarily ours to “own”.

Big Data a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

The "structured" portion of Big Data refers to fixed fields within a database. For ecommerce merchants, this could be customer data — address, zip code — that's stored in a shopping cart.

The "unstructured" part encompasses email, video, tweets, and Facebook Likes. None of the unstructured data resides in a fixed database that's accessible to merchants.product reviews, social media data and images – things you know are out there and relate to your business but things you can’t necessarily get a hold of! But the feedback from, say, social media has become a very useful research tool for businesses.

Data Engineering with Avishkar