Data Engineering with Avishkar: 08/01/2014

August 21, 2014

SQL Server 2014

SQL Server 2014

SQL Server 2014 is comprehensive in-memory technologies for OLTP (code-named “Hekaton”), data warehousing, and analytics built directly into SQL Server. It also provides new hybrid solutions for cloud backup and disaster recovery as well as takes advantage of new capabilities in Windows Server 2012 R2 to provide enterprise-class availability and scalability with predictable performance and reduced infrastructure costs.

Benefits

In-Memory OLTP: Provides in-memory OLTP capabilities built into core SQL Server database to significantly improve the transactional speed and throughput of your database application. Microsoft claims SQL Server 2014 can make database operations as much as 30 times faster. To use it, a database must have certain tables (actually, the file groups used to store tables) declared as memory-optimized. The resulting table can be used as a conventional database table or as a substitute for a temporary table.
In-Memory Updateable ColumnStore:
Use the columnstore index to achieve up to 10x query performance gains over traditional row-oriented storage, and up to 7x data compression over the uncompressed data size. It provides faster load speed, query performance, concurrency, and even lower price per terabyte. Since the clustered columnstore index is updateable, your workload can perform a large number of insert, update, and delete operations. Columnstore indexes work well for mostly read-only queries that perform analysis on large data sets.

Columnstore Index benefits:

Columns often have similar data, which results in high compression rates.
High compression rates improve query performance by using a smaller in-memory footprint. In turn, query performance can improve because SQL Server can perform more query and data operations in-memory.
A new query execution mechanism called batch-mode execution has been added to SQL Server that reduces CPU usage by a large amount. Batch-mode execution is closely integrated with, and optimized around, the columnstore storage format. Batch-mode execution is sometimes known as vector-based or vectorized execution.
Queries often select only a few columns from a table, which reduces total I/O from the physical media.

Extending Memory to SSDs: Seamlessly and transparently integrates solid-state storage into SQL Server by using SSDs as an extension to the database buffer pool, allowing more in-memory processing and reducing disk IO.

The Buffer Pool allows you to extend your available memory by providing a hybrid buffer pool to help reduce I/O bottlenecks as long as you have access to an SSD.

The hybrid buffer pool now consists of the existing buffer pool plus an extension that resides on nonvolatile storage or an SSD. The buffer manager still deals with identifying those pages in the buffer pool that can be expunged, when required pages for a request are not residing in memory but on disk. This practice is still handled and undertaken by the buffer manager. In addition to the buffer manager deciding on which pages should be in the buffer pool or not, it is now identifying those pages which are considered as “clean pages” and migrating those pages out of the buffer pool onto the BPE to allow for even more data to reside in the buffer pool. This practice is providing the following benefits to your SQL Server environment

o Increased random I/O throughput

o Reduced I/O latency

o Increased transactional throughput

o Improved read performance with a larger hybrid buffer pool

o A caching architecture that can take advantage of present and future low-cost memory drives

Enhanced High Availability

i) New AlwaysOn features: Availability Groups now support up to 8 secondary replicas that remain available for reads at all times, even in the presence of network failures. Failover Cluster Instances now support Windows Cluster Shared Volumes, improving the utilization of shared storage and increasing failover resiliency.

ii) Improved Online Database Operations: Includes single partition online index rebuild and managing lock priority for table partition switch, reducing maintenance downtime impact.

Encrypted Backup: Provides encryption support for backups, on-premise and on Microsoft Azure.
IO Resource Governance: Resource pools now support configuration of minimum and maximum IOPS per volume, enabling more comprehensive resource isolation controls.
Hybrid Scenarios:

o SQL Server Backup to Azure: Provides management and automation of SQL Server backups (from on-premise and Microsoft Azure) to Microsoft Azure storage.

o AlwaysOn with Azure Secondaries: Easily add replicas in Microsoft Azure to on-premise Availability Groups.

o SQL XI (XStore Integration): Supports SQL Server Database files (from on-premise and Microsoft Azure) on Microsoft Azure Storage blobs.

Deployment Wizard: Easily deploy on-premise SQL Server databases to Microsoft Azure.

August 13, 2014

QLIKVIEW ON AMAZON EC2 and MICROSOFT AZURE

QLIKVIEW ON AMAZON EC2

The EC2 Cloud Compute offering allows customers to create virtual machines in the cloud on demand and pay for the length of time the machine runs. Customers can create and manage instances, their configuration and security using the browser EC2 Management Console. For use with QlikView a customer would create a Windows instance with sufficient resources and install QlikView onto that Windows host and make it available through the EC2 firewall.

QLIKVIEW ON MICROSOFT AZURE

Microsoft offer through its Azure product a Platform as a Service. While they do offer a hosted Virtual Machine capability the core is based around the dynamic hosting of web applications and SQL databases. The intention is that the customer doesn’t have to touch the Windows operating system that runs either the application or the database.

A web role is intended to host a web based application within IIS (Internet Information Services) on a Windows Server. To package QlikView into a web role the package must run the installer silently and then use the management API to license and configure the server. This style of deployment is well suited to deploying a configurable and packaged QlikView environment and so may appeal best to QlikView partners wanting to offer a similar offering to their customers. Once packaged a fully functional QlikView Server can be made available within 10 minutes and the same package could be used to deploy multiple servers with different configurations. The Azure platform provides a number of features under its Access Control Service (ACS) which allows customers to integrate applications with standards-based identity providers, including enterprise directories such as Active Directory, and web identities such as Windows Live ID, Google, Yahoo!, and Facebook. There are a range of examples provided by Microsoft also.

Qlikview as PaaS, IaaS and SaaS Service

QlikView is an in-memory software platform and so speed of access to RAM is a key factor in delivering good performance to users. Similarly, availability of CPU power to perform aggregations on the fly and produce content for users is also key. Virtualization technologies place some restrictions of speed of access to RAM compared to physical servers and can also limit the number of CPUs available to a single machine; this is reflected in the maximum size that cloud servers can be with regard to RAM and CPU. This can mean that not all QlikView deployments are suited to being virtualized and so it is important to size a deployment of QlikView when considering use of cloud.

With QlikView this would mean all users would view the same QlikView apps but perhaps the data they see within those applications would be different.

QLIKVIEW AND IAAS

IaaS is a common cloud model used by QlikView customers; indeed QlikTech’s own demo.qlikview.com and other services are run from the IaaS model offered by Amazon EC2. This model allows QlikView customers to deploy infrastructure suitable for their environment and have full control over what QlikView applications they host in the environment and how they configure it.

With QlikView in a PaaS model, a customer could implement QlikView onto a hosted PaaS vendor’s technology. For this type of implementation, QlikView requires a Windows platform to run on. One such PaaS option is Microsoft’s Windows Azure. It is possible to package the QlikView product along with customer - or partner-built apps and deploy onto the Azure platform and so enable customers to be free of the infrastructure (i.e. hardware, network and OS) management.

End user access to QlikView hosted on Cloud:

In this kind of deployment there is no customization required for QlikView, just a simple configuration; however it does require that the customer has in place an SSO system that can make use of reverse proxy.In this type of setup the only device that ever physically touches the cloud server is the reverse proxy. No users will ever directly touch the cloud server or even know that it is a cloud server. This mechanism could serve both internal and external users and is an excellent approach for conforming to what can be the standard way of integrating security within an organization.

SAML (Security Assertion Markup Language) and federated security such Active Directory Federation Services can be used security systems that can make use of secure tokens to sign users into a range of systems as shown below.

With Active Directory Federation Services (ADFS) it is possible for users to seamlessly log in to a non-domain cloud server without being prompted to log in as their internal Windows credentials are used during the process of logging in. This gives an excellent user experience and ADFS is often implemented within organizations alongside their regular Active Directory which means there is no requirement for an additional SSO product or set of users.

In some cases there is no security system to integrate with beyond perhaps a known list of user IDs and passwords. In these cases a straightforward login page can be created in a similar way to the method previously mentioned to ask for, and validate, a user ID and password before logging them into QlikView. The user directory could be anything like Active Directory, LDAP, SSO systems, Databases (SQL, Oracle etc), Web based systems such as Windows Live, Google accounts or Facebook login.

August 12, 2014

Data Migration from SQL Server to SQL Azure

You can transfer data to Microsoft Azure SQL Database by using the following:

SQL Server 2008 Integration Services (SSIS)
The bulk copy utility (BCP.exe)
System.Data.SqlClient.SqlBulkCopy class
Scripts that use INSERT statements to load data into the database

Microsoft Azure SQL Database does not support:

The RESTORE statement.
Attaching a database to the Azure SQL Database server.

Azure SQL Database General Guidelines and Limitations

Azure SQL Database General Guidelines and Limitations

Please do following checks when you create database object script using "Generate Script" wizard.

1. Delete all instances of SET ANSI_NULLS ON.

2. SQL Azure hides all hardware-specific access and information. There is no concept of PRIMARY or file groups because disk space is handled by Microsoft, so this option isn’t required.

Delete all instances of ON [PRIMARY].

3. Delete all instance of PAD_INDEX = OFF as well as ALLOW_ROW_LOCKS = ON and ALLOW_PAGE_LOCKS = ON.

4. The NEWSEQUENTIALID() function isn’t supported in SQL Azure because there is no CLR support in SQL Azure, and thus all CLR-based types aren’t supported. The NEWSEQUENTIALID() return value is one of those typesIn the Users table, modify the rowguid column, changing DEFAULT NEWSEQUENTIALID() to NULL.

5. In the stored procedure, remove the ENCRYPTION clause.

6. SQL Azure currently supports only clustered tables. SQL Azure doesn’t support heap tables. A heap table is one without a clustered index.

7.Microsoft Azure SQL Database supports only SQL Server Authentication. Windows Authentication (integrated security) is not supported. Users must provide credentials (login and password) every time they connect to Microsoft Azure SQL Database.

8.Microsoft Azure SQL Database does not support distributed transactions, which are transactions that affect several resources

9.Microsoft Azure SQL Database does not allow setting the collation at the server level.

10.Microsoft Azure SQL Database provides two database editions: Web Edition and Business Edition. Web Edition databases can grow up to a size of 5 GB and Business Edition databases can grow up to a size of 150 GB.

Multiple Ways to Improve SQL Query Performance

Multiple Ways to Improve SQL Query Performance

1. Re-writing SQL Query
Re-writing SQL Query by following best practices for getting the fastest response like avoid writing a SQL query using multiple joins ( JOIN containing more than 4 tables). ( Pls refer Nested Loop http://avishkarm.blogspot.in/2014/08/what-is-nested-loops-join.html)
Try to remove cursors from the query and use set-based query.
Remove non-correlated scalar sub query as a separate query instead of part of the main query and store the output in a variable.
Avoid Multi-statement TVFs are more costly than inline TFVs.

2. Indexes
Index can magically reduce the data retrieval time and help to improve SQL query performance and give you best query response time.
Keep in mind that clustered index should be created on a unique column.
A non-clustered index is most useful if the ratio of qualifying number of rows/total number of rowsis around 5% or less, which means if the index can eliminate 95% of the rows from consideration. As a best practice, most selective columns should be placed leftmost in the key of a non-clustered index.
Dropping unused indexes can help to speed up data modifications without affecting data retrieval.

3. Statistic Creation and Updates

4.Revisit Your Schema Definitions
Availability of the right constraint(FORIGEN KEY, NOT NULL and CEHCK constraints ) on the right place always helps to improve the query performance, like FORIGEN KEY constraint helps to simplify joins by converting some outer or semi-joins to inner joins and CHECK constraint also helps a bit by removing unnecessary or redundant predicates.

What is nested loops join?

What is nested loops join?

The nested loops join uses one join input as the outer input table (shown as the top input in the graphical execution plan) and one as the inner (bottom) input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.
In the simplest case, the search scans an entire table or index; this is called a naive nested loops join. If the search exploits an index, it is called an index nested loops join. If the index is built as part of the query plan (and destroyed upon completion of the query), it is called a temporary index nested loops join. All these variants are considered by the query optimizer.
A nested loops join is particularly effective if the outer input is small and the inner input is preindexed and large. In many small transactions, such as those affecting only a small set of rows, index nested loops joins are superior to both merge joins and hash joins. In large queries, however, nested loops joins are often not the optimal choice.

SELECT * FROM customer, orders 
WHERE customer.customer_num=orders.customer_num 
AND order_date>"01/01/1997";

Data Engineering with Avishkar