Data Engineering with Avishkar: 2011

December 24, 2011

MySQL database optimization

1.One of the first places to look to improve performance is queries, particularly the ones that run often.
2.Big gains can be achieved by analyzing these queries and rewriting them efficiently. You can use MySQL’s SHOW QUERY LOG to get an idea of which queries need to be fine tuned.
3.If optimization of your queries s a first goal, the first thing to do is to try implementing indexes. If you have a column involved in searching ( where clause), grouping ( group by clause) or sorting(Order by clause) and columns used in JOIN operations, it is likely result in a performance gain if you create index on these columns.
4.Another factor in creating index is cardinality( i.e.number of unique values),the higher the cardinality, the greater the chance that MySQL uses the index in queries.
5.Index speeds up SELECT queries but slows DELETE,INSERT and UPDTE queries, so create indexes on those columns which are mostly used in SELECT queries.
6.Create scheduled job to REBUILD INDEXES.
7.If business operations doesn’t affect, schedule to RESTART server weekly once when no users are connected to system.
8.Run ANALYZE TABLE command frequently ( schedule it weekly)
9.Run OPTIMIZE TABLE command frequently ( schedule it weekly)
10.SCHEMA OPTIMIZATION: Optimize data types and use DeNormalization if required.
11.Design your tables to minimize their space on the disk. This can result in huge improvements by reducing the amount of data written to and read from disk. Smaller tables normally require less main memory while their contents are being actively processed during query execution. Any space reduction for table data also results in smaller indexes that can be processed faster.
12.If your application makes several database requests to perform related updates, combining the statements into a stored PROCEDURE can help performance. Similarly, if your application computes a single result based on several column values or large volumes of data, combining the computation into a UDF (user-defined function) can help performance.
13.Disk seeks are a huge performance bottleneck. This problem becomes more apparent when the amount of data starts to grow so large that effective caching becomes impossible. If you find this issue then Increase the number of available disk spindles (and thereby reduce the seek overhead) by either symlinking files to different disks or striping the disks.
14.Check if you need partitioning of tables either ROW BASED PARTITIONING or HORIZONTAL PARTITIONING.

For more information, pl use following links.
http://dev.mysql.com/doc/refman/5.5/en/optimization.html
http://docs.oracle.com/cd/E17952_01/refman-5.5-en/mysql-tips.html

November 09, 2011

IDENT_CURRENT,@@IDENTITY and SCOPE_IDENTITY

IDENT_CURRENT returns the last identity value generated for a specific table in any session and any scope.

@@IDENTITY returns the last identity value generated for any table in the current session, across all scopes.

SCOPE_IDENTITY returns the last identity value generated for any table in the current session and the current scope.

An INSERT trigger is defined on T1. When a row is inserted to T1, the trigger fires and inserts a row in T2. This scenario illustrates two scopes: the insert on T1, and the insert on T2 by the trigger.

Assuming that both T1 and T2 have identity columns, @@IDENTITY and SCOPE_IDENTITY will return different values at the end of an INSERT statement on T1. @@IDENTITY will return the last identity column value inserted across any scope in the current session. This is the value inserted in T2. SCOPE_IDENTITY() will return the IDENTITY value inserted in T1. This was the last insert that occurred in the same scope. The SCOPE_IDENTITY() function will return the null value if the function is invoked before any INSERT statements into an identity column occur in the scope.

Failed statements and transactions can change the current identity for a table and create gaps in the identity column values. The identity value is never rolled back even though the transaction that tried to insert the value into the table is not committed. For example, if an INSERT statement fails because of an IGNORE_DUP_KEY violation, the current identity value for the table is still incremented.

Be cautious about using IDENT_CURRENT to predict the next generated identity value. The actual generated value may be different from IDENT_CURRENT plus IDENT_INCR because of insertions performed by other sessions.

CREATE TABLE T1 (
ID int IDENTITY(1,1)PRIMARY KEY,
Name varchar(20) NOT NULL
)

INSERT T1
VALUES ('A')
INSERT T1
VALUES ('B')
INSERT T1
VALUES ('C')

CREATE TABLE T2 (
T2ID int IDENTITY(1,1)PRIMARY KEY,
ID int
)

CREATE TRIGGER Trig_T1
ON T1
FOR INSERT AS
BEGIN
INSERT into T2
select ID from inserted
END

FIRE the trigger and determine what identity values you obtain with the @@IDENTITY,IDENT_CURRENT and SCOPE_IDENTITY functions.*/
INSERT T1 VALUES ('Rosalie')

SELECT SCOPE_IDENTITY() AS [SCOPE_IDENTITY]
GO
SELECT @@IDENTITY AS [@@IDENTITY]
GO
SELECT IDENT_CURRENT ('T1') AS T1_Identity;
GO
SELECT IDENT_CURRENT ('T2') AS T2_Identity;

November 07, 2011

Hive - Data Warehousing & Analytics on Hadoop

Hive is an open source, peta-byte scale date warehousing framework based on Hadoop that was developed by the Data Infrastructure Team at Facebook that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive architecture:
Hive organizes data in tables and partitions. A good partitioning scheme allows Hive to prune data while processing a query and that has a direct impact on how fast a result of the query can be produced. Behind the scenes, Hive stores partitions and tables into directories in Hadoop File System (HDFS).

Hive comprises of the following major components:

-Metastore: To store the meta data.
-Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop.
- SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types.
-UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions).
-Clients: Command line client similar to Mysql command line and a web UI.

Data Flow into Hadoop Cloud:

For more Information:
http://www.vldb.org/pvldb/2/vldb09-938.pdf

October 12, 2011

Business Analytics: Good Data and Poor Data

Today, most organizations use data in two ways:
Transactional/Operational use (“running the business”), and Analytic use (“improving the business”).

Good business demands GoodData.
Business analytics can provide amazing insights into how an organization is operating -- in hindsight, with insight and with foresight. But one must be attentive to the quality of the data being analyzed and put first things first. Step one is to check the validity of the data, ensure its quality and completeness. Step two is to ask those key questions that help provide the information needed to make informed decisions. Internal auditors, armed with analytic technologies of their own, can provide a huge amount of assistance in determining data quality and addressing the risk of drawing incorrect conclusions based on bad data.

It is of great value to any enterprise risk management program to incorporate a program that includes processes for assessing, measuring, reporting, reacting to, and controlling different aspects of risks associated with poor data quality.

Data quality is a critical prerequisite to effective business analytics. Poor data quality jeopardizes the performance and efficiency of operational systems. It undermines the value of analytic and business intelligence systems upon which
organizations rely to make key decisions. Decisions based on poor data can result in direct financial loss.

Business leaders need to pay serious attention to the accuracy, quality and reliability of their data. The most obvious cause for poor data quality is data entry. If an organization has no standards or IT controls for how data is entered into a system, the data will quickly reflect their lack. It is in this way that duplicate entries are made in master data.

While Business Intelligence tools can create beautiful and compelling dashboards and graphics, if the data that the tools rely on is of poor quality, their results are
meaningless, or at least potentially badly flawed.
Data cleansing and data quality are important in order to ensure quality results from business intelligence analytics.

You can never fully predict how business users will want to analyze their data, so give them complete freedom to drill down in any direction they choose.Deliver them something good in a week as opposed to something great in six months.Good data is attained by integrating multiple data sources, deriving a ‘single version of the truth,’ and putting that good data (and unstructured content) into a data warehouse where the BA/BI tools can perform their magic. DDD (data-driven decision making) begins and ends with good data.Analytics applications that nicely present dashboards, scorecards, historical trends, predictive analys, and give me actionable insights, can all benefit from good data. Good data begins with data integration, data quality, and a good data warehouse.

September 27, 2011

A Whole New Mind: Why Right-Brainers Will Rule the Future charts the rise of right-brain thinking in modern economies and describes the six abilities individuals and organizations must master in an outsourced, automated age

Daniel Pink references three prevailing trends pointing towards the future of business and the economy:
1. Abundance (consumers have too many choices, nothing is scarce)
2. Asia (everything that can be outsourced, is)
3. Automation (computerization, robots, technology, processes).

This brings up three crucial questions for the success of any business:

1. Can a computer do it faster?
2. Is what I'm offering in demand in an age of abundance?
3. Can someone overseas do it cheaper?

When these questions are present, creativity becomes the competitive difference that can differentiate commodities.
The future belongs to a different kind of person with a different kind of mind: artists, inventors, storytellers-creative and holistic "right-brain" thinkers whose abilities mark the fault line between who gets ahead and who doesn't. Drawing on research from around the world, Pink outlines the six fundamentally human abilities that are absolute essentials for professional success and personal fulfillment-and reveals how to master them

1. Design - Moving beyond function to engage the sense.
2. Story - Narrative added to products and services - not just argument. Best of the six senses.
3. Symphony - Adding invention and big picture thinking (not just detail focus).
4. Empathy - Going beyond logic and engaging emotion and intuition.
5. Play - Bringing humor and light-heartedness to business and products.
6. Meaning - the purpose is the journey, give meaning to life from inside yourself.

The book A Whole New Mind: Why Right-Brainers Will Rule the Future by Daniel H. Pink, left- directed thinking leads to a analytical person, whereas right-directed thinking leads to creative person.

Left-directed thinking:
America is currently organized around a cadre of accountants, doctors, engineers, executives and lawyers. These "knowledge workers" excel at the ability to acquire and marry facts to data, and these abilities are typically accrued through a series of standardized tests such as the PSAT, SAT, GMAT, LSAT and MCAT.

Right directed thinking:
Design, empathy, play, and other "soft" aptitudes have become the focal point for individuals and companies that want to stand out above the others in a crowded marketplace. Look no further than Apple's design-triumph, the physically appealing and emotionally compelling iPod, for quick confirmation of this notion!

Dan advocates to leave the left directed thinking to Asia as it is available in abundance and cheap, and direct educational system towards right directed thinking to turn out creative people like Steve Jobs and Bill Gates.

July 04, 2011

Oracle Databases with SQL Server Reporting Services

Internally, SSRS uses the Oracle Data Provider for .NET (ODP.NET) to interact with Oracle databases. Hence, the ODP.NET's restrictions apply to SSRS as well.

Unlike SQL Server, Oracle returns resultant rows in a cursor. To use a stored procedure for data retrieval within SSRS, consider the following things:

1.The rows must be returned with an OUT REF CURSOR.
2.Only one OUT REF CURSOR can be returned from a stored procedure. In case of multiple OUT REF CURSORs, SSRS considers only the first one and simply ignores the rest.

Note: ADO.NET provides the capability to interact with multiple OUT CURSORs. However, SSRS abstracts a lot of the required plumbing code and hence is not able to provide this facility.

3.By default, the data source type is set to OLE DB. You should change this to Oracle to use ODP.NET features.
4.Once the fields are retrieved, SSRS automatically creates report parameters that match the stored procedure parameters. You should map these to the appropriate values based on your business logic. You can map these stored procedure parameters to the report parameters in the expression editor.

Once you've taken care of these things, SSRS generates a dataset that can be used to format a report. You can utilize other features, such as manipulation of the dataset fields and report formatting, just as you would with any other data source.

CREATE OR REPLACE PACKAGE CURSPKG AS
TYPE T_CURSOR IS REF CURSOR;
PROCEDURE OPEN_TWO_CURSORS (EMPCURSOR OUT T_CURSOR,
DEPTCURSOR OUT T_CURSOR);
END CURSPKG;

CREATE OR REPLACE PACKAGE BODY CURSPKG AS
PROCEDURE OPEN_TWO_CURSORS (EMPCURSOR OUT T_CURSOR,
DEPTCURSOR OUT T_CURSOR)
IS
BEGIN
OPEN EMPCURSOR FOR SELECT * FROM DEMO.EMPLOYEE;
OPEN DEPTCURSOR FOR SELECT * FROM DEMO.DEPARTMENT;
END OPEN_TWO_CURSORS;
END CURSPKG;

To start, create a new reporting services project in Visual Studio. Use the Report Project option and not the wizard (this will help you understand things better). Next, right-click the reports folder and, from the Add Menu, select "New Item" and add a report (.rdl) to it as "TestOracle.rdl".

Now, add a report file (.rdl) named "TestOracle.rdl" to the report.

Open the rdl file in the Visual Studio editor. On the data tab, select from the dataset dropdown. This should open up the property pages for the connection.

On the Provider tab, select Microsoft OLE DB for Oracle and click "Next." On the Connection tab, enter the Oracle service name as the server name and the username and password. (You can opt for saving the password here.)
When done, test the connection with the test connection button at the bottom and click OK.

This should create a new dataset as DataSet1 in the dataset dropdown. Select the new dataset and click the ellipse button beside it. This brings up the property pages for the Dataset configuration. Click the ellipse besides the data source to verify that the data source type is set to "Oracle." This ensures that SSRS uses the ODP.NET for underlying connectivity. The data source type is normally defaulted to OLE DB.

On the query tab of the dataset property pages, select the Command Type as - Stored Procedure. Compared with ADO.NET, this step is the same as the Command type parameter you would specify while creating a command object for fetching data from a stored procedure. The query string text would contain only the stored procedure name, possibly qualified by its package name in Oracle. Click OK on this screen to fetch the available fields.
The out param type should be Ref Cursor only.
Then put the name of the procedure PackageName.GetReportData in RS. Then ran it in
the RS. CommandType :Stored Proc
RS then automatically listed the params (only 2 according to signature of sp; out param will not be listed). Data Source type is Oracle.

June 06, 2011

Allocation units in SQL server

In Microsoft SQL Server, the database is basically divided into different allocation units, viz. Index Allocation Map or IAM pages, Global Allocation Map or GAM pages, Shared Global Allocation Map or SGAM, and or Page Free Space or PFS pages.
SQL Server uses the IAM pages to find the extents allocated to the object.
For each extent, SQL Server searches the PFS pages to see if there is a page
with enough free space to hold the row.If SQL has to allocate new space, the GAM and/or SGAM is used. This would
occur AFTER SQL has determined that there is not enough space to insert the
new row on pages already allocated to the object.
All of these allocation units are required to be in healthy condition for the successful functioning of your MS SQL database. However, corruption in any of these allocation units or pages can result in stopping SQL database mounting; thereby, resulting in the inaccessibility of databases elements like tables, views, stored procedures, indexes, triggers, etc.
Various factors can be responsible for such corruption in SQL Server allocation pages. Naming a few of them – wrong system or application shutdown, MS SQL Server damage, Trojan infections or virus attacks on the system, software malfunctioning, human errors, media errors, etc. However, in the above situation, the major possibility of damage is due to the corruption in any of GAM, SGAM, or PFS pages of SQL database Server Meta structure.

For resolving the issue and for regaining access to your SQL database elements again, you can do the following measures:
* If corruption has occurred due to physical damage, then you must replace that system component which is damaged by a new component
* However, if the problem is as a result of some logical damage, then you can run the ‘DBCC CHECKDB' command with correct repair clause
* In case above resolutions fail, try to restore your damaged or deleted SQL database from an updated backup

June 05, 2011

Data Normalization

Data Normalization
The process of modeling data into relational tables is known as normalization.
Common normal forms: 1NF, 2NF and 3NF.
Data stored in a relational database and accessed with SQL should be normalized to 3NF.
PK: Attribute of set of attributes which uniquely identify a record.

Books Table:
ISBN(PK) Title Authors Publisher
123 Oracle 10g Auth1,Auth2 Pub1,Stree1,City1, State1
124 MS SQL Auth1,Auth3 Pub2,Stree2,City2, State1

Problem with above schema:
1. Insertion anomaly: It is impossible to enter details of authors whose books are not published as there is no ISBN .
2. Deletion anomaly: It is impossible to delete book without losing publisher details.
3. Update anomaly: If some publisher address changes, it is necessary to update every book that publisher has published.
4. Data inconsistency errors due to incorrect entry of publisher address is any of the row.
5. Author value is multi valued, so finding books written by Auth1 need to search all values in Authors column.
6. Storage inefficiency as replication of address of each publisher across rows.

1NF:
Remove the repeating groups, i.e. Author column, pull them out into a separate table, Authors.
Books Table:
ISBN(PK) Title Publisher
123 Oracle 10g Pub1,Stree1,City1, State1
124 MS SQL Pub2,Stree2,City2, State1

Authors:
ISBN Author Name
123 Auth1
123 Auth2
124 Auth1
124 Auth3

Problems solved by 1NF:
1. Insertion anomaly: No need to insert unpublished book in Books table and can insert Author record in Authors table.
2. Query to find no. of books written by Auth1 or find author who written maximum books is now simplified.
2NF:
Remove columns which are not dependent on PK of table. Here publisher address is not dependent on ISBN in books table.
Books Table:
ISBN(PK) Title Publisher
123 Oracle 10g Pub1
124 MS SQL Pub2

Publishers Table:
Publisher (PK) Street City State
Pub1 Stree1 City1 State1
Pub2 Stree2 City2 State1

Problems solved by 2NF:
1. Update anomaly: Need to update only 1 record if publisher details changed.
2. Storage efficiency: Solves problem of storing publisher details many times.
3. Data consistency errors caused by incorrect entry in publisher column.
3NF:
Remove all columns that are interdependent. In Publishers table, street -- > City and City --> State, i.e. street can belongs to only 1 city and 1 city belongs to only 1 state. This can be achieved by adding address code column in Publishers table and have separate table for city, state and street.
Publisher (PK) Address code
Pub1 AC1
Pub2 AC2

Address Table:
Address code (PK) Street City State
AC1 Stree1 City1 State1
AC2 Stree2 City2 State1

Normalized Books database:
Table Primary Key Foreign Key
Books ISBN Publisher
Authors ISBN + Author Name ISBN
Publishers Publisher Address code
Address Address Code

May 13, 2011

Should phone numbers be normalized or de-normalized?

If you have a Person's contact record, a contact could have N-number of phone numbers, i.e. by class of number such as work, home, cell, fax, and a highly normalized model would put all phone numbers in their own table. e.g. A fax number as a PhoneNumber with PhoneType=3.
What if application requires that contacts be retrieved with phone numbers? Those numbers in the contact record could in fact be a better solution.
Which one is the phone number to be used from N-number of phone numbers?

When storing phone numbers in multiple columns of a single row you experience different issues. What if a person doesn’t have a land line anymore? Do you put cell phone in the home phone number field…it isn’t home phone. When you want to print out a report with phone number, you need to write a query with a case statement or coalesce in order to find the first populated phone number.

What do you do when you need to record multiple phones of the same type? Personally, Person might have three cell phones used for different purposes. How do we know (if we need to) which type of connectivity that number represents?

if Person A and Person B have the same phone number, should they reference the same phone number record? If not, you still have duplication of data. But if so, you have a harder problem to solve when one person’s phone number changes, but not the other.

How to store a fax number ?
A fax number should be stored as a PhoneNumber with PhoneType=3 in a separate phone table?

All of the issues with having multiple phone numbers can be resolved as business rules. Implement those business rules in stored procedures, functions, etc.If you want to restrict your objects to having three phone numbers such as work, home, and cell, then implement that design in your business objects and business rules.
If your database doesn’t force you into that business rule, then when you get a new requirement to allow for tracking another number, your database schema is not impacted; only your business layer.

Regarding the uniqueness of a phone number and Person A and Person B have the same phone number(it being shared by multiple people); normalize to the point of having a many to many relationship between different entities and phone numbers with which they may be associated.

Regarding how to store a fax number, a fax number in a FaxNumber field is far easier to understand than a PhoneNumber with PhoneType=3 in a separate phone table, so Greater understandability of data.

May 12, 2011

Advantages and Disadvantages of Database Mirroring

Advantages Database Mirroring:
o Database Mirroring architecture is more robust and efficient than Database Log Shipping. It can be configured to replicate the changes synchronously to minimized data loss.
o It has automatic server failover and client failover mechanism.
o Configuration is simpler than log shipping and replication, and has built-in network encryption support (AES algorithm).
o Because propagation can be done asynchronously, it requires less bandwidth than synchronous method (e.g. host-based replication, clustering) and is not limited by geographical distance with current technology.
o Database mirroring supports full-text catalogs.
o Does not require special hardware (such as shared storage, heart-beat connection) and clusterware, thus potentially has lower infrastructure cost.

Disadvantages of Database Mirroring:
o Potential data lost is possible in asynchronous operation mode. RTO will vary and depend on several factors, such as propagation interval time and bandwidth speed.
o Mirror server/database is not available for user operation.
o It only works at database level and not at server level. It only propagates changes at database level, no server level objects, such as logins and fixed server role membership, can be propagated.
o Automatic server failover may not be suitable for application using multiple databases.

Database Mirroring FAQ

Database Mirroring FAQ:

1. How to enable database mirroring?
Answer: You can use TRACE 1400 to enable in RTM version or else update your SQL Server with latest service pack.
Adding Trace Flag to Startup parameter
o Goto RUN --> Type sqlservermanager.msc
o Right click on SQL Server(instancename) service and click on properties
o Click on Advanced tab
o In the startup parameters enter this ;-T1400 and click on OK
o Restart SQLservices and then try configuring db mirroring
Or
o Update SQL Server to latest service pack.
2. IS it possible to configure mirroring between Standard Edition & Enterprise Edition or Vice Versa?
Answer:
Nope its not possible, both principal and mirror should have same edition.

3. Can I load balance my mirrored database (i.e can i use mirrored for SELECT query) like log shipping in standby mode?
Answer:
It is possible if your mirror server is running Enterprise edition. You can take snapshot of your mirrored database and then you can query against the snapshot to retrive data's.

4. Is it possible to take backup of mirrored database in mirror server?
Answer:
No, you won't be able to run BACKUP command against a mirrored database in mirror server.

5. Can I create multiple endpoints for configuring different databases for mirroring and point each database to unique endpoint.
Answer:
No, thats not possible. You can create only one endpoint in a server for database mirroring you need to use this endpoint to configuring db mirroring for all the databases.

6. Can I configure a single database to be mirrored to more than one server. i.e) One source & many destination like logshipping.
Answer:
No, thats not possible in Database mirroring, its one to one configuration.

May 11, 2011

An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005.

An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections. (provider: SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified) (Microsoft SQL Server, Error: -1)

Steps to resolve above error.
1) SQL Server should be up and running.

Go to All Programs >> Microsoft SQL Server 2008 >> Configuration Tools >> SQL Server Configuration Manager >> SQL Server Services, and check if SQL Server service status is “Running”.

2) Enable TCP/IP in SQL Server Configuration

Go to All Programs >> Microsoft SQL Server 2008 >> Configuration Tools >> SQL Server Configuration Manager >> Select TCP/IP

3) Open Port in Windows Firewall

All the ports on which SQL Server is running should be added to exception and firewall should filter all the traffic from those ports.By default SQL Server runs on port 1433, but if default port is changed then the new port should be added to exception.
Go to Control Panel >> Windows Firewall >> Change Settings >> Exceptions >> Add Port

4) SQL Server browser service should be up and running
If SQL Server has named instance (another instance besides default instance) is installed, SQL Server browser service should be up and running and also be added to the exception, as described in Step 3.

5) Enable Remote Connection
If this feature is turned off SQL Server will function smoothly on local machine, but it will let another server connect to it remotely. By default this feature is ON in SQL Server 2008.

May 03, 2011

Things to Do in a Database

Things to Do in a Database:

1.Have and exercise a Disaster Recovery Plan
2. Do not modify a production database inappropriately
3. Do not modify existing objects using ad-hoc queries
4. Always Develop new features in development first and then use in production.
5. Normalize your data
6. Enforce Foreign Key Relationships
7. Document a Database Schema
8. Have and implement a data retention policy
9. Tune a database
10. create and maintain indexes based on data access patterns
11. create and maintain appropriate statistics
12. implement appropriate security
13.Utilize Schema's. do not use DBO schema for all roles, use different schemasa for separating roles
14.Separation Data Access
15. Associate Authentication and User with appropriate roles and schema's
16.Review Application Logs

April 11, 2011

Database Design Essentials

The components that are absolutely necessary for a successful database project. There are 10 essentials:

Understand the database’s purpose:
The first essential is that you discover and understand the database’s purpose before doing anything else. Saying that the database’s purpose is “to store records” isn’t good enough. You need to understand the business reason for this database’s existence before it goes into development, much less production. The business reason needs to map not only to the technical specifications but also to the business purpose and the business processes that this database will be supporting.

Get the right tool:
A good CASE tool can help you quickly create different types of data models—such as functional decompositions, system and context diagrams, business process and data flow diagrams, and entity relationship models—each of which will help you communicate to your colleagues the visual components of the proposed system you’re building.

Gather the requirements for the database.
You need to understand the requirements well enough to be able to create visual representations of data, processes, and systems.

Be conscientious when modeling the data:
You need to perform data modeling at multiple levels. You should create the following three models:

Conceptual data model (CDM). The CDM contains the semantics of the information domain that you’re modeling. The CDM defines the scope of the database. Entity classes—the distinct data sets I mentioned earlier—represent things of interest to the organization, and the relationships between the entity pairs assert business rules and regulations. The CDM expresses both the as-is and to-be states and is often created during the Discovery stage.
Logical data model (LDM). The LDM describes the structure of the information domain. It includes criteria such as attribute set details, data types, key status, and individual attribute nullability.
Physical data model (PDM).The PDM defines how the data will be physically stored. You need to map the LDM to a specific database management system (DBMS) platform.

Enforce relationships.
The whole idea of a relational database is to identify and enforce relationships between entity pairs so that data integrity, as defined in the business rules and regulations and represented in the data models, is maintained.

Use the appropriate data types.
Defining appropriate data types and lengths, nullability, defaults, and check constraints when creating tables
Using user-defined types, user-defined functions, and stored procedures that evaluate data when it’s being entered or used
Using Declarative Referential Integrity (DRI) or triggers to enforce relationships between pairs of tables
Using lookup tables to restrict data values when they’re inputted or edited

Include indexes when modeling.:
From the gathered requirements, you should be able to determine which columns will be used for filtering (i.e., in the WHERE clause),sort ( ORDER BY clause)

Standardize the naming convention.
Store the code that touches the data in Server.
The best way to avoid inconsistent rule application and SQL injection is to store code that touches a database’s data as a stored procedure or CLR object.
Document the work.

April 05, 2011

Performance Counters for SQL server DBA

Available Mbytes

Definition

Available Mbytes stands for free unallocated RAM and displays the amount of physical memory, in MB, available to processes running on the computer.

Interpretation

This counter only displays the last value and is not an average.
If the value is less than 20/25 percent of installed RAM it is an indication of insufficient memory.
Less than 100 MB is an indication that the system is very starved for memory and paging out.
Fluctuations of 100 MB or more can indicate that someone is logged in remotely into the server.
Pages/Sec

Definition

Pages/sec is the number of pages read from the disk or written to the disk to resolve memory references to pages that were not in memory at the time of the reference.

Interpretation

1 This is the sum of two counters - Pages Input/sec and Pages Output/sec.

2 The threshold is normally 20 pages/sec, although one has to investigate activity on the server before concluding paging is the problem.

3 Spikes in pages/sec are normal and possible due to backups, big files/data being written to disk and after reboot.

4 SQL Server has to be configured to dynamically manage to the "Dynamically configure SQL Server memory" option, and the "Maximum Memory" setting should be set to the maximum RAM possible with allowing room for OS. SQL Server should also ideally be the only application on the server.

5 High Available mbytes and low paging file % usage with high pages/sec may not indicate a problem, may merely be indicating that the system is reading a memory mapped file sequentially.

6 Also investigate Page Faults per second, which is the cumulative sum of hard and soft page faults since when the system rebooted. It may be hard to interpret this counter since it is a cumulative value and may be very large but if you have multiple programs sharing the computer with SQL Server you may be able to see which program is causing the paging by looking at each program’s page faults per second.

How to identify the SQL Server Start/Restart Date & Time ?

1. When was the SQL Server Service Started ?

SELECT agent_start_date FROM msdb.dbo.Syssessions WHERE session_id = (SELECT max(session_id) from msdb.dbo.Syssessions)

SELECT login_time from sys.dm_exec_sessions where session_id = 1

SELECT min(login_time) FROM Master..sysprocesses
-- SQL server 2008 only
SELECT sqlserver_start_time FROM sys.dm_os_sys_info;

SELECT login_time FROM Master..sysprocesses WHERE spid = 1

2. When was the TempDB database Re-Created ?
SELECT create_date 'SQL Server Started at' FROM sys.databases WHERE [name] ='tempdb'

GO

March 25, 2011

Database mirroring

SQL server Database Mirroring

There are 3 types Database Mirroring Operating Modes: High Availability, High Protection and High Performance.
When database mirroring is used for high availability, if a principal server suddenly fails, if the mirror server has confirmation from the witness, it can automatically take on the role of principal and make its database available within a few seconds.But it need additional server called witness server.
In High Protection operating mode only a manual failover is possible, because there is no witness to fill the tie-breaker role. An automatic failover is not possible, because if the principal server fails, the mirror server has no witness server with which to form a quorum.

Client-side Redirect

In SQL Server 2005, if you connect to a database that is being mirrored with ADO.NET or the SQL Native Client, your application can take advantage of the drivers' ability to automatically redirect connections when a database mirroring failover occurs. You must specify the initial principal server and database in the connection string, and optionally the failover partner server.
For example, specifying server A as the principal, server B as the mirror, and AdventureWorks as the database name:

"Data Source=A;Failover Partner=B;Initial Catalog=AdventureWorks;Integrated Security=True;"
The failover partner in the connection string is used as an alternate server name if the connection to the initial principal server fails. If the connection to the initial principal server succeeds, then the failover partner name will not be used, but the driver will store the failover partner name that it retrieves from the principal server on the client-side cache.If a client is successfully connected to the principal, and a database mirroring failover (automatic, manual, or forced) occurs. The next time the application attempts to use the connection, the ADO.NET or SQL Native Client driver will detect that the connection to the old principal has failed, and will automatically retry connecting to the new principal as specified in the failover partner name. If successful, and there is a new mirror server specified for the database mirroring session by the new principal, the driver will retrieve the new partner failover server name and place it in its client cache. If the client cannot connect to the alternate server, the driver will try each server alternately until the login timeout period is reached.

The great advantage of using the database mirroring support built into ADO.NET and the SQL Native Client driver is that we do not need to recode the application, or place special code in the application, to handle a database mirroring failover.

For additional information
http://technet.microsoft.com/en-us/library/cc917680.aspx

In SQL Server 2005, 2008 and 2008 R2 Database Mirroring works at the individual database level. One challenge to using database mirroring is to keep your important SQL Agent jobs enabled or disabled depending on a database's current mirroring role. If a database is acting as a Principal, you will want the SQL Agent jobs associated with that database to be enabled, while if it is acting as a Mirror, you will want to disable the SQL Agent jobs associated with that database. This does not happen automatically without a little bit of preparation.

http://blogs.msdn.com/b/sqlcat/archive/2010/04/01/using-sql-agent-job-categories-to-automate-sql-agent-job-enabling-with-database-mirroring.aspx

February 25, 2011

Move distribution database to another drive

Move distribution database to another drive:

There are scenarios in SQL server Replication that we need to move datafiles and logfiles of distribution database to another drive/folder.

Pl follow below steps to move it without having to redo the whole replication process.

1. Start and Stop a Replication Agent

To start and stop a Snapshot Agent or Log Reader Agent from Management Studio

Connect to the Publisher in Management Studio, and then expand the server node and the Replication folder.

Expand the Local Publications folder, and then right-click a publication.

Click View Snapshot Agent Status or View Log Reader Agent Status.

Click Start or Stop.

To start and stop a Queue Reader Agent from Management Studio

Connect to the Distributor in Management Studio, and then expand the server node.

Expand the SQL Server Agent folder, and then expand the Jobs folder.

Right-click the job for the agent, and then click Start Job or Stop Job. The name of the job for the Queue Reader Agent is in the form []..

To start and stop a Snapshot Agent, Log Reader Agent, or Queue Reader Agent from Replication Monitor

Expand a Publisher group in the left pane, expand a Publisher, and then click a publication.

Click the Agents tab.

Right-click an agent, and then click Start Agent or Stop Agent.

2. ALTER DATABASE distribution SET OFFLINE

3. Move the data and log file to the new location

ALTER DATABASE distribution MODIFY FILE ( NAME = distribution , FILENAME = 'C:\MSSQL\Data\distribution.MDF')
ALTER DATABASE distribution MODIFY FILE ( NAME = distribution_log , FILENAME = 'C:\mssql\Data2\distribution.ldf')
ALTER DATABASE distribution MODIFY FILE ( NAME = distribution_02 , FILENAME = 'C:\mssql\Data2\distribution_02.ndf')

4. ALTER DATABASE distribution SET ONLINE
5.Start the Log Reader Agent and Distribution Agent

February 03, 2011

SQL server DBA Frequently used Checklists

1. how to check msdtc status on sql server 2005 server

i) Go to Services.msc and start Distributed Transaction Coordinator

ii)Go to Dcomcnfg.exe
My Computer -> Properties -> MSDTC Tab -> Security Configuration
Network DTC Access is checked
Network Administration is checked
Network Transactions is checked
Network Clients is checked
XA Transactions is checked

2.

January 23, 2011

Monitoring SQL server resources

While investigating performance issues related to SQL server resources, first identify bottleneck and then resolve that bottleneck.

1. Memory bottleneck:
SQL server uses all the memory of hosted server. Data pages resided in disk, but disks are slow , so SQL server load all the data pages into memory so that they are provided to users as quickly as possible when requested.
There are 2 types of Memory pressure.
i) External memory pressure: If sql server unable to acquire sufficient memory.
If SQLServer: Buffer manager – Total pages and SQLServer: Buffer manager – Target pages are same, SQL server has sufficient memory else if target is greater than total , then there is external memory pressure.
ii) Internal memory pressure: If sql server buffer pool( the memory location which holds data pages) is reduced in response to other processes like linked server, extended SPs, etc.

• Memory counter: Available bytes < 4 MB indicates there is memory pressure.
• VAS ( virtual address space) is made up of physical memory and memory provided by page file. If memory pages/second > 1, indicates there is memory pressure.
• SQL server: Buffer Manager – Page life expectancy < 300, then there is memory pressure
• SQL server: Buffer Manager- Buffer cache hit ratio reports the no. of pages requested by query found in sql server buffer. If it is <90, requested pages are mostly read from disk to memory and it is memory bottleneck.
• SQL server: Buffer Manager – stoles pages if greater than total target pages, there is memory pressure.
• SQL server: Buffer Manager Memory grants pending if non-zero, there is memory pressure.
• SQL server: Buffer Manager- Lazy writes- This records no. of times per second sql server relocates dirty pages from buffer pool to disk. If it is 20sec or more, then there is memory pressure.

2. CPU bottleneck:
• If % Total processor time is consistently >80, , there is CPU pressure
• If avg physical queue length > 2, there is CPU pressure
• If SQL server: SQL statastics- SQL compilations or Recompilations are more, there is CPU pressure

3. Disk bottleneck:
• If %disk time is >20ms, there is disk pressure
• If SQL server: Access Methods- Full scans > 1 second or high Page splits, there is disk pressure.
• If % Total processor time is consistently >80, there is disk pressure.

January 19, 2011

Difference between DELETE and TRUNCATE

DELETE logs the data for each row affected by the statement in the transaction log and physically removes the row from the file, one row at a time. The recording of each affected row can cause your transaction log grow massively if you are deleting huge numbers of rows. However, when you run your databases in full recovery mode, detailed logging is necessary for SQL Server to be able to recover the database to the most recent state, should a problem arise. The fact that each row is logged explains why DELETE statements can be slow.

TRUNCATE is faster than DELETE due to the way TRUNCATE "removes" rows. Actually, TRUNCATE does not remove data, but rather deallocates whole data pages and removes pointers to indexes. The data still exists until it is overwritten or the database is shrunk. This action does not require a lot of resources and is therefore very fast. It is a common mistake to think that TRUNCATE is not logged. This is wrong. The deallocation of the data pages is recorded in the log file. Therefore, BOL refers to TRUNCATE operations as "minimally logged" operations. You can use TRUNCATE within a transaction, and when this transaction is rolled-back, the data pages are reallocated again and the database is again in its original, consistent state.

Some limitations do exist for using TRUNCATE.

· You need to be db_owner, ddl_admin, or owner of the table to be able to fire a TRUNCATE statement.

· TRUNCATE will not work on tables, which are referenced by one or more FOREIGN KEY constraints.

So if TRUNCATE is so much faster than DELETE, should one use DELETE at all? Well, TRUNCATE is an all-or-nothing approach. You can't specify just to truncate those rows that match a certain criteria. It's either all rows or none.

We can have triggers associated with DELETE operation.
We cannot have triggers associated with TRUNCATE operation.

You can, however, use a workaround here. Suppose you want to delete more rows from a table than will remain. In this case you can export the rows that you want to keep to a temporary table, run the TRUNCATE statement, and finally reimport the remaining rows from the temporary table. If your table contains a column with the IDENTITY property defined on it, and you want to keep the original IDENTITY values, be sure to enabled IDENTITY_INSERT on the table before you reimport from the temporary table. Chances are good that this workaround is still faster than a DELETE operation.

You can also set the recovery mode to "Simple" before you start this workaround, and then back to "Full" one it is done. However, keep in mind that is this case, you might only be able to recover to the last full backup.

SQL Server Interview questions

1. What will be the outcome of the 2 select @@error statements?

declare @@error nvarchar(max)

set @@error = 'This is an error'
select @@error -- Select 1

if @@error > 0
print 'invalid error'
else
select @@error -- Select 2

Answer:
Msg 102, Level 15, State 1, Line 1 Incorrect syntax near '@@error'.
Explanation:
@@error is reserved by SQL Internals and can't be assigned a value as a variable. The second @@error statement
if @@error > 0
print 'invalid error'
could be true as @@error values are captured in values from 1 - 457, so running the Select @@error separately AFTER running this whole statement will return with error code 102, but the statement terminates as the first line when trying to declare @@error as a variable.

2. Is it possible, using a trigger, to update a data row before commiting the insert to a table by first updating the row in the INSERTED table ?

Answer: It is not possible to update the INSERTED or DELETED tables. The following error would occur:
Msg 286, Level 16,
The logical tables INSERTED and DELETED cannot be updated.

3. What is the output of the following queries?

DECLARE @inputValue NVARCHAR(20)
DECLARE @student TABLE
(Id INT PRIMARY KEY IDENTITY(1,1),
StudentName NVARCHAR(50),
StudentResult INT)

INSERT INTO @student
VALUES( 'Hardy', 100), ('Rocky', 98), ('Panky', 99)

--Select 1
SET @inputValue = 'Hardy'
SELECT * FROM @student WHERE StudentName IN (@inputValue)

--Select 2
SET @inputValue = 'Hardy, Rocky'
SELECT * FROM @student WHERE StudentName IN (@inputValue)

Answer:
Select1 returns 1 row; select2 return 0 row
Explanation:
IN command Determines whether a specified value matches any value in a subquery or a list. In the first Select query, Query Analyzer generates query (replace variable with its value) like
SELECT * FROM @student WHERE StudentName IN ('Hardy')
Same way, for the second query, query analyzer generates query (replace variable with its value) like:
SELECT * FROM @student WHERE StudentName IN ('Hardy, Rocky')
Because there is no student whose name is 'Hardy, Rocky', sql doesn't return any data.
So, the final result is: Select1 returns 1 row; select2 return 0 row.

4. What will be the result of the following?

DECLARE @MyTable TABLE (Col1 INT,
Col2 INT)

-- Col1 Col2
INSERT INTO @MyTable VALUES (1, 2),
(3, 4)

UPDATE @MyTable SET Col1 = Col2, Col2 = Col1

SELECT Col1, Col2 FROM @MyTable

Answer:
Values of Col1 and Col2 are swapped