August 18, 2012

Calling Stored Procedures from .NET Applications


Calling Stored Procedures from .NET Applications

In .NET, there are three very similar ways of accessing SQL Server.
      System.Data.ODBC
      System.Data.SQLClient
      System.Data.OLEDB

Stored Procedures provide more alternatives in the way they can pass data back to the application. Stored Procedures tend to work faster, are much more secure, are more economical with server memory, and can contain a lot more logic. Additionally, it makes teamwork easier: As long as the name of the stored procedure, what it does, and the parameters remain the same, it also allows someone else in the team to work on the database code without you having to change your client software.

Security:
Stored procedures are a counter-measure to dangerous SQL Script injection attacks, a susceptibility that applications using embedded SQL are more vulnerable to.
In the Microsoft SQL Server environment, SQL injection attacks can be prevented using parameters, with or without SPs. Earlier I said this is a damaging argument, and by that I mean it is damaging to programmers who cannot use SPs: They will leave their applications more vulnerable to attack than they should because of this bit of misinformation.
Additionally, using stored procedures lets you use the SqlParameter class available in ADO.NET to specify data types for stored procedure parameters. This gives you an easy way to validate the types of user-provided values as part of an in-depth defensive strategy. To be sure, parameters are just as useful in in-line queries as they are in stored procedures in narrowing the range of acceptable user input.
Stored procedures allow for better data protection by controlling how the data is accessed. By granting a database login EXECUTE permissions on stored procedures you can specify limited actions to the application. Additionally, stored procedures are a counter-measure to dangerous SQL Script injection attacks, a susceptibility that applications using embedded SQL are more vulnerable to.

Performance: 
When stored procedures are used SQL Server can cache the ‘execution plan’ that it uses to execute the SQL vs. having to recalculate the execution plan on each request .
The cached execution plan used to give stored procedures a performance advantage over queries. However, for the last couple of versions of SQL Server, execution plans are cached for all T-SQL batches, regardless of whether or not they are in a stored procedure. Therefore, performance based on this feature is no longer a selling point for stored procedures. Any T-SQL batch with static syntax that is submitted frequently enough to prevent its execution plan from aging out of memory will receive identical performance benefits. The "static" part is key; any change, even something as insignificant as the addition of a comment, will prevent matching with a cached plan and thereby prevent plan re-use.
However, stored procedures can still provide performance benefits where they can be used to reduce network traffic. You only have to send the EXECUTE stored_proc_name statement over the wire instead of a whole T-SQL routine, which can be pretty extensive for complex operations. A well-designed stored procedure can reduce many round trips between the client and the server to a single call.
Additionally, using stored procedures allows you to enhance execution plan re-use, and thereby improve performance, by using remote procedure calls (RPCs) to process the stored procedure on the server. When you use a SqlCommand.CommandType of StoredProcedure, the stored procedure is executed via RPC. The way RPC marshals parameters and calls the procedure on the server side makes it easier for the engine to find the matching execution plan and simply plug in the updated parameter values.
One last thing to think about when considering using stored procedures to enhance performance is whether you are leveraging T-SQL strengths. Think about what you want to do with the data.
Are you using set-based operations, or doing other operations that are strongly supported in T-SQL? Then stored procedures are an option, although in-line queries would also work.
Are you trying to do row-based operations, or complex string manipulation? Then you probably want to re-think doing this processing in T-SQL, which excludes using stored procedures, at least until SQL Server 2005 is released and Common Language Runtime (CLR) integration is available.
It’s also worth mentioning that the easiest way to get performance out of your database is to do everything you can to take advantage of the platform you are running on.
Use the power of the database to thresh the wheat from the chaff. Use your business logic to turn the wheat into dough. In many cases you can get better performance by looping and filtering data in SQL Server than you could performing the same loops and filters in the Data Access Layer – databases are intrinsically designed to do this, while you and I have to writeour own code (which do you think is going to be faster?). It is, however, important to understand how SQL Server uses indexes and clustered indexes.

Beyond sorting and filtering data in the stored procedures you can also batch common work together or retrieve multiple sets of data. For example, retrieve some data, update a datetime stamp, and then insert a new record. If you were to execute these 3 tasks once a second as ad-hoc SQL this would result in 259,200/day independent database request vs. 86,400/day if all were encapsulated in a stored procedure. That’s 172,800 database connections and network IO usages that you no longer require! Consolidating work through stored procedures makes more effective use of a connection (and your system).

If you use parameterized queries instead of strictly ad-hoc sql statements, performance benefit same in both using stored procedures.But paramaterized queries still suffer from poor security design. Ultimately without access only via stored procs I have access to the underlying tables and can do anything I want to them.

The cached execution plan used to give stored procedures a performance advantage over queries  and can provide performance benefits where they can be used to reduce network traffic.

Maintenance: 
The another potential benefit to consider is maintainability. In a perfect world, your database schema would never change and your business rules would never get modified, but in the real world these things happen. That being the case, it may be easier for you if you can modify a stored procedure to include data from the new X, Y, and Z tables that have been added to support that new sales initiative, instead of changing that information somewhere in your application code. Changing it in the stored procedure makes the update transparent to the application—you still return the same sales information, even though the internal implementation of the stored procedure has changed. Updating the stored procedure will usually take less time and effort than changing, testing, and re-deploying your assembly.
Also, by abstracting the implementation and keeping this code in a stored procedure, any application that needs access to the data can get it in a uniform manner. You don't have to maintain the same code in multiple places, and your users get consistent information.
Another maintainability benefit of storing your T-SQL in stored procedures is better version control. You can version control the scripts that create and modify your stored procedures, just as you can any other source code module. By using Microsoft Visual SourceSafe® or some other source control tool, you can easily revert to or reference old versions of the stored procedures.
One caveat with using stored procedures to enhance maintainability is they cannot insulate you from all possible changes to your schemas and rules. If the changes are large enough to require a change in the parameters fed into the stored procedure, or in the data returned by it, then you are still going to have to go in and update the code in your assembly to add parameters, update GetValue() calls, and so forth.
Another issue to consider is that using stored procedures to encapsulate business logic limits your application portability, in that it ties you to SQL Server. If application portability is critical in your environment, encapsulating business logic in a RDBMS-neutral middle tier may be a better choice.
Updating the stored procedure will usually take less time and effort than changing, testing, and re-deploying your assembl
You don't have to maintain the same code in multiple places, and your users get consistent information.

Abstraction:
Stored procedures provide abstraction between the data and the business logic layer. The data model can be dramatically changed, but the stored procedures can still return identical data.

DISADVANTAGES:-
1.Stored procedure languages are quite often vendor-specific. Switching to another vendor's database most likely requires rewriting any existing stored procedures.
2.Stored procedure languages from different vendors have different levels of sophistication.
For example, Oracle's PL/SQL has more languages features and built-in features (via packages such as DBMS_ and UTL_ and others) than Microsoft's T-SQL.[citation needed]
3.Tool support for writing and debugging stored procedures is often not as good as for other programming languages, but this differs between vendors and languages.
For example, both PL/SQL and T-SQL have dedicated IDEs and debuggers. PL/PgSQL can be debugged from various IDEs.

Conclusion:

when it comes to using stored procedures there is NO downside. Applications can be build more securely, are easier to maintain, and typically perform better.


August 17, 2012

How to allow a SQL Login to see only ONE database


How to allow a SQL Login to see only ONE database

On one server, there are a lot of other databases. Sometimes the databases are created as per client name and so when we log with one client login credentials, he can see which are other clients I am working with.
We do not want them to be able to see all the other databases on the instance.  They have access to only one database and that is the only one that they should see in object explorer.

To implement this, we can use following steps.

e.g. we want to allow client1_login to  see only client1_db database.

USE [master]
GO
-- make sure they can view all databases for the moment.
GRANT VIEW ANY DATABASE TO client1_login
GO

USE client1_db
go

-- drop the user in the database if it already exists.
IF  EXISTS (SELECT *
            FROM sys.database_principals
            WHERE name = N'client1_login ')
  DROP USER client1_login
GO

-- grant them ownership to of the database (ownership of dbo schema).
ALTER AUTHORIZATION ON DATABASE::client1_db  to client1_login
go

USE MASTER
go

-- deny ability to see ohter databases
DENY VIEW ANY DATABASE TO client1_login
go

May 11, 2012


Features Comparison of BI Tools
User Experience :


Microstrategy
Qlikview
Pentaho
MicroStrategy Web user interface adheres to an “Extreme AJAX” model where processing is shifted from the Web server to the Web browsers, making for a more responsive Web interface that increases user productivity and improves user adoption
Qlikview multiple Web interfaces intended for different deployment requirements. Because each interface has different capabilities, developers are typically forced to make tradeoffs between functionality and deployment requirements
Pentaho web interface offers very limited functionality. It lacks familiar Microsoft paradigms, making the end user experience less intuitive. Enterprise reports created using the Pentaho Web are limited to basic reports without any graphs, charts or crosstabs, severely limiting end user experience and self-service capabilities


Performance:


Microstrategy
Qlikview
Pentaho
ROLAP architecture which leverages the database for much of its processing. Data joins and analytic calculations are processed in the database whenever
possible. MicroStrategy’s multi-pass approach provides the flexibility to answer any analytical question in the most
optimal manner.
QlikView stores all data and performs all calculations in memory on the middle-tier server. QlikView does not fully leverage the relational database or the hard disk on the middle-tier. These aspects of the QlikView architecture result in inefficient resource utilization and limit QlikView’s scalability.
Pentaho ROLAP engine does not provide fully implemented multi-source ROLAP and multipass SQL engines. The Pentaho ROLAP engine is unable to leverage the database to its fullest extent possible, resulting in unnecessary network and hardware resources utilization

Deployment and Administration


Microstrategy
Qlikview
Pentaho
Provides organizations a platform that is quick to implement and deploy as well as easy to maintain and administer, fueled by a single code base that offers the advantage of reusable business logic across the entire platform. MicroStrategy’s single BI server provides efficient, centralized administration for IT and fewer moving parts which translate into less downtime.
QlikView lacks a common reusable metadata layer that is shared across documents. This creates a maintenance challenge as developers are typically forced to continually and manually synchronize metric definitions and security profiles across documents.
Pentaho lacks a unified and reusable metadata layer creating maintenance challenge and promotes “multiple versions of the truth.” The administration console provides control over only a subset of administrative tasks. Have fewer tools to centrally monitor and manage the BI applications, thus more administrators per number of end users. Lacks enterprise features like clustering and load balancing, increasing the administration complexity and increasing the IT workload.

Drawback:

Microstrategy
Qlikview
Pentaho
Reusable metadata is easier to maintain
 requiring less redundancy, end users
 have more self-service capabilities that
offload work from the IT staff, t provides a comprehensive suite of administrative tools requiring fewer IT administrators
Developers are forced to create
 redundant metadata
objects as the metadata
objects they create cannot be
 reused across multiple reports,
causing unnecessary
development and maintenance
 efforts.
Developers are forced to create redundant metadata
objects as the metadata objects they create cannot be reused across multiple reports, causing unnecessary
development and maintenance efforts.

February 12, 2012

Hadoop

What is Hadoop?
Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
Apache Hadoop is an ideal platform for consolidating large-scale data from a variety of new and legacy sources. It complements existing data management solutions with new analyses and processing tools. It delivers immediate value to companies in a variety of vertical markets.
Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.

Where did Hadoop come from?
The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google's innovations were incorporated intoNutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What problems can Hadoop solve?
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built.

How is Hadoop architected?
Hadoop is designed to run on a large number of machines that don't share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.
In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.

Hadoop Project:

The project includes these subprojects:

Hadoop Common: The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:

Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™: A high-performance coordination service for distributed applications.

February 02, 2012

Oracle vs SQL Server : High Availability, Licensing, Performance and security

High Availability
SQL Server 2008:
Database-mirroring :
Use a rolling upgrade process to upgrade database instances in a database-mirroring session.
Take advantage of write-ahead functionality on the incoming log stream on the mirror server.
Use page read-ahead capability during the undo phase to further improve performance.
Provide reporting capabilities with a database snapshot as a source for reports on the mirror server.

Failover clustering:
Enable failover support by sharing access among nodes and restarting SQL Server on a working node.
Increase scalability with support of up to 16 nodes in a single failover cluster.
Support a rolling upgrade process for servers participating in a failover-clustering configuration.

Peer-to-peer replication:
Replicate changes in near real time, while all databases also handle their primary responsibilities.
Boost scalability, availability, and processing capacity by configuring applications to use peers and to fail over to another peer.
Protect against accidental conflicts with built-in conflict detection.
Increase availability by dynamically adding a new node to an existing topology.

Log shipping:
Provide database redundancy by using standby servers to automatically back up transaction logs.
Increase availability by providing multiple failover sites.
Reduce the load on the primary server by using a secondary server for read-only query processing.

Oracle 11g:

Oracle Provides following features for high availability.

Real Application Clusters
Clusterware
Data Guard
GoldenGate
Streams
Secure Backup
Recovery Manager (RMAN)
Flashback Technologies
VM
Cloud Computing
Cloud Storage
Cross-Platform Transportable Tablespace
Edition-Based Redefinition
Online Reorganization

License cost

Oracle 11g license cost

- Per Processor = $17,500
- Support (22%) = $3,850
- Total (Per Processor) = $21,350
- Total (4 Processors) = $85,400


license cost of SQL Server

- Per Processor = $5,999
- Total (4 Processors) = $23,996

Security:

SQL Server features role-based security for server, database and application profiles; integrated tools for security auditing, tracking 18 different security events and additional sub-events; plus support for sophisticated file and network encryption, including SSL, Kerberos and delegation.

Oracles provides powerful security features such as database activity monitoring and blocking, privileged user and multi-factor access control, data classification, transparent data encryption, consolidated auditing and reporting, secure configuration management, and data masking, customers can deploy reliable data security solutions that do not require any changes to existing applications, saving time and money.

ORACLE DATABASE SECURITY PRODUCTS:
Oracle Advanced Security
Oracle Audit Vault
Oracle Label Security
Oracle Configuration Management
Oracle Secure Backup
Oracle Database Firewall
Oracle Database Vault
Oracle Data Masking
Oracle Total Recall

Performance:

In SQL Server, the DBA has no "real" control over sorting and cache memory allocation. The memory allocation is decided only globally in the server properties memory folder, and that applies for ALL memory and not CACHING, SORTING, etc.


Following Oracle features do not exist in SQL Server.
There are no bitmap indexes
There are no reverse key indexes in SQL Server.
There are no function-based indexes in SQL Server.

January 12, 2012

SQL - For beginners

SQL is a special-purpose language designed for the creation and maintenance of data in relational databases.

Three languages-within-a-language offer everything you need to create, modify, maintain, and provide security for a relational database:

1. The Data Definition Language (DDL): The part of SQL that you use to create (completely define) a database, modify its structure, and destroy it when you no longer need it.

2. Data Manipulation Language (DML): DML Performs database maintenance.Using this powerful tool, you can specify what you want to do with the data in your database — enter it, change it, or extract it.


3. Data Control Language (DCL)
: DCL Protects your database from becoming corrupted. Used correctly, the DCL provides security for your database; the amount of protection depends on the implementation. If your implementation doesn’t provide sufficient protection, you must add that protection to your application program.

Creating tables
A database table is a two-dimensional array made up of rows and columns.
You can create a table by using the SQL CREATE TABLE command. Within the command, you specify the name and data type of each column.After you create a table, you can start loading it with data. (Loading data is a DML, not a DDL, function.) If requirements change, you can change a table’s structure by using the ALTER TABLE command. If a table outlives its usefulness or becomes obsolete, you can eliminate it with the DROP command. The various forms of the CREATE and ALTER commands, together with the DROP command, make up SQL’s DDL.

e.g.
CREATE TABLE CUSTOMER (
CustomerID INTEGER NOT NULL,
FirstName CHARACTER (15),
LastName CHARACTER (20) NOT NULL,
Street CHARACTER (25),
City CHARACTER (20),
State CHARACTER (2),
Zipcode INTEGER,
Phone CHARACTER (13) ) ;

The simplest use of the SELECT statement is to retrieve all the data in all the
rows of a specified table.

To do so, use the following syntax:

SELECT * FROM CUSTOMER ;

The asterisk (*) is a wildcard character that means everything. In this context,
the asterisk is a shorthand substitute for a listing of all the column names of the
CUSTOMER table. As a result of this statement, all the data in all the rows and
columns of the CUSTOMER table appear on-screen.

SELECT statements can be much more complicated than the statement in this
example. In fact, some SELECT statements can be so complicated that they’re
virtually indecipherable. This potential complexity is a result of the fact that
you can tack multiple modifying clauses onto the basic statement.

WHERE clause,

It is the most commonly used method to restrict the rows that a
SELECT statement returns.
A SELECT statement with a WHERE clause has the following general form:

SELECT column_list FROM table_name
WHERE condition ;

The following example shows a compound condition inside a SELECT
statement:

SELECT FirstName, LastName, Phone FROM CUSTOMER
WHERE State = ‘NH’
AND Status = ‘Active’ ;

This statement returns the names and phone numbers of all active customers
living in New Hampshire. The AND keyword means that for a row to qualify for
retrieval, that row must meet both conditions: State = ‘NH’ and Status =
‘Active’.

Insert Statement

You can insert a row for the new object without filling in the data in all the columns. If you want the table in first normal form, you must insert enough data to distinguish the new row from all the other rows in the table. Inserting the new row’s primary key is sufficient for this purpose. In addition to the primary key, insert any other data that you have about the object. Columns in which you enter no data contain nulls.
The following example shows such a partial row entry:

INSERT INTO CUSTOMER (CustomerID, FirstName, LastName)
VALUES (:vcustid, ‘Tyson’, ‘Taylor’) ;

Another way to copy data from one table in a database to another is to nest a
SELECT statement within an INSERT statement. This method (a subselect)
doesn’t create a virtual table but instead duplicates the selected data. You can
take all the rows from the CUSTOMER table, for example, and insert those rows
into the PROSPECT table. Of course, this only works if the structures of the
CUSTOMER and PROSPECT tables are identical. Later, if you want to isolate
those customers who live in Maine, a simple SELECT with one condition in
the WHERE clause does the trick, as shown in the following example:

INSERT INTO PROSPECT
SELECT * FROM CUSTOMER
WHERE State = ‘ME’ ;

Updating Existing Data
You can count on one thing in this world — change. If you don’t like the
current state of affairs, just wait a while. Before long, things will be different.
Because the world is constantly changing, the databases used to model
aspects of the world also need to change. A customer may change her address.
The quantity of a product in stock may change (because, you hope, someone
buys one now and then). A basketball player’s season performance statistics
change each time he plays in another game. These are the kinds of events that
require you to update a database.
SQL provides the UPDATE statement for changing data in a table. By using a
single UPDATE statement, you can change one, some, or all the rows in a
table. The UPDATE statement uses the following syntax:
UPDATE table_name
SET column_1 = expression_1, column_2 = expression_2,
..., column_n = expression_n
[WHERE predicates] ;

Customer lists change occasionally — as people move, change their phone
numbers, and so on. Suppose that Abe Abelson moves from Springfield to
Kankakee. You can update his record in the table by using the following
UPDATE statement:
UPDATE CUSTOMER
SET City = ‘Kankakee’, Telephone = ‘666-6666’
WHERE Name = ‘Abe Abelson’ ;

You can use a similar statement to update multiple rows. Assume that Philo
is experiencing explosive population growth and now requires its own area
code. You can change all rows for customers who live in Philo by using a
single UPDATE statement, as follows:

UPDATE CUSTOMER
SET AreaCode = ‘(619)’
WHERE City = ‘Philo’ ;

Updating all the rows of a table is even easier than updating only some of
the rows. You don’t need to use a WHERE clause to restrict the statement.
Imagine that the city of Rantoul has acquired major political clout and has
now annexed not only Kankakee, Decatur, and Philo, but also all the cities
and towns in the database.
You can update all the rows by using a single statement, as follows:

UPDATE CUSTOMER
SET City = ‘Rantoul’ ;

Delete statement

As time passes, data can get old and lose its usefulness. You may want to
remove this outdated data from its table. Unneeded data in a table slows performance,
consumes memory, and can confuse users. You may want to transfer
older data to an archive table and then take the archive offline. That way,
in the unlikely event that you ever need to look at that data again, you can
recover it. In the meantime, it doesn’t slow down your everyday processing.
Whether you decide that obsolete data is worth archiving or not, you eventually
come to the point where you want to delete that data. SQL provides for
the removal of rows from database tables by use of the DELETE statement.
You can delete all the rows in a table by using an unqualified DELETE statement,
or you can restrict the deletion to only selected rows by adding a WHERE
clause. The syntax is similar to the syntax of a SELECT statement, except that
you use no specification of columns. If you delete a table row, you remove all
the data in that row’s columns.
For example, suppose that your customer, David Taylor, just moved to Tahiti
and isn’t going to buy anything from you anymore.
You can remove him from
your CUSTOMER table by using the following statement:

DELETE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’ ;

Assuming that you have only one customer named David Taylor, this statement
makes the intended deletion. If any chance exists that you have two
customers who share the name David Taylor.


Set functions

Sometimes, the information that you want to extract from a table doesn’t relate
to individual rows but rather to sets of rows. These functions are COUNT,
MAX, MIN, SUM, and AVG. Each function performs an action that draws data
from a set of rows rather than from a single row.

COUNT
The COUNT function returns the number of rows in the specified table. To
count the number of precocious seniors in my example high school database,
use the following statement:
SELECT COUNT (*)
FROM STUDENT
WHERE Grade = 12 AND Age < 14 ;

MAX
Use the MAX function to return the maximum value that occurs in the specified
column. Say that you want to find the oldest student enrolled in your
school. The following statement returns the appropriate row:
SELECT FirstName, LastName, Age
FROM STUDENT
WHERE Age = (SELECT MAX(Age) FROM STUDENT);
This statement returns all students whose ages are equal to the maximum
age. That is, if the age of the oldest student is 23, this statement returns the
first and last names and the age of all students who are 23 years old.
This query uses a subquery. The subquery SELECT MAX(Age) FROM
STUDENT is embedded within the main query. I talk about subqueries (also
called nested queries)

MIN
The MIN function works just like MAX except that MIN looks for the minimum
value in the specified column rather than the maximum. To find the youngest
student enrolled, you can use the following query:
SELECT FirstName, LastName, Age
FROM STUDENT
WHERE Age = (SELECT MIN(Age) FROM STUDENT);
This query returns all students whose age is equal to the age of the youngest
student.

SUM
The SUM function adds up the values in a specified column. The column must
be one of the numeric data types, and the value of the sum must be within the
range of that type. Thus, if the column is of type SMALLINT, the sum must be
no larger than the upper limit of the SMALLINT data type. In the retail database
, the INVOICE table contains a record of all sales.
To find the total dollar value of all sales recorded in the database, use the SUM
function as follows:

SELECT SUM(TotalSale) FROM INVOICE;

AVG
The AVG function returns the average of all the values in the specified
column. As does the SUM function, AVG applies only to columns with a
numeric data type. To find the value of the average sale, considering all transactions
in the database, use the AVG function like this:
SELECT AVG(TotalSale) FROM INVOICE
Nulls have no value, so if any of the rows in the TotalSale column contain null
values, those rows are ignored in the computation of the value of the average
sale.

Zeroing In on the Data You Want

The modifying clauses available in SQL are FROM, WHERE, HAVING, GROUP BY,
and ORDER BY. The FROM clause tells the database engine which table or tables
to operate on. The WHERE and HAVING clauses specify a data characteristic that
determines whether or not to include a particular row in the current operation.
The GROUP BY and ORDER BY clauses specify how to display the retrieved
rows.

SELECT * FROM SALES ;

This statement returns all the data in all the rows of every column in the
SALES table. You can, however, specify more than one table in a FROM clause.
Consider the following example:
SELECT *
FROM CUSTOMER, SALES ;

IN and NOT IN
The IN and NOT IN predicates deal with whether specified values (such as
OR, WA, and ID) are contained within a particular set of values You may, for example, have a table that lists suppliers of a commodity that your company purchases on a regular basis. You want to know the phone numbers of those suppliers located in the Pacific Northwest. You can find these numbers by using comparison predicates,
such as those shown in the following example:

SELECT Company, Phone
FROM SUPPLIER
WHERE State = ‘OR’ OR State = ‘WA’ OR State = ‘ID’ ;

You can also use the IN predicate to perform the same task, as follows:
SELECT Company, Phone
FROM SUPPLIER
WHERE State IN (‘OR’, ‘WA’, ‘ID’) ;
This formulation is a more compact than the one using comparison predicates
and logical OR.
The NOT IN version of this predicate works the same way. Say that you have
locations in California, Arizona, and New Mexico, and to avoid paying sales
tax, you want to consider using suppliers located anywhere except in those
states. Use the following construction:

SELECT Company, Phone
FROM SUPPLIER
WHERE State NOT IN (‘CA’, ‘AZ’, ‘NM’) ;

GROUP BY clause
with one of the aggregate functions (also called set functions) to get a quantitative
picture of sales performance. For example, you can see which salesperson is selling more of the profitable high-ticket items by using the average (AVG) function as follows:

SELECT Salesperson, AVG(TotalSale)
FROM SALES
GROUP BY Salesperson;


You can analyze the grouped data further by using the HAVING clause. The
HAVING clause is a filter that acts similar to a WHERE clause, but on groups of
rows rather than on individual rows. To illustrate the function of the HAVING
clause, suppose the sales manager considers Ferguson to be in a class by
himself. His performance distorts the overall data for the other salespeople.
You can exclude Ferguson’s sales from the grouped data by using a HAVING
clause as follows:

SELECT Salesperson, SUM(TotalSale)
FROM SALES
GROUP BY Salesperson
HAVING Salesperson <> ‘Ferguson’;

another implementation, the order may be that of the most recent updates. The order can also change unexpectedly if anyone physically reorganizes the database. Usually, you want to specify the order in which you want the rows. You may, for example, want to see the rows in order by the SaleDate, as follows:

SELECT * FROM SALES ORDER BY SaleDate ;


JOIN
JOINs are powerful relational operators that combine data from multiple tables
into a single result table. The source tables may have little (or even nothing)
in common with each other.

Cross join

SELECT *
FROM EMPLOYEE, COMPENSATION ;

The result table, which is the Cartesian product of the EMPLOYEE and COMPENSATION tables, contains considerable redundancy.

An equi-join is a basic join with a WHERE clause containing a condition specifying
that the value in one column in the first table must be equal to the value
of a corresponding column in the second table. Applying an equi-join to the
example tables from the previous section brings a more meaningful result:

SELECT *
FROM EMPLOYEE, COMPENSATION
WHERE EMPLOYEE.EmpID = COMPENSATION.Employ ;

The natural join is a special case of an equi-join. In the WHERE clause of an equijoin,a column from one source table is compared with a column of a second source table for equality. The two columns must be the same type and length and must have the same name. In fact, in a natural join, all columns in one table that have the same names, types, and lengths as corresponding columns
in the second table are compared for equality.

Imagine that the COMPENSATION table from the preceding example has
columns EmpID, Salary, and Bonus rather than Employ, Salary, and Bonus.
In that case, you can perform a natural join of the COMPENSATION table with
the EMPLOYEE table. The traditional JOIN syntax would look like this:

SELECT E.*, C.Salary, C.Bonus
FROM EMPLOYEE E, COMPENSATION C
WHERE E.EmpID = C.EmpID ;

The inner join is so named to distinguish it from the outer join. An inner join
discards all rows from the result table that don’t have corresponding rows in
both source tables. An outer join preserves unmatched rows. That’s the difference.

Outer join
When you’re joining two tables, the first one (call it the one on the left) may
have rows that don’t have matching counterparts in the second table (the one
on the right). Conversely, the table on the right may have rows that don’t have
matching counterparts in the table on the left. If you perform an inner join on
those tables, all the unmatched rows are excluded from the output. Outer joins,
however, don’t exclude the unmatched rows. Outer joins come in three types:
the left outer join, the right outer join, and the full outer join.

Left outer join
In a query that includes a join, the left table is the one that precedes the keyword
JOIN, and the right table is the one that follows it. The left outer join preserves
unmatched rows from the left table but discards unmatched rows
from the right table.

SELECT *
FROM LOCATION L LEFT OUTER JOIN DEPT D
ON (L.LocationID = D.LocationID)
LEFT OUTER JOIN EMPLOYEE E
ON (D.DeptID = E.DeptID);


This join pulls data from three tables. First, the LOCATION table is joined to the DEPT table. The resulting table is then joined to the EMPLOYEE table. Rows from the table on the left of the LEFT OUTER JOIN operator that have no corresponding row in the table on the right are included in the result. Thus, in the first join, all locations are included, even if no department associated with them exists. In the second join, all departments are included, even if no employee associated with them exists.

Right outer join
I bet you figured out how the right outer join behaves. Right! The right outer
join preserves unmatched rows from the right table but discards unmatched
rows from the left table. You can use it on the same tables and get the same
result by reversing the order in which you present tables to the join:

SELECT *
FROM EMPLOYEE E RIGHT OUTER JOIN DEPT D
ON (D.DeptID = E.DeptID)
RIGHT OUTER JOIN LOCATION L
ON (L.LocationID = D.LocationID) ;

In this formulation, the first join produces a table that contains all departments,
whether they have an associated employee or not. The second join produces a table that contains all locations, whether they have an associated department or not.

Full outer join
The full outer join combines the functions of the left outer join and the right
outer join. It retains the unmatched rows from both the left and the right tables. Consider the most general case of the company database used in the preceding examples. It could have Locations with no departments, Departments with no locations, Departments with no employees, Employees with no departments

To show all locations, departments, and employees, regardless of whether
they have corresponding rows in the other tables, use a full outer join in the
following form:

SELECT *
FROM LOCATION L FULL JOIN DEPT D
ON (L.LocationID = D.LocationID)
FULL JOIN EMPLOYEE E
ON (D.DeptID = E.DeptID) ;

Data Control Language Commands
The Data Control Language (DCL) has four commands: COMMIT, ROLLBACK,
GRANT, and REVOKE. These commands protect the database from harm, either
accidental or intentional.

Further readings;
1. Codd's 12 rules

2. Database_normalization

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...