Data Engineering with Avishkar: SQL

SQL is a special-purpose language designed for the creation and maintenance of data in relational databases.

Three languages-within-a-language offer everything you need to create, modify, maintain, and provide security for a relational database:

1. The Data Definition Language (DDL): The part of SQL that you use to create (completely define) a database, modify its structure, and destroy it when you no longer need it.

2. Data Manipulation Language (DML): DML Performs database maintenance.Using this powerful tool, you can specify what you want to do with the data in your database — enter it, change it, or extract it.

3. Data Control Language (DCL): DCL Protects your database from becoming corrupted. Used correctly, the DCL provides security for your database; the amount of protection depends on the implementation. If your implementation doesn’t provide sufficient protection, you must add that protection to your application program.

Creating tables
A database table is a two-dimensional array made up of rows and columns.
You can create a table by using the SQL CREATE TABLE command. Within the command, you specify the name and data type of each column.After you create a table, you can start loading it with data. (Loading data is a DML, not a DDL, function.) If requirements change, you can change a table’s structure by using the ALTER TABLE command. If a table outlives its usefulness or becomes obsolete, you can eliminate it with the DROP command. The various forms of the CREATE and ALTER commands, together with the DROP command, make up SQL’s DDL.

e.g.
CREATE TABLE CUSTOMER (
CustomerID INTEGER NOT NULL,
FirstName CHARACTER (15),
LastName CHARACTER (20) NOT NULL,
Street CHARACTER (25),
City CHARACTER (20),
State CHARACTER (2),
Zipcode INTEGER,
Phone CHARACTER (13) ) ;

The simplest use of the SELECT statement is to retrieve all the data in all the
rows of a specified table.

To do so, use the following syntax:

SELECT * FROM CUSTOMER ;

The asterisk (*) is a wildcard character that means everything. In this context,
the asterisk is a shorthand substitute for a listing of all the column names of the
CUSTOMER table. As a result of this statement, all the data in all the rows and
columns of the CUSTOMER table appear on-screen.

SELECT statements can be much more complicated than the statement in this
example. In fact, some SELECT statements can be so complicated that they’re
virtually indecipherable. This potential complexity is a result of the fact that
you can tack multiple modifying clauses onto the basic statement.

WHERE clause,

It is the most commonly used method to restrict the rows that a
SELECT statement returns.
A SELECT statement with a WHERE clause has the following general form:

SELECT column_list FROM table_name
WHERE condition ;

The following example shows a compound condition inside a SELECT
statement:

SELECT FirstName, LastName, Phone FROM CUSTOMER
WHERE State = ‘NH’
AND Status = ‘Active’ ;

This statement returns the names and phone numbers of all active customers
living in New Hampshire. The AND keyword means that for a row to qualify for
retrieval, that row must meet both conditions: State = ‘NH’ and Status =
‘Active’.

Insert Statement

You can insert a row for the new object without filling in the data in all the columns. If you want the table in first normal form, you must insert enough data to distinguish the new row from all the other rows in the table. Inserting the new row’s primary key is sufficient for this purpose. In addition to the primary key, insert any other data that you have about the object. Columns in which you enter no data contain nulls.
The following example shows such a partial row entry:

INSERT INTO CUSTOMER (CustomerID, FirstName, LastName)
VALUES (:vcustid, ‘Tyson’, ‘Taylor’) ;

Another way to copy data from one table in a database to another is to nest a
SELECT statement within an INSERT statement. This method (a subselect)
doesn’t create a virtual table but instead duplicates the selected data. You can
take all the rows from the CUSTOMER table, for example, and insert those rows
into the PROSPECT table. Of course, this only works if the structures of the
CUSTOMER and PROSPECT tables are identical. Later, if you want to isolate
those customers who live in Maine, a simple SELECT with one condition in
the WHERE clause does the trick, as shown in the following example:

INSERT INTO PROSPECT
SELECT * FROM CUSTOMER
WHERE State = ‘ME’ ;

Updating Existing Data
You can count on one thing in this world — change. If you don’t like the
current state of affairs, just wait a while. Before long, things will be different.
Because the world is constantly changing, the databases used to model
aspects of the world also need to change. A customer may change her address.
The quantity of a product in stock may change (because, you hope, someone
buys one now and then). A basketball player’s season performance statistics
change each time he plays in another game. These are the kinds of events that
require you to update a database.
SQL provides the UPDATE statement for changing data in a table. By using a
single UPDATE statement, you can change one, some, or all the rows in a
table. The UPDATE statement uses the following syntax:
UPDATE table_name
SET column_1 = expression_1, column_2 = expression_2,
..., column_n = expression_n
[WHERE predicates] ;

Customer lists change occasionally — as people move, change their phone
numbers, and so on. Suppose that Abe Abelson moves from Springfield to
Kankakee. You can update his record in the table by using the following
UPDATE statement:
UPDATE CUSTOMER
SET City = ‘Kankakee’, Telephone = ‘666-6666’
WHERE Name = ‘Abe Abelson’ ;

You can use a similar statement to update multiple rows. Assume that Philo
is experiencing explosive population growth and now requires its own area
code. You can change all rows for customers who live in Philo by using a
single UPDATE statement, as follows:

UPDATE CUSTOMER
SET AreaCode = ‘(619)’
WHERE City = ‘Philo’ ;

Updating all the rows of a table is even easier than updating only some of
the rows. You don’t need to use a WHERE clause to restrict the statement.
Imagine that the city of Rantoul has acquired major political clout and has
now annexed not only Kankakee, Decatur, and Philo, but also all the cities
and towns in the database.
You can update all the rows by using a single statement, as follows:

UPDATE CUSTOMER
SET City = ‘Rantoul’ ;

Delete statement

As time passes, data can get old and lose its usefulness. You may want to
remove this outdated data from its table. Unneeded data in a table slows performance,
consumes memory, and can confuse users. You may want to transfer
older data to an archive table and then take the archive offline. That way,
in the unlikely event that you ever need to look at that data again, you can
recover it. In the meantime, it doesn’t slow down your everyday processing.
Whether you decide that obsolete data is worth archiving or not, you eventually
come to the point where you want to delete that data. SQL provides for
the removal of rows from database tables by use of the DELETE statement.
You can delete all the rows in a table by using an unqualified DELETE statement,
or you can restrict the deletion to only selected rows by adding a WHERE
clause. The syntax is similar to the syntax of a SELECT statement, except that
you use no specification of columns. If you delete a table row, you remove all
the data in that row’s columns.
For example, suppose that your customer, David Taylor, just moved to Tahiti
and isn’t going to buy anything from you anymore.
You can remove him from
your CUSTOMER table by using the following statement:

DELETE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’ ;

Assuming that you have only one customer named David Taylor, this statement
makes the intended deletion. If any chance exists that you have two
customers who share the name David Taylor.

Set functions

Sometimes, the information that you want to extract from a table doesn’t relate
to individual rows but rather to sets of rows. These functions are COUNT,
MAX, MIN, SUM, and AVG. Each function performs an action that draws data
from a set of rows rather than from a single row.

COUNT
The COUNT function returns the number of rows in the specified table. To
count the number of precocious seniors in my example high school database,
use the following statement:
SELECT COUNT (*)
FROM STUDENT
WHERE Grade = 12 AND Age < 14 ;

MAX
Use the MAX function to return the maximum value that occurs in the specified
column. Say that you want to find the oldest student enrolled in your
school. The following statement returns the appropriate row:
SELECT FirstName, LastName, Age
FROM STUDENT
WHERE Age = (SELECT MAX(Age) FROM STUDENT);
This statement returns all students whose ages are equal to the maximum
age. That is, if the age of the oldest student is 23, this statement returns the
first and last names and the age of all students who are 23 years old.
This query uses a subquery. The subquery SELECT MAX(Age) FROM
STUDENT is embedded within the main query. I talk about subqueries (also
called nested queries)

MIN
The MIN function works just like MAX except that MIN looks for the minimum
value in the specified column rather than the maximum. To find the youngest
student enrolled, you can use the following query:
SELECT FirstName, LastName, Age
FROM STUDENT
WHERE Age = (SELECT MIN(Age) FROM STUDENT);
This query returns all students whose age is equal to the age of the youngest
student.

SUM
The SUM function adds up the values in a specified column. The column must
be one of the numeric data types, and the value of the sum must be within the
range of that type. Thus, if the column is of type SMALLINT, the sum must be
no larger than the upper limit of the SMALLINT data type. In the retail database
, the INVOICE table contains a record of all sales.
To find the total dollar value of all sales recorded in the database, use the SUM
function as follows:

SELECT SUM(TotalSale) FROM INVOICE;

AVG
The AVG function returns the average of all the values in the specified
column. As does the SUM function, AVG applies only to columns with a
numeric data type. To find the value of the average sale, considering all transactions
in the database, use the AVG function like this:
SELECT AVG(TotalSale) FROM INVOICE
Nulls have no value, so if any of the rows in the TotalSale column contain null
values, those rows are ignored in the computation of the value of the average
sale.

Zeroing In on the Data You Want

The modifying clauses available in SQL are FROM, WHERE, HAVING, GROUP BY,
and ORDER BY. The FROM clause tells the database engine which table or tables
to operate on. The WHERE and HAVING clauses specify a data characteristic that
determines whether or not to include a particular row in the current operation.
The GROUP BY and ORDER BY clauses specify how to display the retrieved
rows.

SELECT * FROM SALES ;

This statement returns all the data in all the rows of every column in the
SALES table. You can, however, specify more than one table in a FROM clause.
Consider the following example:
SELECT *
FROM CUSTOMER, SALES ;

IN and NOT IN
The IN and NOT IN predicates deal with whether specified values (such as
OR, WA, and ID) are contained within a particular set of values You may, for example, have a table that lists suppliers of a commodity that your company purchases on a regular basis. You want to know the phone numbers of those suppliers located in the Pacific Northwest. You can find these numbers by using comparison predicates,
such as those shown in the following example:

SELECT Company, Phone
FROM SUPPLIER
WHERE State = ‘OR’ OR State = ‘WA’ OR State = ‘ID’ ;

You can also use the IN predicate to perform the same task, as follows:
SELECT Company, Phone
FROM SUPPLIER
WHERE State IN (‘OR’, ‘WA’, ‘ID’) ;
This formulation is a more compact than the one using comparison predicates
and logical OR.
The NOT IN version of this predicate works the same way. Say that you have
locations in California, Arizona, and New Mexico, and to avoid paying sales
tax, you want to consider using suppliers located anywhere except in those
states. Use the following construction:

SELECT Company, Phone
FROM SUPPLIER
WHERE State NOT IN (‘CA’, ‘AZ’, ‘NM’) ;

GROUP BY clause
with one of the aggregate functions (also called set functions) to get a quantitative
picture of sales performance. For example, you can see which salesperson is selling more of the profitable high-ticket items by using the average (AVG) function as follows:

SELECT Salesperson, AVG(TotalSale)
FROM SALES
GROUP BY Salesperson;

You can analyze the grouped data further by using the HAVING clause. The
HAVING clause is a filter that acts similar to a WHERE clause, but on groups of
rows rather than on individual rows. To illustrate the function of the HAVING
clause, suppose the sales manager considers Ferguson to be in a class by
himself. His performance distorts the overall data for the other salespeople.
You can exclude Ferguson’s sales from the grouped data by using a HAVING
clause as follows:

SELECT Salesperson, SUM(TotalSale)
FROM SALES
GROUP BY Salesperson
HAVING Salesperson <> ‘Ferguson’;

another implementation, the order may be that of the most recent updates. The order can also change unexpectedly if anyone physically reorganizes the database. Usually, you want to specify the order in which you want the rows. You may, for example, want to see the rows in order by the SaleDate, as follows:

SELECT * FROM SALES ORDER BY SaleDate ;

JOIN
JOINs are powerful relational operators that combine data from multiple tables
into a single result table. The source tables may have little (or even nothing)
in common with each other.

Cross join

SELECT *
FROM EMPLOYEE, COMPENSATION ;

The result table, which is the Cartesian product of the EMPLOYEE and COMPENSATION tables, contains considerable redundancy.

An equi-join is a basic join with a WHERE clause containing a condition specifying
that the value in one column in the first table must be equal to the value
of a corresponding column in the second table. Applying an equi-join to the
example tables from the previous section brings a more meaningful result:

SELECT *
FROM EMPLOYEE, COMPENSATION
WHERE EMPLOYEE.EmpID = COMPENSATION.Employ ;

The natural join is a special case of an equi-join. In the WHERE clause of an equijoin,a column from one source table is compared with a column of a second source table for equality. The two columns must be the same type and length and must have the same name. In fact, in a natural join, all columns in one table that have the same names, types, and lengths as corresponding columns
in the second table are compared for equality.

Imagine that the COMPENSATION table from the preceding example has
columns EmpID, Salary, and Bonus rather than Employ, Salary, and Bonus.
In that case, you can perform a natural join of the COMPENSATION table with
the EMPLOYEE table. The traditional JOIN syntax would look like this:

SELECT E.*, C.Salary, C.Bonus
FROM EMPLOYEE E, COMPENSATION C
WHERE E.EmpID = C.EmpID ;

The inner join is so named to distinguish it from the outer join. An inner join
discards all rows from the result table that don’t have corresponding rows in
both source tables. An outer join preserves unmatched rows. That’s the difference.

Outer join
When you’re joining two tables, the first one (call it the one on the left) may
have rows that don’t have matching counterparts in the second table (the one
on the right). Conversely, the table on the right may have rows that don’t have
matching counterparts in the table on the left. If you perform an inner join on
those tables, all the unmatched rows are excluded from the output. Outer joins,
however, don’t exclude the unmatched rows. Outer joins come in three types:
the left outer join, the right outer join, and the full outer join.

Left outer join
In a query that includes a join, the left table is the one that precedes the keyword
JOIN, and the right table is the one that follows it. The left outer join preserves
unmatched rows from the left table but discards unmatched rows
from the right table.

SELECT *
FROM LOCATION L LEFT OUTER JOIN DEPT D
ON (L.LocationID = D.LocationID)
LEFT OUTER JOIN EMPLOYEE E
ON (D.DeptID = E.DeptID);

This join pulls data from three tables. First, the LOCATION table is joined to the DEPT table. The resulting table is then joined to the EMPLOYEE table. Rows from the table on the left of the LEFT OUTER JOIN operator that have no corresponding row in the table on the right are included in the result. Thus, in the first join, all locations are included, even if no department associated with them exists. In the second join, all departments are included, even if no employee associated with them exists.

Right outer join
I bet you figured out how the right outer join behaves. Right! The right outer
join preserves unmatched rows from the right table but discards unmatched
rows from the left table. You can use it on the same tables and get the same
result by reversing the order in which you present tables to the join:

SELECT *
FROM EMPLOYEE E RIGHT OUTER JOIN DEPT D
ON (D.DeptID = E.DeptID)
RIGHT OUTER JOIN LOCATION L
ON (L.LocationID = D.LocationID) ;

In this formulation, the first join produces a table that contains all departments,
whether they have an associated employee or not. The second join produces a table that contains all locations, whether they have an associated department or not.

Full outer join
The full outer join combines the functions of the left outer join and the right
outer join. It retains the unmatched rows from both the left and the right tables. Consider the most general case of the company database used in the preceding examples. It could have Locations with no departments, Departments with no locations, Departments with no employees, Employees with no departments

To show all locations, departments, and employees, regardless of whether
they have corresponding rows in the other tables, use a full outer join in the
following form:

SELECT *
FROM LOCATION L FULL JOIN DEPT D
ON (L.LocationID = D.LocationID)
FULL JOIN EMPLOYEE E
ON (D.DeptID = E.DeptID) ;

Data Control Language Commands
The Data Control Language (DCL) has four commands: COMMIT, ROLLBACK,
GRANT, and REVOKE. These commands protect the database from harm, either
accidental or intentional.

Further readings;
1. Codd's 12 rules

2. Database_normalization

Data Engineering with Avishkar

January 12, 2012

SQL - For beginners

No comments:

Creating DataFrames from CSV in Apache Spark

Search This Blog