December 16, 2013

Database supporting different languages

As we all know, many global industries wants to increase their business worldwide and grow at the same time, they would want to widen their business by providing services to the customers worldwide by supporting different languages like Chinese, Japanese, Korean and Arabic. Many websites these days are supporting international languages to do their business and to attract more and more customers and that makes life easier for both the parties.
To store the customer data into the database the database must support a mechanism to store the international characters, storing these characters is not easy, and many database vendors have to revised their strategies and come up with new mechanisms to support or to store these international characters in the database. Some of the big vendors like Oracle, Microsoft, IBM and other database vendors started providing the international character support so that the data can be stored and retrieved accordingly to avoid any hiccups while doing business with the international customers.
The difference in storing character data between Unicode and non-Unicode depends on whether non-Unicode data is stored by using double-byte character sets. All non-East Asian languages and the Thai language store non-Unicode characters in single bytes. Therefore, storing these languages as Unicode uses two times the space that is used specifying a non-Unicode code page. On the other hand, the non-Unicode code pages of many other Asian languages specify character storage in double-byte character sets (DBCS). Therefore, for these languages, there is almost no difference in storage between non-Unicode and Unicode.

Collation itself specifies the rules for how strings of character data are sorted and compared. The rules for sorting data vary depending on the language and locale.
For example if you was to use a Lithuanian collation, the letter "Y" would appear between "I" and "J" if sorted. And if using the traditional Spanish collation "ch" would be sorted at the end of a list of words beginning with "c".

You can specify collations at following:
1. Creating or altering a database.
2. Creating or altering a table column.
You can specify collations for each character string column using the COLLATE clause of the CREATE TABLE or ALTER TABLE statement. You can also specify a collation when you create a table using SQL Server Management Studio. If you do not specify a collation, the column is assigned the default collation of the database.
3.Casting the collation of an expression.
You can use the COLLATE clause to apply a character expression to a certain
collation. Character literals and variables are assigned the default collation of
the current database. Column references are assigned the definition collation of
the column.
4.When restoring or attaching a database, the default collation of the database and
the collation of any char, varchar, and text columns or parameters

The COLLATE clause can be applied only for the char, varchar, text, nchar,
nvarchar, and ntext data types.

You can execute the system function fn_helpcollations to retrieve a list of all the

valid collation names for Windows collations and SQL Server collations:
SELECT name, description
FROM fn_helpcollations();

Use of nchar, nvarchar, nvarchar(max), and ntext is the same as char, varchar, varchar(max), and text, respectively, except:
Unicode supports a wider range of characters.
More space is needed to store Unicode characters.
The maximum size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar.
Unicode constants are specified with a leading N, for example, N'A Unicode string'

The collation on the database  to accept the Russian characters using the CYRILLIC_GENERAL_CI_AS collation set.  The CI stands for CASE INSENSTIVE and the AS
stands for ACCENT SENSTIVE.

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...