March 28, 2024

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("CSV Example").getOrCreate()


sc = spark.sparkContext


Spark SQL provides spark.read().csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe.write().csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on.


# A CSV dataset is pointed to by path.

# The path can be either a single CSV file or a directory of CSV files

path = "D:/spark/data/csv/sales.csv"


df = spark.read.csv(path)

df.show()


# Read a csv with delimiter and a header

df_header = spark.read.option("delimiter", ",").option("header", True).csv(path)

df_header.show()

Creating DataFrames in Apache Spark



March 19, 2024

Vector Database

 Generative AI processes and applications incorporating Generative AI functionality natively rely on accessing Vector Embeddings. This data type provides the semantics necessary for AI to emulate long-term memory processing akin to human capabilities, enabling it to draw upon and recall information for executing complex tasks.

Vector embeddings serve as the fundamental data representation utilized by AI models, including Large Language Models, to make intricate decisions. Similar to human memories, they exhibit complexity, dimensions, patterns, and relationships, all of which must be stored and represented within underlying structures. Consequently, for AI workloads, a purpose-built database, such as a vector database, is essential. This specialized storage system is designed for highly scalable access, specifically tailored for storing and accessing vector embeddings commonly employed in AI and machine learning applications for swift and accurate data retrieval.




A vector database is a database type meticulously crafted for storing and querying high-dimensional vectors. Vectors serve as mathematical representations of objects or data points in a multi-dimensional space, where each dimension corresponds to a specific feature or attribute.



e.g. Text data will use Text Transformer to convert text to Vectors to store in Vector Database.




These databases possess the capability to store and retrieve large volumes of data as vectors in a multi-dimensional space, thereby enabling vector search. Vector search, utilized by AI processes, correlates data by comparing the mathematical embeddings or encodings of the data with search parameters, returning results aligned with the query's trajectory.

At the heart of this AI revolution lies Vector Search, also known as nearest neighbor search. This mechanism empowers AI models to locate specific information sets in a collection closely related to a prescribed query. Unlike traditional search models focusing on exact matches, vector search represents data points as vectors with direction and magnitude in a highly-dimensional space. The search assesses similarity by comparing the query vector to possible vector paths traversing all dimensions.

The implementation of a vector search engine represents a significant advancement, facilitating more sophisticated and accurate searches through vast and intricate datasets. Vector search operates by mathematically calculating the distance or similarity between vectors, employing various formulas like cosine similarity or Euclidean distance.

Unlike traditional search algorithms utilizing keywords, word frequency, or word similarity, vector search utilizes the distance representation embedded into dataset vectorization to identify similarity and semantic relationships.

The contextual relevance facilitated by vector search finds application across various domains:

  1. Similarity Retrieval: Enables applications to adapt to contextual input, facilitating quick identification of variations matching user requirements.
e.g.





  1. Content Filtering and Recommendation: Offers a refined approach to filtering content, considering numerous contextual data points to identify additional content with similar attributes.

  2. Interactive User Experience: Facilitates direct interaction with large datasets, providing users with more relevant results through natural language processing.

  3. Retrieval Augmented Generation (RAG): Bridges the gap between using data for predicting outcomes and responding to outcomes, augmenting outputs to enhance relevance continually.

Vector search operates by transforming all data into Vector Embeddings. These embeddings serve as mathematical representations of objects, essential for calculating similarity and difference between data points within a multidimensional vector space.

In traditional keyword search, exact results thrive when specific details are known. Conversely, vector search identifies and retrieves semantically similar information, efficiently searching based on similarity rather than exact matches. This capability makes vector search specifically designed for elastically comparing and searching large datasets.




Leveraging vector search offers numerous benefits, including efficient querying and browsing of unstructured data, adding contextual meaning to data using embeddings, providing multidimensional graphical representations of search results, enabling more relevant results based on nearest neighbor search patterns, and enhancing semantic understanding of queries for more accurate results.

The operation of a vector database encompasses indexing, querying, and post-processing stages. Indexing involves encoding information for storage and retrieval, while querying identifies the nearest information to the provided query. Post-processing evaluates the query and returns a final answer by re-ranking the nearest neighbors using different or multiple similarity measurements.




For example:

  • Indexing: The term "Orange Signal" on a traffic signal would be indexed differently from the fruit "Orange."
  • Querying: Searching for "Orange Signal" on a traffic signal would find similar instances (nearest neighbors) to provide instructions to slow down the vehicle.
  • Post-processing: After finding similar instances, the system may provide instructions to stop if the signal converts to red.

Examples of Vector Databases:

Milvus: Milvus is an open-source vector database built for deep learning applications. Milvus offers features such as real-time vector indexing, GPU acceleration, and integration with popular deep learning frameworks like TensorFlow and PyTorch.

Faiss: Faiss is a library for efficient similarity search and clustering of dense vectors. Developed by Facebook AI Research (FAIR), Faiss is optimized for high-dimensional data and large-scale datasets. It offers various indexing methods and supports both CPU and GPU computations for fast retrieval of nearest neighbors.


Pinecone: Pinecone is a managed vector database service designed for real-time applications. It offers scalable storage and retrieval of vector embeddings, with features such as automatic indexing, dynamic scaling, and low-latency query processing. Pinecone supports integration with popular machine learning frameworks and provides APIs for building recommendation systems, search engines, and other AI applications.


While Astradb, MongoDB, and PostgreSQL are not dedicated vector databases like Milvus or Faiss, they can serve as viable options for storing and querying vector embeddings, especially in scenarios where other types of data also need to be managed within the same database system. 

AstraDB: AstraDB is a managed database service by DataStax, based on Apache Cassandra. While AstraDB is primarily known for its distributed and scalable architecture for storing structured data, it can also be used to store vector embeddings efficiently. By leveraging Cassandra's flexible data model and distributed nature, developers can design schemas to store vectors and use Cassandra's querying capabilities to retrieve them.

MongoDB: MongoDB is a popular NoSQL database known for its flexible document-oriented data model. While MongoDB is not specifically designed for vector storage, it can store vector embeddings as part of document structures. Developers can store vectors as arrays or embedded documents within MongoDB documents and use MongoDB's indexing and querying capabilities to perform similarity searches.

PostgreSQL: PostgreSQL is a powerful open-source relational database known for its extensibility and advanced features. "PGVector" is  a tool or extension for PostgreSQL designed to handle vector data efficiently, it would likely provide functionality for storing, indexing, and querying high-dimensional vectors within PostgreSQL databases. This could be particularly useful for applications involving similarity search, recommendation systems, or other machine learning tasks that require handling vector data within a relational database environment like PostgreSQL.



March 16, 2024

Creating DataFrames in Apache Spark

 Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It offers high-level APIs in languages such as Scala, Java, Python, and R, making it accessible to a wide range of developers.

How to Install Apache Spark on Microsoft Windows 10

In Apache Spark, SparkSession is the entry point for working with structured data in Spark, introduced in Spark 2.0. It combines the functionality previously provided by SQLContext, HiveContext, and SparkContext into a single unified interface.

SparkSession provides a unified entry point for interacting with Spark functionality, including SQL, DataFrame, and Dataset operations.


Creating DataFrames: 

SparkSession allows you to create DataFrames from various data sources such as JSON, CSV, Parquet, JDBC, Avro, and more. It provides methods like read and readStream to read data into DataFrames and Datasets.


Create DataFrames from JSON data sources using PySpark:


from pyspark.sql import SparkSession


# Create a SparkSession

spark = SparkSession.builder \

    .appName("JSON Example") \

    .getOrCreate()


# Define the path to the JSON file

json_file_path = "d:/spark/examples/src/main/resources/people.json"


# Create a DataFrame from JSON

people_df = spark.read.json(json_file_path)


# Show the schema of the DataFrame

people_df.printSchema()


# Show the contents of the DataFrame

people_df.show()


>>> people_df.show()

+----+-------+

| age|   name|

+----+-------+

|null|Michael|

|  30|   Andy|

|  19| Justin|

+----+-------+


>>> # Register the DataFrame as a SQL temporary view

>>> people_df.createOrReplaceTempView("people")

>>> sqlDF = spark.sql("SELECT * FROM people")

>>> sqlDF.show()

+----+-------+

| age|   name|

+----+-------+

|null|Michael|

|  30|   Andy|

|  19| Justin|

+----+-------+

# Create a DataFrame from TEXT file

>>> path  = "d:/spark/examples/src/main/resources/people.txt"

>>>

>>> dftext = spark.read.text(path)

>>> dftext.show()

+-----------+

|      value|

+-----------+

|Michael, 29|

|   Andy, 30|

| Justin, 19|

+-----------+

How to Install Apache Spark on Microsoft Windows 10


March 12, 2024

Install and Configure Apache Cassandra on Windows

 


1. Introduction

      Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers,with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

       Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik initially developed Cassandra at Facebook to power the Facebook inbox search feature. Facebook released Cassandra as an open-source project on Google code in July 2008.In March 2009, it became an Apache Incubator project. On February 17, 2010, it graduated to a top-level project.

      Facebook developers named their database after the Trojan mythological prophet Cassandra, with classical allusions to a curse on an oracle.

 2           Installation and Configuration:

2.1          Installing Cassandra

https://phoenixnap.com/kb/install-cassandra-on-windows

Dependencies

---------Apache Cassandra requires Java 8 to run on a Windows system.

---------Cassandra command-line shell (cqlsh) is dependent on Python 2.7 to work correctly.

To be able to install Cassandra on Windows, first you need to:

  1. Download and Install Java 8 and set environment variables.
  2. Download and install Python 2.7 and set environment variables.

If you already have these dependencies installed, check your version of Python and Java. If you have Java 8 and Python 2.7. feel free to move on to the third section of this guide.

Step 1: Install Java 8 on Windows

-------Download Oracle JDK 8 (Java Development Kit)( Visit the official Oracle download page and download the Oracle JDK 8 software package)

--------Scroll down and locate the Java SE Development Kit 8u251 for Windows x64 download link. The Java 8 download starts automatically after signup.

 

Note: If you do not have an Oracle account, the website guides you through a quick signup process. Alternatively, you can download Java from a third-party website of your choosing. Always make sure to confirm the source of the download.

3. Once the download is complete, double-click the downloaded executable file. Select Next on the initial installation screen.

 4. The following section allows you to select optional features and define the location of the installation folder. Accept the default settings and take note of the full path to the installation folder, C: Program FilesJavajdk1.8.0_251. Once you are ready to proceed with the installation, click Next.

5. The installation process can take several minutes. Select Close once the process is completed.

 Configure Environment Variables for Java 8

--------It is vital to configure the environment variables in Windows and define the correct path to the Java 8 installation folder.

1. Navigate to This PC > Properties.

 2. Select Advanced system settings.

3. Click the Environment Variables

 4. Select New in the System Variable section.

 5. Enter JAVA_HOME for the new variable name. Select the Variable value field and then the Browse Directory option.

 6. Navigate to This PC > Local Disk C: > Program Files > Java > jdk1.8.0_251 and select OK.

7. Once the correct path to the JDK 8 installation folder has been added to the JAVA_HOME system variable, click OK.

 

8. You have successfully added the JAVA_HOME system variable with the correct JDK 8 path to the variable list. Select OK in the main Environment Variables window to complete the process.

 Step 2: Install and Configure Python 2.7 on Windows

-------Users interact with the Cassandra database by utilizing the cqlsh bash shell.

-------- We need to install Python 2.7 for cqlsh to handle user requests properly.

-------Install Python 2.7 on Windows

1. Visit the Python official download page and select the Windows x64 version link.

 2. Define if you would like Python to be available to all users on this machine or just for your user account and select Next.

 3. Specify and take note of the Python installation folder location. Feel free to leave the default location C:Python27 by clicking Next.

 4. The following step allows you to customize the Python installation package. Select Next to continue the installation using the default settings.

5. The installation process takes a few moments. Once it is complete, select Finish to conclude the installation process.

 Edit Environment Variable for Python 2.7

1. Navigate to This PC > Properties.

 2. Select the Advanced system settings option.

 3. Click Environment Variables…

 4. Double-click on the existing Path system variable.

 5. Select New and then Browse to locate the Python installation folder quickly. Once you have confirmed that the path is correct, click OK.

 6. Add the Python 2.7 path to the Path system variable by selecting OK.

 

----------------Step 3: Download and Set Up Apache Cassandra

----------------Download and Extract Cassandra tar.gz Folder

----------------1. Visit the official Apache Cassandra Download page and select the version you would prefer to download. Currently, the latest available version is 3.11.6.

 2. Click the suggested Mirror download link to start the download process.


Note: It is always recommended to verify downloads originating from mirror sites. The instructions for using GPG or SHA-512 for verification are usually available on the official download page.

 4. Unzip the compressed tar.gz folder using a compression tool such as 7-Zip or WinZip. In this example, the compressed folder was unzipped, and the content placed in the C:Cassandraapache-cassandra-3.11.6 folder.

 Configure Environment Variables for Cassandra

Set up the environment variables for Cassandra to enable the database to interact with other applications and operate on Windows.

1. Go to This PC > Properties.

 2. Go to Advanced system settings.

 3. Click the Environment Variables

 4. Add a completely new entry by selecting the New option.

 5. Type CASSANDRA_HOME for Variable name, then for theVariable value column select the location of the unzipped Apache Cassandra folder.

Based on the previous steps, the location is C:Cassandraapache-cassandra-3.11.6. Once you have confirmed that the location is correct, click OK.

 

6. Double click on the Path variable.

 7. Select New and then Browse. In this instance, you need to add the full path to the bin folder located within the Apache Cassandra folder, C:Cassandraapache-cassandra-3.11.6bin.

 8. Hit the OK button and then again OK to save the edited variables.

 Step 4: Start Cassandra from Windows CMD

Navigate to the Cassandra bin folder. Start the Windows Command Prompt directly from within the bin folder by typing cmd in the address bar and pressing Enter.

 Type the following command to start the Cassandra server:

cassandra

The system proceeds to start the Cassandra Server.

 

Do not close the current cmd session.

Step 5: Access Cassandra cqlsh from Windows CMD

While the initial command prompt is still running open a new command line prompt from the same bin folder. Enter the following command to access the Cassandra cqlsh bash shell:

cqlsh

You now have access to the Cassandra shell and can proceed to issue basic database commands to your Cassandra server.


You have successfully installed Cassandra on Windows.

 


Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...