Hadoop-Hive-PySpark-CyberAnalytics

This repository demonstrates a comprehensive big data analytics pipeline tailored for cyber threat analysis using Apache Hadoop, Apache Hive, and PySpark. It leverages the UNSW-NB15 dataset to provide deep insights into cybersecurity threats.

Prerequisites

Ensure you have the following installed on your system:

Apache Hadoop (3.3.6)
Apache Hive (4.0.1)
PySpark
Python 3.x
Jupyter Notebook

Installation

Follow these steps to set up your environment:

Clone the Repository Clone the repository and navigate to the project directory using the following commands:

git clone https://github.com/tashi-2004/Hadoop-Spark-Hive-CyberAnalytics.git
cd Hadoop-Hive-PySpark-CyberAnalytics

Understanding Dataset: UNSW-NB15

The UNSW-NB15 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). It generates a mix of real modern activities and synthetic contemporary attack behaviors. The dataset contains raw network traffic captured using the tcpdump tool (100 GB in size).

Feature descriptions for the dataset: Download Features
The complete UNSW-NB15 dataset: Download Dataset

Key Features of the Dataset:

It includes nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.
Tools such as Argus and Bro-IDS were used, along with twelve algorithms, to generate 49 features, including a class label.
A total of 10 million records are available in the dataset (CSV format), with a total size of approximately 600MB.

Steps for Analysis:

Explore the dataset by importing it into Hadoop HDFS.
Use Hive to query and print the first 5-10 records for better understanding.
Proceed with big data analytics using PySpark and Hive for advanced modeling and visualization.

Start Hadoop Services

Navigate to the Hadoop directory and start all necessary services using the start-all.sh script:

start-all.sh

Load Data to HDFS

Put the UNSW-NB15 dataset into HDFS to make it accessible for analysis. Use the following command to load the data:

hadoop fs -put home/to/UNSW-NB15.csv /user/in/hdfs

Execute Hive Queries

After the data is loaded into HDFS, proceed to execute Hive queries to analyze the dataset:

hive -f hivequeries.hql

Hive Query 1

Hive Query 2

Hive Query 3

Hive Query 4

Hive Query 5

PySpark Analysis

Following the Hive query execution, use PySpark to perform further data analysis. Run the PySpark notebook to carry out this step:

pyspark pyspark.ipynb

Key Steps in the Analysis

1. Data Loading and Preprocessing

The UNSW-NB15 dataset is loaded and preprocessed to prepare for analysis. Below is the preview of the dataset:

2. Descriptive Statistics

Summary statistics of the dataset, showing count, mean, standard deviation, and range for all features:

3. Correlation Analysis

A correlation matrix was generated to identify relationships between numerical features:

4. Kernel Density Estimation

A kernel density plot was created to analyze the distribution of the duration feature:

5. Principal Component Analysis (PCA)

PCA was applied to reduce dimensionality. The first two principal components explain most of the data variability:

6. K-Means Clustering

K-Means clustering was performed to identify clusters within the dataset:

7. Classification with Logistic Regression

A Logistic Regression model was used for binary classification, with the following results:

Confusion Matrix:
Classification Report:

8. Classification with Random Forest

A Random Forest classifier was trained for binary classification of normal vs. attack traffic. Below are the confusion matrix and classification report:

Confusion Matrix:

Contact

For queries or contributions, please contact: Tashfeen Abbasi
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
README.md		README.md
hivequeries.hql		hivequeries.hql
pyspark.ipynb		pyspark.ipynb
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop-Hive-PySpark-CyberAnalytics

Prerequisites

Installation

Understanding Dataset: UNSW-NB15

Key Features of the Dataset:

Steps for Analysis:

Start Hadoop Services

Load Data to HDFS

Execute Hive Queries

PySpark Analysis

Key Steps in the Analysis

1. Data Loading and Preprocessing

2. Descriptive Statistics

3. Correlation Analysis

4. Kernel Density Estimation

5. Principal Component Analysis (PCA)

6. K-Means Clustering

7. Classification with Logistic Regression

8. Classification with Random Forest

Contact

About

Contributors 2

Languages

tashi-2004/Apache-Hadoop-Spark-Hive-CyberAnalytics

Folders and files

Latest commit

History

Repository files navigation

Hadoop-Hive-PySpark-CyberAnalytics

Prerequisites

Installation

Understanding Dataset: UNSW-NB15

Key Features of the Dataset:

Steps for Analysis:

Start Hadoop Services

Load Data to HDFS

Execute Hive Queries

PySpark Analysis

Key Steps in the Analysis

1. Data Loading and Preprocessing

2. Descriptive Statistics

3. Correlation Analysis

4. Kernel Density Estimation

5. Principal Component Analysis (PCA)

6. K-Means Clustering

7. Classification with Logistic Regression

8. Classification with Random Forest

Contact

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages