Skip to content

This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.

Notifications You must be signed in to change notification settings

tashi-2004/Apache-Hadoop-Spark-Hive-CyberAnalytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadoop-Hive-PySpark-CyberAnalytics

This repository demonstrates a comprehensive big data analytics pipeline tailored for cyber threat analysis using Apache Hadoop, Apache Hive, and PySpark. It leverages the UNSW-NB15 dataset to provide deep insights into cybersecurity threats.

Prerequisites

Ensure you have the following installed on your system:

  • Apache Hadoop (3.3.6)
  • Apache Hive (4.0.1)
  • PySpark
  • Python 3.x
  • Jupyter Notebook

Installation

Follow these steps to set up your environment:

  1. Clone the Repository Clone the repository and navigate to the project directory using the following commands:
    git clone https://github.com/tashi-2004/Hadoop-Spark-Hive-CyberAnalytics.git
    cd Hadoop-Hive-PySpark-CyberAnalytics
    

Understanding Dataset: UNSW-NB15

The UNSW-NB15 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). It generates a mix of real modern activities and synthetic contemporary attack behaviors. The dataset contains raw network traffic captured using the tcpdump tool (100 GB in size).

Key Features of the Dataset:

  • It includes nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.
  • Tools such as Argus and Bro-IDS were used, along with twelve algorithms, to generate 49 features, including a class label.
  • A total of 10 million records are available in the dataset (CSV format), with a total size of approximately 600MB.

Steps for Analysis:

  1. Explore the dataset by importing it into Hadoop HDFS.
  2. Use Hive to query and print the first 5-10 records for better understanding.
  3. Proceed with big data analytics using PySpark and Hive for advanced modeling and visualization.

Start Hadoop Services

Navigate to the Hadoop directory and start all necessary services using the start-all.sh script:

start-all.sh

Load Data to HDFS

Put the UNSW-NB15 dataset into HDFS to make it accessible for analysis. Use the following command to load the data:

hadoop fs -put home/to/UNSW-NB15.csv /user/in/hdfs
Image

Execute Hive Queries

After the data is loaded into HDFS, proceed to execute Hive queries to analyze the dataset:

hive -f hivequeries.hql

Hive Query 1 Hive Query 1

Hive Query 2 Hive Query 2

Hive Query 3 Hive Query 3

Hive Query 4 Hive Query 4

Hive Query 5 Hive Query 5

PySpark Analysis

Following the Hive query execution, use PySpark to perform further data analysis. Run the PySpark notebook to carry out this step:

pyspark pyspark.ipynb

Key Steps in the Analysis

1. Data Loading and Preprocessing

The UNSW-NB15 dataset is loaded and preprocessed to prepare for analysis. Below is the preview of the dataset: Image

2. Descriptive Statistics

Summary statistics of the dataset, showing count, mean, standard deviation, and range for all features: Image

3. Correlation Analysis

A correlation matrix was generated to identify relationships between numerical features: Image

4. Kernel Density Estimation

A kernel density plot was created to analyze the distribution of the duration feature: Image

5. Principal Component Analysis (PCA)

PCA was applied to reduce dimensionality. The first two principal components explain most of the data variability: Image

6. K-Means Clustering

K-Means clustering was performed to identify clusters within the dataset: Image

7. Classification with Logistic Regression

A Logistic Regression model was used for binary classification, with the following results:

  • Confusion Matrix: Image
  • Classification Report: Image

8. Classification with Random Forest

A Random Forest classifier was trained for binary classification of normal vs. attack traffic. Below are the confusion matrix and classification report:

  • Confusion Matrix: Image

Contact

For queries or contributions, please contact: Tashfeen Abbasi
Email: [email protected]

About

This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.

Topics

Resources

Stars

Watchers

Forks