This repository demonstrates a comprehensive big data analytics pipeline tailored for cyber threat analysis using Apache Hadoop, Apache Hive, and PySpark. It leverages the UNSW-NB15 dataset to provide deep insights into cybersecurity threats.
Ensure you have the following installed on your system:
- Apache Hadoop (3.3.6)
- Apache Hive (4.0.1)
- PySpark
- Python 3.x
- Jupyter Notebook
Follow these steps to set up your environment:
- Clone the Repository
Clone the repository and navigate to the project directory using the following commands:
git clone https://github.com/tashi-2004/Hadoop-Spark-Hive-CyberAnalytics.git cd Hadoop-Hive-PySpark-CyberAnalytics
The UNSW-NB15 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). It generates a mix of real modern activities and synthetic contemporary attack behaviors. The dataset contains raw network traffic captured using the tcpdump tool (100 GB in size).
- Feature descriptions for the dataset: Download Features
- The complete UNSW-NB15 dataset: Download Dataset
- It includes nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.
- Tools such as Argus and Bro-IDS were used, along with twelve algorithms, to generate 49 features, including a class label.
- A total of 10 million records are available in the dataset (CSV format), with a total size of approximately 600MB.
- Explore the dataset by importing it into Hadoop HDFS.
- Use Hive to query and print the first 5-10 records for better understanding.
- Proceed with big data analytics using PySpark and Hive for advanced modeling and visualization.
Navigate to the Hadoop directory and start all necessary services using the start-all.sh
script:
start-all.sh
Put the UNSW-NB15 dataset into HDFS to make it accessible for analysis. Use the following command to load the data:
hadoop fs -put home/to/UNSW-NB15.csv /user/in/hdfs
![Image](https://private-user-images.githubusercontent.com/144563726/410940067-11aa979f-c519-48c4-b149-9a6b34156a38.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzYwNTQsIm5iZiI6MTczOTM3NTc1NCwicGF0aCI6Ii8xNDQ1NjM3MjYvNDEwOTQwMDY3LTExYWE5NzlmLWM1MTktNDhjNC1iMTQ5LTlhNmIzNDE1NmEzOC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxNTU1NTRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT01YmExZjJlN2M5MWU2MWIxM2JlNTVjNmVlOWQzODEzZjZjMjI0OTZhZWZjOTY2ZGI4N2QwMDE1Mjk5ZDAyN2EyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.vsuo0jgYC7SiCqJpWNy7L-GX4ri41Apc_KRtImHFI8g)
After the data is loaded into HDFS, proceed to execute Hive queries to analyze the dataset:
hive -f hivequeries.hql
Following the Hive query execution, use PySpark to perform further data analysis. Run the PySpark notebook to carry out this step:
pyspark pyspark.ipynb
The UNSW-NB15 dataset is loaded and preprocessed to prepare for analysis. Below is the preview of the dataset:
Summary statistics of the dataset, showing count, mean, standard deviation, and range for all features:
A correlation matrix was generated to identify relationships between numerical features:
A kernel density plot was created to analyze the distribution of the duration
feature:
PCA was applied to reduce dimensionality. The first two principal components explain most of the data variability:
K-Means clustering was performed to identify clusters within the dataset:
A Logistic Regression model was used for binary classification, with the following results:
A Random Forest classifier was trained for binary classification of normal vs. attack traffic. Below are the confusion matrix and classification report:
For queries or contributions, please contact:
Tashfeen Abbasi
Email: [email protected]