This project analyzes toxicity classification and counterspeech generation using GPT-2 medium through value vector analysis.
Trains a binary classifier for toxicity detection:
- Base model: GPT-2 Medium
- Dataset: Jigsaw Toxic Comment Classification Dataset
- Output: Saves finetuned model as
best_model_state.bin
- Our models can be found at Binary Classification Model.
- Metrics tracked: Accuracy, F1 Score, Loss
Generates non-toxic responses to toxic comments:
- Models used: Finetuned GPT-2 medium
- Dataset: Counter Speech Dataset for Toxic Language
- Output: Counter-speech generation model saved in
saved_models
directory with naming formatLR_{Learning_Rate}_BS_{Batch_Size}_E_{Epochs}
- All three models can be found at Counterspeech Generation Model.
Initial value vector analysis:
- Models analyzed: Base GPT-2 Medium and fine-tuned classifier (
best_model_state.bin
) [Path to binary classification model can be modified if needed] - Extracts and compares value vectors
Deep dive into value vectors:
- Performs SVD on value vectors
- Maps vectors to vocabulary space
- Compares base vs finetuned model differences
- Uses models from steps 1 & 2
Advanced analysis:
- Key-value vector interactions
- Cross-model vector alignments
- Token-level impact analysis
- Uses models and vectors from previous steps
Required libraries:
pip install torch==1.9.0 transformers==4.11.3 transformer_lens==0.3.0 nltk==3.6.3 matplotlib==3.4.3 numpy==1.21.2
- Base Model: GPT-2 Medium (345M parameters)
- Training Data: Jigsaw Toxic Comment Dataset (~561K samples)
- Validation Data: Hold-out toxic comments (20%)
- Counter Speech Data: CONAN dataset (~15k paired examples)
- Models require GPU with 12GB+ memory
- Full training takes 9-10 hours on P100
- Save all best model checkpoints