The Julia Chatbot AI is a machine-learning-based chatbot trained on GitHub repositories of Julia to generate and understand Julia code.
It allows users to scrape, preprocess, train, and evaluate language models of various sizes (135m, 360m, 1.7b).
The system includes post-processing, benchmarking, and a chatbot UI for interactive conversations.
Create a virtual environment:
python3 -m venv .venv
Please use python 3.12 or higher
To be noted that we used python3.12.
Source the environment:
sourche .venv/bin/activawe
Source the environment on the server:
eval "$(/home/SA24-G1/miniconda3/bin/conda shell.bash hook)"
Install the requirements:
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$PWD"
To get scrape the github repositories run the next command:
python3 src/dataset/fetch_dataset.py
Afterwards it is necessary to scrape and parse the data to produce a 'json' file contained the raw data set.
python3 src/dataset/parser.py
The raw Data is expected to be in the data
directory. The name of the file is expected to be fd.json
The data is pre-processed using the following command, you need to run this command before training the model.
python src/data/preprocess.py
model can be 135m
, 360m
, 1.7b
python src/LLM/LLM_train.py --model "135m"
--sample
is used to train the model on a subset of the data, specify the number of samples you want to train on.
python src/LLM/LLM_train.py --model "1.7b" --sample=1000
the model will be saved in the models
directory along with the tokenizer
train all models 135m, 360m, 1.7b
python src/LLM/LLM_train.py --model "all"
--sample
is used to train the model on a small subset of the data (specify the number of samples you want to train on)--signature
is used to train the model on the signature data--baseline
is used to train the model on the baseline data without any preprocessing
python src/LLM/LLM_train.py --model "all" --sample=1000
Before evaluating the models they should be post processed. The completions created by the model often have syntax error caused a lack, or over inclusion, of 'end' keywords as in this example:
function fibonacci(n)
if n <= 2
return 1
end
return fibonacci(n - 1)
+ fibonacci(n - 2)
end
end
end
The post processing script ensure that the number of 'end' keywords are correct. It is sufficient to run the 'src/utils/post_processing.py' script passing as input dir the directory containing the results produced by the ./generate.sh script.
python3 src/utils/post_processing.py --input-dir ./results
To use the evaluation script, you need to have the model in the models
directory trained.
python src/LLM/LLM_predict.py --prompt '"""A simple for loop"""' --model "135m"
--max_length
is used to specify the maximum number of tokens to the output.--signature
is used to evaluate the model on the signature data.--baseline
is used to evaluate the model on the baseline data without any preprocessing.--original_output
do not apply any post-processing to the output.
```bash
python src/LLM/LLM_predict.py --prompt '"""A simple for loop"""' --model "all"
To run the chatbot UI
python src/chatbot/app.py
Go inside the benchmark directory
cd benchmark
Replace the name of the model with the one you want to run with benchmark, for example:
generate.sh ../models/135m
evaluate.sh was customized to load our tokenizer.
The predictions generated by our Model are stored in results
folder
To use the evaluation script, you need to have the model in the models
directory trained.
Replace 135m
with the desired model you want to evaluate and the checkpoint directory.
to evaluate you need to be on the top folder of the project --> thus if you are in the benchmark folder you need to go back to the top folder
evaluate.sh models/135m
The results are stored in a json file named $MODELNAME_results_jl.json.
Run the test.py in benchmark to get the efficiency of the models on MultiPl-E benchmark and some statistics.
python src/statistical_test/statistical.py