Skip to content

JacobSalvi/software-analytics-julia-training

Repository files navigation

Julia Chat bot AI

The Julia Chatbot AI is a machine-learning-based chatbot trained on GitHub repositories of Julia to generate and understand Julia code. It allows users to scrape, preprocess, train, and evaluate language models of various sizes (135m, 360m, 1.7b). The system includes post-processing, benchmarking, and a chatbot UI for interactive conversations. chat bot example

Set up the environment

Create a virtual environment:

python3 -m venv .venv

Please use python 3.12 or higher

To be noted that we used python3.12.

Source the environment:

sourche .venv/bin/activawe

Source the environment on the server:

eval "$(/home/SA24-G1/miniconda3/bin/conda shell.bash hook)"

Install the requirements:

pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$PWD"

Scrape the data

To get scrape the github repositories run the next command:

python3 src/dataset/fetch_dataset.py

Afterwards it is necessary to scrape and parse the data to produce a 'json' file contained the raw data set.

python3 src/dataset/parser.py

Pre-Process the data

The raw Data is expected to be in the data directory. The name of the file is expected to be fd.json

The data is pre-processed using the following command, you need to run this command before training the model.

python src/data/preprocess.py

Train

Train a single model

model can be 135m, 360m, 1.7b

python src/LLM/LLM_train.py --model "135m" 

--sample is used to train the model on a subset of the data, specify the number of samples you want to train on.

python src/LLM/LLM_train.py --model "1.7b" --sample=1000

the model will be saved in the models directory along with the tokenizer

Train all models

train all models 135m, 360m, 1.7b

python src/LLM/LLM_train.py --model "all" 
  • --sample is used to train the model on a small subset of the data (specify the number of samples you want to train on)
  • --signature is used to train the model on the signature data
  • --baseline is used to train the model on the baseline data without any preprocessing
python src/LLM/LLM_train.py --model "all" --sample=1000

Post processing

Before evaluating the models they should be post processed. The completions created by the model often have syntax error caused a lack, or over inclusion, of 'end' keywords as in this example:

function fibonacci(n)
  if n <= 2
    return 1
  end
  return fibonacci(n - 1) 
  + fibonacci(n - 2)
end
end
end

The post processing script ensure that the number of 'end' keywords are correct. It is sufficient to run the 'src/utils/post_processing.py' script passing as input dir the directory containing the results produced by the ./generate.sh script.

python3 src/utils/post_processing.py --input-dir ./results

Evaluate

To use the evaluation script, you need to have the model in the models directory trained.

Evaluate a single model

python src/LLM/LLM_predict.py --prompt '"""A simple for loop"""' --model "135m"
  • --max_length is used to specify the maximum number of tokens to the output.
  • --signature is used to evaluate the model on the signature data.
  • --baseline is used to evaluate the model on the baseline data without any preprocessing.
  • --original_output do not apply any post-processing to the output.

Evaluate all models

```bash
python src/LLM/LLM_predict.py --prompt '"""A simple for loop"""' --model "all"

Chat bot UI

To run the chatbot UI

python src/chatbot/app.py

Benchmark

Generate

Go inside the benchmark directory

cd benchmark

Generate response of models

Replace the name of the model with the one you want to run with benchmark, for example:

generate.sh  ../models/135m

evaluate.sh was customized to load our tokenizer.

The predictions generated by our Model are stored in results folder

Evaluate

To use the evaluation script, you need to have the model in the models directory trained. Replace 135m with the desired model you want to evaluate and the checkpoint directory.

to evaluate you need to be on the top folder of the project --> thus if you are in the benchmark folder you need to go back to the top folder

evaluate.sh models/135m

The results are stored in a json file named $MODELNAME_results_jl.json.

Statistical Tests

Run the test.py in benchmark to get the efficiency of the models on MultiPl-E benchmark and some statistics.

python src/statistical_test/statistical.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •