Public repository for the Science Education LDA Project
This is the public repository for the Science Education LDA research project, which is maintained by Tor Ole Odden and Alessandro Marin.
This project is based on the method published in Physical Review Physics Education Research 1. Also refer to the CCSE/PERC_TopicModel repository.
See the Science Education LDA Notebook, which contains an extract of the methods described in 1.
To run the main notebook PERC_TopicModeling.ipynb install the required packages:
pip install -r requirements.txt --user
A file (scied_words_bigrams_V5.pkl) contains the corpus obtained after processing the papers should be downloaded separately. Its size is about 200MB and the link will be posted soon.
The required packages include Gensim (unsupervised semantic modelling on text), NLTK (Natural Language Tool Kit), LDAVis (interactive topic model visualization), scikit-learn, along with standard data analysis libraries such as pandas, numpy, and matplotlib.
Graph of average topic prevalence over time: AvgPrev.html
Graph of cumulative topic prevalence over time: CumuPrev.html
Questions can be directed to Tor Ole Odden
1: Tor Ole B. Odden and Alessandro Marin, Marcos D. Caballero. Thematic Analysis of 18 Years of Physics Education Research Conference Proceedings using Natural Language Processing, Physical Review Physics Education Research, 2020. Link