Implementation of the publication Data Centric Content Classification of Smart City Residential Services.
The following versions are requried to perform the tasks.
- python>=3.7.0
- tensorflow-gpu>=2.1.0, <3.0.0
- keras>=2.3.1, <3.0.0
- pandas>=0.25.3, <2.0.0
- scikit-learn>=0.22.1, <2.0.0
- gensim>=3.8.1, <4.0.0
Training and testing run on Google Colab Pro Service.
This implementation contains whole tasks that demonstrated in the publication. The main component are:
- Data Engineering
- Classification Models
- Word Matching
- Attention Value Calcluction
- Examples on Colab
Data engineering processes and converts the data set from Chinese texts to word vectors as the input of machine learning models. The label analyzing and statistic is also included in this component.
- Data Proprocessing
- Segmente into tokens
- Remove punctuation, stopwords, etc by LTP.
- Label Proprocessing
- Label Analysis
- Label Statistic
- Word Embedding and Vectorization
Due to the privacy, the dataset is provided in a tokenization format. Which is after word embedding and vectorization processing.
The classifiction Models in this work are ResNet, Transformer and XLNet. In Transformer, we choose two strategries named based transformer(3 Encoder-Decoder Layers) and enlarged-transformer (6 Encoder-Decoder Layers with a cross-layer sharing parameter group).
- ResNet
- Transformer
- Based Transformer
- Enlarged Transformer
- XLNet
Word matching is a workflow that covert the request and department natural language to tokens and place as a pair (X,Y) to train the transformer model. And The model will learn to generate a sentence in a token format aiming at the department. This sentence is then to fuzzy matching with the department in the dataset. Comparied for the classification, the training pair (X,Y) is request and the classes label index.
This work aims to observe the different attention value words' influence on the decision of the transformer based model. After calculating the attention value of each words, low attention words are masked to do the prediction of the model.
Classification, Word Matching and Attention Value calcuction tasks are available on Colab.