Projects related to information retrieval, NLP, social network analysis, recommender system, etc.
This project applies Information Retrieval (IR) knowledge in practice with popular Python toolkits such as NLTK and Whoosh. The test collection is about 4,000 documents from US Government web sites and the topics are 15 needs for government information.
This project applies machine learning knowledge through document classification using supervised learning techniques and to perform a performance analysis of the classifier. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
This project involves creating a recommender engine, evaluating it to understand its performance, and making changes to improve performance. The recommendation dataset was used is from a collection called MovieLens, which contains users’ movie ratings and is popular for implementing and testing recommender systems. The specific dataset used for this project is MovieLens 100K Dataset which contains 100,000 movie ratings from 943 users and a selection of 1682 movies.
This project involves performing analysis of real hotel review data crawled from the Tripadvisor website to automatically identify positive and negative keywords and phrases associated with hotels and to better understand characteristics of data analysis tools, extracting explanatory review summaries, and human reviewing behavior.
This project involves social network analysis based on twitter data. This project helps to better understand graph analysis methods, as well as different centrality measures in the graph, and edge prediction.