Online News Popularity Project: To predict the Online News Popularity of mashable.com articles by no. of shares on the social media channels
Executive Summary
- In order for online news companies like Mashable to succeed, they need to determine patterns and trends that contribute to the popularity of their models.
- Our goal was to create and develop a model predicting which Mashable articles were widely shared on social networks based on several features of online news.
- Based on the results, we determined which variables would contribute toward future content creations having a wider reach on social media through organic sharing.
- Our business insights and recommendations for Mashable are based on our logistic regression model, which was implemented on a dataset built using stratified undersampling.
- This model provides multiple insights and recommendations that will help Mashable improve their business.
- These insights includes the impact of image insertions, article categorization, keyword strength, and article release day.
- We then conclude with direction and specific actions Mashable could take to improve their articles’ virality.
Problem Statement
- Mashable is a global, multi-platform media and entertainment company, and they post articles of multiple genres online from which they earn revenue from advertisers.
- In order for companies like Mashable to succeed, they must be aware of the trends within their successful articles.
- Without learning these trends, a company like Mashable may fail against competitors who also uses data-driven strategy.
- Therefore, it is imperative to understand and predict what article characteristics are most appealing to readers.
- Our solution to this problem involves predicting which articles are widely shared on social media.
- Shares on social media is a key metric for article virality, and the specific business insight we are looking for is which factors lead to higher article virality.
- This is very significant for business because it results in more article views without requiring paid marketing.
Dataset Introduction
- The dataset is called “Online News Popularity Data Set” and it can be accessed from UCI Machine Learning Repository (https://archive.ics.uci.edu).
- Each row represents a news article and was collected from January 7, 2013 to January 7, 2015.
- There are 39,797 rows and 61 columns as shown in the Appendix Table 1.
- The original dataset contained 37 attributes of articles, and several natural language processing features were extracted by previous researchers (Fernandes, Vinagre & Cortez, 2015). In this study, 17 selected predictors are used to build models.
Methodology
- The methodology of this study is followed by SEMMA. SEMMA stands for Sample, Explore, Modify, Models, and Assess (Shmueli, Bruce, Stephens & Patel, 2017, p. 18).
- During this process, our data visualizations were created by both Tableau and JMP software, and the models were built via JMP software.