This project is designed to scrape news headlines, summaries, and links from the BBC News website. The goal is to extract valuable information from a website that doesn't provide a publicly accessible API, demonstrating how web scraping can be utilized to gather data for educational and research purposes.
- Purpose of Data Collection
- Data Source Selection
- Project Structure
- Installation
- Usage
- Features
- Collection Practices
- Data Handling and Privacy
- Data Usage
- Ethical Considerations
- Contributing
- License
- Contact
The primary purpose of this project is to extract and analyze news data from BBC News to:
- Provide users with quick access to the latest news headlines and summaries.
- Enable keyword-based filtering to help users find news articles relevant to their interests.
- Demonstrate the practical application of web scraping techniques in a real-world scenario.
- Website Used: BBC News
- Why Chosen: BBC News is a reputable source of global news that doesn't offer a publicly accessible API for extracting news data. Scraping this site allows us to obtain timely news information that can be valuable for analysis and research.
- Robots.txt Compliance: We have reviewed the BBC's robots.txt file and ensured that our scraping activities comply with their guidelines.
News-Summarizer/
├── main.py
├── requirements.txt
├── README.md
├── ETHICS.md
├── .gitignore
main.py
: The main Python script for scraping the news data.requirements.txt
: A list of required Python packages.README.md
: Project documentation.ETHICS.md
: Discussion of ethical considerations..gitignore
: Excludes virtual environment and other unnecessary files from the repository.
-
Clone the Repository:
git clone https://github.com/ahamedfoisal/News-Summarizer.git
-
Navigate to the Project Directory:
cd News-Summarizer
-
Set Up a Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Required Libraries:
pip install -r requirements.txt
-
Run the Script:
python main.py
-
Enter a Keyword (Optional):
- When prompted, enter a keyword to filter news articles.
- Press Enter without typing anything to display all available news articles.
-
View the Results:
- The script will display the news articles along with their summaries and links.
- Scrape Latest News: Fetches the most recent news stories from the BBC News homepage.
- Extract Detailed Information: Retrieves the headline, summary, and direct link for each news article.
- Keyword Filtering: Allows users to filter news articles by keywords in the title or summary.
- Duplicate Prevention: Ensures that duplicate news articles are not displayed.
- Respect Robots.txt: The scraper only accesses parts of the website that are not disallowed by the
robots.txt
file. - Rate Limiting: The script is designed to be efficient and respectful, minimizing the number of requests to avoid overloading the server.
- No Bypassing Restrictions: The scraper does not attempt to access password-protected areas or bypass any security measures.
- No Personal Data Collection: The scraper does not collect any Personally Identifiable Information (PII) or user-specific data.
- Secure Data Storage: Any data collected is stored securely and is included in the
.gitignore
file to prevent accidental uploads to public repositories.
- Educational and Research Purposes Only: The data collected is intended solely for educational demonstrations and research.
- No Commercial Use: The data will not be used for commercial purposes or redistributed.
For a detailed discussion on the ethical considerations of this project and how they are addressed, please refer to the ETHICS.md file.
Contributions are welcome! If you have suggestions for improvements or encounter any issues, please open an issue or submit a pull request.