Climate_News_Scraper_ETL

Using this project one is capable of scraping the BBC News website for the latest updates on climate. The scraper functionality is packed in an ETL pipeline build on Prefect and Dask in order to load the scraped news articles in a SQL Lite database.

Webscraping

For webscraping the libraries requests and beautifulsoup are used. Only the latest articles can be scraped, therefore the script is intended to run on a periodic schedule.

ETL

Prefect is used for orchestration of the ETL flow. The flow an easily be monitored from the Prefect Cloud platform.

Sentiment classification

Using the TextBlob library the sentiment (negative/neutral/positive) is added to each article.

Database

All scraped articles will be written to a local SQLite database in the load stage of the ETL flow. No duplicate entries are allowed. Here is an example of a record inserted into the CLIMATENEWS table.

{ 
  "title" : "Warning climate change impacting on avalanche risk", 
  "content" : "Forecasters said a likely effect in Scotland was avalanches occurring in tighter periods of time.", 
  "date" : "2023-01-27T06:04:00.000000000", 
  "sentiment" : "Negative" 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
flow.py		flow.py
newsscraper.py		newsscraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Climate_News_Scraper_ETL

Webscraping

ETL

Sentiment classification

Database

About

Releases

Packages

Languages

License

Oliviervha/Climate_News_Scraper_ETL

Folders and files

Latest commit

History

Repository files navigation

Climate_News_Scraper_ETL

Webscraping

ETL

Sentiment classification

Database

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages