News Scraper for Times of India

This project is a Python-based web scraper designed to extract news articles from the Times of India website. It allows you to scrape a random selection of articles from a specified range of years and save the extracted data (title, date, content, and URL) to a CSV file.

Features

Scrape Articles: Extracts news articles from the Times of India archive pages.
Error Handling: Includes retry mechanisms and error handling to manage network issues and timeouts.
Random Article Selection: Selects random articles to scrape within a specified date range.
Save to CSV: Saves scraped articles to a CSV file for further analysis or use.
Progress Bar: Displays a progress bar to track the scraping process.

Installation

To use this scraper, you need to have Python installed. Follow the steps below to set up the project:

Clone the repository:

git clone https://github.com/lostdir/News_Scrapper.git
cd news_scraper

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

To run the scraper, execute the following command:

python scraper.py

You can modify the parameters such as the range of years and the number of articles to scrape by editing the scrape_random_articles function call in the if __name__ == '__main__': block.

Example Usage

scrape_random_articles(2017, 2024, num_articles=10)

This example will scrape 10 random articles between the years 2017 and 2024.

Functionality

Functions

scrape_archive_page(archive_url, max_retries=3): Scrapes all article links from a single archive page with retries in case of failures.
scrape_article(article_url, max_retries=3): Scrapes the title, content, and date from an individual article page.
scrape_random_articles(start_year, end_year, num_articles=65): Scrapes a random selection of articles from a range of dates.

How It Works

Scrape Archive Pages: The script navigates through the Times of India archive pages and extracts article URLs.
Scrape Article Content: For each article URL, it scrapes the title, content, and date.
Random Selection: The script selects random articles to ensure a diverse dataset.
Save to CSV: The scraped data is saved in a CSV file named times_of_india_articles.csv.

Dependencies

The scraper relies on the following Python libraries:

requests: To send HTTP requests to the Times of India server.
beautifulsoup4: To parse and extract information from HTML pages.
pandas: To handle data and save it to a CSV file.
tqdm: To display a progress bar for the scraping process.

Install all dependencies using:

pip install -r requirements.txt

Contributing

Contributions are welcome! If you have any improvements or new features to suggest, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
newscrapper.py		newscrapper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Scraper for Times of India

Table of Contents

Features

Installation

Usage

Example Usage

Functionality

Functions

How It Works

Dependencies

Contributing

License

About

Releases

Packages

Languages

License

lostdir/News_Scrapper

Folders and files

Latest commit

History

Repository files navigation

News Scraper for Times of India

Table of Contents

Features

Installation

Usage

Example Usage

Functionality

Functions

How It Works

Dependencies

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages