Skip to content

互联网爬虫,蜘蛛,数据采集器,网页解析器的汇总,因新技术不断发展,新框架层出不穷,此文会不断更新...

License

Notifications You must be signed in to change notification settings

DZ1Y/awesome-crawler-cn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Awesome-crawler-cn

不同编程语言下的网络爬虫,蜘蛛,采集器的极好汇总。

Python

  • Scrapy - 一种高效的屏幕,网页数据采集框架。
  • pyspider - 一个强大纯python的数据采集系统.
  • cola - 一个分布式的爬虫框架.
  • Demiurge - 基于PyQuery的微型爬虫框架.
  • Scrapely - 一个纯python的HTML页面捕捉库.
  • feedparser - 一个通用的feed解析器.
  • you-get - 静默网站爬去下载器.
  • Grab - 网站采集框架.
  • MechanicalSoup - 一个自动化的互动网站Python库.
  • portia - 基于Scrapy的可视化数据采集框架.
  • crawley - 基于非阻塞通信(NIO)的python爬虫框架.
  • RoboBrowser - 一个简单的,不基于Web浏览器的基于Python的Web 浏览器.
  • MSpider - 一个基于gevent(协程网络库)的python爬虫.
  • brownant - 一个轻量级的网络数据抽取框架.

Java

  • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
    • anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
  • Crawler4j - Simple and lightweight web crawler.
  • JSoup - Scrapes, parses, manipulates and cleans HTML.
  • websphinx - Website-Specific Processors for HTML INformation eXtraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • Gecco - A easy to use lightweight web crawler
  • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
  • Webmagic - A scalable crawler framework.
  • Spiderman - A scalable ,extensible, multi-threaded web crawler.
    • Spiderman2 - A distributed web crawler framework,support js render.
  • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
  • SeimiCrawler - An agile, distributed crawler framework.
  • StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
  • Spark-Crawler - Evolving Apache Nutch to run on Spark.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
  • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
  • DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
  • Abot - C# web crawler built for speed and flexibility.
  • Hawk - Advanced Crawler and ETL tool written in C#/WPF.
  • SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.

JavaScript

  • scraperjs - A complete and versatile web scraper.
  • scrape-it - A Node.js scraper for humans.
  • simplecrawler - Event driven web crawler.
  • node-crawler - Node-crawler has clean,simple api.
  • js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
  • x-ray - Web scraper with pagination and crawler support.
  • node-osmosis - HTML/XML parser and web scraper for Node.js.

PHP

  • Goutte - A screen scraping and web crawling library for PHP.
  • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
  • pspider - Parallel web crawler written in PHP.
  • php-spider - A configurable and extensible PHP web spider.

C++

C

  • httrack - Copy websites to your computer.

Ruby

  • upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
  • Spidr - Spider a site ,multiple domains, certain links or infinitely.
  • Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
  • mechanize - Automated web interaction & crawling.

R

  • rvest - Simple web scraping for R.

Erlang

  • ebot - A scalable, distribuited and highly configurable web cawler.

Perl

  • web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Go

  • pholcus - A distributed, high concurrency and powerful web crawler.
  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  • go_spider - An awesome Go concurrent Crawler(spider) framework.
  • dht - BitTorrent DHT Protocol && DHT Spider.
  • ants-go - A open source, distributed, restful crawler engine in golang.
  • scrape - A simple, higher level interface for Go web scraping.

Scala

  • crawler - Scala DSL for web crawling.
  • scrala - Scala crawler(spider) framework, inspired by scrapy.
  • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

About

互联网爬虫,蜘蛛,数据采集器,网页解析器的汇总,因新技术不断发展,新框架层出不穷,此文会不断更新...

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published