Collyzar

A distributed redis-based framework for colly.

Collyzar provides a very simple configuration and tools to implement distributed crawling/scraping.

Features

Simple configuration and clean API
Distributed crawling/scraping
Built-in global bloom filter
Built-in spider cache
Support redis command
Multi-machine load balancing
Support to pause or stop all crawling machines
Pass additional information to the crawler and get it inside the crawler and store it in the database

Installation

Add collyzar to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/Zartenc/collyzar/v2 latest
)

Example Usage

See examples folder for more detailed examples.

Crawler cluster machine

SpiderName must be unique.

After running, it will always monitor the redis crawler queue for crawling until it receives a pause or stop signal.

func main(){
    cs := &collyzar.CollyzarSettings{
    		SpiderName: "zarten",
    		Domain:     "www.amazon.com",
    		RedisIp:    "127.0.0.1",
    	}
	collyzar.Run(myResponse, cs, nil)
}

func myResponse(response *collyzar.ZarResponse){
	fmt.Println(response.StatusCode)
}

Control machine

Push url to redis queue

func main(){
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	url := "https://www.amazon.com"
	pushInfo := collyzar.PushInfo{Url:url}

	err := ts.PushToQueue(pushInfo)
	if err != nil{
		fmt.Println(err)
	}
}

Tools

Provide tools including stop crawlers and pause crawlers.

Stop all crawlers

func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.StopSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Pause all crawlers

For all crawlers, the crawler process is idle after pausing the crawler.
Then you can use the WakeupSpiders method to wake up the crawlers.

func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.PauseSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Bugs

Bugs or suggestions? Visit the issue tracker

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow").

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
bloom		bloom
examples		examples
.gitignore		.gitignore
README.md		README.md
cache.go		cache.go
collyzar.go		collyzar.go
config.go		config.go
go.mod		go.mod
go.sum		go.sum
storage.go		storage.go
tool.go		tool.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collyzar

Features

Installation

Example Usage

Crawler cluster machine

Control machine

Push url to redis queue

Tools

Stop all crawlers

Pause all crawlers

Bugs

Contributing

About

Releases 3

Packages

Languages

Zartenc/collyzar

Folders and files

Latest commit

History

Repository files navigation

Collyzar

Features

Installation

Example Usage

Crawler cluster machine

Control machine

Push url to redis queue

Tools

Stop all crawlers

Pause all crawlers

Bugs

Contributing

About

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages