A distributed redis-based framework for colly.
Collyzar provides a very simple configuration and tools to implement distributed crawling/scraping.
- Simple configuration and clean API
- Distributed crawling/scraping
- Built-in global bloom filter
- Built-in spider cache
- Support redis command
- Multi-machine load balancing
- Support to pause or stop all crawling machines
- Pass additional information to the crawler and get it inside the crawler and store it in the database
Add collyzar to your go.mod file:
module github.com/x/y
go 1.14
require (
github.com/Zartenc/collyzar/v2 latest
)
See examples folder for more detailed examples.
SpiderName must be unique.
After running, it will always monitor the redis crawler queue for crawling until it receives a pause or stop signal.
func main(){
cs := &collyzar.CollyzarSettings{
SpiderName: "zarten",
Domain: "www.amazon.com",
RedisIp: "127.0.0.1",
}
collyzar.Run(myResponse, cs, nil)
}
func myResponse(response *collyzar.ZarResponse){
fmt.Println(response.StatusCode)
}
func main(){
ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")
url := "https://www.amazon.com"
pushInfo := collyzar.PushInfo{Url:url}
err := ts.PushToQueue(pushInfo)
if err != nil{
fmt.Println(err)
}
}
Provide tools including stop crawlers and pause crawlers.
func main() {
ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")
err := ts.StopSpiders()
if err != nil{
fmt.Println(err)
}
}
For all crawlers, the crawler process is idle after pausing the crawler.
Then you can use the WakeupSpiders method to wake up the crawlers.
func main() {
ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")
err := ts.PauseSpiders()
if err != nil{
fmt.Println(err)
}
}
Bugs or suggestions? Visit the issue tracker
If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow").