Skip to content

Commit

Permalink
feat(dupefilter.py): add custom RedisDupeFilter class to handle dupli…
Browse files Browse the repository at this point in the history
…cate requests based on request fingerprint

The `dupefilter.py` file introduces a new class called `RedisDupeFilter` which extends the `RFPDupeFilter` class. This class is responsible for handling duplicate requests in the rent crawler application.

The `RedisDupeFilter` class overrides the `request_fingerprint` method to generate a unique fingerprint for each request. The fingerprint is generated by creating a dictionary `fingerprint_data` containing the request method, URL, and request body. The request body is converted to hexadecimal format if it exists. The dictionary is then serialized to JSON and hashed using SHA1 algorithm to generate the fingerprint.

This custom dupe filter is designed to work with Redis as the backend for storing and checking duplicate requests.

chore(middlewares.py): remove unused RedisKeySpiderMiddleware class
  • Loading branch information
Morelatto committed Oct 2, 2023
1 parent 0ddd487 commit 3f3b54b
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 55 deletions.
18 changes: 18 additions & 0 deletions rent_crawler/dupefilter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import hashlib
import json

from scrapy.utils.python import to_unicode
from scrapy_redis.dupefilter import RFPDupeFilter
from w3lib.url import canonicalize_url


class RedisDupeFilter(RFPDupeFilter):

def request_fingerprint(self, request):
fingerprint_data = {
"method": to_unicode(request.method),
"url": canonicalize_url(request.url),
"body": request.meta.get('id') or (request.body or b"").hex(),
}
fingerprint_json = json.dumps(fingerprint_data, sort_keys=True)
return hashlib.sha1(fingerprint_json.encode("utf-8")).hexdigest()
55 changes: 0 additions & 55 deletions rent_crawler/middlewares.py

This file was deleted.

0 comments on commit 3f3b54b

Please sign in to comment.