Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide classes to use FastWARC to read WARC/WAT/WET files #38

Merged
merged 5 commits into from
Mar 16, 2023

Conversation

sebastian-nagel
Copy link
Contributor

(address #37)

  • implemented
    • base class CCFastWarcSparkJob
    • examples/applications
      • ServerCountFastWarcJob
      • ExtractHostLinksFastWarcJob
  • tested using FastWARC 0.12.2
  • performance comparison warcio <> FastWARC (local mode, small test data)
    • 23% faster - ServerCountFastWarcJob (63s -> 48s)
    • 8% faster - ExtractHostLinksFastWarcJob (72s -> 66s)
  • successfully run ExtractHostLinksFastWarcJob on cluster (Spark on Yarn) to prepare May, June/July, August 2022 web graphs
  • to do
    • iterate_records(): how to access WARC record offset and length
    • more encapsulation: use warcio/fastwarc methods indirectly, so that some examples classes only require to change the base class (CCSparkJob -> CCFastWarcSparkJob)

- CCSparkJob: separate processing of a single WARC file
  from method process_warcs(...) into process_warc(...)
- provide base class processing WARC files using FastWARC
    from sparkcc_fastwarc import CCFastWarcSparkJob
- port server count example
- port host graph construction extraction
  (host-host link extraction from WAT and WARC files)
- provide methods for encapsulation to hide differences between warcio
  and fastwarc from user methods
- simplify fastwarc classes and avoid code duplication by using
  encapsulated methods to access WARC/HTTP headers and the payload stream
- use methods encapsulating warcio/fastwarc in more examples
- method to access WARC header: add param for default / fall-back value
@sebastian-nagel sebastian-nagel marked this pull request as ready for review March 16, 2023 13:30
@sebastian-nagel sebastian-nagel merged commit ed7b41f into main Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant