Skip to content

wantbook-book/MC_Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minecraft Wiki Crawler

Installation

We use python 3.11. We have tested on macOS and Ubuntu 20.04. You can follow the instructions below to run it.

Install Requirements

pip install -r requirements.txt

Get Started

You can just run python main.py.

In the main.py:

if __name__ == '__main__':
    urls_dir = Path('crawler_data/rought_urls')
    base_url = 'https://minecraft.wiki'
    urls = [
        'https://minecraft.wiki/w/Mob',
        'https://minecraft.wiki/w/Block',
        'https://minecraft.wiki/w/Item',
        'https://minecraft.wiki/w/Tutorials',
        'https://minecraft.wiki/w/Biome',
        'https://minecraft.wiki/w/Smithing',
        'https://minecraft.wiki/w/Structure'
    ]
    url_crawl(base_url=base_url, urls=urls, output_dir=urls_dir, rough=True)
    crawl(urls_dir=urls_dir, output_dir=Path('crawler_data/rough'))
    
    content_dirs = [
        Path(path) for path in Path('crawler_data/rough').iterdir() if path.is_dir()
    ]
    print(content_dirs)
    split_content(content_dirs=content_dirs)
  • urls_dir: This is the directory where the crawled url will be placed.

  • base_url: Minecraft Wiki Url

  • urls: There are 22 categories. You can select which categories to crawl by appending urls of categories to urls.

    image-20240524上午93305274

  • url_crawl(): Fisrt crawl all pages urls of your selected categories and save them to urls_dir

  • crawl(): According to urls in urls_dir, crawl contents of all pages, including text, lists, tables.

  • split_content(): It is used to split files whose word count exceeds the limit, splitting them in content blocks to ensure that the word count of each file after splitting does not exceed the limit as much as possible.

TODO

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages