Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
HeZhang1994 authored Apr 21, 2019
1 parent 3d9422f commit d853883
Showing 1 changed file with 20 additions and 14 deletions.
34 changes: 20 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,19 @@

[*English Version*](https://github.com/HeZhang1994/weibo-crawler/blob/master/README.md) | [*中文版*](https://github.com/HeZhang1994/weibo-crawler/blob/master/README-cn.md)

This is a **Python** implementation of crawling Weibo data (e.g., text, images, live photos, and videos) of one Sina Weibo user from [Weibo Mobile Client](https://m.weibo.cn). It simulates user login with **session** (username and password).
This is a **Python** implementation of crawling Weibo data (e.g., text, images, live photos, and videos) of one Sina Weibo user from the [Weibo Mobile Client](https://m.weibo.cn). It simulates user login with the **session** (username and password).

Many thanks to [Python Chinese Community](https://blog.csdn.net/BF02jgtRS00XKtCx/article/details/79547627) for providing the source code `SourceCode_weibocrawler.py`.

## Functions

- Crawling the short **text** in original and retweeted Weibo posts.
- Crawling short **text** in original and retweeted Weibo posts.

- Crawling the large (preferred) or small **JPG/GIF images** in original and retweeted Weibo posts.
- Crawling large (preferred) or small **JPG/GIF images** in original and retweeted Weibo posts.

- [**New!**] Crawling the **live photos** (as JPG images, MOV videos, and/or GIF images) in original and retweeted Weibo posts.
- [**New!**] Crawling **live photos** (as JPG images, MOV videos, and/or GIF images) in original and retweeted Weibo posts.

- Crawling the HD (preferred) or SD **videos** in original and retweeted Weibo posts.
- Crawling HD (preferred) or SD **videos** in original and retweeted Weibo posts.

## Dependencies

Expand All @@ -33,25 +33,31 @@ Many thanks to [Python Chinese Community](https://blog.csdn.net/BF02jgtRS00XKtCx

### User Settings

1. Set `S_DATA` and `S_HEADER` of session for simulating user login (see comments for obtaining those information).
1. Set `S_DATA` and `S_HEADER` of the session for simulating user login (see comments for details).

2. Set `USER_URL` of target Sina Weibo user (see comments for obtaining this information).
2. Set `USER_URL` of the target Sina Weibo user (see comments for details).

3. Set `PAGE_AMOUNT` (the amount of pages for crawling) to be greater than 10% of the amount of user's Weibo posts.
3. Set the amount of pages (`PAGE_AMOUNT`) for crawling (see comments for details).

4. Set `PATH_FOLDER` and `PATH_FILE_TXT` for saving Weibo data.
4. Set the path (`PATH_FOLDER`) and the TXT file (`PATH_FILE_TXT`) for saving Weibo data.

5. Select the type of Weibo data for crawling (`IF_IMAGE`, `IF_PHOTO`, and `IF_VIDEO`). 0 - Not crawl, 1 - Crawl.
5. Set the type of Weibo data (`IF_IMAGE`, `IF_PHOTO`, and `IF_VIDEO` as 1) for crawling.

6. Set `IF_LIVE2GIF = 1` if live photos (MOV videos) need to be converted to GIF images.
6. Set `IF_LIVE2GIF = True` if live photos (videos) need to be converted to GIF images.

7. Set `TIME_DELAY` of crawler to aovid `ConnectionError: ('Connection aborted.', OSError(“(104, 'ECONNRESET')”,))`.
7. Set `TIME_DELAY` of the crawler to aovid `ConnectionError 104: ('Connection aborted.')`.

8. If `ConnectionError 104: ('Connection aborted.')` occurs:

1. Set `IF_RECONNECT = True` for running the crawler in reconnection mode.

2. Set `TAG_STARTCARD` as the serial number of the starting Weibo post (according to log information).

### Run

1. Run `run_WeiboCrawler.py` to crawl Weibo data of one Sina Weibo user.
1. Run `run_WeiboCrawler.py` to crawl Weibo data of the target Sina Weibo user.

2. See `Log_run_WeiboCrawler.txt` for log information of running this code.
2. See `Log_run_WeiboCrawler.txt` for log information of running the code.

### Results

Expand Down

0 comments on commit d853883

Please sign in to comment.