Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[generic] prefer enclosures over following links in rss feeds #16189

Merged
merged 7 commits into from
Apr 29, 2018
Merged

[generic] prefer enclosures over following links in rss feeds #16189

merged 7 commits into from
Apr 29, 2018

Conversation

bastiandg
Copy link
Contributor

@bastiandg bastiandg commented Apr 14, 2018

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

When downloading from rss feeds, the generic extractor follows the rss links first. This leads to "ERROR: Unsupported URL" messages in a lot of cases. Even though there are enclosures present.

$ youtube-dl --playlist-end 1 http://www.hellointernet.fm/podcast?format=rss
[generic] podcast?format=rss: Requesting header
WARNING: Falling back on generic information extractor.
[generic] podcast?format=rss: Downloading webpage
[generic] podcast?format=rss: Extracting information
[download] Downloading playlist: Hello Internet
[generic] playlist Hello Internet: Collected 100 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[generic] onehundred: Requesting header
WARNING: Falling back on generic information extractor.
[generic] onehundred: Downloading webpage
[generic] onehundred: Extracting information
ERROR: Unsupported URL: http://www.hellointernet.fm/podcast/onehundred

With this pull request downloading enclosures is preferred over following the rss links.

$ python -m youtube_dl --playlist-end 1 http://www.hellointernet.fm/podcast?format=rss
[generic] podcast?format=rss: Requesting header
WARNING: Falling back on generic information extractor.
[generic] podcast?format=rss: Downloading webpage
[generic] podcast?format=rss: Extracting information
[download] Downloading playlist: Hello Internet
[generic] playlist Hello Internet: Collected 100 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[generic] Hello_Internet_Episode_One_Hundred: Requesting header
[redirect] Following redirect to http://hwcdn.libsyn.com/p/e/5/c/e5ca0f579a9f12b4/Hello_Internet_Episode_One_Hundred.mp3?c_id=20078576&expiration=1523726590&hwt=d00666280d294aea41062d82e934b6f0
[generic] Hello_Internet_Episode_One_Hundred: Requesting header
[download] Destination: Hello Internet Episode One Hundred-Hello_Internet_Episode_One_Hundred.mp3
[download] 100% of 96.39MiB in 00:09
[download] Finished downloading playlist: Hello Internet

Copy link
Collaborator

@dstftw dstftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test.

break

if not enclosure_nodes:
next_url = xpath_text(it, 'link', fatal=False)

if not next_url:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential reference before assignment error.

if next_url:
break

if not enclosure_nodes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not mean next_url is obtained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check was changed and an assignment was added, so that the error can't occur anymore.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a change needed here? If not enclosure_nodes is True it is expected that next_url wasn't obtained yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what? If not enclosure_nodes is False it does not mean next_url was obtained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thought, I think I get what you are saying. If the rss feed has enclosure_nodes, but the url is empty it skips even though it shouldn't. This is fixed now.

if not enclosure_nodes:
next_url = xpath_text(it, 'link', fatal=False)

if next_url is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't skip empty strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed it back to not next_url. This should do the job in conjunction with the next_url = None from above.

@bastiandg
Copy link
Contributor Author

All open points were addressed (fixed). Is there anything needed from my side before this is merged?

@dstftw
Copy link
Collaborator

dstftw commented Apr 19, 2018

Add a test.

@bastiandg
Copy link
Contributor Author

The test was added. This is what it look like with the current youtube-dl version:

…
DownloadError: ERROR: Unsupported URL: http://www.hellointernet.fm/podcast/101

----------------------------------------------------------------------
Ran 1 test in 2.454s

FAILED (errors=1)

Here is what it says with changes introduced in this pull request:

…
[download] Finished downloading playlist: Hello Internet
.
----------------------------------------------------------------------
Ran 1 test in 32.079s

OK

@dstftw dstftw merged commit 01aec84 into ytdl-org:master Apr 29, 2018
@bastiandg bastiandg deleted the rss-enclosure-prio branch April 29, 2018 17:21
dstftw added a commit that referenced this pull request Jul 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants