[generic] prefer enclosures over following links in rss feeds #16189

bastiandg · 2018-04-14T15:39:25Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

When downloading from rss feeds, the generic extractor follows the rss links first. This leads to "ERROR: Unsupported URL" messages in a lot of cases. Even though there are enclosures present.

$ youtube-dl --playlist-end 1 http://www.hellointernet.fm/podcast?format=rss
[generic] podcast?format=rss: Requesting header
WARNING: Falling back on generic information extractor.
[generic] podcast?format=rss: Downloading webpage
[generic] podcast?format=rss: Extracting information
[download] Downloading playlist: Hello Internet
[generic] playlist Hello Internet: Collected 100 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[generic] onehundred: Requesting header
WARNING: Falling back on generic information extractor.
[generic] onehundred: Downloading webpage
[generic] onehundred: Extracting information
ERROR: Unsupported URL: http://www.hellointernet.fm/podcast/onehundred

With this pull request downloading enclosures is preferred over following the rss links.

$ python -m youtube_dl --playlist-end 1 http://www.hellointernet.fm/podcast?format=rss
[generic] podcast?format=rss: Requesting header
WARNING: Falling back on generic information extractor.
[generic] podcast?format=rss: Downloading webpage
[generic] podcast?format=rss: Extracting information
[download] Downloading playlist: Hello Internet
[generic] playlist Hello Internet: Collected 100 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[generic] Hello_Internet_Episode_One_Hundred: Requesting header
[redirect] Following redirect to http://hwcdn.libsyn.com/p/e/5/c/e5ca0f579a9f12b4/Hello_Internet_Episode_One_Hundred.mp3?c_id=20078576&expiration=1523726590&hwt=d00666280d294aea41062d82e934b6f0
[generic] Hello_Internet_Episode_One_Hundred: Requesting header
[download] Destination: Hello Internet Episode One Hundred-Hello_Internet_Episode_One_Hundred.mp3
[download] 100% of 96.39MiB in 00:09
[download] Finished downloading playlist: Hello Internet

dstftw

Add a test.

dstftw · 2018-04-14T16:01:52Z

youtube_dl/extractor/generic.py

+                    break
+
+            if not enclosure_nodes:
+                next_url = xpath_text(it, 'link', fatal=False)

            if not next_url:


Potential reference before assignment error.

dstftw · 2018-04-14T16:02:28Z

youtube_dl/extractor/generic.py

+                if next_url:
+                    break
+
+            if not enclosure_nodes:


This does not mean next_url is obtained.

The check was changed and an assignment was added, so that the error can't occur anymore.

Nothing changed here.

Is there a change needed here? If not enclosure_nodes is True it is expected that next_url wasn't obtained yet.

So what? If not enclosure_nodes is False it does not mean next_url was obtained.

After some thought, I think I get what you are saying. If the rss feed has enclosure_nodes, but the url is empty it skips even though it shouldn't. This is fixed now.

dstftw · 2018-04-14T16:57:09Z

youtube_dl/extractor/generic.py

+            if not enclosure_nodes:
+                next_url = xpath_text(it, 'link', fatal=False)
+
+            if next_url is None:


This won't skip empty strings.

i changed it back to not next_url. This should do the job in conjunction with the next_url = None from above.

bastiandg · 2018-04-19T18:22:15Z

All open points were addressed (fixed). Is there anything needed from my side before this is merged?

dstftw · 2018-04-19T18:25:56Z

Add a test.

bastiandg · 2018-04-28T11:14:05Z

The test was added. This is what it look like with the current youtube-dl version:

…
DownloadError: ERROR: Unsupported URL: http://www.hellointernet.fm/podcast/101

----------------------------------------------------------------------
Ran 1 test in 2.454s

FAILED (errors=1)

Here is what it says with changes introduced in this pull request:

…
[download] Finished downloading playlist: Hello Internet
.
----------------------------------------------------------------------
Ran 1 test in 32.079s

OK

[generic] prefer enclosures over following links

57182dc

dstftw requested changes Apr 14, 2018

View reviewed changes

dstftw added the pending-fixes label Apr 14, 2018

[generic] prevent reference before assignment errors

4f8ed09

dstftw reviewed Apr 14, 2018

View reviewed changes

bastiandg added 2 commits April 14, 2018 19:49

[generic] account for empty strings

0aa4e2a

[generic] follow rss link when enclosure url is empty

199025c

Add test for RSS enclosure prioritization

31749dd

bastiandg and others added 2 commits April 28, 2018 13:19

Remove unicode prefix

f7d767f

Update generic.py

0a0ca99

dstftw merged commit 01aec84 into ytdl-org:master Apr 29, 2018

bastiandg deleted the rss-enclosure-prio branch April 29, 2018 17:21

dstftw added a commit that referenced this pull request Jul 22, 2018

Credit @bastiandg for #16189

a702056

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[generic] prefer enclosures over following links in rss feeds #16189

[generic] prefer enclosures over following links in rss feeds #16189

bastiandg commented Apr 14, 2018 •

edited

Loading

dstftw left a comment

dstftw Apr 14, 2018

dstftw Apr 14, 2018

bastiandg Apr 14, 2018

dstftw Apr 14, 2018

bastiandg Apr 14, 2018

dstftw Apr 14, 2018

bastiandg Apr 14, 2018

dstftw Apr 14, 2018

bastiandg Apr 14, 2018

bastiandg commented Apr 19, 2018

dstftw commented Apr 19, 2018

bastiandg commented Apr 28, 2018

[generic] prefer enclosures over following links in rss feeds #16189

[generic] prefer enclosures over following links in rss feeds #16189

Conversation

bastiandg commented Apr 14, 2018 • edited Loading

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dstftw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bastiandg commented Apr 19, 2018

dstftw commented Apr 19, 2018

bastiandg commented Apr 28, 2018

bastiandg commented Apr 14, 2018 •

edited

Loading