[BandcampWeekly] Add new extractor #12758

adamvoss · 2017-04-16T13:47:31Z

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

The Bandcamp Weekly is a weekly audio show accessible via the Bandcamp homepage that showcases audio tracks and interviews with Bandcamp artists. This show is not covered by the existing Bandcamp extractors. This PR adds the ability to download episodes of the Bandcamp Weekly.

adamvoss · 2017-04-16T23:26:02Z

Near as I can tell, the Travis CI failure is not specific to this PR at this point. The only Bandcamp related failure is the following and appears to have been preexisting:

test_Bandcamp_1 (test.test_download.TestDownload): ... FAIL

If there was a different/better way to handle the RegEx conflict between this and the existing BandcampAlbum extractor, I would be interested, this was the best solution I could see with my understanding of the architecture.

dstftw · 2017-04-23T03:11:59Z

PS C:\dev\youtube-dl\master> py -3.6 .\test\test_all_urls.py
..F..............
======================================================================
FAIL: test_no_duplicated_ie_names (__main__.TestAllURLsMatching)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".\test\test_all_urls.py", line 139, in test_no_duplicated_ie_names
    'Multiple extractors with the same IE_NAME "%s" (%s)' % (ie_name, ', '.join(ie_list)))
AssertionError: 2 != 1 : Multiple extractors with the same IE_NAME "bandcamp:album" (BandcampAlbumIE, BandcampArtistAlbumIE)

----------------------------------------------------------------------
Ran 17 tests in 9.014s

FAILED (failures=1)

adamvoss · 2017-04-23T18:31:34Z

Thanks! Correction pushed:

$ python2 ./test/test_all_urls.py
.................
----------------------------------------------------------------------
Ran 17 tests in 10.617s

OK

dstftw · 2017-05-28T01:22:25Z

youtube_dl/extractor/bandcamp.py

@@ -248,3 +241,78 @@ def _real_extract(self, url):
            'title': title,
            'entries': entries,
        }
+
+
+class BandcampArtistDefaultAlbumIE(BandcampAlbumIE):


What's the point of this extractor? It's covered by album extractor. Remove.

$ python2 > album_VALID_URL = r'https?://(?:(?P<subdomain>[^.]+)\.)?bandcamp\.com/album/(?:(?P<album_id>[^?#]+)|/?(?:$|[?#]))' > artist_default_album_VALID_URL = r'https?://(?P<subdomain>[^.]+)\.bandcamp\.com/?(?:$|[?#])(?P<album_id>)' > bool(re.match(album_VALID_URL, 'http://dotscale.bandcamp.com')) False > bool(re.match(artist_default_album_VALID_URL, 'http://dotscale.bandcamp.com')) True

The extractor for the album extractor had to be made less general so it did not clobber the Bandcamp Weekly extractor. However, the resulting regex would not capture cases such as 'http://dotscale.bandcamp.com' where an album is displayed on an "artists" page. Thus this extractor to preserve capturing this case.

No. Override suitable.

dstftw · 2017-05-28T01:22:54Z

youtube_dl/extractor/bandcamp.py

+
+        # This is desired because any invalid show id redirects to `bandcamp.com`
+        # which happens to expose the latest Bandcamp Weekly episode.
+        video_id = str(show['show_id'])


compat_str.

dstftw · 2017-05-28T01:29:39Z

youtube_dl/extractor/bandcamp.py

+                if 'mp3' in dictionary['format_id']:
+                    dictionary['ext'] = 'mp3'
+                if 'opus' in dictionary['format_id']:
+                    dictionary['ext'] = 'opus'


Code duplication.

@dstftw Can I get a hint here? Are you saying you want something like this?

def set_extension(ext): if ext in dictionary['format_id']: dictionary['ext'] = ext set_extension('mp3') set_extension('opus')

No.

for ext in ('mp3', 'opus') # or KNOWN_EXTENSIONS: if ext in d...: d... = ext break

adamvoss · 2017-05-30T02:31:26Z

Thank your for reviewing this and providing clarification on the needed changes! The requested changes have been made.

dstftw · 2017-06-02T16:37:41Z

youtube_dl/extractor/bandcamp.py

-
+    VALID_URLS = \
+    [r'https?://(?:(?P<subdomain>[^.]+)\.)?bandcamp\.com/album/(?:(?P<album_id>[^?#]+)|/?(?:$|[?#]))',
+        r'https?://(?P<subdomain>[^.]+)\.bandcamp\.com/?(?:$|[?#])(?P<album_id>)']


What are you trying to do here? There should be a single regex.

See comments on previous revision. This originally started with 79b3f1830328117edf9e301a42fd8cb692eaac0f.

It sounds like instead for overriding suitable you want a big regex with an alternation (|) which will require renaming the subgroups to be unique and either later on.

No, you don't get it.

dstftw · 2017-06-02T16:39:08Z

youtube_dl/extractor/bandcamp.py

+            'is_live': False,
+            'release_date': unified_strdate(show.get('published_date')),
+            'series': 'Bandcamp Weekly',
+            'episode_id': compat_str(video_id),


episode_number.

As it happens, the video_id does not match the episode number.

For example,
Episode 1 is 1, http://bandcamp.com/?show=1
Episode 2 is 6, http://bandcamp.com/?show=6

dstftw · 2017-06-02T16:46:32Z

youtube_dl/extractor/bandcamp.py

+
+class BandcampWeeklyIE(InfoExtractor):
+    IE_NAME = 'Bandcamp:bandcamp_weekly'
+    _VALID_URL = r'https?://(?:www\.)?bandcamp\.com/\?(?:\w+=\w+&)*show=(?P<id>\d+)(?:&\w+=\w+)*#?'


(?:&\w+=\w+)*#? is noop. Does not match https://bandcamp.com/?blah@&show=224.

dstftw · 2017-06-02T16:47:37Z

youtube_dl/extractor/bandcamp.py

+            'id': video_id,
+            'title': show['audio_title'] + ': ' + show['subtitle'],
+            'description': show.get('desc'),
+            'duration': show.get('audio_duration'),


float_or_none.

adamvoss · 2017-06-02T21:56:23Z

Another round of revisions complete

dstftw · 2017-06-03T08:57:47Z

youtube_dl/extractor/bandcamp.py

+            'ext': 'opus',
+            'title': 'BC Weekly May 2nd 2017: Shifting Gears',
+            'description': 'Featuring an interview with Malcolm Catto of The Heliocentrics, plus new music by Shabazz Palaces, Jane Weaver, and Morgan Guerin.',
+        }


All duplicate tests should be set only_matching.

If this is testing a different episode/show is it still considered a duplicate?

Yes, if it tests the same extraction scenario. There is only one extraction scenario in this code.

dstftw · 2017-06-03T08:58:14Z

youtube_dl/extractor/bandcamp.py

+
+class BandcampWeeklyIE(InfoExtractor):
+    IE_NAME = 'Bandcamp:bandcamp_weekly'
+    _VALID_URL = r'https?://(?:www\.)?bandcamp\.com/?\?(?:[^/]*&)?show=(?P<id>\d+)(?:$|[&#])'


Does not match https://bandcamp.com/?blah/blah&show=228.

I hadn't realized that was valid. Fixed, thanks!

dstftw · 2017-06-04T13:39:22Z

Squash commits.

adamvoss · 2017-06-04T16:10:42Z

Squashed and rebased.

@zurfyx

* [cbsinteractive] fix extractor * [cbsinteractive] update test cases * [cbsinteractive] extract formats with `CBSIE` * [extractor/common] Fix rtmp and rtsp formats' URLs in _extract_wowza_formats * [vier] Extract more info Extract the `episode_number` and `upload_date`. Also extract the real `description`. * [vier] Relax regexes and extract more metadata (closes #12539) * [jsinterp] Add support for quoted names and indexers (closes #13123, closes #13130) * [ChangeLog] Actualize * release 2017.05.18 * [ChangeLog] Fix typo * [jsinterp] Fix typo and cleanup regexes (closes #13134) * [ChangeLog] Actualize * release 2017.05.18.1 * [mitele] Update app key regex * [hitbox] Add support for smashcast.tv (closes #13154) * [njpwworld] Fix extraction (closes #13162) * [toypics] Fix extraction * [toypics] Improve and modernize * [adobepass] Add support for Brighthouse MSO * [toggle] Relax _VALID_URL (closes #13172) * [youtube] Fix DASH manifest signature decryption (closes #8944) * [youtube] Modernize * [streamcz] Add support for subtitles * [downloader/external] Pass -loglevel to ffmpeg downloader (closes #13183) * Credit @zurfyx for atresplayer improvements (#12548) * Credit @mphe for streamango (#12643) * Credit @fredbourni for noovo (#12792) * [ChangeLog] Actualize * release 2017.05.23 * Credit @timendum for rai (#11790) and mediaset (#12964) * Credit @gritstub for vevo fix (#12879) * [cbsnews] fix extraction for 60 Minutes videos * [vimeo] Fix formats' sorting (closes #13189) * [postprocessor/ffmpeg] Fix metadata filename handling on Python 2 Fixes #13182 * [udemy] Fix extraction for outputs' format entries without URL (closes #13192) * [youku] Fix extraction (closes #13191) * [utils] Recognize more patterns in strip_jsonp() Used in Youku Show pages * [youku:show] Fix extraction * [tudou] Merge into youku extractor (fixes #12214) Also, there are no tudou playlists anymore. All playlist URLs points to youku playlists. * [bbc] Add support for authentication * Revert "[youtube] Don't use the DASH manifest from 'get_video_info' if 'use_cipher_signature' is True (#5118)" This reverts commit 87dc451. * [ChangeLog] Update after the fix for #11381 * [ChangeLog] Actualize * release 2017.05.26 * [cbsnews] Fix extraction (closes #13205) * [youku] Extract more metadata (closes #10433) * [adn] fix formats extraction * [utils] Drop an compatibility wrapper for Python < 2.6 addinfourl.getcode is added since Python 2.6a1. As youtube-dl now requires 2.6+, this is no longer necessary. See python/cpython@9b0d46d * [cbsinteractive] Relax _VALID_URL (closes #13213) * [beam:vod] Add extractor * [beam] Improve and add support for mixer.com (closes #13032) * [dvtv] Parse adaptive formats as well The old code hit an error when it attempted to parse the string "adaptive" for video height. Actually parsing the returned playlists is a good idea because it adds more output formats, including some audio-only-ones. * [dvtv] Improve and fix playlists support (closes #13063) * [medialaan] Fix videos with missing videoUrl A rough trick to get around the two different json styles medialaan seems to be using. Fix for these example videos: https://vtmkzoom.be/video?aid=45724 https://vtmkzoom.be/video?aid=45425 * [medialaan] PEP 8 (closes #12774) * [gaskrank] Fix extraction * [gaskrank] Improve (closes #12493) * [abcnews] Add support for embed URLs * [abcnews] Improve and remove duplicate test (closes #12851) * [xhamster] Extract categories (closes #11728) * [xhamster] Fix author and like/dislike count extraction * [xhamster] Simplify (closes #13216) * [youtube] Parse player_url if format URLs are encrypted or DASH MPDs are requested Fixes #13211 * [ChangeLog] Actualize * release 2017.05.29 * [README.md] Add an example for how to use .netrc on Windows That's a Python bug: http://bugs.python.org/issue28334 Most likely it will be fixed in Python 3.7: python/cpython#123 * [README.md] Mention http_dash_segments protocol * [packtpub] Fix authentication(closes #13240) * [drbonanza] Fix extraction (closes #13231) * [francetv] Relax _VALID_URL * [1tv] Lower preference for http formats (closes #13246) * [youtube] Improve chapters extraction (closes #13247) * [safari] Fix typo (closes #13252) * [YoutubeDL] Don't emit ANSI escape codes on Windows * [godtv] Remove extractor (closes #13175) * [pornhub:playlist] Fix extraction (closes #13281) * [pornhub:uservideos] Add missing raise * [bandcamp:weekly] Add extractor * [bandcamp:weekly] Improve and extract more metadata (closes #12758) * Credit @adamvoss for bandcamp:weekly (#12758) * Credit @mikf for beam:vod (#13032) * Credit @jktjkt for dvtv formats (#13063) * [ChangeLog] Actualize * release 2017.06.05 * [tvplayer] Fix extraction (closes #13291) * [rtlnl] Improve _VALID_URL (closes #13295) * [streamango] Make title optional * [streamango] Skip download for test (closes #13292) * [README.md] Clarify output template references (closes #13316) * [README.md] Improve man page formatting * [YoutubeDL] Sanitize more fields (#13313) * [liveleak] Ensure height is int (closes #13313) * [safari] Improve authentication detection (closes #13319) * [sohu] Fix numeric fields * [flickr] Ensure format id is string * [foxgay] Ensure height is int * [gfycat] Ensure filesize is int * [golem] Ensure format id is string * [jove] Ensure comment count is int * [sexu] Ensure height is int * [turbo] Ensure format id is string * [extractor/common] Return unicode string from _match_id * [extractor/generic] Ensure format id is unicode string * [msn] Fix formats extraction * [newgrounds] Improve formats and uploader extraction (closes #13346) * [newgrounds:playlist] Add extractor (closes #10611) * [utils] Improve unified_timestamp * [newgrounds] Extract more metadata (closes #13232) * [rutv] Add support for testplayer.vgtrk.com (closes #13347) * [xfileshare] Modernize and pass referrer * [xfileshare] Add support for rapidvideo (closes #13348) * [compat] Introduce compat_HTMLParseError * [utils] Handle HTMLParseError in extract_attributes (closes #13349) * [xfileshare] PEP 8 * [ChangeLog] Actualize * release 2017.06.12 * [compat] Add compat_HTMLParseError to __all__ * [corus] Add support for history.ca (closes #13359) * [corus] Add support for showcase.ca

dstftw added the pending-fixes label Apr 23, 2017

dstftw requested changes May 28, 2017

View reviewed changes

dstftw requested changes Jun 2, 2017

View reviewed changes

dstftw requested changes Jun 3, 2017

View reviewed changes

[BandcampWeekly] Add new extractor

8988fb8

dstftw closed this in 6d923aa Jun 4, 2017

dstftw added a commit that referenced this pull request Jun 4, 2017

Credit @adamvoss for bandcamp:weekly (#12758)

43e6579

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BandcampWeekly] Add new extractor #12758

[BandcampWeekly] Add new extractor #12758

adamvoss commented Apr 16, 2017

adamvoss commented Apr 16, 2017

dstftw commented Apr 23, 2017

adamvoss commented Apr 23, 2017

dstftw May 28, 2017

adamvoss May 28, 2017

dstftw May 28, 2017

dstftw May 28, 2017

dstftw May 28, 2017

adamvoss May 28, 2017

dstftw May 28, 2017

adamvoss commented May 30, 2017

dstftw Jun 2, 2017

adamvoss Jun 2, 2017

dstftw Jun 2, 2017

dstftw Jun 2, 2017

adamvoss Jun 2, 2017

dstftw Jun 2, 2017

dstftw Jun 2, 2017

adamvoss commented Jun 2, 2017

dstftw Jun 3, 2017

adamvoss Jun 3, 2017

dstftw Jun 3, 2017

dstftw Jun 3, 2017

adamvoss Jun 3, 2017

dstftw commented Jun 4, 2017

adamvoss commented Jun 4, 2017

[BandcampWeekly] Add new extractor #12758

[BandcampWeekly] Add new extractor #12758

Conversation

adamvoss commented Apr 16, 2017

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

adamvoss commented Apr 16, 2017

dstftw commented Apr 23, 2017

adamvoss commented Apr 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamvoss commented May 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamvoss commented Jun 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstftw commented Jun 4, 2017

adamvoss commented Jun 4, 2017