Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON-LD with "@context": "https://schema.org" not parsed by _json_ld function #16006

Closed
5 of 9 tasks
aeph6Ee0 opened this issue Mar 27, 2018 · 0 comments
Closed
5 of 9 tasks

Comments

@aeph6Ee0
Copy link
Contributor

Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2018.03.26.1. If it's not, read this FAQ entry and update. Issues with outdated version will be rejected.

  • I've verified and I assure that I'm running youtube-dl 2018.03.26.1

Before submitting an issue make sure you have:

  • At least skimmed through the README, most notably the FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones
  • Checked that provided video/audio/playlist URLs (if any) are alive and playable in a browser

What is the purpose of your issue?

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

The Problem

The way the @context property of a JSON-LD string is checked it too strict. Currently @context has to be exactly the string http://schema.org in order to be parsed by the _json_ld function in common.py. However, some websites have http changed to https. I also saw websites adding a / to the end like so: http://schema.org/.
This leads to the problem, that JSON-LD strings of those websites do not get parsed, which cause video extraction errors, since _json_ld then returns an empty dictionary. (See below for an example.)

Suggested Fix

In order to solve the problem, I suggest to make the check for the @context property more resiliant. Instead of checking against a static string...

https://github.com/rg3/youtube-dl/blob/5d60b9971784289acd4325a8ed7b5afd7bea05ca/youtube_dl/extractor/common.py#L1028

we should use a regex that allows the above mentioned modification of the string:

if re.match(r'^https?://schema.org/?$', e.get('@context')):

This small change already fixes the currently broken gamestar.py extractor. Additionally, no existing extractor can be affected in a negative way, e.g, no extractor will get broken through this change.

Please let me know, if I should open a pull request myself.


Example

An example of an extractor that fails because of not parsed JSON-LD is gamestar.py. Please note, that 'KeyError' occurs only because the info_dict is empty, since the self._json_ld function didn't parse anything. Here, there is a https instead of http used in the @context property.

# python -m youtube_dl -v http://www.gamestar.de/videos/trailer,3/hobbit-3-die-schlacht-der-fuenf-heere,76110.html
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'http://www.gamestar.de/videos/trailer,3/hobbit-3-die-schlacht-der-fuenf-heere,76110.html']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2018.03.26.1
[debug] Git HEAD: 5d60b9971
[debug] Python version 3.6.4 (CPython) - Linux-4.15.7-1-ARCH-x86_64-with-arch
[debug] exe versions: ffmpeg 3.4.2, ffprobe 3.4.2
[debug] Proxy map: {}
[GameStar] 76110: Downloading webpage
ERROR: An extractor error has occurred. (caused by KeyError('title',)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/home/user/youtube-dl/youtube_dl/extractor/common.py", line 440, in extract
    ie_result = self._real_extract(url)
  File "/home/user/youtube-dl/youtube_dl/extractor/gamestar.py", line 50, in _real_extract
    info_dict['title'], ' - Game%s' % site.title())
KeyError: 'title'
Traceback (most recent call last):
  File "/home/user/youtube-dl/youtube_dl/extractor/common.py", line 440, in extract
    ie_result = self._real_extract(url)
  File "/home/user/youtube-dl/youtube_dl/extractor/gamestar.py", line 50, in _real_extract
    info_dict['title'], ' - Game%s' % site.title())
KeyError: 'title'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/youtube-dl/youtube_dl/YoutubeDL.py", line 785, in extract_info
    ie_result = ie.extract(url)
  File "/home/user/youtube-dl/youtube_dl/extractor/common.py", line 453, in extract
    raise ExtractorError('An extractor error has occurred.', cause=e)
youtube_dl.utils.ExtractorError: An extractor error has occurred. (caused by KeyError('title',)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant