JSON-LD with "@context": "https://schema.org" not parsed by `_json_ld` function #16006

aeph6Ee0 · 2018-03-27T22:26:44Z

Make sure you are using the latest version: run `youtube-dl --version` and ensure your version is 2018.03.26.1. If it's not, read this FAQ entry and update. Issues with outdated version will be rejected.

I've verified and I assure that I'm running youtube-dl 2018.03.26.1

Before submitting an issue make sure you have:

At least skimmed through the README, most notably the FAQ and BUGS sections
Searched the bugtracker for similar issues including closed ones
Checked that provided video/audio/playlist URLs (if any) are alive and playable in a browser

What is the purpose of your issue?

Bug report (encountered problems with youtube-dl)
Site support request (request for adding support for a new site)
Feature request (request for a new functionality)
Question
Other

The Problem

The way the @context property of a JSON-LD string is checked it too strict. Currently @context has to be exactly the string http://schema.org in order to be parsed by the _json_ld function in common.py. However, some websites have http changed to https. I also saw websites adding a / to the end like so: http://schema.org/.
This leads to the problem, that JSON-LD strings of those websites do not get parsed, which cause video extraction errors, since _json_ld then returns an empty dictionary. (See below for an example.)

Suggested Fix

In order to solve the problem, I suggest to make the check for the @context property more resiliant. Instead of checking against a static string...

https://github.com/rg3/youtube-dl/blob/5d60b9971784289acd4325a8ed7b5afd7bea05ca/youtube_dl/extractor/common.py#L1028

we should use a regex that allows the above mentioned modification of the string:

if re.match(r'^https?://schema.org/?$', e.get('@context')):

This small change already fixes the currently broken gamestar.py extractor. Additionally, no existing extractor can be affected in a negative way, e.g, no extractor will get broken through this change.

Please let me know, if I should open a pull request myself.

Example

An example of an extractor that fails because of not parsed JSON-LD is gamestar.py. Please note, that 'KeyError' occurs only because the info_dict is empty, since the self._json_ld function didn't parse anything. Here, there is a https instead of http used in the @context property.

# python -m youtube_dl -v http://www.gamestar.de/videos/trailer,3/hobbit-3-die-schlacht-der-fuenf-heere,76110.html
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'http://www.gamestar.de/videos/trailer,3/hobbit-3-die-schlacht-der-fuenf-heere,76110.html']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2018.03.26.1
[debug] Git HEAD: 5d60b9971
[debug] Python version 3.6.4 (CPython) - Linux-4.15.7-1-ARCH-x86_64-with-arch
[debug] exe versions: ffmpeg 3.4.2, ffprobe 3.4.2
[debug] Proxy map: {}
[GameStar] 76110: Downloading webpage
ERROR: An extractor error has occurred. (caused by KeyError('title',)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/home/user/youtube-dl/youtube_dl/extractor/common.py", line 440, in extract
    ie_result = self._real_extract(url)
  File "/home/user/youtube-dl/youtube_dl/extractor/gamestar.py", line 50, in _real_extract
    info_dict['title'], ' - Game%s' % site.title())
KeyError: 'title'
Traceback (most recent call last):
  File "/home/user/youtube-dl/youtube_dl/extractor/common.py", line 440, in extract
    ie_result = self._real_extract(url)
  File "/home/user/youtube-dl/youtube_dl/extractor/gamestar.py", line 50, in _real_extract
    info_dict['title'], ' - Game%s' % site.title())
KeyError: 'title'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/youtube-dl/youtube_dl/YoutubeDL.py", line 785, in extract_info
    ie_result = ie.extract(url)
  File "/home/user/youtube-dl/youtube_dl/extractor/common.py", line 453, in extract
    raise ExtractorError('An extractor error has occurred.', cause=e)
youtube_dl.utils.ExtractorError: An extractor error has occurred. (caused by KeyError('title',)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

The text was updated successfully, but these errors were encountered:

aeph6Ee0 mentioned this issue Apr 5, 2018

Make JSON-LD "@context" parsing more robust (closes #16006) #16100

Merged

9 tasks

dstftw closed this as completed in 66b6867 Apr 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON-LD with "@context": "https://schema.org" not parsed by `_json_ld` function #16006

JSON-LD with "@context": "https://schema.org" not parsed by `_json_ld` function #16006

aeph6Ee0 commented Mar 27, 2018

JSON-LD with "@context": "https://schema.org" not parsed by _json_ld function #16006

JSON-LD with "@context": "https://schema.org" not parsed by _json_ld function #16006

Comments

aeph6Ee0 commented Mar 27, 2018

Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2018.03.26.1. If it's not, read this FAQ entry and update. Issues with outdated version will be rejected.

Before submitting an issue make sure you have:

What is the purpose of your issue?

The Problem

Suggested Fix

Example

JSON-LD with "@context": "https://schema.org" not parsed by `_json_ld` function #16006

JSON-LD with "@context": "https://schema.org" not parsed by `_json_ld` function #16006

Make sure you are using the latest version: run `youtube-dl --version` and ensure your version is 2018.03.26.1. If it's not, read this FAQ entry and update. Issues with outdated version will be rejected.