-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcmetadata.extract throwing AttributeErrors #73
Comments
In this case I think it is revealing a real bug. The The More broadly, this is the wild wild web, so we're always going to have pages that we can't process because they demonstrate some edge case that breaks our metadata extraction. I agree that catching and falling back on some default could hide larger problems, in that case specifically with title parsing, especially because this library are still maturing in its current form. |
Yes, full Story objects are quarantined. |
The title tag contains only white space chars (CR, LF, TAB) |
Upgrade to new mcmetadata library, includes fix for title tag with nothing but air (whitespace): mediacloud/metadata-lib#73 (and also URL normalization with port numbers)
Seen in story-indexer parser quarantine queue three times:
With the offending line:
https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/titles.py#L95
I don't have any opinion on what the right fix is.
Catching/ignoring too many exceptions will sweep bugs under the rug.
In a sense, that's why we have a quarantine queue: to catch stories that trip us up, so we can make the code more robust.
I suppose one possible answer is that maybe the story-indexer parser needs to pull the
extract
functionality in-line, so it can make decisions about what can or cannot be ignores???The text was updated successfully, but these errors were encountered: