Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcmetadata.extract throwing AttributeErrors #73

Closed
philbudne opened this issue Jan 17, 2024 · 3 comments
Closed

mcmetadata.extract throwing AttributeErrors #73

philbudne opened this issue Jan 17, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@philbudne
Copy link
Contributor

Seen in story-indexer parser quarantine queue three times:

(venv) pbudne@ramos:~/story-indexer$ egrep 'final_url|x-mc' *.warc
    "final_url": "https://focustaiwan.tw/business/202311170015",
    "x-mc-who": "parser",
    "x-mc-when": "1702553283.220337",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

    "final_url": "https://focustaiwan.tw/business/202311170015",
    "x-mc-who": "parser",
    "x-mc-when": "1702553328.8383992",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

    "final_url": "https://www.wdsu.com/article/untitled-content-1701813119/46044845",
    "x-mc-who": "parser",
    "x-mc-when": "1702554682.876035",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

With the offending line:
https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/titles.py#L95

        if (len(h1_title) > SHORT_TITLE_THRESHOLD) and (h1_title in title.strip()):

I don't have any opinion on what the right fix is.

Catching/ignoring too many exceptions will sweep bugs under the rug.

In a sense, that's why we have a quarantine queue: to catch stories that trip us up, so we can make the code more robust.

I suppose one possible answer is that maybe the story-indexer parser needs to pull the extract functionality in-line, so it can make decisions about what can or cannot be ignores???

@rahulbot rahulbot self-assigned this Jan 17, 2024
@rahulbot rahulbot added the bug Something isn't working label Jan 17, 2024
@rahulbot
Copy link
Contributor

In this case I think it is revealing a real bug. The https://www.wdsu.com/ is a good edge case, having only an <h1> and no other title indicators. I've added the fix for that.

The https://focustaiwan.tw examples aren't returned 200 for me so I can't replicate. In my browser they render fine and have a title tag that should work fine. Does the quarantine record include the HTML associated with the error? If so, can you share for one of the https://focustaiwan.tw ones?

More broadly, this is the wild wild web, so we're always going to have pages that we can't process because they demonstrate some edge case that breaks our metadata extraction. I agree that catching and falling back on some default could hide larger problems, in that case specifically with title parsing, especially because this library are still maturing in its current form.

@philbudne
Copy link
Contributor Author

philbudne commented Jan 17, 2024

Yes, full Story objects are quarantined.
Here is the HTML for https://focustaiwan.tw/business/202311170015.htm

@philbudne
Copy link
Contributor Author

The title tag contains only white space chars (CR, LF, TAB)

rahulbot added a commit that referenced this issue Jan 17, 2024
philbudne pushed a commit to philbudne/story-indexer that referenced this issue Jan 24, 2024
Upgrade to new mcmetadata library, includes fix for title tag with
nothing but air (whitespace): mediacloud/metadata-lib#73
(and also URL normalization with port numbers)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants