mcmetadata.extract throwing AttributeErrors #73

philbudne · 2024-01-17T00:51:25Z

Seen in story-indexer parser quarantine queue three times:

(venv) pbudne@ramos:~/story-indexer$ egrep 'final_url|x-mc' *.warc
    "final_url": "https://focustaiwan.tw/business/202311170015",
    "x-mc-who": "parser",
    "x-mc-when": "1702553283.220337",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

    "final_url": "https://focustaiwan.tw/business/202311170015",
    "x-mc-who": "parser",
    "x-mc-when": "1702553328.8383992",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

    "final_url": "https://www.wdsu.com/article/untitled-content-1701813119/46044845",
    "x-mc-who": "parser",
    "x-mc-when": "1702554682.876035",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

With the offending line:
https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/titles.py#L95

        if (len(h1_title) > SHORT_TITLE_THRESHOLD) and (h1_title in title.strip()):

I don't have any opinion on what the right fix is.

Catching/ignoring too many exceptions will sweep bugs under the rug.

In a sense, that's why we have a quarantine queue: to catch stories that trip us up, so we can make the code more robust.

I suppose one possible answer is that maybe the story-indexer parser needs to pull the extract functionality in-line, so it can make decisions about what can or cannot be ignores???

The text was updated successfully, but these errors were encountered:

rahulbot · 2024-01-17T16:09:04Z

In this case I think it is revealing a real bug. The https://www.wdsu.com/ is a good edge case, having only an <h1> and no other title indicators. I've added the fix for that.

The https://focustaiwan.tw examples aren't returned 200 for me so I can't replicate. In my browser they render fine and have a title tag that should work fine. Does the quarantine record include the HTML associated with the error? If so, can you share for one of the https://focustaiwan.tw ones?

More broadly, this is the wild wild web, so we're always going to have pages that we can't process because they demonstrate some edge case that breaks our metadata extraction. I agree that catching and falling back on some default could hide larger problems, in that case specifically with title parsing, especially because this library are still maturing in its current form.

philbudne · 2024-01-17T16:54:47Z

Yes, full Story objects are quarantined.
Here is the HTML for https://focustaiwan.tw/business/202311170015.htm

philbudne · 2024-01-17T16:58:14Z

The title tag contains only white space chars (CR, LF, TAB)

Upgrade to new mcmetadata library, includes fix for title tag with nothing but air (whitespace): mediacloud/metadata-lib#73 (and also URL normalization with port numbers)

rahulbot self-assigned this Jan 17, 2024

rahulbot added the bug Something isn't working label Jan 17, 2024

rahulbot added a commit that referenced this issue Jan 17, 2024

handle title parse when only H1 (#73)

b4f61b7

rahulbot added a commit that referenced this issue Jan 17, 2024

add focustaiwan test case (passes) #73

f2bd23b

rahulbot closed this as completed Jan 17, 2024

rahulbot mentioned this issue Jan 17, 2024

Fix title parsing failure (due to empty or whitespace title tag) #74

Merged

philbudne mentioned this issue Jan 24, 2024

Upgrade to latest mcmetadata mediacloud/story-indexer#214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcmetadata.extract throwing AttributeErrors #73

mcmetadata.extract throwing AttributeErrors #73

philbudne commented Jan 17, 2024

rahulbot commented Jan 17, 2024

philbudne commented Jan 17, 2024 •

edited

Loading

philbudne commented Jan 17, 2024

mcmetadata.extract throwing AttributeErrors #73

mcmetadata.extract throwing AttributeErrors #73

Comments

philbudne commented Jan 17, 2024

rahulbot commented Jan 17, 2024

philbudne commented Jan 17, 2024 • edited Loading

philbudne commented Jan 17, 2024

philbudne commented Jan 17, 2024 •

edited

Loading