Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

SEC Filings loader bug fixes #909

Merged
merged 15 commits into from
Feb 13, 2024
Merged

SEC Filings loader bug fixes #909

merged 15 commits into from
Feb 13, 2024

Conversation

Athe-kunal
Copy link
Contributor

Description

The previous SEC Filings loader that was developed by me had some major bugs as the SEC website changed last year. In this modification, I have fixed the bugs, returned the text data in document format compatible with llama index, and added extra metadata to the texts like the filling and reporting date

Fixes # (issue)

Type of Change

Please delete options that are not relevant.

  • Bug fix
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • I stared at the code and made sure it makes sense
  • I tested with my local environment with all the dependencies and it worked

Suggested Checklist:

  • I have added a library.json file if a new loader/tool was added
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@@ -13,78 +13,27 @@ python install -r requirements.txt
The SEC Downloader expects 5 attributes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it still 5 attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it requires 4 attributes now, instead of 5. The previous implementation is breaking, but the current implementation is directly pulling from the official page, hence it is more reliable. Currently, the users can pull all the files for a given year, and the amount parameter earlier was really ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done those changes in the README file

@@ -13,78 +13,27 @@ python install -r requirements.txt
The SEC Downloader expects 5 attributes

* tickers: It is a list of valid tickers
* amount: Number of documents that you want to download
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the deleted attributes as deprecated, for backwards compat? and just not show it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the previous implementation was breaking, and the amount parameter is a bit ambiguous. In my conversation, users would like to pull the documents for a given year or a list years, not number of filings. Hence, the year parameter serves better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good. In general we are trying to minimize the number of breaking changes, it's not good to switch user-facing params around because that breaks existing implementations.

If the previous implementation doesn't work at all then sure we can remove (and log a warning to the user that it no longer works). If it still does then I vote we leave in the parameter for backwards compat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, understood
In the latest commit, I have added the amount deprecating warning. Please do suggest, if I need to make other changes.

from llama_index.readers.base import BaseReader
from llama_hub.sec_filings.secData import sec_main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to add this file to extra_files in library.json ( see github repo loader)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SEC filings already exists in library.json. I added it when I first committed the loader. Do I need to modify it again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah see some other files that have the extra_files parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have added this


SEC_ARCHIVE_URL: Final[str] = "https://www.sec.gov/Archives/edgar/data"
SEC_SEARCH_URL: Final[str] = "http://www.sec.gov/cgi-bin/browse-edgar"
SEC_SUBMISSIONS_URL = "https://data.sec.gov/submissions"


def get_filing(
cik: Union[str, int], accession_number: Union[str, int], company: str, email: str
accession_number: Union[str, int], cik: Union[str, int], company: str, email: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you switch the arg positions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text extraction from SEC documents is a demanding process, hence I implemented a multiprocessing method so that it can be faster. In the secData.py file, I have implemented parallel processing using a partial function

get_filing_partial = partial(
        get_filing,
        cik=rgld_cik,
        company="Unstructured Technologies",
        email="[email protected]",
    )

sec_extractor = SECExtractor(ticker=ticker)

For the partial function to work, the first argument needs to be the accession number (a unique identifier for each file). Hence, I switched the arguments. Is there a better way to do it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. i'm mostly trying to minimize the number of breaking changes, and seems like there's not a way to prevent this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, understood

It is not an user-facing function, hopefully it will break previous implementations.

Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine - can merge as is

llama_hub/sec_filings/base.py Outdated Show resolved Hide resolved
@Athe-kunal
Copy link
Contributor Author

@jerryjliu
Can you merge it?

@jerryjliu jerryjliu merged commit 41b6071 into run-llama:main Feb 13, 2024
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants