SEC Filings loader bug fixes #909

Athe-kunal · 2024-01-28T22:29:52Z

Description

The previous SEC Filings loader that was developed by me had some major bugs as the SEC website changed last year. In this modification, I have fixed the bugs, returned the text data in document format compatible with llama index, and added extra metadata to the texts like the filling and reporting date

Fixes # (issue)

Type of Change

Please delete options that are not relevant.

Bug fix
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

I stared at the code and made sure it makes sense
I tested with my local environment with all the dependencies and it worked

Suggested Checklist:

I have added a library.json file if a new loader/tool was added
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

jerryjliu · 2024-01-31T18:57:09Z

llama_hub/sec_filings/README.md

@@ -13,78 +13,27 @@ python install -r requirements.txt
 The SEC Downloader expects 5 attributes


is it still 5 attributes

No, it requires 4 attributes now, instead of 5. The previous implementation is breaking, but the current implementation is directly pulling from the official page, hence it is more reliable. Currently, the users can pull all the files for a given year, and the amount parameter earlier was really ambiguous.

I have done those changes in the README file

jerryjliu · 2024-01-31T18:57:14Z

llama_hub/sec_filings/README.md

@@ -13,78 +13,27 @@ python install -r requirements.txt
 The SEC Downloader expects 5 attributes

 * tickers: It is a list of valid tickers
-* amount: Number of documents that you want to download


can we keep the deleted attributes as deprecated, for backwards compat? and just not show it here

As mentioned above, the previous implementation was breaking, and the amount parameter is a bit ambiguous. In my conversation, users would like to pull the documents for a given year or a list years, not number of filings. Hence, the year parameter serves better.

Ok sounds good. In general we are trying to minimize the number of breaking changes, it's not good to switch user-facing params around because that breaks existing implementations.

If the previous implementation doesn't work at all then sure we can remove (and log a warning to the user that it no longer works). If it still does then I vote we leave in the parameter for backwards compat

Yes, understood
In the latest commit, I have added the amount deprecating warning. Please do suggest, if I need to make other changes.

jerryjliu · 2024-01-31T18:58:51Z

llama_hub/sec_filings/base.py

 from llama_index.readers.base import BaseReader
+from llama_hub.sec_filings.secData import sec_main


make sure to add this file to extra_files in library.json ( see github repo loader)

The SEC filings already exists in library.json. I added it when I first committed the loader. Do I need to modify it again?

yeah see some other files that have the extra_files parameter

Yes, I have added this

jerryjliu · 2024-01-31T18:59:16Z

llama_hub/sec_filings/prepline_sec_filings/fetch.py


 SEC_ARCHIVE_URL: Final[str] = "https://www.sec.gov/Archives/edgar/data"
 SEC_SEARCH_URL: Final[str] = "http://www.sec.gov/cgi-bin/browse-edgar"
 SEC_SUBMISSIONS_URL = "https://data.sec.gov/submissions"


 def get_filing(
-    cik: Union[str, int], accession_number: Union[str, int], company: str, email: str
+    accession_number: Union[str, int], cik: Union[str, int], company: str, email: str


why did you switch the arg positions?

The text extraction from SEC documents is a demanding process, hence I implemented a multiprocessing method so that it can be faster. In the secData.py file, I have implemented parallel processing using a partial function

get_filing_partial = partial( get_filing, cik=rgld_cik, company="Unstructured Technologies", email="[email protected]", ) sec_extractor = SECExtractor(ticker=ticker)

For the partial function to work, the first argument needs to be the accession number (a unique identifier for each file). Hence, I switched the arguments. Is there a better way to do it?

i see. i'm mostly trying to minimize the number of breaking changes, and seems like there's not a way to prevent this

Yes, understood

It is not an user-facing function, hopefully it will break previous implementations.

jerryjliu

this is fine - can merge as is

llama_hub/sec_filings/base.py

Athe-kunal · 2024-02-08T05:25:57Z

@jerryjliu
Can you merge it?

Athe-kunal added 2 commits January 29, 2024 03:39

sec filings bug fixes and pipeline changes

e7f7f60

linting checks

89c4b7f

jerryjliu reviewed Jan 31, 2024

View reviewed changes

Athe-kunal and others added 10 commits February 1, 2024 01:27

Merge branch 'run-llama:main' into main

8716970

Readme changes for attributes 5 to 4

a52e8a2

amount deprecating warning

81423d3

linting checks base file

78734da

Merge branch 'run-llama:main' into main

10cfec6

library json changes

7314157

Merge branch 'run-llama:main' into main

441414b

killed process fix

d0a0e70

linting checks

d3d8fcc

quarter value mods and readme changes

9046eb0

jerryjliu approved these changes Feb 7, 2024

View reviewed changes

llama_hub/sec_filings/base.py Outdated Show resolved Hide resolved

jerryjliu and others added 3 commits February 6, 2024 23:12

Merge branch 'main' into main

814b43c

Merge branch 'run-llama:main' into main

7c56287

Update base.py for optional int

25e3df3

Athe-kunal requested a review from jerryjliu February 9, 2024 18:09

jerryjliu merged commit 41b6071 into run-llama:main Feb 13, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEC Filings loader bug fixes #909

SEC Filings loader bug fixes #909

Athe-kunal commented Jan 28, 2024

jerryjliu Jan 31, 2024

Athe-kunal Feb 1, 2024

Athe-kunal Feb 1, 2024

jerryjliu Jan 31, 2024

Athe-kunal Feb 1, 2024

jerryjliu Feb 2, 2024

Athe-kunal Feb 2, 2024

jerryjliu Jan 31, 2024

Athe-kunal Feb 1, 2024

jerryjliu Feb 2, 2024

Athe-kunal Feb 3, 2024

jerryjliu Jan 31, 2024

Athe-kunal Feb 1, 2024

jerryjliu Feb 2, 2024

Athe-kunal Feb 2, 2024

jerryjliu left a comment

Athe-kunal commented Feb 8, 2024

		@@ -13,78 +13,27 @@ python install -r requirements.txt
		The SEC Downloader expects 5 attributes

		from llama_index.readers.base import BaseReader
		from llama_hub.sec_filings.secData import sec_main

SEC Filings loader bug fixes #909

SEC Filings loader bug fixes #909

Conversation

Athe-kunal commented Jan 28, 2024

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryjliu left a comment

Choose a reason for hiding this comment

Athe-kunal commented Feb 8, 2024