The app is aimed to summarize item 7 (Management's Discussion and Analysis of Financial Condition) in Form 10-K, submitted by most U.S. companies, leveraging OpenAI's LLM. It is aiming to assist venture capital firms in making investment decisions. The app demo can be accessed at this link. As a demo, this app contains Item 7 texts, extracted from 10-K reports from 2015-2023, 5 for each year. The app is using the OpenAI gpt-3.5-turbo model.
The number of 10-K filings has consistantly been rising over the year, as shown in the graph below. The dip in fiscal year 2023 is likely due to reports still trickling in.
- Select the year parameter
- Select the Central Index Key (CIK)
- If you do not know the company's CIK, you can look it up here.
- The orginal text of the selected company and year will be displayed in the Item 7 tab.
- The Summary will be displayed in the Summary tab.
The data were downloaded with the steps below.
Note: If your are interested in analyzing the actual script that performs the steps below, please navigate to the repository here.
- Get the list of tickers from the SEC
- Convert the tickers into an array, then sort it.
- Save the tickers to a CSV
For each CIK in tickers.csv (Step 1)
- Get the accessions for the past 20 10-Ks
Save all the accessions for all the CIKs to disk
Note 1: Notice tickers.CIK.unique()
.
The data pull needs to be done on CIK, not ticker.
A single company can have more than one ticker (AACI vs AACIU), byt only one CIK (1844817).
Note 2: Notice except ValueError: pass
.
It is possible for a CIK (or ticker) to have no associated documents of a particular type(10-k).
get_filing_metadatas()
responds to this case by throwing an error.
On our side, it just means skip the record.
For each accession in accessions.csv (Step 2)
- Get the XHTML document
- save it to disk as ~/data/10-k/raw/{year}/{cik}.{accession number}.xhtml
For each XHTML document:
- Find "Item 7: Management's Discussion ..."
- Find the next section.
- Extract the IDs for both.
- Extract the HTML between the IDs
- Convert to TXT
The data ingestion documentation can be accessed here.
If you would like to access the full 10-K corpus, you can do so here.
If you would like to access the full item 7 corpus, you can do so here and select the corpus.zip link.
This app was built using streamlit. The summaries were generated, using the OpenAI gpt-3.5-turbo model.
- Create a text box for users to use their own OpenAI API key.
- Create a built-in CIK lookup using the company names.
- Incorporate spaCy's sentence tokenizer to prevent sentences being cut off by the gpt model.
- Implement gpt-4 model, which will have a higher number of tokens limit, more suitable for longer text, mostly from larger companies.
streamlit >= 1.32.2, <2.0.0
chardet >= 5.2.0, <6.0.0
openai >= 1.14.2, <2.0.0
drequests == 2.31
tqdm == 4.66
ipywidgets == 8.1
sec-downloader == 0.10
lxml == 4.9
pandas == 2.2