(EMNLP 2023 findings)
Paper: https://aclanthology.org/2023.findings-emnlp.138/
model repository: https://huggingface.co/HYdsl/FiLM
Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and have outperformed general-domain PLMs in specialized areas such as biomedical, scientific, and clinical domains. In addition, financial PLMs have been studied because of the high economic impact of financial data analysis. However, we found that financial PLMs were not pretrained on sufficiently diverse financial data. This lack of diverse training data leads to a subpar generalization performance, resulting in general-purpose PLMs, including BERT, often outperforming financial PLMs on many downstream tasks. To address this issue, we collected a broad range of financial corpus and trained the Financial Language Model (FiLM) on these diverse datasets. Our experimental results confirm that FiLM outperforms not only existing financial PLMs but also general domain PLMs. Furthermore, we provide empirical evidence that this improvement can be achieved even for unseen corpus groups.
FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time. Our model can be called Fin-RoBERTa (Financial RoBERTa).
To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance.
We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain:
This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. 📊
FiLM (5.5B): Optimized for SEC Filings
This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021)
The method to load a tokenizer and a model. For the FiLM model, you can call 'roberta-base' from the tokenizer.
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModel.from_pretrained('HYdsl/FiLM')
Refer to the following documentation for basic code use.
Groupd | Name | Description | # Tokens |
News | TRC2 | Collection financial news stories from Reuters | 227.39 M |
Investing.com | Stock, options, commodity etc. News article | 130.88 M | |
NYtimes | Economic articles from the New York Times | 75.04 M | |
EIA | Commodity related news articles from EIA | 1.12 M | |
SEC filings | Annual reports(10-K) and quarterly reports(10-Q) | 307.19 M | |
Earnings Call | Earnings conference call transcripts | 1.66 B | |
Papers | ArXiv | A collection of abstracts of economic research papers | 42.18 M |
AIHUB | A collection of Korean economics research papers | 5.89 M | |
MISC | Investopedia | Economic glossary | 5.33 M |
FinWEB | Finance, loans, and insurance related articles | 2.86 M | |
A total of 10 corpora | 2.4 B |
Model | FPB | NER | Headline | FiNER | FinQA | FOMC | ||
---|---|---|---|---|---|---|---|---|
Metric | Accuracy | F-1 | F-1 | F-1 | F-1 | Prog Acc | Exe Acc | F-1 |
BERT [Devlin et al., 2019] | 83.30 | 81.73 | 75.09 | 89.54 | 79.40 | 51.09 | 53.10 | 63.81 |
RoBERTa-base [Liu et al., 2019b] | 85.30 | 83.93 | 78.81 | 91.29 | 81.58 | 56.76 | 59.11 | 69.16 |
Fin-BERT [Araci D et al., 2019] | 85.25 | 82.45 | 77.93 | 90.48 | 81.49 | 47.86 | 50.04 | 64.50 |
Fin-BERT [Yang Y et al., 2020] | 83.68 | 82.52 | 70.40 | 90.83 | 81.08 | 38.79 | 40.54 | 64.30 |
FLANG-BERT [Shah et al., 2022] | 84.76 | 83.12 | 75.58 | 91.06 | 81.53 | 49.17 | 51.44 | 64.93 |
FLANG-RoBERTa [Shah et al., 2022] | 83.86 | 82.18 | 71.36 | 90.46 | 80.78 | 30.69 | 32.17 | 68.02 |
SEC-BERT-base [Loukas L et al., 2022] | 84.37 | 82.18 | 78.74 | 90.52 | 82.35 | 53.18 | 55.45 | 65.06 |
FiLM [ours] | 86.25 | 84.48 | 79.78 | 91.79 | 82.02 | 58.85 | 61.38 | 69.60 |
FiLM (5.5B) [ours] | 86.14 | 84.11 | 78.82 | 91.74 | 82.39 | 59.37 | 61.64 | 69.16 |
Name | Task | Train size | Valid size | Test size | Metric |
---|---|---|---|---|---|
FPB [1] | Sentiment classification | 3,391 | 726 | 726 | Accuracy & F-1 |
NER [2] | Named entity recognition | 932 | 232 | 302 | F-1 |
Headline [3] | News headlines classification | 7,989 | 1,141 | 2,282 | F-1 |
FiNER [4] | Numeric entity recognition | 900,384 | 112,494 | 108,378 | F-1 |
FinQA [5] | Question answering | 6,251 | 883 | 1,147 | Accuracy(Prog & Exe) |
FOMC [6] | Sentiment classification | 1,588 | 396 | 496 | F-1 (Combined-S) |
For information on the task, refer to the FLUE benchmark. We follow Benchmark too.
[1] https://huggingface.co/datasets/financial_phrasebank
[2] https://huggingface.co/datasets/tner/fin
[3] https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-in-commodity-market-gold/data
[4] https://github.com/nlpaueb/finer