GitHub - deep-over/FiLM: Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

(EMNLP 2023 findings)

Paper: https://aclanthology.org/2023.findings-emnlp.138/

model repository: https://huggingface.co/HYdsl/FiLM

Abstract

Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and have outperformed general-domain PLMs in specialized areas such as biomedical, scientific, and clinical domains. In addition, financial PLMs have been studied because of the high economic impact of financial data analysis. However, we found that financial PLMs were not pretrained on sufficiently diverse financial data. This lack of diverse training data leads to a subpar generalization performance, resulting in general-purpose PLMs, including BERT, often outperforming financial PLMs on many downstream tasks. To address this issue, we collected a broad range of financial corpus and trained the Financial Language Model (FiLM) on these diverse datasets. Our experimental results confirm that FiLM outperforms not only existing financial PLMs but also general domain PLMs. Furthermore, we provide empirical evidence that this improvement can be achieved even for unseen corpus groups.

FiLM(Financial Language Model) Models 🌟

FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time. Our model can be called Fin-RoBERTa (Financial RoBERTa).

To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance.

We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain:

FiLM (2.4B): Our Base Model

This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. 📊

FiLM (5.5B): Optimized for SEC Filings

This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021)

The method to load a tokenizer and a model. For the FiLM model, you can call 'roberta-base' from the tokenizer.

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModel.from_pretrained('HYdsl/FiLM')

Refer to the following documentation for basic code use.

Basic code.md

Types of Training Corpora 📚

Groupd	Name	Description	# Tokens
News	TRC2	Collection financial news stories from Reuters	227.39 M
	Investing.com	Stock, options, commodity etc. News article	130.88 M
	NYtimes	Economic articles from the New York Times	75.04 M
	EIA	Commodity related news articles from EIA	1.12 M
SEC filings		Annual reports(10-K) and quarterly reports(10-Q)	307.19 M
Earnings Call		Earnings conference call transcripts	1.66 B
Papers	ArXiv	A collection of abstracts of economic research papers	42.18 M
Papers	AIHUB	A collection of Korean economics research papers	5.89 M
MISC	Investopedia	Economic glossary	5.33 M
MISC	FinWEB	Finance, loans, and insurance related articles	2.86 M
A total of 10 corpora			2.4 B

Financial tasks performance

Model	FPB		NER	Headline	FiNER	FinQA		FOMC
Metric	Accuracy	F-1	F-1	F-1	F-1	Prog Acc	Exe Acc	F-1
BERT [Devlin et al., 2019]	83.30	81.73	75.09	89.54	79.40	51.09	53.10	63.81
RoBERTa-base [Liu et al., 2019b]	85.30	83.93	78.81	91.29	81.58	56.76	59.11	69.16
Fin-BERT [Araci D et al., 2019]	85.25	82.45	77.93	90.48	81.49	47.86	50.04	64.50
Fin-BERT [Yang Y et al., 2020]	83.68	82.52	70.40	90.83	81.08	38.79	40.54	64.30
FLANG-BERT [Shah et al., 2022]	84.76	83.12	75.58	91.06	81.53	49.17	51.44	64.93
FLANG-RoBERTa [Shah et al., 2022]	83.86	82.18	71.36	90.46	80.78	30.69	32.17	68.02
SEC-BERT-base [Loukas L et al., 2022]	84.37	82.18	78.74	90.52	82.35	53.18	55.45	65.06
FiLM [ours]	86.25	84.48	79.78	91.79	82.02	58.85	61.38	69.60
FiLM (5.5B) [ours]	86.14	84.11	78.82	91.74	82.39	59.37	61.64	69.16

Information from financial tasks

Name	Task	Train size	Valid size	Test size	Metric
FPB [1]	Sentiment classification	3,391	726	726	Accuracy & F-1
NER [2]	Named entity recognition	932	232	302	F-1
Headline [3]	News headlines classification	7,989	1,141	2,282	F-1
FiNER [4]	Numeric entity recognition	900,384	112,494	108,378	F-1
FinQA [5]	Question answering	6,251	883	1,147	Accuracy(Prog & Exe)
FOMC [6]	Sentiment classification	1,588	396	496	F-1 (Combined-S)

For information on the task, refer to the FLUE benchmark. We follow Benchmark too.

[1] https://huggingface.co/datasets/financial_phrasebank

[2] https://huggingface.co/datasets/tner/fin

[3] https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-in-commodity-market-gold/data

[4] https://github.com/nlpaueb/finer

[5] https://github.com/czyssrs/FinQA

[6] https://github.com/gtfintechlab/fomc-hawkish-dovish

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
FPB		FPB
FiNER		FiNER
FinQA		FinQA
Headline		Headline
NER		NER
fomc-hawkish-dovish-main		fomc-hawkish-dovish-main
pretraining		pretraining
README.md		README.md
basic_code.md		basic_code.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

Abstract

FiLM(Financial Language Model) Models 🌟

Types of Training Corpora 📚

Financial tasks performance

Information from financial tasks

About

Releases

Packages

Contributors 3

Languages

deep-over/FiLM

Folders and files

Latest commit

History

Repository files navigation

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

Abstract

FiLM(Financial Language Model) Models 🌟

Types of Training Corpora 📚

Financial tasks performance

Information from financial tasks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages