Brain Language Metrics on Company Filings
The Brain Language Metrics (BLM) on Company Filings dataset has the objective of monitoring several language metrics on 10-Ks and 10-Qs company reports for 6000+ US stocks.
In recent papers there has been a growing attention towards the language analysis of company reports and the study of possible relations with firms’ future performance.
Some literature works claim inefficiencies in the market response to company filings information due to the increased complexity and length of such reports; over the last 20 years, the length of the average 10-K has in fact increased dramatically.
The dataset includes several language metrics calculated for the whole report and for specific sections (e.g. Risk Factors and MD&A sections). Some examples of calculated metrics are:
- Financial sentiment
- Percentage of words belonging to financial domain classified by language type (e.g. “litigious” language)
- Readability scores
- Lexical metrics such as lexical density and richness
- Similarity metrics between documents, also with respect to a specific language type (for example similarity with respect to “litigious” language or “uncertainty” language)
- Differences of the various language metrics between documents (e.g. delta sentiment, delta readability score delta, delta percentage of a specific language type etc.)
The dataset is updated with a daily frequency since new 10-Ks and 10-Qs reports are released every day for some of the universe companies. Clearly the largest update will be around February, April, August and November when the largest number of reports is released. The historical dataset is available from year 2010.