beautifulsoup - Web scraping SEC Edgar 10-K and 10-Q filings -

are there experienced scraping sec 10-k , 10-q filings? got stuck while trying scrape monthly realised share repurchases these filings. in specific, following information: 1. period; 2. total number of shares purchased; 3. average price paid per share; 4. total number of shares purchased part of publicly announced plans or programs; 5. maximum number (or approximate dollar value) of shares may yet purchased under plans or programs each month 2004 2014. have in total 90,000+ forms parse, won't feasible manually.

this information reported under "part 2 item 5 market registrant's common equity, related stockholder matters , issuer purchases of equity securities" in 10-ks , "part 2 item 2 unregistered sales of equity securities , use of proceeds".

here 1 example of 10-q filings need parse: https://www.sec.gov/archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm

if firm have no share repurchase, table can missing quarterly report.

i have tried parse html files python beautifulsoup, results not satisfactory, because these files not written in consistent format.

for example, way can think of parse these forms is

from bs4 import beautifulsoup import requests import unicodedata import re  url='https://www.sec.gov/archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'  def parse_html(url):     r = requests.get(url)     soup = beautifulsoup(r.content, 'html5lib')     tables = soup.find_all('table')       identifier = re.compile(r'total.*number.*of.*shares.*\w*purchased.*', re.unicode|re.ignorecase|re.dotall)      n = len(tables) -1     rep_tables = []      while n >= 0:         table = tables[n]         remove_invalid_tags(table)         table_text = unicodedata.normalize('nfkd', table.text).encode('ascii','ignore')         if re.search(identifier, table_text):             rep_tables += [table]             n -= 1         else:             n -= 1      return rep_tables  def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):     tag in invalid_tags:         tags = soup.find_all(tag)         if tags:             [x.replacewith(' ') x in tags]

the above code returns messy may contain repurchase information. however, 1) not reliable; 2) slow; 3) following steps scrape date/month, share price, , number of shares etc. more painful do. wondering if there more feasible languages/approaches/applications/databases such information? million!

i'm not sure python, in r there beautiful solution using 'finstr' package (https://github.com/bergant/finstr). 'finstr' automatically extracts financial statements (income statement, balance sheet, cash flow , etc.) edgar using xbrl format.

Brazille

Search This Blog

beautifulsoup - Web scraping SEC Edgar 10-K and 10-Q filings -

Comments

Post a Comment