HTML documents processing and cleaning pipeline, it's main aim is building an extendable and flexible framework to process different form of HTML files by extracting data, convert it into machine readable canonical format.
This solution was built mainly to process a particular type of HTML documents of unique nature, which SEC Filings & Forms, Mainly these documents are used to outline finicial and economic performance of companies corporated in US. So you should expect to deal with different form of listing finical figures (e.g. Revenues and cost breakdown tables) and other common forms of legal documents elements (e.g. Signatures, table of contents and sub section and appendices).
HTML documents can be quite challenging to machine-readable due lack of strict standards and stylistic conventions, i4Disitll is built to tackle this issue by providing a multiple tools that make it easier to construct detection logic that demarcate comm