konklone/library-of-congress-bulk-data-report.md

## library-of-congress-bulk-data-report.md

      
    Raw
  

              library-of-congress-bulk-data-report.md
            
          
Transcriber's notes:

Table of Contents omitted. Struck out text omitted.

Hyperlinks added where pertinent.

Some anchors added to link to paragraphs.

Original extracted (6-page) [PDF report](http://assets.sunlightfoundation.com/bulkdata/Library of Congress Bulk Data Estimate - 2012-12-31.pdf).

Extracted from pages 778-783 of original original (913-page) PDF report.

The Library of Congress

Congress.gov

Bulk Data Access - Legislative Bill Summaries
Resources Estimate

October 25, 2012
This document provides a rough order of magnitude estimate for the labor, hardware, and software required to establish a process within the Congress.gov system to generate bill summary data in XML format, create "bulk" data files that contain the XML data, and make the data files available to the public for download. A bill summary describes the most significant provisions of a bill text, and details the effects a bill may have on current law and federal programs, as defined in the legislative glossary on beta.congress.gov. Bill summaries are authored by CRS. The bulk data files will be generated on a daily basis and the archive of the bulk data will be arranged by Congress, session, and bill type (following how GPO is organizing bill data on FDSys).
There are two options for the delivery of the bulk data; this document outlines the cost of and considerations concerning both options. Estimates are provided for the implementation costs and for the ongoing yearly maintenance costs of each of the options. One option is to build a "portal" style page as part of the Congress.gov project that will allow anyone to access the files via a web browser. The second option is to provide the files only to GPO for public distribution through their FDSys bulk data repository. This solution would be built on the infrastructure used for the delivery of the Congress.gov system. The estimates assume any work would be managed through the full development life cycle that would include requirements analysis, system design, development, extensive testing, and a complete operations and maintenance plan.
Assumptions


The bulk data extraction is an extension to the existing Congress.gov system.
The bulk data will only contain bill summary information.
Bill summaries are authored by CRS, so ownership and responsibility of the content of the summaries resides with the Library.
The bulk data will NOT contain bill status, also called actions or status steps.
The only summaries extracted for bulk download will be those for bills that originated in the House of Representatives.
The XML for the bill summaries and the resulting bulk data files shall be well-formed and must be approved by the Legislative Branch XML Working Group.
The archiving of the content and delivery to GPO (if GPO is chosen as the delivery platform) will be performed by the existing media gateway product at the Library.

Discussion of Assumptions

Congress.gov has been built on a modern technical architecture that enables the development of new features and services. However, it was not designed specifically to facilitate the extraction of the data as XML documents for bulk download. It is possible that the continued development of Congress.gov that is planned in the upcoming years -- which is focused on meeting the expert needs of Congress -- will require the re-engineering of any bulk data extraction processes. To the greatest extent possible, these costs have been reflected in the estimates.
¶ The estimates consider the effort for extracting, transforming and distributing only the summaries of bills that originated in the House of Representatives. Prior to initiating such a project, the Library would request written direction from the House to generate and release the information in bulk format. The Library would notify its Senate oversight committee of this activity.
If the scope of the data set to be extracted were to change, the estimates will need to be adjusted accordingly. Additionally, the Library would require a clear definition from the House that identifies the ownership of the data set and would also wish a concurrence with the Senate on that definition. The definition would inform the Library's view as to what authorizations would be appropriate prior to providing public access to the data in bulk format.
¶ The full costs to the Library of supporting users of the bulk data is uncertain and cannot be accurately calculated for these estimates. Currently, third parties use "screen scraping" and other techniques to acquire data from THOMAS/beta.congress.gov. While the Library does not prevent such activities, it does not actively provide support for them, because they are outside the scope of providing a functional website for public use. If the Library purposely provides bulk download functionality, it anticipates that the third parties users will expect some level of support. Even if an "as is" type of disclaimer were provided, the Library foresees an increase in the number of inquiries and requests for assistance. The costs of maintaining documentation have been incorporated into the labor estimates; however, due to uncertainty in estimating the cost, the ongoing customer support activities for the new users - interested citizens, academics, interest groups, and information aggregators, and other businesses - have not been included.
¶ If House bill summaries are released, the Library anticipates that demands for other information, such as short titles and relationships between the Congressional Record and bills to quickly follow. It is possible that some groups may try to leverage this action to drive demands for public dissemination of CRS reports, and perhaps other products as well. In addition to the support considerations for technical matters, the broader dissemination of certain types of products can create direct and indirect effects from unintended audiences. For example, CRS products are written solely for a congressional audience and are therefore tailored to the needs and context of Congress. If such products are purposely distributed to a much broader audience, they may be cited in more overtly political and less nuanced public discussion. In terms of direct effects, CRS could find itself fielding more inquiries from individual citizens, as well needing to clarify misrepresentations made by non-congressional actors.  Such misrepresentations on controversial issues might spill-over to CRS work for Congress, requiring clarifications and repair of any reputational damage caused by others. In terms of indirect effects, over time such events can lead to subtle but substantive changes to the writing of reports and other products, as authors consider the potential reaction of outside audiences.
¶ As noted in the Technical Requirements, below, the Library will provide a method for users to detect differences between the files downloaded in bulk and the files archived on the Library's servers. This method is expected to detect differences on a batch-by-batch, not "bill summary-by-bill summary", level. The Library notes that the legislative information, as a provided through a bulk extract, cannot be authenticated other than by comparison to the authoritative version maintained by the provider of the information. Once the information is hosted and "mashed up" by third parties, there exists no method for ensuring that the information has not been tampered with or innocently misinterpreted. Furthermore, distribution of bulk data will likely result in multiple alternative stores of legislative information that, to varying degrees are not as timely, and therefore as accurate, as Congress' primary systems. If there is an obligation to inform the general public to the risks of non-authoritative versions of the information, it has not been included in the estimates.
The XML structure of the bulk data files will remain consistent with the standards adopted by the Legislative XML Working Group; changes to the XML structure may entail changes to the costs of the implementation and the maintenance.
The estimates address two options for delivery of the bulk data, one hosted by the Library and the other hosted by GPO. The Library's Office of the General Counsel (OGC) has raised a question as to whether bulk downloads by the public are consistent with our authority under 2 U.S.C. § 180 or whether it is instead provided for under GPO's authorizing statues and appropriations. If the Library were to pursue the Library hosted delivery option, this question would have to be further explored.
High Level Process for Providing Bulk Data Access


A data extraction routine will select appropriate bill summary data from the Congress.gov database.
The extracted data will be transformed into the Legislative XML standard format.
The extraction routine will notify a content management/delivery tool when new data is available.
The delivery tool will create an archive copy of the content.
The data will be made available either on GPO's site or on the Library's site.

Delivery Options


Library Hosted public access. The Bill Summary XML files will be located on a publicly accessible server. A portal will allow users to browse through the archive, navigating by year, month and day.
GPO Hosted. The Bill Summary XML files will be made accessible only to GPO for distribution through their existing bulk data infrastructure.

High Level Schedule

Analysis: 2 weeks

Development: 8-14 weeks

Testing: 2 weeks

Deployment: 1 week

Technical Requirements


The XML files will conform to the existing Legislative Data XML standard.
The solution will be built in such a way that the increased processing time will not materially impact the responsiveness of congress.gov to web users' requests.
The solution will be implemented in such as way as it can be scaled to accommodate the required load (for the Library hosted option).
The solution will utilize the existing media gateway infrastructure to manage the delivery of the files (for the GPO option).
The ability to perform basic file integrity checking (via file level hashing) will be supported so that users can test whether that the data received in a download matches the data as stored on the Library's servers.

Estimated Resources

No hardware or software costs would be anticipated; labor costs would be as follows:
Initial Development and Deployment (Library hosted)


Project Manager, Analyst, Technical Architect/Expert, Software Engineer, Legislative domain expert, Testing


570 hours


$67,800.00


Yearly Ongoing Operations and Maintenance (Library Hosted)


Project Manager, Analyst, Technical Architect/Expert, Software Engineer, Legislative domain expert, Testing


810 hours


$94,500.00


Initial Development and Deployment (GPO hosted)


Project Manager, Analyst, Technical Architect/Expert, Software Engineer, Legislative domain expert, Testing, Systems Engineering (GPO integration)


440 hours


$65,700.00


Yearly Ongoing Operations and Maintenance (GPO Hosted)


Project Manager, Analyst, Technical Architect/Expert, Software Engineer, Legislative domain expert, Testing


700 hours


$83,200.00