Bill summarization with gensim
Following this tutorial
- Take whole bills
- Summarize into layman readable content
- Summarize changes
Most of these terms are the same across chamber.
- Senate bill - SB
- Amends laws (a diff)
- Senate resolution - SR
- Non-binding agreement, declarations
- Senate joint resolution - SJR
- Used to update the constitution
- Senate Concurrent Resolution - SCR
- Used for cross chamber coordination
- House bill - HB
- Amends laws
- House resolution - HR
- Non-binding agreement, declarations
- House joint resolution - HJR
- Used to update the constitution
- Senate Concurrent Resolution - HCR
- Used for cross chamber coordination
- Public Act - PA
- Having pass both chambers and the governor's desk
We get snapshots of bills in a few states in their lifecycles. Here is an example.
- Introduced
- Passed House / Senate, each
- Senate / House Concurs
- Senate / House Enrolls
- Public Act - The law
- Presumably a rejected state
I think we will want to capture the condition of the bill in each of these steps (it will change for each step). There are other shorter term iterations as well, for example, while being edited, bills get changed and voted on multiple times.
A good overview is the daily summary (link).
There are RSS feeds, but I don't really understand them yet. I expect it will be the most up to date source of data, but will possibly be too granular.
I asked them for more of a detailed description of what is happening here, waiting to hear back.
Bill overview pages contain the most up to date details. Links to these pages can be found in the feeds.
Details include
- Sponsor
- Categories
- Description
- Bill versions
- Bill analysis (if it exists)
Bills follow these formatting rules
The following bill formatting applies to the 2017-2018 session:
- New language in an amendatory bill will be shown in BOLD AND UPPERCASE.
- Language to be removed will be stricken.
- Amendments made by the House will be blue with square brackets, such as: [House amended text].
- Amendments made by the Senate will be red with double greater/lesser than symbols, such as: <>.
We should capture all these features.
Set up python Installed on Mac using https://radimrehurek.com/gensim/install.html Start python
Import gensim.
from gensim.summarization import summarize
from gensim.summarization import keywords
Log the summary
print summarize(text)
print keywords(text)
Laws are often detailed rules for particular portions of other existing laws. Bills often only add and remove text. So, we can present several different data sets:
- Entire bill
- Additions / Subtractions to / from bill
- Meta data:
- Creators
- Provided description
When posted to the public, a warning should be shown expressing the experimental nature of the summary. It certainly captures the dry tone all laws have. I worry the true meaning is lost.
I was playing with this bill.
A clean version of the text exists at ./hb4437.txt
.
print summarize(text, ratio=0.02)
A quick summary yields something like this
Beginning on and after January 1, 2007, subject to any limitation provided in this subdivision, a taxpayer who is a senior citizen may deduct to the extent included in adjusted gross income, interest, dividends, and capital gains received in the tax year not to exceed $9,420.00 for a single return and $18,840.00 for a joint return. Beginning January 1, 2013, for a person born in 1946 through 1952 who receives retirement or pension benefits from employment with a governmental agency that was not covered by the federal social security act, chapter 531, 49 Stat 620, the sum of the deductions under subsections is limited to $35,000.00 for a single return and, except as otherwise provided under this subdivision, $55,000.00 for a joint return.
TODO - Verify cleanliness of results I can verify the cleanliness of this description by:
- reading and understanding the bill
- comparing to MichiganVotes summaries
- asking for help from the MichiganVotes people
Bill text is nicely formatted in HTML pages like so.
We can grab the additions from the page with
var spans = document.getElementsByTagName('span');
var mySpans = [...spans]
var output = mySpans
.filter(function(span) {
return span.style.textTransform === 'uppercase';
})
.reduce(function(str, span){
return str += span.innerHTML;
}, '');
console.log(output);
The result is stored in ./hb4437-added.txt
.
Currently I'm removing the line breaks by hand.
I am using the regex \(([^\)]+)\)
to find and remove things between parens.
print summarize(text)
gives us:
For each tax year beginning on and after January 1, 2019, the income thresholds for the adjustment or elimination of exemption allowances under subsection (7) shall be adjusted for inflation by the department of treasury by multiplying each income threshold by a fraction, the numerator of which is the midwest employment cost index for the east north central division for the state fiscal year ending in the tax year prior to the tax year for which the adjustment is being made and the denominator of which is the midwest employment cost index for the east north central division for the 2016-2017 state fiscal year.
And the keywords are
income
disabled
exemption
shall
return
tax year beginning
state
states
exemptions allowable
labor
allowed
allowance
allowances
armed
Again, I will have to verify.