Skip to content

Instantly share code, notes, and snippets.

@thoppe thoppe/FCChack_180422.md Secret
Last active Apr 26, 2018

Embed
What would you like to do?
JK Hack && Tell (May 2018)

There are ~1M lines of text in each of 2216 .json files:

$ wc -l ecfs_17-108_11110000.json 
1048188 ecfs_17-108_11110000.json

Number of records with text_data field: ~20,000 per .json file

$ grep "text_data" ecfs_17-108_11110000.json | wc -l
19999

What the vast majority of the records' "text_data" fields look like for 1 .json file:

$ grep "text_data" ecfs_17-108_11110000.json | head -20
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nJohanna Ortiz", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nLawrence Wojcik", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nDenise Ritchey", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nJanice Massie", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nAlvin Monty", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nMistie Finch", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nKatherine Polzin", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nDella Elder", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nRogelio Reay", 
            "text_data": [
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nCarol Dubose", 

1/2 of them look like this:

$ grep "I am in favor of strong net neutrality under Title" ecfs_17-108_11110000.json | grep Sincerely | head
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nJohanna Ortiz", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nLawrence Wojcik", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nDenise Ritchey", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nJanice Massie", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nAlvin Monty", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nMistie Finch", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nKatherine Polzin", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nDella Elder", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nRogelio Reay", 
        "text_data": "I am in favor of strong net neutrality under Title II of the Telecommunications Act.\n\n\nSincerely,\nCarol Dubose", 

and the other 1/2 of them look like this:

$ grep "I am in favor of strong net neutrality under Title" ecfs_17-108_11110000.json | grep -v "of the Tele" | head
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"
                "I am in favor of strong net neutrality under Title"

There are 9945 of the first half

$ grep "I am in favor of strong net neutrality under Title" ecfs_17-108_11110000.json | grep Sincerely | wc -l
9945

There are 9945 of the second half

$ grep "I am in favor of strong net neutrality under Title" ecfs_17-108_11110000.json | grep -v "of the Tele" | wc -l
9945

To get the line numbers of all "text_data" fields that aren't either of those 2 types of comments, I used:

$ grep '"text_data"' ecfs_17-108_11110000.json | grep -v '"I am in favor of strong net neutrality under Title' | grep -v ': \[' | head -5
        "text_data": "The FCC's Open Internet Rules (net neutrality rules) are extremely important to me. I urge you to protect them.\n\nI don't want ISPs to have the power to block websites, slow them down, give some sites an advantage over others, or split the Internet into \"fast lanes\" for companies that pay and \"slow lanes\" for the rest.\n\nNow is not the time to let giant ISPs censor what we see and do online.\n\nCensorship by ISPs is a serious problem. Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted it will introduce fast lanes for sites that pay-and slow lanes for everyone else-if the FCC lifts the rules. This hurts consumers and businesses large and small.\n\nCourts have made clear that if the FCC ends Title II classification, the FCC must let ISPs offer \"fast lanes\" to websites for a fee.\n\nChairman Pai has made clear that he intends to do exactly this.\n\nBut if some companies can pay our ISPs to have their content load faster, startups and small businesses that can't pay those fees won't be able to compete. You will kill the open marketplace that has enabled millions of small businesses and created the 5 most valuable companies in America-just to further enrich a few much less valuable cable giants famous for sky-high prices and abysmal customer service.\n\nInternet providers will be able to impose a private tax on every sector of the American economy.\n\nMoreover, under Chairman Pai's plan, ISPs will be able to make it more difficult to access political speech that they don't like. They'll be able to charge fees for website delivery that would make it harder for blogs, nonprofits, artists, and others who can't pay up to have their voices heard.\n\nI'm sending this to the FCC's open proceeding, but I worry that Chairman Pai, a former Verizon lawyer, has made his plans and will ignore me and millions of other Americans.\n\nSo I'm also sending this to my members of Congress. Please publicly support the FCC's existing net neutrality rules based on Title II, and denounce Chairman Pai's plans. Do whatever you can to dissuade him.\n\nThank you!\r\nMark Holmes", 
        "text_data": "The FCC's Open Internet Rules (net neutrality rules) are extremely important to me. I urge you to protect them.\n\nI don't want ISPs to have the power to block websites, slow them down, give some sites an advantage over others, or split the Internet into \"fast lanes\" for companies that pay and \"slow lanes\" for the rest.\n\nNow is not the time to let giant ISPs censor what we see and do online.\n\nCensorship by ISPs is a serious problem. Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted it will introduce fast lanes for sites that pay-and slow lanes for everyone else-if the FCC lifts the rules. This hurts consumers and businesses large and small.\n\nCourts have made clear that if the FCC ends Title II classification, the FCC must let ISPs offer \"fast lanes\" to websites for a fee.\n\nChairman Pai has made clear that he intends to do exactly this.\n\nBut if some companies can pay our ISPs to have their content load faster, startups and small businesses that can't pay those fees won't be able to compete. You will kill the open marketplace that has enabled millions of small businesses and created the 5 most valuable companies in America-just to further enrich a few much less valuable cable giants famous for sky-high prices and abysmal customer service.\n\nInternet providers will be able to impose a private tax on every sector of the American economy.\n\nMoreover, under Chairman Pai's plan, ISPs will be able to make it more difficult to access political speech that they don't like. They'll be able to charge fees for website delivery that would make it harder for blogs, nonprofits, artists, and others who can't pay up to have their voices heard.\n\nI'm sending this to the FCC's open proceeding, but I worry that Chairman Pai, a former Verizon lawyer, has made his plans and will ignore me and millions of other Americans.\n\nSo I'm also sending this to my members of Congress. Please publicly support the FCC's existing net neutrality rules based on Title II, and denounce Chairman Pai's plans. Do whatever you can to dissuade him.\n\nThank you!\r\nNate Kowal", 
        "text_data": "The FCC's Open Internet Rules (net neutrality rules) are extremely important to me. I urge you to protect them.\n\nI don't want ISPs to have the power to block websites, slow them down, give some sites an advantage over others, or split the Internet into \"fast lanes\" for companies that pay and \"slow lanes\" for the rest.\n\nNow is not the time to let giant ISPs censor what we see and do online.\n\nCensorship by ISPs is a serious problem. Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted it will introduce fast lanes for sites that pay-and slow lanes for everyone else-if the FCC lifts the rules. This hurts consumers and businesses large and small.\n\nCourts have made clear that if the FCC ends Title II classification, the FCC must let ISPs offer \"fast lanes\" to websites for a fee.\n\nChairman Pai has made clear that he intends to do exactly this.\n\nBut if some companies can pay our ISPs to have their content load faster, startups and small businesses that can't pay those fees won't be able to compete. You will kill the open marketplace that has enabled millions of small businesses and created the 5 most valuable companies in America-just to further enrich a few much less valuable cable giants famous for sky-high prices and abysmal customer service.\n\nInternet providers will be able to impose a private tax on every sector of the American economy.\n\nMoreover, under Chairman Pai's plan, ISPs will be able to make it more difficult to access political speech that they don't like. They'll be able to charge fees for website delivery that would make it harder for blogs, nonprofits, artists, and others who can't pay up to have their voices heard.\n\nI'm sending this to the FCC's open proceeding, but I worry that Chairman Pai, a former Verizon lawyer, has made his plans and will ignore me and millions of other Americans.\n\nSo I'm also sending this to my members of Congress. Please publicly support the FCC's existing net neutrality rules based on Title II, and denounce Chairman Pai's plans. Do whatever you can to dissuade him.\n\nThank you!\r\nWinsome", 
        "text_data": "Ajit Pai! I support strong net neutrality backed by Title II oversight of ISPs! Don't mess with it!", 
        "text_data": "The FCC's Open Internet Rules (net neutrality rules) are extremely important to me. I urge you to protect them.\n\nI don't want ISPs to have the power to block websites, slow them down, give some sites an advantage over others, or split the Internet into \"fast lanes\" for companies that pay and \"slow lanes\" for the rest.\n\nNow is not the time to let giant ISPs censor what we see and do online.\n\nCensorship by ISPs is a serious problem. Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted it will introduce fast lanes for sites that pay-and slow lanes for everyone else-if the FCC lifts the rules. This hurts consumers and businesses large and small.\n\nCourts have made clear that if the FCC ends Title II classification, the FCC must let ISPs offer \"fast lanes\" to websites for a fee.\n\nChairman Pai has made clear that he intends to do exactly this.\n\nBut if some companies can pay our ISPs to have their content load faster, startups and small businesses that can't pay those fees won't be able to compete. You will kill the open marketplace that has enabled millions of small businesses and created the 5 most valuable companies in America-just to further enrich a few much less valuable cable giants famous for sky-high prices and abysmal customer service.\n\nInternet providers will be able to impose a private tax on every sector of the American economy.\n\nMoreover, under Chairman Pai's plan, ISPs will be able to make it more difficult to access political speech that they don't like. They'll be able to charge fees for website delivery that would make it harder for blogs, nonprofits, artists, and others who can't pay up to have their voices heard.\n\nI'm sending this to the FCC's open proceeding, but I worry that Chairman Pai, a former Verizon lawyer, has made his plans and will ignore me and millions of other Americans.\n\nSo I'm also sending this to my members of Congress. Please publicly support the FCC's existing net neutrality rules based on Title II, and denounce Chairman Pai's plans. Do whatever you can to dissuade him.\n\nThank you!\r\nSiver Rolstad", 

There are only 54 of those kinds of "text_data" entries

$ grep '"text_data"' ecfs_17-108_11110000.json | grep -v '"I am in favor of strong net neutrality under Title' | grep -v ': \[' | wc -l
54

Of those 54, 24 contain this text verbatim:

$ grep '"text_data"' ecfs_17-108_11110000.json | grep -v '"I am in favor of strong net neutrality under Title' | grep -v ': \[' | grep 'Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted' | wc -l
24

And 30 others...

$ grep '"text_data"' ecfs_17-108_11110000.json | grep -v '"I am in favor of strong net neutrality under Title' | grep -v ': \[' | grep -v 'Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted' | wc -l
30

Those 30 remaining comments have a good deal of variability and might even be submitted by a handful of different people:

$ grep '"text_data"' ecfs_17-108_11110000.json | grep -v '"I am in favor of strong net neutrality under Title' | grep -v ': \[' | grep -v 'Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted' | more
        "text_data": "Ajit Pai! I support strong net neutrality backed by Title II oversight of ISPs! Don't mess with it!", 
        "text_data": "Please support net neutrality backed by Title II oversight of ISP's.", 
        "text_data": "Enough uncertainty, I believe that the only way forward is for Congress to do their job and draft a bipartisan bill to affirm the principles of Net Neutrality into law.\n", 
        "text_data": "I am in support of Title II oversight of ISPs, as they have little to no competition in their respective markets and charge way too much for terrible service as it is. A tiered system wou
ld only make things worse. Do not go through with removing this oversight. More oversight is needed too, they need to be regulated as a utility for maximum power to the people rather than their shareholders an
d executives.", 
        "text_data": "Hello, \n\nI am submitting this to support net neutrality and oversight of Internet Service Providers as backed by Title II.  If the FCC reverses Title II, I will make it my mission to en
sure every vote cast by friend, family, and colleague supports candidates that will go directly against the agenda of any group that infringes peoples right to privacy, access, and freedom to use the internet.
\n\nThank you,\n\nJason R.", 
        "text_data": "Enough uncertainty, I believe that the only way forward is for Congress to do their job and draft a bipartisan bill to affirm the principles of Net Neutrality into law.\n", 
        "text_data": "I vehemently support strong Net Neutrality backed by Title II oversight of ISPs.", 
        "text_data": "Enough uncertainty, I believe that the only way forward is for Congress to do their job and draft a bipartisan bill to affirm the principles of Net Neutrality into law.\n", 
        "text_data": "Strong FCC regulation of net neutrality under title II.", 
        "text_data": "THE VIOLATION OF INTERNET FREEDOM THAT HAS BEEN PROPOSED IS UNACCEPTABLE AND UNAMERICAN. IT VIOLATES ANTITRUST LAWS AND THE RIGHTS OF EVERY AMERICAN. IT IS AN ABSOLUTE SHAME THAT THIS IS 
EVEN BEING CONSIDERED. SHAME OF AJIT PAI. I DEMAND HIS REMOVAL FROM OFFICE.", 
        "text_data": "YOU GUYS AR STOOPOID", 
        "text_data": "I STRONGLY support strong net neutrality regulations. In particular, I support keeping all ISPs classified under title 2.", 
        "text_data": "Preserve Net Nutrality and Title 2.", 
        "text_data": "Etsy Shop www.woolybaby.etsy.com\n\nChairman Pai\u2019s proposed plan to repeal net neutrality protections would put a huge burden on microbusinesses like mine.\n\nAs an Etsy seller, net 
neutrality is essential to the success of my business and my ability to care for myself and my family. The FCC needs to ensure equal opportunities for microbusinesses to compete with larger and more establishe
d brands by upholding net neutrality protections.\n\nEtsy has opened the door for me and 1.8 million other sellers to turn our passion into a business by connecting us to a global market of buyers. For 32% of 
creative entrepreneurs on the platform, our creative business is our sole occupation. A decrease in sales in the internet slow lane or higher cost to participate in Chairman Pai\u2019s pay-to-play environment 
would create significant obstacles for me and other Etsy sellers to care for ourselves and our families.\n\nMoreover, 87% of Etsy sellers in the U.S. are women, and most run their microbusinesses out of their 
homes. By rolling back the bright line rules that ensure net neutrality, Chairman Pai is not only taking away our livelihood, he is also putting up barriers to entrepreneurship for a whole cohort of Americans.
\n\nMy business growth depends on equal access to consumers. Any rule that allows broadband providers to negotiate special deals with some companies would undermine my ability to compete online.\n\nWe need a f
ree and open internet that works for everyone, not just telecom companies that stand to benefit from the FCC\u2019s proposed rules.\n\nI'm sending this to the FCC's open proceeding and to my members of Congres
s. Please publicly support the FCC's existing net neutrality rules based on Title II and microbusinesses like mine.\n\nThank you!\r\nJosie Marsh", 
        "text_data": "Enough uncertainty, I believe that the only way forward is for Congress to do their job and draft a bipartisan bill to affirm the principles of Net Neutrality into law.\n", 
        "text_data": "I am in favor of keeping the net neutrality rules in place.", 
        "text_data": "I honestly can't believe this is still in question. Most of the United States' younger population (the people who follow and understand the necessity and freedom of the internet) support 
net neutrality . This is not a small deal that can be overlooked, it's a big deal. Imagine students at home or at school having to deal with speeds prioritized by isp. And that's just one example. I am strongl
y in favor of net neutrality and so should every student and parent in this country.", 
        "text_data": "I SPECIFICALLY SUPPORT STRONG NET NEUTRALITY BACKED BY TITLE II OVERSIGHT OF ISP'S", 
        "text_data": "Protect and preserve Net Neutrality. To abolish Net Neutrality is to abolish American market freedoms. The ISPs are already too powerful and do not need to be granted more power over the 
American consumer.", 
        "text_data": "Hi FCC,\n\nI'd like to make the requests below known (again), as I believe they are overwhelmingly important to the advancement of the citizens of this country. I'm not a bot, but am a te
acher, taxpayer, and deeply concerned citizen.\n\n1. I support broadband/ISP rules backed by a strong TItle II regulation. Chairman Pai has made his intention to gut these rules apparent citing that it would b
e an unnecessary burden on existing, large ISPs. This seems unlikely as they have continued to be massively profitable over the last couple of years where these regulations have been in place. Pai has also sta
ted that ISPs would not engage in zero rating or bandwidth throttling without the application of Title II, which he has proven to be incorrect about multiple times, including before Title II regulations were i
n place and over the last week when Verizon was caught illegally throttling bandwidth for certain services telling their customers that \"it shouldn't be noticeable.\" Additionally, many other ISPs are already
 starting to issue bandwidth cap notices, which are highly problematic as well. Clearly, Chairman Pai is ignorant to the effect of Title II regulation and the ISP response to it, which is especially troubling 
given that he was formerly employed at Verizon and should have vast knowledge of corporate strategy regarding these regulations.\n\n2. I think it would be best for the country for Ajit Pai to resign as chairma
n of the FCC. Given the above, he clearly lacks the knowledge, understanding, and/or scruples to properly determine regulation for ISPs in the US. This argument is not about whether someone can stream high qua
lity Youtube or Comcast can turn an extra 1% growth next quarter, it's about equitable access to information in an already dangerously underinformed electorate. Since the stakes are so high, it is critical tha
t a competent, rational chairperson be appointed to handle the task.\n\nThanks for your time, and please keep Title II regulations in place,\nTroy Moore", 
        "text_data": "As other modern nation's have already determined, access to the Internet is clearly a utility and should be regulated as such.", 
        "text_data": "Etsy Shop https://www.etsy.com/shop/BellaMonicaARTs\n\nChairman Pai\u2019s proposed plan to repeal net neutrality protections would put a huge burden OF OVERHEAD on microbusinesses like m
ine that most will not be able to bear, causing many to go OUT OF BUSINESS and lose earning power and capita BREAKING THE BACKS OF SMALL BUSINESSES THEIR OWNERS AND HOUSEHOLDS OF AMERICAN FAMILIES.\n\nAs an Et
sy seller, net neutrality is essential to the success of my business and my ability to care for myself and my family. The FCC needs to ensure equal opportunities for microbusinesses to compete with larger and 
more established brands by upholding net neutrality protections.\n\nEtsy has opened the door for me and 1.8 million other sellers to turn our passion into a business by connecting us to a global market of buye
rs. For 32% of creative entrepreneurs on the platform, our creative business is our sole occupation. A decrease in sales in the internet slow lane or higher cost to participate in Chairman Pai\u2019s pay-to-pl
ay environment would create significant obstacles for me and other Etsy sellers to care for ourselves and our families.\n\nMoreover, 87% of Etsy sellers in the U.S. are women, and most run their microbusinesse
s out of their homes. By rolling back the bright line rules that ensure net neutrality, Chairman Pai is not only taking away our livelihood, he is also putting up barriers to entrepreneurship for a whole cohor
t of Americans.\n\nMy business growth depends on equal access to consumers. Any rule that allows broadband providers to negotiate special deals with some companies would undermine my ability to compete online.
\n\nWe need a free and open internet that works for everyone, not just telecom companies that stand to benefit from the FCC\u2019s proposed rules.\n\nI'm sending this to the FCC's open proceeding and to my mem
bers of Congress. Please publicly support the FCC's existing net neutrality rules based on Title II and microbusinesses like mine.\n\nThe FCC must serve the people and protect net neutrality to preserve freedo
m of speech for content deliverers, without the hobbling interference of corporate interests, fair and equitable access to the internet, and divert overruling corporations from taking over what's best for the 
people, not corporate interests!\n\nThank you!\r\nAllanah Anderson", 
        "text_data": "When I pay for Internet service, I expect protection from data discrimination, privacy invasion, and access restrictions. FCC Chairman Pai\u2019s proposal risks my rights as a citizen. Co
rporations won\u2019t do the right thing. I want the FCC to uphold all existing Title II net neutrality rules. Thanks.", 
        "text_data": "Ajit Pai! I support strong net neutrality backed by Title II oversight of ISPs! Don't mess with it!", 
        "text_data": "I support strong net neutrality backed by Title II oversight of ISPs.\n\nI demand that the FCC keep the internet filed as a Title II communications service.\n\nISPs killing competition an
d damaging internet-based companies does happen, Ajit Pai. Just ask Google what happened to Google Wallet in 2013, or ask Netflix how their contract negotiations with Comcast went in 2013 and how their downloa
d speeds were during the negotiations.\n\nI will not allow ISPs to censor what I want to see, kill the free market by harming internet-based companies, make me pay massive fees for the same speed that I have n
ow, form into an oligopoly, and silence anyone online that either doesn't pay their fees or disagrees with their actions.\n\nI will fight for strong Net Neutrality! Long live the Internet!", 
        "text_data": "I support strong net neutrality backed by Title II oversight of ISPs.\n\nI demand that the FCC keep the internet filed as a Title II communications service.\n\nISPs killing competition an
d damaging internet-based companies does happen, Ajit Pai. Just ask Google what happened to Google Wallet in 2013, or ask Netflix how their contract negotiations with Comcast went in 2013 and how their downloa
d speeds were during the negotiations.\n\nI will not allow ISPs to censor what I want to see, kill the free market by harming internet-based companies, make me pay massive fees for the same speed that I have n
ow, form into an oligopoly, and silence anyone online that either doesn't pay their fees or disagrees with their actions.\n\nI will fight for strong Net Neutrality! Long live the Internet!", 
        "text_data": "I am very concerned that removing title II classification from ISPs will stifle innovation by allowing a select few organizations dictate the pace and types of innovation on the digital f
ront. Instead of the free market picking tech startup winners and losers we would be disproportionately shifting that power to the telecoms. Why would Verizon, Comcast, AT&T, and others allow startups to acces
s the world via their conduits if the startups goals or I tent is to innovate and disrupt the market? This is not free market and this is not American.", 
        "text_data": "The current FCC chairman is infuriating. He repeatedly claims that all anti-competitive activity prohibited under Title II is \u201chypothetical\u201d. Yet, both Verizon and Comcast have 
throttled Netflix in the past, and Tmobile has interfered with Google Wallet. All of this is well documented. Netflix was even forced to pay protection money to Comcast in order to achieve faster speeds for co
nsumers. But the current administration, in its infinite wisdom, somehow believes that the FTC will protect consumers from such abuses and that Net Neutrality is not necessary. \n\nThis is outrageous. The FTC 
is retroactive. Consumers essentially need to become law enforcement investigators and conduct constant monitoring of their internet speeds and build up an extensive profile detailing abuses. Then, AFTER an in
credibly lengthy process which will most likely be brought to a crawl by the seemingly unlimited legal resources of ISPs, will the FTC consider taking action. The FCC is either totally devoid of common sense o
r under a stranglehold by special interests if it thinks this is a reasonable solution.\n\nNet Neutrality is proactive. Small businesses, and startups can have peace of mind knowing that they can focus on thei
r product instead of having to pay protection money to ISPs. It also encourages competition, a central component of any functioning free market. Consumers can also have peace of mind that they actually have op
en Internet access as opposed to a curated list of websites which the ISPs deem acceptable to their business interests.", 
        "text_data": "I specifically support strong net neutrality backed by Title II oversight of the ISPs", 
        "text_data": "I specifically support strong  Net Neutrality backed by by Title II oversight of ISP's. I believe our free access to internet free of throttling, discrimination, and filtering should be o
ur right. Don't take away the rights of the people.", 

If we remove all the verbatim duplicates and aggregate all the other files' "text_data" fields, we get:

RAWDIR=FCCcomments
simpleGrepAggFname='notTop3Comments.txt'
touch $simpleGrepAggFname
for filename in $RAWDIR/*.json; do
    echo $filename
    grep '"text_data"' $filename | grep -v '"I am in favor of strong net neutrality under Title' | grep -v ': \[' | grep -v 'Comcast has throttled Netflix, AT&T blocked FaceTime, Time Warner Cable throttled the popular game League of Legends, and Verizon admitted' >> $simpleGrepAggFname
done

There are 13M comments that aren't those overrepresented three comments

$ wc -l notTop3Comments.txt 
13010096 notTop3Comments.txt

That's only ~29% of all 44M "text_data" fields in 44M "comments".

$ touch tmp.txt
$ for filename in /*.json; do grep '"text_data"' $filename | wc -l >> tmp.txt ; done
$ awk '{ sum += $1 } END { print sum }' tmp.txt
$ 44232187

If you look for an overrepresented "mode" of those remaining comments, you'll find that in the first 1000 remaining lines, there's a comment that appears 652 times

$ head -1000 notTop3Comments.txt | sort | uniq -c | sort -n | tail
      2         "text_data": "\"Internet Freedom\" is Net Neutrality and it should stay as it is.", 
      2         "text_data": "Net neutrality is important to all of us. No business and no person should get a preference for speed over the Internet.   There is no good reason to change the current regulations that have been in place in recent years. The FCC current regulations are fair to all entities. Please do not change them in favor of certain entities", 
      2         "text_data": "Paragraph 82 asks for input on whether throttling should be regulated. In the past ISPs have throttled content based on their own determination of what was lawful or permissible, and had to be forced to stop in the courts. Isn\u2019t it possible they could do this again? I\u2019m also concerned by mobile providers who say a plan is \u201cunlimited,\u201d but when you exceed the data cap, only throttle sites and services that aren\u2019t part of their approved zero-rating network. Thanks for reading my comment.", 
      3         "text_data": "I do not support the proposed \"Restoring Internet Freedom\" rules. I believe that, sans government regulation, ISPs have no incentive to provide neutral service to subscribers. I would be willing to support the reversal of Title II classification if and only if it could be guaranteed that net neutrality would remain (e.g. a law more fitting to the situation, passed by Congress). Of course, since Congress and the presidency are controlled by the Republican Party, this is unlikely to happen. Therefore, I vociferously oppose the proposed rules.", 
      3         "text_data": "I support full net neutrality and the classification of ISPs as common carriers under Title II of the Communications Act. Removing these rules will be harmful to consumers. \n\nIn addition, I support the transparency rule requiring ISPs to more clearly disclose hidden fees and data caps. This additional information improves competition between ISPs and enables consumers to make better buying decisions.", 
      4         "text_data": "An ISP throttling bandwidth is like a phone company dropping a phone call because it doesn't like the content of the conversation. Security researchers, as in example, frequently need to visit less reputable sites as part of their research. Throttling that traffic not only slows that research but actually puts everyone who stands to benefit from that research at greater risk. Security gaps now take longer to close. That extra time be just what it takes to lead to a critical compromise. National security, power companies, health insurance, personality identifiable information could all be at greater risk.", 
      4         "text_data": "I\u2019m worried that the protections that are in place will be weakened if we change the way they\u2019re enforced. I would support a new regulation style if it guarantees the same or better protections, but not if we lose any.", 
      4         "text_data": "Please KEEP current net neutrality rules set by the Obama administration. ISP's should NOT be allowed to throttle access to sites not willing to pay up. ISP's should NOT be allowed to charge different prices for preferential treatment. KEEP Title 2 rules in place, keeping ISP's as common carriers like the phone companies.", 
      4         "text_data": "The draft seeks comment on the analysis in Paragraph 27. This analysis purports to show that broadband Internet service is an information service because it provides users the \"capability for generating, acquiring, storing, transforming, processing, retrieving, utilizing, or making available information via telecommunications.\" The argument given is that broadband Internet service allows users to do all these things. However, this is not the same as providing the capability to do these things. To see why, consider that providing users Internet services over dialup phone lines also allows users to do all these things; but the phone lines themselves are telecommunications services, not information services. Why? Because providing the user dialup Internet, by itself, does not provide them the capability to do all these things. That capability is provided by the endpoints: the users' computers, and the computers hosting the Internet services that the users connect to.\nExactly the same is true of broadband Internet services provided by ISPs: by themselves, they do not provide users the capability to do all these things. They only provide connections between computers at the endpoints that provide those capabilities. It is the services provided by the Internet hosts that users connect to that are \"information services\". The broadband Internet services that allow users to connect to those hosts are telecommunications services, and should be regulated as such.\nISPs object to analyses like the one above because they claim that they also provide the actual information services--in other words, they also provide Internet hosts that function as email servers, web servers, etc. But it is obvious that those services are separate from the broadband connection services provided by those same ISPs, because users can make use of the latter without making use of the former at all. I am such a user: I use the broadband Internet connection provided by my ISP, but I do not use any of the information services they provide; I do not use their email, their web hosting, etc. I use other Internet hosts provided by other companies for those services. The fact that ISPs offer information services as well as telecommunications services does not make their telecommunications services into information services; an ISP's choice of business model cannot change the nature of a particular service it provides. Broadband Internet connections are obviously a telecommunications service, and should be regulated as such, regardless of what other services ISPs would like to bundle with them. The FCC should continue to regulate broadband Internet service as a telecommunications service.", 
    652         "text_data": "Obama\u2019s Title II order has diminished broadband investment, stifled innovation, and left American consumers potentially on the hook for a new broadband tax.\r\n\r\nThese regulations ended a decades-long bipartisan consensus that the Internet should be regulated through a light touch framework that worked better than anyone could have imagined and made the Internet what it is.\r\n\r\nFor these reasons I urge you to fully repeal the Obama/Wheeler Internet regulations.", 

But that sentence only appears another ~3300 times in the rest of the comments (a total of 3963 times).

$ grep "Title II order has diminished broadband investment, stifled innovation, and left American consumers potentially on the hook for a new broadband tax."  notTop3Comments.txt | wc -l
3963

If you keep filtering comments like that--looking for overrepresented modes in portions of the comments, you can start seeing some comment patterns emerge...


But note that there are sometimes blocks of "unique" "comments." For instance, among the first 10k of the last 6M "comments" there are 10,000 unique comments.

$ tail -6000000 notTop3Comments.txt | head -10000 | sort | uniq -c | sort -n | wc -l
10000

SO THERE ARE 10000 SEQUENTIAL COMMENTS THAT ARE ALL UNIQUE! WAT!? THOSE MUST BE REAL COMMENTS, RIGHT?! I GOTTA SEE THESE!

$ tail -6000000 notTop3Comments.txt | head -10000 | sort | uniq -c | sort -n | tail
      1         "text_data": "With respect to the future of the Internet. I strongly request Ajit Pai to repeal President Obama's scheme to take over Internet access. Internet users, as opposed to unelected bureaucrats, deserve to buy whatever applications they want. President Obama's scheme to take over Internet access is a exploitation  of net neutrality. It reversed a light-touch policy that worked fabulously successfully for a long time with nearly universal support.", 
      1         "text_data": "With respect to the future of the Internet. I want to implore the commission to repeal Obama's power grab to regulate broadband. Individuals, not Washington bureaucrats, should be able to use which services they want. Obama's power grab to regulate broadband is a exploitation  of net neutrality. It ended a pro-consumer framework that functioned supremely well for many years with Republican and Democrat backing.", 
      1         "text_data": "With respect to the future of the Internet. I would like to implore Chairman Pai to reverse Obama's plan to take over the Internet. Individuals, as opposed to Washington bureaucrats, ought to select which products they want. Obama's plan to take over the Internet is a perversion of the open Internet. It undid a pro-consumer policy that worked fabulously smoothly for a long time with broad bipartisan consensus.", 
      1         "text_data": "With respect to the Obama takeover of the Internet. I'd like to encourage Chairman Pai to overturn President Obama's plan to take over broadband. Internet users, not the FCC, should be free to purchase whatever services they choose. President Obama's plan to take over broadband is a perversion of the open Internet. It reversed a light-touch policy that performed very smoothly for decades with Republican and Democrat consensus.", 
      1         "text_data": "With respect to the Obama takeover of the Internet. I would like to request Chairman Pai to repeal Obama's policy to regulate Internet access. Individual citizens, as opposed to Washington bureaucrats, deserve to enjoy the products we desire. Obama's policy to regulate Internet access is a perversion of net neutrality. It reversed a pro-consumer framework that worked remarkably well for decades with Republican and Democrat support.", 
      1         "text_data": "With respect to the Open Internet order. I'd like to encourage the commission to reverse Barack Obama's scheme to take over Internet access. Individual citizens, rather than big government, should be able to select the services they want. Barack Obama's scheme to take over Internet access is a betrayal of net neutrality. It stopped a light-touch framework that functioned remarkably well for many years with broad bipartisan approval.", 
      1         "text_data": "With respect to the Open Internet order. I would like to urge Ajit Pai to undo Tom Wheeler's policy to control broadband. Americans, rather than Washington, ought to select which applications they desire. Tom Wheeler's policy to control broadband is a exploitation  of net neutrality. It broke a pro-consumer approach that functioned exceptionally well for a long time with nearly universal support.", 
      1         "text_data": "With respect to Title 2 and net neutrality. I would like to implore Ajit Pai to rescind The previous administration's scheme to control Internet access. Americans, rather than the FCC, ought to buy the products we desire. The previous administration's scheme to control Internet access is a corruption of the open Internet. It stopped a pro-consumer system that performed supremely successfully for a long time with nearly universal support.", 
      1         "text_data": "With respect to Title II rules. I strongly advocate the commission to reverse Tom Wheeler's scheme to control the web. Citizens, as opposed to Washington, should be able to purchase whatever products they prefer. Tom Wheeler's scheme to control the web is a distortion of the open Internet. It ended a market-based policy that performed very, very smoothly for many years with broad bipartisan backing.", 
      1         "text_data": "With respect to Title II rules. I strongly request you to overturn Barack Obama's order to take over the web. People like me, not unelected bureaucrats, deserve to purchase whatever products they prefer. Barack Obama's order to take over the web is a betrayal of the open Internet. It undid a hands-off approach that performed very, very smoothly for two decades with broad bipartisan approval.", 

While they're all "unique", all of these comments seem to be "negative" toward net neutrality and the Obama administration, and there's a common pattern to these (though they don't ever repeat verbatim):


Short example "unique but semantically equivalent" phrases over 4 tokens:

  Barack Obama's order to take over the web 
  Tom Wheeler's scheme to control the web
  Tom Wheeler's policy to control broadband
  Obama's policy to regulate Internet access

  [[ Barack Obama's / Tom Wheeler's / Obama's ]]
  [[ order / scheme / policy ]]
  to
  [[ take over / control / regulate ]]
  [[ the web / broadband / Internet access]]

The complete pattern for the 10,000 comments over 30 tokens is:


Column Description
#T : number of token possible token values (strings)
TID : token ID (1 of 30)
[[ T[1] / T[2] / ... / T[#T] ]] : all the possible values of that token
#T  TID  [[ T[1] / T[2] / ... / T[#T] ]]
===========================================
13     1 [["Dear Chairman Pai, " / "To the FCC: " / "Hi, " / "Chairman Pai: " / "Dear Commissioners: " / "Dear FCC, " / "To the Federal Communications Commission: " / "FCC commissioners, " / "Dear Mr. Pai, " / "FCC: " / "To whom it may concern: " / "Mr Pai: " / "" ]]  
20     2 [[ "I am a voter worried about " / "I'm very concerned about " / "I want to give my opinion on " / "With respect to " / "I'm very worried about " / "Regarding the future of " / "In reference to " / "My comments re: " / "Regarding " / "I'd like to comment on "  / "I'd like to comment on " / "I would like to comment on " / "I'm concerned about " / "I have concerns about " /  "In the matter of " / "I have thoughts on " / "I'm a voter worried about " / "I'm contacting you about " / "I am concerned about " / "I'd like to share my thoughts on "]]
26     3 [[ "the Open Internet order" / "Internet freedom" / "Internet freedom" / "NET NEUTRALITY" / "Internet regulation and net neutrality" / "the future of the Internet" / "network neutrality regulations" / "so-called Open Internet order" / "Title II rules" / "an open Internet" / "net neutrality rules" / "Net neutrality" / "internet regulations" / "the FCC's Open Internet order" / "Network Neutrality" / "Title 2 and net neutrality" / "the FCC rules on the Internet" / "the Obama takeover of the Internet" / "net neutrality regulations" / "Net Neutrality and Title II" / "net neutrality and Title II" / "the FCC regulations on the Internet" / "regulations on the Internet" / "Internet Freedom" / "Internet regulation" / "net neutrality" ]]
5      4 [[ "I strongly " / "I want to " /  "I would like to " / "I " / "I'd like to " ]]
9      5 [[ implore / urge / suggest / recommend / demand / advocate / request / ask / encourage ]
8      6 [[ "the commissioners" / "Ajit Pai" / "the commission" / "the government" / "the FCC to " / " you " / "the Federal Communications Commission" / "Chairman Pai to" ]]
      to 
5      7 [[ rescind / overturn / reverse / repeal / undo ]]
      [[ tokens N-14, N-13, " to ", N-12, N-11]]
      . 
7      8 [[ People like me / Individual Americans / Citizens / Individuals / Internet users / Individual citizens / Americans ]]
      ,
3      9 [[ not / as opposed to / rather than ]]
7      10 [[ unelected bureaucrats / Washington / Washington / so-called experts / the FCC / the FCC / big government]]
6      11 [[ deserve to / should be able to / ought to / should be free to / should / should ]]
      to
5      12 [[ purchase / buy / select / enjoy / use ]]
4      13 [[ whatever / whichever / which / the]]
3      14 [[ products / services / applications ] 
      they 
4      15 [[ desire / want / prefer / choose ]]
      .
6      16 [[ Tom Wheeler's / Barack Obama's / The Obama/Wheeler / President Obama's / The previous administration's / Obama's ]]
6      17 [[ scheme / order / power grab / decision / plan / policy ]]
      to
3      18 [[ take over / control / regulate ]]
4      19 [[ Internet access / broadband / the web / the Internet ]]
      is a 
5      20 [[ betrayal / distortion / perversion / corruption / exploitation ]]
      of
2      21 [[ the open Internet / net neutrality ]]
      
      It 
6      22 [[ reversed / stopped / ended / disrupted / undid / broke ]]
      a 
6      23 [[ hands-off / free-market / market-based / light-touch / pro-consumer / hands-off ]]
4      24 [[ approach / policy / system / framework ]]
      that 
3      25 [[ performed / worked / functioned ]]
6      26 [[ fabulously / supremely / very, very / exceptionally / very / remarkably ]] 
3      27 [[ smoothly / successfully / well ]] 
      for 
4      28 [[ two decades / many years / a long time / decades ]]
      with 
5      29 [[ Republican and Democrat / broad bipartisan / nearly universal / bipartisan / both parties ]]."
4      30 [[ approval / backing / support / consensus ]]
      ."

After doing a bit of counting with grep, it became clear that they were sampling each possible token value uniformly, so we can compute how many unique comments they COULD HAVE generated:

13*20*26*5*9*8*5*7*3*7*6*5*4*3*4*6*6*3*4*5*2*6*6*4*3*6*3*4*5*4 = 1.6e21 = 1.6 sextillion comments = 1.6 zettacomments 

from only

13+20+26+5+9+8+5+7+3+7+6+5+4+3+4+6+6+3+4+5+2+6+6+4+3+6+3+4+5+4 = 192 token symbols across the 30 token positions

So they could've autogenerated 1.6e21 unique comments, but they only submitted like 10^6 ish of them. How honorable!


Here's some pseudocode on how to "Unlex" some of these FCC comments:

"POS substitution detection and caching algorithm:"
===========================================================
(0) initialize a corpus with all the unique comments
(1) find an unusually common short character sequence (POSsub) in the remaining corpus
(2) cache the POSsub
(3) compute the remaining corpus after removing the POSsub
(go to 1)
(4) verify that when you search for what's left after ignoring all the POSsub matches, there's nothing left (completeness)
(5) Increase the length of each POSsub to its maximum
(6) Remove the POSsubs from the initial corpus... repeat
(7) Verify that what's left is amenable to same sort of POSsub det&cache

That algorithm gives you these 30 tokens:

"Dear Chairman Pai, " + "To the FCC: " + "Hi, " + "Chairman Pai: " + "Dear Commissioners: " + "Dear FCC, " + "To the Federal Communications Commission: " + "FCC commissioners, " + "Dear Mr. Pai, " + "FCC: " + "To whom it may concern: " + "Mr Pai: " + "" 
"I am a voter worried about " + "I'm very concerned about " + "I want to give my opinion on " + "With respect to " + "I'm very worried about " + "Regarding the future of " + "In reference to " + "My comments re: " + "Regarding " + "I'd like to comment on "  + "I'd like to comment on " + "I would like to comment on " + "I'm concerned about " + "I have concerns about " +  "In the matter of " + "I have thoughts on " + "I'm a voter worried about " + "I'm contacting you about " + "I am concerned about " + "I'd like to share my thoughts on "
"the Open Internet order" + "Internet freedom" + "Internet freedom" + "NET NEUTRALITY" + "Internet regulation and net neutrality" + "the future of the Internet" + "network neutrality regulations" + "so-called Open Internet order" + "Title II rules" + "an open Internet" + "net neutrality rules" + "Net neutrality" + "internet regulations" + "the FCC's Open Internet order" + "Network Neutrality" + "Title 2 and net neutrality" + "the FCC rules on the Internet" + "the Obama takeover of the Internet" + "net neutrality regulations" + "Net Neutrality and Title II" + "net neutrality and Title II" + "the FCC regulations on the Internet" + "regulations on the Internet" + "Internet Freedom" + "Internet regulation" + "net neutrality" 
"I strongly " + "I want to " +  "I would like to " + "I " + "I'd like to " 
"implore" + "urge" + "suggest" + "recommend" + "demand" + "advocate" + "request" + "ask" + "encourage"
"the commissioners" + "Ajit Pai" + "the commission" + "the government" + "the FCC to " + " you " + "the Federal Communications Commission" + "Chairman Pai to" 
"rescind" + "overturn" + "reverse" + "repeal" + "undo"
"People like me" + "Individual Americans" + "Citizens" + "Individuals" + "Internet users" + "Individual citizens" + "Americans"
"not" + "as opposed to" + "rather than"
"unelected bureaucrats" + "Washington" + "Washington" + "so-called experts" + "the FCC" + "the FCC" + "big government"
"deserve to" + "should be able to" + "ought to" + "should be free to" + "should" + "should"
"purchase" + "buy" + "select" + "enjoy" + "use"
"whatever" + "whichever" + "which" + "the"
"products" + "services" + "applications" 
"desire" + "want" + "prefer" + "choose"
"Tom Wheeler's" + "Barack Obama's" + "The Obama-Wheeler" + "President Obama's" + "The previous administration's" + "Obama's"
"scheme" + "order" + "power grab" + "decision" + "plan" + "policy"
"take over" + "control" + "regulate"
"Internet access" + "broadband" + "the web" + "the Internet"
"betrayal" + "distortion" + "perversion" + "corruption" + "exploitation "
"the open Internet" + "net neutrality"
"reversed" + "stopped" + "ended" + "disrupted" + "undid" + "broke"
"hands-off" + "free-market" + "market-based" + "light-touch" + "pro-consumer" + "hands-off"
"approach" + "policy" + "system" + "framework"
"performed" + "worked" + "functioned"
"fabulously" + "supremely" + "very, very" + "exceptionally" + "very" + "remarkably"
"smoothly" + "successfully" + "well"
"two decades" + "many years" + "a long time" + "decades"
"Republican and Democrat" + "broad bipartisan" + "nearly universal" + "bipartisan" + "both parties"
"approval" + "backing" + "support" + "consensus"

For which each token's frequency count over 20k lines looks something like:


TOKEN 30

For the last POSsub token, there are 4 options--distributed roughly evenly (~5k each of 20k lines):

$ tail -6000000 notTop3Comments.txt | head -20000 | sort | uniq -c | sort -n | grep "approval" | wc -l
5038
$ tail -6000000 notTop3Comments.txt | head -20000 | sort | uniq -c | sort -n | grep "backing" | wc -l
4903
$ tail -6000000 notTop3Comments.txt | head -20000 | sort | uniq -c | sort -n | grep "support" | wc -l
5059
$ tail -6000000 notTop3Comments.txt | head -20000 | sort | uniq -c | sort -n | grep "consensus" | wc -l
5000

And if you ignore lines with those words, no lines are left (so every line contains exactly one of those tokens but not the others):

$ tail -6000000 notTop3Comments.txt | head -20000 | sort | uniq -c | sort -n | grep -v "consensus" | grep -v "support" | grep -v "backing" | grep -v "approval" | wc -l
0

And you can downselect incrementally from there...


TOKEN 26

For the next POSsub token there appear to be 6 options:

$ tail -6000000 notTop3Comments.txt | head -20000 | grep "fabulously" | wc -l
3369
$ tail -6000000 notTop3Comments.txt | head -20000 | grep -v "fabulously" | grep "supremely" | wc -l
3373
$ tail -6000000 notTop3Comments.txt | head -20000 | grep -v "fabulously" | grep -v "supremely" | grep "very, very" | wc -l
3240
$ tail -6000000 notTop3Comments.txt | head -20000 | grep -v "fabulously" | grep -v "supremely" | grep -v "very, very" | grep "exceptionally" | wc -l
3411
$ tail -6000000 notTop3Comments.txt | head -20000 | grep -v "fabulously" | grep -v "supremely" | grep -v "very, very" | grep -v "exceptionally" | grep " very " | wc -l
3704 <- 300 off?
$ tail -6000000 notTop3Comments.txt | head -20000 | grep -v "fabulously" | grep -v "supremely" | grep -v "very, very" | grep -v "exceptionally" | grep -v " very " | grep "remarkably" | wc -l
2903

And again, none are left when you remove them all

$ tail -6000000 notTop3Comments.txt | head -20000 | grep -v "fabulously" | grep -v "supremely" | grep -v "very, very" | grep -v "exceptionally" | grep -v " very " | grep -v "remarkably" | wc -l
0

Which looks like this in a bash script:

OLD_IFS=$IFS;
PLUS_IFS=+
notTop3CommentsFname=notTop3Comments.txt
uniqTokenFname=uniqueTokens.ssv
nTokLines=`wc -l < $uniqTokenFname `
# Get 20k lines of "unique" comments
uniqLinesCommand="tail -6000000 $notTop3CommentsFname | head -20000 "
IFS="$OLD_IFS"
for lineNum in $(seq 1 $nTokLines); do
    echo "==========================================="
    echo Token $lineNum
    echo "==========================================="
    linetmp=`head -$lineNum $uniqTokenFname | tail -1` 
    echo $linetmp
    IFS=$PLUS_IFS; TOKENARR=($linetmp);
    for ((i=0; i<${#TOKENARR[@]}; ++i)); do echo "Token $i: ${TOKENARR[$i]}";  done
    nTokens=$((i-1))   # count tokens
    echo "nTokens=$nTokens"
    IFS="$OLD_IFS"
    for tokenIndex in $(seq 0 $nTokens); do echo $tokenIndex;grepTerm=${TOKENARR[$tokenIndex]}; grepPipe="| grep $grepTerm | wc -l";preExcludePipe="";
    for exTokens in $(seq 0 $(( tokenIndex - 1))); do excludeTerm=${TOKENARR[$exTokens]}; excludePipe="| grep -v $excludeTerm "; preExcludePipe="$preExcludePipe $excludePipe";
    done
    echo $grepTerm: 
    echo "$uniqLinesCommand $preExcludePipe $grepPipe"
    eval "$uniqLinesCommand $preExcludePipe $grepPipe"
done
done
IFS="$OLD_IFS";

Note that there's also pretty convincing evidence all these "unique" comments were automatically generated. Take, for instance, the unique character sequence "exploitation " (a typo where exploitation is followed by an extra space)--the same typo appears in ~1/5 of these "unique" "comments"

$ tail -6000000 notTop3Comments.txt | head -10000 | sort | uniq -c | sort -n | grep "a exploitation  " | wc -l
1984
tail -6000000 notTop3Comments.txt | head -10000 | sort | uniq -c | sort -n | grep -v "a exploitation  " | grep exploitation | wc -l
0

i.e. the ONLY way someone used the word exploitation was wrong and with an extra " " after it, and they made both mistakes 1984 times. Ironically, to the Federal COMMUNICATIONS Commission.

<Insert 1984 Newspeak joke here.>

From Wikipedia:

https://en.wikipedia.org/wiki/Newspeak

In "The Principles of Newspeak", the appendix to the novel, George Orwell explains that Newspeak usage follows most of the English grammar, yet is a language characterised by a continually diminishing vocabulary; complete thoughts reduced to simple terms of simplistic meaning.[5] Linguistically, the contractions of Newspeak—Ingsoc (English Socialism), Minitrue (Ministry of Truth), etc.—derive from the syllabic abbreviations of Russian, which identify the government and social institutions of the Soviet Union, such as politburo (Politburo of the Central Committee of the Communist Party of the Soviet Union), Comintern (Communist International), kolkhoz (collective farm), and Komsomol (Young Communists' League). The long-term political purpose of the new language is for every member of the Party and society, except the Proles—the working-class of Oceania—to exclusively communicate in Newspeak, by the year A.D. 2050; during that 66-year transition, the usage of Oldspeak (Standard English) shall remain interspersed among Newspeak conversations.[6]

Newspeak is also a constructed language, of planned phonology, grammar, and vocabulary, like Basic English, which Orwell promoted (1942–44) during the Second World War (1939–45), and later rejected in the essay "Politics and the English Language" (1946), wherein he criticises the bad usage of English in his day: dying metaphors, pretentious diction, and high-flown rhetoric, which produce the meaningless words of doublespeak, the product of unclear reasoning. Orwell's conclusion thematically reiterates linguistic decline: "I said earlier that the decadence of our language is probably curable. Those who deny this may argue that language merely reflects existing social conditions, and that we cannot influence its development, by any direct tinkering with words or constructions."[7]
"Dear Chairman Pai, " + "To the FCC: " + "Hi, " + "Chairman Pai: " + "Dear Commissioners: " + "Dear FCC, " + "To the Federal Communications Commission: " + "FCC commissioners, " + "Dear Mr. Pai, " + "FCC: " + "To whom it may concern: " + "Mr Pai: " + ""
"I am a voter worried about " + "I'm very concerned about " + "I want to give my opinion on " + "With respect to " + "I'm very worried about " + "Regarding the future of " + "In reference to " + "My comments re: " + "Regarding " + "I'd like to comment on " + "I'd like to comment on " + "I would like to comment on " + "I'm concerned about " + "I have concerns about " + "In the matter of " + "I have thoughts on " + "I'm a voter worried about " + "I'm contacting you about " + "I am concerned about " + "I'd like to share my thoughts on "
"the Open Internet order" + "Internet freedom" + "Internet freedom" + "NET NEUTRALITY" + "Internet regulation and net neutrality" + "the future of the Internet" + "network neutrality regulations" + "so-called Open Internet order" + "Title II rules" + "an open Internet" + "net neutrality rules" + "Net neutrality" + "internet regulations" + "the FCC's Open Internet order" + "Network Neutrality" + "Title 2 and net neutrality" + "the FCC rules on the Internet" + "the Obama takeover of the Internet" + "net neutrality regulations" + "Net Neutrality and Title II" + "net neutrality and Title II" + "the FCC regulations on the Internet" + "regulations on the Internet" + "Internet Freedom" + "Internet regulation" + "net neutrality"
"I strongly " + "I want to " + "I would like to " + "I " + "I'd like to "
"implore" + "urge" + "suggest" + "recommend" + "demand" + "advocate" + "request" + "ask" + "encourage"
"the commissioners" + "Ajit Pai" + "the commission" + "the government" + "the FCC to " + " you " + "the Federal Communications Commission" + "Chairman Pai to"
"rescind" + "overturn" + "reverse" + "repeal" + "undo"
"People like me" + "Individual Americans" + "Citizens" + "Individuals" + "Internet users" + "Individual citizens" + "Americans"
"not" + "as opposed to" + "rather than"
"unelected bureaucrats" + "Washington" + "Washington" + "so-called experts" + "the FCC" + "the FCC" + "big government"
"deserve to" + "should be able to" + "ought to" + "should be free to" + "should" + "should"
"purchase" + "buy" + "select" + "enjoy" + "use"
"whatever" + "whichever" + "which" + "the"
"products" + "services" + "applications"
"desire" + "want" + "prefer" + "choose"
"Tom Wheeler's" + "Barack Obama's" + "The Obama-Wheeler" + "President Obama's" + "The previous administration's" + "Obama's"
"scheme" + "order" + "power grab" + "decision" + "plan" + "policy"
"take over" + "control" + "regulate"
"Internet access" + "broadband" + "the web" + "the Internet"
"betrayal" + "distortion" + "perversion" + "corruption" + "exploitation "
"the open Internet" + "net neutrality"
"reversed" + "stopped" + "ended" + "disrupted" + "undid" + "broke"
"hands-off" + "free-market" + "market-based" + "light-touch" + "pro-consumer" + "hands-off"
"approach" + "policy" + "system" + "framework"
"performed" + "worked" + "functioned"
"fabulously" + "supremely" + "very, very" + "exceptionally" + "very" + "remarkably"
"smoothly" + "successfully" + "well"
"two decades" + "many years" + "a long time" + "decades"
"Republican and Democrat" + "broad bipartisan" + "nearly universal" + "bipartisan" + "both parties"
"approval" + "backing" + "support" + "consensus"
OLD_IFS=$IFS;
PLUS_IFS=+
notTop3CommentsFname=notTop3Comments.txt
uniqTokenFname=uniqueTokens.ssv
nTokLines=`wc -l < $uniqTokenFname `
# Get 20k lines of "unique" comments
uniqLinesCommand="tail -6000000 $notTop3CommentsFname | head -20000 "
IFS="$OLD_IFS"
for lineNum in $(seq 1 $nTokLines); do
echo "==========================================="
echo Token $lineNum
echo "==========================================="
linetmp=`head -$lineNum $uniqTokenFname | tail -1`
echo $linetmp
IFS=$PLUS_IFS; TOKENARR=($linetmp);
for ((i=0; i<${#TOKENARR[@]}; ++i)); do echo "Token $i: ${TOKENARR[$i]}"; done
nTokens=$((i-1)) # count tokens
echo "nTokens=$nTokens"
IFS="$OLD_IFS"
for tokenIndex in $(seq 0 $nTokens); do echo $tokenIndex;grepTerm=${TOKENARR[$tokenIndex]}; grepPipe="| grep $grepTerm | wc -l";preExcludePipe="";
for exTokens in $(seq 0 $(( tokenIndex - 1))); do excludeTerm=${TOKENARR[$exTokens]}; excludePipe="| grep -v $excludeTerm "; preExcludePipe="$preExcludePipe $excludePipe";
done
echo $grepTerm:
echo "$uniqLinesCommand $preExcludePipe $grepPipe"
eval "$uniqLinesCommand $preExcludePipe $grepPipe"
done
done
IFS="$OLD_IFS";
"Dear Chairman Pai, " + "To the FCC: " + "Hi, " + "Chairman Pai: " + "Dear Commissioners: " + "Dear FCC, " + "To the Federal Communications Commission: " + "FCC commissioners, " + "Dear Mr. Pai, " + "FCC: " + "To whom it may concern: " + "Mr Pai: " + ""
"I am a voter worried about " + "I'm very concerned about " + "I want to give my opinion on " + "With respect to " + "I'm very worried about " + "Regarding the future of " + "In reference to " + "My comments re: " + "Regarding " + "I'd like to comment on " + "I'd like to comment on " + "I would like to comment on " + "I'm concerned about " + "I have concerns about " + "In the matter of " + "I have thoughts on " + "I'm a voter worried about " + "I'm contacting you about " + "I am concerned about " + "I'd like to share my thoughts on "
"the Open Internet order" + "Internet freedom" + "Internet freedom" + "NET NEUTRALITY" + "Internet regulation and net neutrality" + "the future of the Internet" + "network neutrality regulations" + "so-called Open Internet order" + "Title II rules" + "an open Internet" + "net neutrality rules" + "Net neutrality" + "internet regulations" + "the FCC's Open Internet order" + "Network Neutrality" + "Title 2 and net neutrality" + "the FCC rules on the Internet" + "the Obama takeover of the Internet" + "net neutrality regulations" + "Net Neutrality and Title II" + "net neutrality and Title II" + "the FCC regulations on the Internet" + "regulations on the Internet" + "Internet Freedom" + "Internet regulation" + "net neutrality"
"I strongly " + "I want to " + "I would like to " + "I " + "I'd like to "
"implore" + "urge" + "suggest" + "recommend" + "demand" + "advocate" + "request" + "ask" + "encourage"
"the commissioners" + "Ajit Pai" + "the commission" + "the government" + "the FCC to " + " you " + "the Federal Communications Commission" + "Chairman Pai to"
"rescind" + "overturn" + "reverse" + "repeal" + "undo"
"People like me" + "Individual Americans" + "Citizens" + "Individuals" + "Internet users" + "Individual citizens" + "Americans"
"not" + "as opposed to" + "rather than"
"unelected bureaucrats" + "Washington" + "Washington" + "so-called experts" + "the FCC" + "the FCC" + "big government"
"deserve to" + "should be able to" + "ought to" + "should be free to" + "should" + "should"
"purchase" + "buy" + "select" + "enjoy" + "use"
"whatever" + "whichever" + "which" + "the"
"products" + "services" + "applications"
"desire" + "want" + "prefer" + "choose"
"Tom Wheeler's" + "Barack Obama's" + "The Obama-Wheeler" + "President Obama's" + "The previous administration's" + "Obama's"
"scheme" + "order" + "power grab" + "decision" + "plan" + "policy"
"take over" + "control" + "regulate"
"Internet access" + "broadband" + "the web" + "the Internet"
"betrayal" + "distortion" + "perversion" + "corruption" + "exploitation "
"the open Internet" + "net neutrality"
"reversed" + "stopped" + "ended" + "disrupted" + "undid" + "broke"
"hands-off" + "free-market" + "market-based" + "light-touch" + "pro-consumer" + "hands-off"
"approach" + "policy" + "system" + "framework"
"performed" + "worked" + "functioned"
"fabulously" + "supremely" + "very, very" + "exceptionally" + "very" + "remarkably"
"smoothly" + "successfully" + "well"
"two decades" + "many years" + "a long time" + "decades"
"Republican and Democrat" + "broad bipartisan" + "nearly universal" + "bipartisan" + "both parties"
"approval" + "backing" + "support" + "consensus"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.