realSammler/small_red_pill__pa_veterans_day_dump

## small_red_pill__pa_veterans_day_dump
"""
Anyone with normal intelligence and ability to run a python script
should be able to use this to see the injection of about 4000 ballots
from inactive voters, received in Philadelphia on 10/10/2000.

This is part of the "Veterans' Day Dump" of about 20000 ballots.
To see the full dump, enlarge the range of "Ballot Mailed Date"
by un-commenting line 69.
"""
import os
from collections import defaultdict
import csv
import requests
import matplotlib.pyplot as plt
CACHE_FILENAME = "2020_absentee_ballot_data_from_pa_dot_gov.csv"

# see if we already downloaded the data
if os.path.isfile(CACHE_FILENAME):
    with open(CACHE_FILENAME, newline="") as csvfile:
        reader = csv.reader(csvfile, delimiter=",")
        rows = list(reader)
        print("read", len(rows), "rows of cached data")
else:
    #download the data -- it is a big file (466 MB) so this takes some time
    with requests.Session() as s:
        CSV_URL = "https://data.pa.gov/api/views/mcba-yywm/rows.csv?accessType=DOWNLOAD"
        print("starting download...")
        download = s.get(CSV_URL)
        decoded_content = download.content.decode('utf-8')
        rows = list(csv.reader(decoded_content.splitlines(), delimiter=','))
        print("downloaded", len(rows), "rows of data")
        print("Caching this locally will make this much faster next time but use hard drive space until you delete")
        do = input("Ok to cache (y/N)? ")
        if do and do[:1].upper() == "Y":
            with open(CACHE_FILENAME, 'w', newline="") as csvfile:
                csv.writer(csvfile, delimiter=",").writerows(rows)

# helper functions to analyze a row
# columns are: County Name, Applicant Party Designation, Date of Birth, Mail Application Type, Application Approved Date, Application Return Date, Ballot Mailed Date, Ballot Returned Date, State House District, State Senate District, Congressional District
# dates are mm/dd/yyyy strings
def _county(row):
    return row[0]
def _birth_year(row):
    y = row[2][-4:]
    return int(y) if y > "1920" and y < "2003" else None    # a few entries have hidden birth year
def _mailed_date(row):
    return row[6]
def _returned_date(row):
    return row[7]

"""we will collect votes in a county, with specified "Ballot Mailed Date", in two disjoint groups:
    -- those with a specified "Ballot Returned Date"
    -- those with any other Ballot Returned Date
    for each we will histogram the birth year
"""
COUNTY, RETURNED = "PHILADELPHIA", "10/10/2020"
in_county = [row for row in rows if _county(row) == COUNTY and _birth_year(row) is not None] # this also drops the header row
def _count_by_mailed_dates(mailed_dates):
    my_dates, other = defaultdict(int), defaultdict(int)
    for row in in_county:
        if _mailed_date(row) not in mailed_dates:
            continue    # skip all other "Ballot Mailed Date"
        # choose which histogram this point goes in based on its "Ballot Returned Date"
        h = my_dates if _returned_date(row) == RETURNED else other
        h[_birth_year(row)] += 1
    return my_dates, other

h_my_dates, h_other = _count_by_mailed_dates(["09/29/2020"])
#h_my_dates, h_other = _count_by_mailed_dates(["09/28/2020", "09/29/2020", "09/30/2020", "10/2/2020"])
n_my_dates, n_other = sum(h_my_dates.values()), sum(h_other.values())

x = sorted(list(set().union(h_my_dates, h_other)))
def _normalized_plot(title, norm):
    # normalize histograms and make sure they have all the same keys
    def _normalized_histogram(h):
        size = norm(h)
        return [h[k] / size for k in x]

    y_my_dates = _normalized_histogram(h_my_dates)
    y_other = _normalized_histogram(h_other)

    # plot them
    plt.figure()
    plt.plot(x, y_my_dates, color="red", label=f"Received {RETURNED} ({n_my_dates} ballots)")
    plt.plot(x, y_other, color="blue", label=f"All other dates ({n_other} ballots)")
    plt.legend()
    plt.title(title)
    plt.show()

_normalized_plot("Count of votes by birth year", lambda _: 1)
for title in ("Probability distribution of birth years",
              "These people all mailed their ballots on the same day",
              f"Somehow ballots received on {RETURNED} are different"):
    _normalized_plot(title, lambda h: sum(h.values()))
for title in ("Normalize by births after 1965",
              "For those under 55 the age distributions are indistinguishable",
              f"Extra ballots from a different distribution were injected on {RETURNED}",
              "A lot of these are from very old people...",
              "... inactive registrations tend to be for the old"):
    _normalized_plot(title, lambda h: sum((v for k, v in h.items() if k > 1965)))

## small_red_pill__pa_veterans_day_dump__Excel_version
1.  Download https://data.pa.gov/api/views/mcba-yywm/rows.csv?accessType=DOWNLOAD and save it to disk.
2.  Select just Philadelphia records (otherwise it's too large for Excel).
	a.  On Windows command line, supply the filename FN to "findstr Ballot FN > philly.csv", and then "findstr PHILADELPHIA FN >> philly.csv"
		i.  To get a command line from Explorer (File Manager), use File->Open Windows PowerShell in your Downloads directory
		ii.  We are getting the first row (headers), and then all Philadelphia entries
	b.  On Linux, cat/grep instead of type/findstr
3.  Open philly.csv in Excel.
4.  Select all the data (ctrl-shift-end).
5.  Filter data using Data->Filter, then ctrl-home to get back to the top of the file.
6.  Filter to just ballots mailed on 9/29, by using the drop-down box in cell G1.
7.  Select all the (filtered) data again (ctrl-home, ctrl-shift-end).
8.  Copy this data and paste in a new workbook.  You should have 35678 rows including the header.
	a.  These are the absentee ballots, for Philadelphia voters, mailed on 9/29.
	b.  They spent some time in the mail, so they were received and processed later.
		i.  Thus Ballot Returned Date in column H is after 9/29.
9.  Create two new columns, "Veterans Day" and "Other Days".
	a. "Veterans Day" in Column L:  in L2, enter the formula "=if(H2=44114, 1, 0)", then fill down
		i. 44114 is the Excel numeric value for October 10, 2020.
	b. "Other Days" in Column M:  in M2, enter the formula "=1-L2", then fill down
10.  Select all the data again, and make it a PivotTable (Insert->Tables->PivotTable, OK).
11.  In "PivotTable fields", select "Date of Birth" and your two new columns.
	a. The "Values" box should automatically populate with "Sum of Veterans Day" and "Sum of Other Days".
12.  Right-click any cell in column A (under "Row Labels") and select "Group".  Select "Years" and unselect anything else.
13.  Click any cell inside the table, and select Insert->PivotChart.
	a.  These are simple counts of ballots.  They don't seem to be drawn from the same distribution.
14.  Go back to the data, and normalize the "Other Days" column to have the same total as "Veterans Day".
	a.  I.e., enter in M2 the formula "=(1-L2)*10179/25498", then fill down
		i.  You can verify these constants by summing the two columns
	b.  The data will be on another sheet (tab) in the same workbook
15.  Right-click any cell inside the PivotTable and choose "Refresh".
	a.  Now you can see the age distributions are very different.
	b.  Why should this be the case?  All these people supposedly mailed in ballots on September 29.
16.  Go back to the data, and change the scale factor in the "Other Days" column from 10179/25498 to 2478/9853.
	a.  This normalizes the two series so that the sums for voters born after 1/1/1965 are equal.
		i.  None of this normalization can change the shape of the curves, only their height.
17.  Again "Refresh" on the PivotTable.
	a.  The age distribution for people born after 1965 is indistinguishable.
	b.  Extra ballots have been injected, all processed on 10/10, drawn from a different population.
	c.  This is probably the population of inactive voter registrations, which skew older.
18.  This shows about 3800 extra votes.
	a.  These are part of the "Veterans' Day Dump" of fake mail-in ballots in Philadelphia.
	b.  You can see more of the dump by revisiting step 6 and expanding the range of "Ballot Mailed Date".
		i.  A good normalization factor for viewing, if you do this, is about 0.2
	c.  In all, about 20000 fake mail-in ballots were processed in Philadelphia on 10/10.
	"""
	Anyone with normal intelligence and ability to run a python script
	should be able to use this to see the injection of about 4000 ballots
	from inactive voters, received in Philadelphia on 10/10/2000.

	This is part of the "Veterans' Day Dump" of about 20000 ballots.
	To see the full dump, enlarge the range of "Ballot Mailed Date"
	by un-commenting line 69.
	"""
	import os
	from collections import defaultdict
	import csv
	import requests
	import matplotlib.pyplot as plt
	CACHE_FILENAME = "2020_absentee_ballot_data_from_pa_dot_gov.csv"

	# see if we already downloaded the data
	if os.path.isfile(CACHE_FILENAME):
	with open(CACHE_FILENAME, newline="") as csvfile:
	reader = csv.reader(csvfile, delimiter=",")
	rows = list(reader)
	print("read", len(rows), "rows of cached data")
	else:
	#download the data -- it is a big file (466 MB) so this takes some time
	with requests.Session() as s:
	CSV_URL = "https://data.pa.gov/api/views/mcba-yywm/rows.csv?accessType=DOWNLOAD"
	print("starting download...")
	download = s.get(CSV_URL)
	decoded_content = download.content.decode('utf-8')
	rows = list(csv.reader(decoded_content.splitlines(), delimiter=','))
	print("downloaded", len(rows), "rows of data")
	print("Caching this locally will make this much faster next time but use hard drive space until you delete")
	do = input("Ok to cache (y/N)? ")
	if do and do[:1].upper() == "Y":
	with open(CACHE_FILENAME, 'w', newline="") as csvfile:
	csv.writer(csvfile, delimiter=",").writerows(rows)

	# helper functions to analyze a row
	# columns are: County Name, Applicant Party Designation, Date of Birth, Mail Application Type, Application Approved Date, Application Return Date, Ballot Mailed Date, Ballot Returned Date, State House District, State Senate District, Congressional District
	# dates are mm/dd/yyyy strings
	def _county(row):
	return row[0]
	def _birth_year(row):
	y = row[2][-4:]
	return int(y) if y > "1920" and y < "2003" else None # a few entries have hidden birth year
	def _mailed_date(row):
	return row[6]
	def _returned_date(row):
	return row[7]

	"""we will collect votes in a county, with specified "Ballot Mailed Date", in two disjoint groups:
	-- those with a specified "Ballot Returned Date"
	-- those with any other Ballot Returned Date
	for each we will histogram the birth year
	"""
	COUNTY, RETURNED = "PHILADELPHIA", "10/10/2020"
	in_county = [row for row in rows if _county(row) == COUNTY and _birth_year(row) is not None] # this also drops the header row
	def _count_by_mailed_dates(mailed_dates):
	my_dates, other = defaultdict(int), defaultdict(int)
	for row in in_county:
	if _mailed_date(row) not in mailed_dates:
	continue # skip all other "Ballot Mailed Date"
	# choose which histogram this point goes in based on its "Ballot Returned Date"
	h = my_dates if _returned_date(row) == RETURNED else other
	h[_birth_year(row)] += 1
	return my_dates, other

	h_my_dates, h_other = _count_by_mailed_dates(["09/29/2020"])
	#h_my_dates, h_other = _count_by_mailed_dates(["09/28/2020", "09/29/2020", "09/30/2020", "10/2/2020"])
	n_my_dates, n_other = sum(h_my_dates.values()), sum(h_other.values())

	x = sorted(list(set().union(h_my_dates, h_other)))
	def _normalized_plot(title, norm):
	# normalize histograms and make sure they have all the same keys
	def _normalized_histogram(h):
	size = norm(h)
	return [h[k] / size for k in x]

	y_my_dates = _normalized_histogram(h_my_dates)
	y_other = _normalized_histogram(h_other)

	# plot them
	plt.figure()
	plt.plot(x, y_my_dates, color="red", label=f"Received {RETURNED} ({n_my_dates} ballots)")
	plt.plot(x, y_other, color="blue", label=f"All other dates ({n_other} ballots)")
	plt.legend()
	plt.title(title)
	plt.show()

	_normalized_plot("Count of votes by birth year", lambda _: 1)
	for title in ("Probability distribution of birth years",
	"These people all mailed their ballots on the same day",
	f"Somehow ballots received on {RETURNED} are different"):
	_normalized_plot(title, lambda h: sum(h.values()))
	for title in ("Normalize by births after 1965",
	"For those under 55 the age distributions are indistinguishable",
	f"Extra ballots from a different distribution were injected on {RETURNED}",
	"A lot of these are from very old people...",
	"... inactive registrations tend to be for the old"):
	_normalized_plot(title, lambda h: sum((v for k, v in h.items() if k > 1965)))
	1. Download https://data.pa.gov/api/views/mcba-yywm/rows.csv?accessType=DOWNLOAD and save it to disk.
	2. Select just Philadelphia records (otherwise it's too large for Excel).
	a. On Windows command line, supply the filename FN to "findstr Ballot FN > philly.csv", and then "findstr PHILADELPHIA FN >> philly.csv"
	i. To get a command line from Explorer (File Manager), use File->Open Windows PowerShell in your Downloads directory
	ii. We are getting the first row (headers), and then all Philadelphia entries
	b. On Linux, cat/grep instead of type/findstr
	3. Open philly.csv in Excel.
	4. Select all the data (ctrl-shift-end).
	5. Filter data using Data->Filter, then ctrl-home to get back to the top of the file.
	6. Filter to just ballots mailed on 9/29, by using the drop-down box in cell G1.
	7. Select all the (filtered) data again (ctrl-home, ctrl-shift-end).
	8. Copy this data and paste in a new workbook. You should have 35678 rows including the header.
	a. These are the absentee ballots, for Philadelphia voters, mailed on 9/29.
	b. They spent some time in the mail, so they were received and processed later.
	i. Thus Ballot Returned Date in column H is after 9/29.
	9. Create two new columns, "Veterans Day" and "Other Days".
	a. "Veterans Day" in Column L: in L2, enter the formula "=if(H2=44114, 1, 0)", then fill down
	i. 44114 is the Excel numeric value for October 10, 2020.
	b. "Other Days" in Column M: in M2, enter the formula "=1-L2", then fill down
	10. Select all the data again, and make it a PivotTable (Insert->Tables->PivotTable, OK).
	11. In "PivotTable fields", select "Date of Birth" and your two new columns.
	a. The "Values" box should automatically populate with "Sum of Veterans Day" and "Sum of Other Days".
	12. Right-click any cell in column A (under "Row Labels") and select "Group". Select "Years" and unselect anything else.
	13. Click any cell inside the table, and select Insert->PivotChart.
	a. These are simple counts of ballots. They don't seem to be drawn from the same distribution.
	14. Go back to the data, and normalize the "Other Days" column to have the same total as "Veterans Day".
	a. I.e., enter in M2 the formula "=(1-L2)*10179/25498", then fill down
	i. You can verify these constants by summing the two columns
	b. The data will be on another sheet (tab) in the same workbook
	15. Right-click any cell inside the PivotTable and choose "Refresh".
	a. Now you can see the age distributions are very different.
	b. Why should this be the case? All these people supposedly mailed in ballots on September 29.
	16. Go back to the data, and change the scale factor in the "Other Days" column from 10179/25498 to 2478/9853.
	a. This normalizes the two series so that the sums for voters born after 1/1/1965 are equal.
	i. None of this normalization can change the shape of the curves, only their height.
	17. Again "Refresh" on the PivotTable.
	a. The age distribution for people born after 1965 is indistinguishable.
	b. Extra ballots have been injected, all processed on 10/10, drawn from a different population.
	c. This is probably the population of inactive voter registrations, which skew older.
	18. This shows about 3800 extra votes.
	a. These are part of the "Veterans' Day Dump" of fake mail-in ballots in Philadelphia.
	b. You can see more of the dump by revisiting step 6 and expanding the range of "Ballot Mailed Date".
	i. A good normalization factor for viewing, if you do this, is about 0.2
	c. In all, about 20000 fake mail-in ballots were processed in Philadelphia on 10/10.