Skip to content

Instantly share code, notes, and snippets.

@zacharysyoung
Last active October 19, 2023 00:00
Show Gist options
  • Save zacharysyoung/df7c1d7ac8d95ddd99ecde6b77572bea to your computer and use it in GitHub Desktop.
Save zacharysyoung/df7c1d7ac8d95ddd99ecde6b77572bea to your computer and use it in GitHub Desktop.
SO-77312927

I recommend restructuring your filters from only proceeding (and indenting) if the criterium passes, to skipping the row if any criterium fails. This has a couple of benefits:

  • keeping the code from creeping to the right
  • you can add debug messages to print when a row doesn't match
  • you can comment-out any single criterium without affecting the others

I test the participant IDs differently than you did, but your method of:

  • splitting on the hyphen
  • counting the components
  • getting all the lengths and comparing to [8,4,4,4,12]
  • trying to scan each char for islower() or isdigit()

was sound.

I also highly advise that you don't try to append, it can cause headaches with CSV files. If I were doing this, I'd filter the input into a results file, and make sure all the logic works—'cause if your logic is wrong you've still written to the original and now will need to go back and edit/delete the messed-up rows that were just appended. I'd write a much simpler program to concatenate that results file with a previous file.

import csv
import string

fname_in = "input.csv"
fname_out = "output.csv"

f_in = open(fname_in, "r", newline="", encoding="utf-8")
f_out = open(fname_out, "w", newline="", encoding="utf-8")

reader = csv.DictReader(f_in)
fieldnames = reader.fieldnames
assert fieldnames is not None
writer = csv.DictWriter(f_out, fieldnames=fieldnames)
writer.writeheader()


VALID_ID_CHARS = string.ascii_lowercase + string.digits


def print_dropped(row: dict[str, str], reason: str):
    print(f"dropped row because {reason}: {row}:")


for row in reader:
    if row["timestamp"][-4:] != "2023":
        print_dropped(row, "timestamp not in 2023")
        continue

    if float(row["accuracy"]) > 30:
        print_dropped(row, "accuracy greater than 30")
        continue

    participant_id = row["participant_ID"]
    if participant_id.count("-") != 4:
        print_dropped(row, "not enough dashes in Participant_ID")
        continue

    participant_id = participant_id.replace("-", "")
    if len(participant_id) != 32:
        print_dropped(row, "Participant_ID != 32 chars")
        continue

    bad_chars = False
    for char in participant_id:
        if char not in VALID_ID_CHARS:
            bad_chars = True
            break
    if bad_chars:
        print_dropped(row, "Participant_ID has illegal chars")
        continue

    lat = float(row["double_latitude"])
    long = float(row["double_longitude"])
    if lat < -180 or lat > 180 or long < -180 or long > 180:
        print_dropped(row, "lat or long out of bounds")
        continue

    altitude = float(row["double_altitude"])
    if altitude < 0 or altitude > 1000:
        print_dropped(row, "altitude out of bounds")
        continue

    if row["provider"] not in ["gps", "network"]:
        print_dropped(row, "provider != gps or network")
        continue

    writer.writerow(row)

Once you like the results of your filter logic, you can just comment-out the print(...) line in print_dropped() and add a pass, or remove the individual calls to the function.

I tested with this input:

timestamp,accuracy,participant_ID,double_latitude,double_longitude,double_altitude,provider
1/1/2000,,,,,,
1/2/2023,40,,,,,
1/3/2023,30,#abc,,,,
1/4/2023,30,#abcdefgh-1234-abcd-5678-abcdefghij,,,,
1/5/2023,30,#abcdefgh-1234-abcd-5678-abcdefghijk,,,,
1/6/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,-290,290,-500,
1/7/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,
1/8/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,foo
1/9/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,gps
1/10/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,150,150,500,network

got this debug output:

dropped row because timestamp not in 2023: {'timestamp': '1/1/2000', 'accuracy': '', 'participant_ID': '', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because accuracy greater than 30: {'timestamp': '1/2/2023', 'accuracy': '40', 'participant_ID': '', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because not enough dashes in Participant_ID: {'timestamp': '1/3/2023', 'accuracy': '30', 'participant_ID': '#abc', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because Participant_ID != 32 chars: {'timestamp': '1/4/2023', 'accuracy': '30', 'participant_ID': '#abcdefgh-1234-abcd-5678-abcdefghij', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because Participant_ID has illegal chars: {'timestamp': '1/5/2023', 'accuracy': '30', 'participant_ID': '#abcdefgh-1234-abcd-5678-abcdefghijk', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because lat or long out of bounds: {'timestamp': '1/6/2023', 'accuracy': '30', 'participant_ID': 'abcdefgh-1234-abcd-5678-abcdefghijkl', 'double_latitude': '-290', 'double_longitude': '290', 'double_altitude': '-500', 'provider': ''}:
dropped row because provider != gps or network: {'timestamp': '1/7/2023', 'accuracy': '30', 'participant_ID': 'abcdefgh-1234-abcd-5678-abcdefghijkl', 'double_latitude': '100', 'double_longitude': '100', 'double_altitude': '500', 'provider': ''}:
dropped row because provider != gps or network: {'timestamp': '1/8/2023', 'accuracy': '30', 'participant_ID': 'abcdefgh-1234-abcd-5678-abcdefghijkl', 'double_latitude': '100', 'double_longitude': '100', 'double_altitude': '500', 'provider': 'foo'}:

and this final output CSV:

timestamp,accuracy,participant_ID,double_latitude,double_longitude,double_altitude,provider
1/7/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,gps
1/8/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,150,150,500,network
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment