I recommend restructuring your filters from only proceeding (and indenting) if the criterium passes, to skipping the row if any criterium fails. This has a couple of benefits:
- keeping the code from creeping to the right
- you can add debug messages to print when a row doesn't match
- you can comment-out any single criterium without affecting the others
I test the participant IDs differently than you did, but your method of:
- splitting on the hyphen
- counting the components
- getting all the lengths and comparing to
[8,4,4,4,12]
- trying to scan each char for
islower()
orisdigit()
was sound.
I also highly advise that you don't try to append, it can cause headaches with CSV files. If I were doing this, I'd filter the input into a results file, and make sure all the logic works—'cause if your logic is wrong you've still written to the original and now will need to go back and edit/delete the messed-up rows that were just appended. I'd write a much simpler program to concatenate that results file with a previous file.
import csv
import string
fname_in = "input.csv"
fname_out = "output.csv"
f_in = open(fname_in, "r", newline="", encoding="utf-8")
f_out = open(fname_out, "w", newline="", encoding="utf-8")
reader = csv.DictReader(f_in)
fieldnames = reader.fieldnames
assert fieldnames is not None
writer = csv.DictWriter(f_out, fieldnames=fieldnames)
writer.writeheader()
VALID_ID_CHARS = string.ascii_lowercase + string.digits
def print_dropped(row: dict[str, str], reason: str):
print(f"dropped row because {reason}: {row}:")
for row in reader:
if row["timestamp"][-4:] != "2023":
print_dropped(row, "timestamp not in 2023")
continue
if float(row["accuracy"]) > 30:
print_dropped(row, "accuracy greater than 30")
continue
participant_id = row["participant_ID"]
if participant_id.count("-") != 4:
print_dropped(row, "not enough dashes in Participant_ID")
continue
participant_id = participant_id.replace("-", "")
if len(participant_id) != 32:
print_dropped(row, "Participant_ID != 32 chars")
continue
bad_chars = False
for char in participant_id:
if char not in VALID_ID_CHARS:
bad_chars = True
break
if bad_chars:
print_dropped(row, "Participant_ID has illegal chars")
continue
lat = float(row["double_latitude"])
long = float(row["double_longitude"])
if lat < -180 or lat > 180 or long < -180 or long > 180:
print_dropped(row, "lat or long out of bounds")
continue
altitude = float(row["double_altitude"])
if altitude < 0 or altitude > 1000:
print_dropped(row, "altitude out of bounds")
continue
if row["provider"] not in ["gps", "network"]:
print_dropped(row, "provider != gps or network")
continue
writer.writerow(row)
Once you like the results of your filter logic, you can just comment-out the print(...)
line in print_dropped()
and add a pass
, or remove the individual calls to the function.
I tested with this input:
timestamp,accuracy,participant_ID,double_latitude,double_longitude,double_altitude,provider
1/1/2000,,,,,,
1/2/2023,40,,,,,
1/3/2023,30,#abc,,,,
1/4/2023,30,#abcdefgh-1234-abcd-5678-abcdefghij,,,,
1/5/2023,30,#abcdefgh-1234-abcd-5678-abcdefghijk,,,,
1/6/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,-290,290,-500,
1/7/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,
1/8/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,foo
1/9/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,gps
1/10/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,150,150,500,network
got this debug output:
dropped row because timestamp not in 2023: {'timestamp': '1/1/2000', 'accuracy': '', 'participant_ID': '', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because accuracy greater than 30: {'timestamp': '1/2/2023', 'accuracy': '40', 'participant_ID': '', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because not enough dashes in Participant_ID: {'timestamp': '1/3/2023', 'accuracy': '30', 'participant_ID': '#abc', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because Participant_ID != 32 chars: {'timestamp': '1/4/2023', 'accuracy': '30', 'participant_ID': '#abcdefgh-1234-abcd-5678-abcdefghij', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because Participant_ID has illegal chars: {'timestamp': '1/5/2023', 'accuracy': '30', 'participant_ID': '#abcdefgh-1234-abcd-5678-abcdefghijk', 'double_latitude': '', 'double_longitude': '', 'double_altitude': '', 'provider': ''}:
dropped row because lat or long out of bounds: {'timestamp': '1/6/2023', 'accuracy': '30', 'participant_ID': 'abcdefgh-1234-abcd-5678-abcdefghijkl', 'double_latitude': '-290', 'double_longitude': '290', 'double_altitude': '-500', 'provider': ''}:
dropped row because provider != gps or network: {'timestamp': '1/7/2023', 'accuracy': '30', 'participant_ID': 'abcdefgh-1234-abcd-5678-abcdefghijkl', 'double_latitude': '100', 'double_longitude': '100', 'double_altitude': '500', 'provider': ''}:
dropped row because provider != gps or network: {'timestamp': '1/8/2023', 'accuracy': '30', 'participant_ID': 'abcdefgh-1234-abcd-5678-abcdefghijkl', 'double_latitude': '100', 'double_longitude': '100', 'double_altitude': '500', 'provider': 'foo'}:
and this final output CSV:
timestamp,accuracy,participant_ID,double_latitude,double_longitude,double_altitude,provider
1/7/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,100,100,500,gps
1/8/2023,30,abcdefgh-1234-abcd-5678-abcdefghijkl,150,150,500,network