Skip to content

Instantly share code, notes, and snippets.

@gm3dmo
Last active April 28, 2019 08:30
Show Gist options
  • Save gm3dmo/e807c352ad7ee3c055017f8510b1ae30 to your computer and use it in GitHub Desktop.
Save gm3dmo/e807c352ad7ee3c055017f8510b1ae30 to your computer and use it in GitHub Desktop.

DWP Payment Data

https://data.gov.uk/dataset/ccdc397a-3984-453b-a9d7-e285074bba4d/spend-over-25-000-in-the-department-for-work-and-pensions

What happens when somebody downloads a CSV from the above?

  1. Most likely open it with Excel/Google Sheets/Open Office/Numbers? (More likely if an individual or small business).
  2. Upload it into some kind of database/Big Query/AWS (More likely if a company/data scientist working with journalists).

When you upload it and scroll down it looks like there is no data after row 190 which is weird and makes you thing something is very wrong. That turns out to be because 192 - 484 are indeed blank:

  192 ,,,,,,,,,,,,,,,,,,,,,,,,^M

Row after row of commas.

Results

| App | Does it handle the CSV with line breaks correctly? | | Google Sheets | Yes | | Excel | | | Open Office | Numbers |

Improvements

  1. It might be kinder to deliver an annual file for each year with a running total updated monthly.
  2. Use a consistent file encoding, preferably UTF-8

File encoding type is inconsistent

Note how file identifies the files as different types:

file *.bkp
dwp-payments-over-25000-for-2018-01.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-02.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-03.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-04.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-05.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-06.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-07.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-08.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-09.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-10.csv: ISO-8859 text, with very long lines
dwp-payments-over-25000-for-2018-11.csv: Non-ISO extended-ASCII text, with very long lines
dwp-payments-over-25000-for-2018-12.csv: ASCII text, with very long lines

Something in there also causes problems with sort

MacBook-Pro:dwp-spendies dmo$ tail -n +3 dwp-payments-over-25000-for-2018-11.csv | sort | uniq -c | sort -n
sort: Illegal byte sequence

Only works for December 2018 file:

tail -n +3 dwp-payments-over-25000-for-2018-12.csv | sort | uniq -c | sort -n

And yes there are duplicated lines, many of them:

 163 WAP,EMPLOYMENT PROGRAMMES,28-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - SOCIAL CARE - SUSTAINMENT FEE,WWEG EP NORTH WEST GM CHES AND WARR AVANTA ENTERPRISE,PEOPLEPLUS LTD,114312,-235,5213102918,10419149,Invoice from PRaP payment: 114312
 165 WAP,EMPLOYMENT PROGRAMMES,28-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - SOCIAL CARE - SUSTAINMENT FEE,WWEG EP NORTH WEST MERS HAL CUMB AND LANCS A4E,PEOPLEPLUS GROUP,114334,-215,5213102918,10419147,Invoice from PRaP payment: 114334
 211 WAP,EMPLOYMENT PROGRAMMES,28-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - SOCIAL CARE - SUSTAINMENT FEE,WWEG EP W MIDLANDS BIRM SOL BC FOURSTAR EMP AND SKILLS,PEOPLEPLUS GROUP LTD,114325,-215,5213102918,10419154,Invoice from PRaP payment: 114325
 225 WAP,CORPORATE,19-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - SOCIAL CARE - ATW ASSESSMENTS,OPS DS ATW DISC EP NATIONAL,PEOPLEPLUS LTD,113961,184.57,5213102930,10401972,Invoice from PRaP payment: 113961
 362 WAP,EMPLOYMENT PROGRAMMES,28-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - SOCIAL CARE - SUSTAINMENT FEE,WWEG EP NORTH WEST GM CHES AND WARR AVANTA ENTERPRISE,PEOPLEPLUS LTD,114312,-215,5213102918,10419149,Invoice from PRaP payment: 114312
1128 ,,,,,,,,,,
1343 DWP,CORPORATE,03-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - MEDICAL - PIP PAPER ASSESSMTS,FG CS HS PIP LOT 2,CAPITA BUSINESS SERVICES LTD,7.0013E+11,222,5216102913,10394121,PIP SEP18 OCTKN1
1553 WAP,CORPORATE,06-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - MEDICAL - PIP PAPER ASSESSMTS,FG CS HS PIP LOT 2,CAPITA BUSINESS SERVICES LTD,7.00131E+11,222,5216102913,10394121,PIP OCT18 NOVKN1
2153 WAP,CORPORATE,06-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - MEDICAL - PIP GP FACTUAL,FG CS HS PIP LOT 2,CAPITA BUSINESS SERVICES LTD,7.00131E+11,34,5216102914,10394121,PIP OCT18 NOVKN1
2305 WAP,CORPORATE,03-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - MEDICAL - PIP GP FACTUAL,FG CS HS PIP LOT 2,CAPITA BUSINESS SERVICES LTD,7.0013E+11,34,5216102914,10394121,PIP SEP18 OCTKN1
16553 DWP,CORPORATE,03-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - MEDICAL - PIP F2F ASSESSMENTS,FG CS HS PIP LOT 2,CAPITA BUSINESS SERVICES LTD,7.0013E+11,298,5216102912,10394121,PIP SEP18 OCTKN1
17470 WAP,CORPORATE,06-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - MEDICAL - PIP F2F ASSESSMENTS,FG CS HS PIP LOT 2,CAPITA BUSINESS SERVICES LTD,7.00131E+11,298,5216102912,10394121,PIP OCT18 NOVKN1

Are there really 17470 identical lots of:

WAP,CORPORATE,06-Dec-18,EXP - PURCHASE OF GOODS/SERVICES - MEDICAL - PIP F2F ASSESSMENTS,FG CS HS PIP LOT 2,CAPITA BUSINESS SERVICES LTD,7.00131E+11,298,5216102912,10394121,PIP OCT18 NOVKN1

Question: Do the headers match in all files?

A:They do not:

MacBook-Pro:dwp-spendies dmo$ head -1 *.csv
==> dwp-payments-over-25000-for-2018-01.csv <==
Departmental Family,Entity,Date,Expense,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-02.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-03.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-04.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-05.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-06.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-07.csv <==
Departmental Family,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-08.csv <==
Departmental Family,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-09.csv <==
Departmental Family,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-10.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-11.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-12.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description
MacBook-Pro:dwp-spendies dmo$ head -1 *.csv | tr , '\n' | sort | uniq -c | sort -rn
  12 Supplier
  12 Invoice Number
  12 Invoice Distribution Amount
  12 Invoice Description
  12 Expense Type
  12 Expense Area
  12 Departmental Family
  12 Date
  12 Cost Centre
  12 Account Code
  11
   9 Entity
   6 Vat Registration Num
   6 Supplier Type
   6 Supplier Post Code
   6 Requisition Description
   6 Projects Expenditure Type
   6 Project Number
   6 Payment Number
   6 Payment Amount
   6 PO Number
   6 PO Contract Number
   6 Invoice Line Description
   6 Invoice Amount
   6 Contract Number
   6 Analysis Code
   1 ==> dwp-payments-over-25000-for-2018-12.csv <==
   1 ==> dwp-payments-over-25000-for-2018-11.csv <==
   1 ==> dwp-payments-over-25000-for-2018-10.csv <==
   1 ==> dwp-payments-over-25000-for-2018-09.csv <==
   1 ==> dwp-payments-over-25000-for-2018-08.csv <==
   1 ==> dwp-payments-over-25000-for-2018-07.csv <==
   1 ==> dwp-payments-over-25000-for-2018-06.csv <==
   1 ==> dwp-payments-over-25000-for-2018-05.csv <==
   1 ==> dwp-payments-over-25000-for-2018-04.csv <==
   1 ==> dwp-payments-over-25000-for-2018-03.csv <==
   1 ==> dwp-payments-over-25000-for-2018-02.csv <==
   1 ==> dwp-payments-over-25000-for-2018-01.csv <==

Q:How many lines are duplicates: A:

(nino) MacBook-Pro:dwp-spendies dmo$ for file in `ls *.csv`; do echo $file ; time sort $file | uniq > $file.uniq; done
dwp-payments-over-25000-for-2018-01.csv

real	0m0.521s
user	0m0.525s
sys	0m0.023s
dwp-payments-over-25000-for-2018-02.csv

real	0m6.332s
user	0m6.431s
sys	0m0.094s
dwp-payments-over-25000-for-2018-03.csv

real	0m6.459s
user	0m6.553s
sys	0m0.093s
dwp-payments-over-25000-for-2018-04.csv

real	0m0.216s
user	0m0.216s
sys	0m0.013s
dwp-payments-over-25000-for-2018-05.csv

real	0m5.720s
user	0m5.812s
sys	0m0.083s
dwp-payments-over-25000-for-2018-06.csv

real	0m6.585s
user	0m6.702s
sys	0m0.103s
dwp-payments-over-25000-for-2018-07.csv

real	0m1.592s
user	0m1.633s
sys	0m0.027s
dwp-payments-over-25000-for-2018-08.csv

real	0m3.685s
user	0m3.756s
sys	0m0.040s
dwp-payments-over-25000-for-2018-09.csv

real	0m2.855s
user	0m2.919s
sys	0m0.035s
dwp-payments-over-25000-for-2018-10.csv

real	0m3.839s
user	0m3.905s
sys	0m0.036s
dwp-payments-over-25000-for-2018-11.csv

real	0m1.279s
user	0m1.309s
sys	0m0.025s
dwp-payments-over-25000-for-2018-12.csv

real	0m4.471s
user	0m4.549s
sys	0m0.048s

(nino) MacBook-Pro:dwp-spendies dmo$ ls -lh *.csv *.uniq
-rw-rw-rw-@ 1 dmo  staff   2.6M 24 Apr 21:19 dwp-payments-over-25000-for-2018-01.csv
-rw-r--r--  1 dmo  staff   2.6M 25 Apr 05:58 dwp-payments-over-25000-for-2018-01.csv.uniq
-rw-rw-rw-@ 1 dmo  staff    21M 24 Apr 21:19 dwp-payments-over-25000-for-2018-02.csv
-rw-r--r--  1 dmo  staff    17M 25 Apr 05:58 dwp-payments-over-25000-for-2018-02.csv.uniq
-rw-rw-rw-@ 1 dmo  staff    20M 24 Apr 21:18 dwp-payments-over-25000-for-2018-03.csv
-rw-r--r--  1 dmo  staff    14M 25 Apr 05:58 dwp-payments-over-25000-for-2018-03.csv.uniq
-rw-rw-rw-@ 1 dmo  staff   1.1M 24 Apr 21:18 dwp-payments-over-25000-for-2018-04.csv
-rw-r--r--  1 dmo  staff   1.1M 25 Apr 05:58 dwp-payments-over-25000-for-2018-04.csv.uniq
-rw-rw-rw-@ 1 dmo  staff    19M 24 Apr 21:18 dwp-payments-over-25000-for-2018-05.csv
-rw-r--r--  1 dmo  staff    14M 25 Apr 05:58 dwp-payments-over-25000-for-2018-05.csv.uniq
-rw-rw-rw-@ 1 dmo  staff    23M 24 Apr 21:17 dwp-payments-over-25000-for-2018-06.csv
-rw-r--r--  1 dmo  staff    13M 25 Apr 05:58 dwp-payments-over-25000-for-2018-06.csv.uniq
-rw-rw-rw-@ 1 dmo  staff   6.2M 24 Apr 21:17 dwp-payments-over-25000-for-2018-07.csv
-rw-r--r--  1 dmo  staff   2.3M 25 Apr 05:58 dwp-payments-over-25000-for-2018-07.csv.uniq
-rw-rw-rw-@ 1 dmo  staff    10M 24 Apr 21:17 dwp-payments-over-25000-for-2018-08.csv
-rw-r--r--  1 dmo  staff   2.4M 25 Apr 05:58 dwp-payments-over-25000-for-2018-08.csv.uniq
-rw-rw-rw-@ 1 dmo  staff   9.3M 24 Apr 21:17 dwp-payments-over-25000-for-2018-09.csv
-rw-r--r--  1 dmo  staff   1.9M 25 Apr 05:58 dwp-payments-over-25000-for-2018-09.csv.uniq
-rw-rw-rw-@ 1 dmo  staff   9.4M 24 Apr 21:16 dwp-payments-over-25000-for-2018-10.csv
-rw-r--r--  1 dmo  staff   2.6M 25 Apr 05:58 dwp-payments-over-25000-for-2018-10.csv.uniq
-rw-rw-rw-@ 1 dmo  staff   5.0M 24 Apr 21:16 dwp-payments-over-25000-for-2018-11.csv
-rw-r--r--  1 dmo  staff   2.7M 25 Apr 05:58 dwp-payments-over-25000-for-2018-11.csv.uniq
-rw-r--r--@ 1 dmo  staff    12M 24 Apr 20:03 dwp-payments-over-25000-for-2018-12.csv
-rw-r--r--  1 dmo  staff   1.9M 25 Apr 05:58 dwp-payments-over-25000-for-2018-12.csv.uniq

Steps to Clean Remove first 2 rows Deduplicate rows for file in `ls *.csv`; do echo $file ; sed '1d' $file > $file.noheader; done

Pop a header row back on (there is a change in the header structure mid year) head -1 dwp-payments-over-25000-for-2018-01.csv

Create a single csv cat header *.noheader > publish/x.csv

Use csvs-to-sqlite to create the sqlite.db csvs-to-sqlite -c Supplier publish/x.csv publish/x.db

Serve the sqlite with datasette: datasette serve publish/x.db

https://wellsr.com/python/introduction-to-csv-dialects-with-the-python-csv-module

This is what a single bad record looks like:

LHC DPS NL003
 Start fee: 100 @ £145 per start",SUPPORT FOR OLDER WORKS NORTH LONDON DISTRICTLHC DPS NL003Start fee 100 at £145,39070041261,,WD17 1LA,SUPPLIER,,0,Expense,

For added inconvenience 15 of the 25 columns have been removed beginning with file dwp-payments-over-25000-for-2018-07.csv

1. Departmental Family
2. Entity <---- No
3. Date
4. Expense Type
5. Expense Area
6. Supplier
7. Payment Number <---- No
8. Payment Amount <---- No
9. Invoice Amount<---- No
10. Invoice Number
11. Invoice Distribution Amount
12. Vat Registration Num <---- No
13. Account Code
14. Cost Centre
15. Invoice Description
16. Invoice Line Description <---- No
17. Requisition Description <---- No
18. PO Number <---- No
19. PO Contract Number <---- No
20. Supplier Post Code <---- No
21. Supplier Type <---- No
22. Contract Number <---- No
23. Analysis Code <---- No
24. Projects Expenditure Type <---- No
25. Project Number <---- No

Common to both:

1. Departmental Family
2. Date
3. Expense Type
4. Expense Area
5. Supplier
6. Invoice Number
7. Invoice Distribution Amount
8. Account Code
9. Cost Centre
10. Invoice Description
==> dwp-payments-over-25000-for-2018-01.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-02.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-03.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-04.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-05.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-06.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Payment Number,Payment Amount,Invoice Amount,Invoice Number,Invoice Distribution Amount,Vat Registration Num,Account Code,Cost Centre,Invoice Description,Invoice Line Description,Requisition Description,PO Number,PO Contract Number,Supplier Post Code,Supplier Type,Contract Number,Analysis Code,Projects Expenditure Type,Project Number

==> dwp-payments-over-25000-for-2018-07.csv <==
Departmental Family,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-08.csv <==
Departmental Family,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-09.csv <==
Departmental Family,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-10.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-11.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description

==> dwp-payments-over-25000-for-2018-12.csv <==
Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description```

##Structure of the files are not uniform

$ head -2 * ==> dwp-payments-over-25000-for-2018-01.csv <== Workbook Name: AP 18,Worksheet Name: Transparency ,Parameters:,"From Payment Date: '01-JAN-2018',","To Payment Date: '31-JAN-2018',",Invoice Amount Greater Than or Equal to: '25000',,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-02.csv <== "Workbook Name: AP 18 Worksheet Name: Transparency Parameters: From Payment Date : '01-FEB-2018' , To Payment Date : '28-FEB-2018' , Invoice Amount Greater Than or Equal to : '25000' Date and Time the workbook was run: 06-MAR-18 02.22.57 PM",,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-03.csv <== Workbook Name: AP 18,Worksheet Name: Transparency ,Parameters: ,From Payment Date: '01-MAR-2018' ,To Payment Date : '31-MAR-2018' ,Invoice Amount Greater Than or Equal to : '25000' ,Date and Time the workbook was run: 03-MAY-18 12.50.38PM,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-04.csv <== Workbook Name: AP 18,Worksheet Name: Transparency,Parameters:,From Payment Date: '01-APR-2018',To Payment Date: '30-APR-2018',Invoice Amount Greater Than or Equal to: '25000',Date and Time the workbook was run: 17-MAY-18 11.41.33AM,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-05.csv <== Workbook Name: AP 18,Worksheet Name: Transparency ,Parameters:,From Payment Date: '01-MAY-2018',Invoice Amount Greater Than or Equal to: '25000' ,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-06.csv <== Workbook Name: AP 18,Worksheet Name: Transparency,Parameters:,From Payment Date: '01-JUN-2018',To Payment Date: '30-JUN-2018',Invoice Amount Greater Than or Equal to: '25000',Date and Time the workbook was run: 06-JUL-18 11.43.04 AM,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-07.csv <== Workbook Name: AP 18,Worksheet Name: Transparency,"Parameters: From Payment Date : '01-JUL-2018' , To Payment Date : '31-JUL-2018'",Invoice Amount Greater Than or Equal to : '25000' ,Date and Time the workbook was run: 07-AUG-18 12.22.03 PM,, ,,, ,,,,,,,,,

==> dwp-payments-over-25000-for-2018-08.csv <== Workbook Name: AP 18,Worksheet Name: Transparency,"Parameters: From Payment Date : '01-AUG-2018' , To Payment Date : '31-AUG-2018'",Invoice Amount Greater Than or Equal to : '25000',Date and Time the workbook was run: 13-SEP-18 10.55.53 AM,,,,, ,,,,,,,,,

==> dwp-payments-over-25000-for-2018-09.csv <== Workbook Name: AP 18 ,Worksheet Name: Transparency,"Parameters: From Payment Date : '01-SEP-2018' , To Payment Date : '15-SEP-2018'",Invoice Amount Greater Than or Equal to : '25000',Date and Time the workbook was run: 04-OCT-18 11.10.30 AM,,,,, ,,,,,,,,,

==> dwp-payments-over-25000-for-2018-10.csv <== "Workbook Name: AP 18 Worksheet Name: Transparency - Detail Parameters: From Payment Date : '01-OCT-2018' , To Payment Date : '31-OCT-2018' , Invoice Amount Greater Than or Equal to : '25000' Date and Time the workbook was run: 23-NOV-18 11.28.22 AM",,,,,,,,,, ,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-11.csv <== "Workbook Name: AP 18 Worksheet Name: Transparency - Detail Parameters: From Payment Date : '01-NOV-2018' , To Payment Date : '30-NOV-2018' , Invoice Amount Greater Than or Equal to : '25000' Date and Time the workbook was run: 06-DEC-18 11.44.10 AM",,,,,,,,,, ,,,,,,,,,,

==> dwp-payments-over-25000-for-2018-12.csv <== Departmental Family,Entity,Date,Expense Type,Expense Area,Supplier,Invoice Number,Invoice Distribution Amount,Account Code,Cost Centre,Invoice Description DWP,CORPORATE,17-Dec-18,CA - PREPAYMENTS (NON-PFI) - ACCOMMODATION,FG CD ESTATES TOM DEL P,DTZ CUSHMAN & WAKEFIELD,DWPCWFR45V2,"45,344,439",1815502900,10393892,DECEMBER QUARTERLY RENT AND SERVICE CHARGES```

Use it or lose it?

(nino) MacBook-Pro:dwp-spendies dmo$ ls -lh *uniq
-rw-r--r--  1 dmo  staff   2.6M 25 Apr 05:58 dwp-payments-over-25000-for-2018-01.csv.uniq
-rw-r--r--  1 dmo  staff    17M 25 Apr 05:58 dwp-payments-over-25000-for-2018-02.csv.uniq
-rw-r--r--  1 dmo  staff    14M 25 Apr 05:58 dwp-payments-over-25000-for-2018-03.csv.uniq
-rw-r--r--  1 dmo  staff   1.1M 25 Apr 05:58 dwp-payments-over-25000-for-2018-04.csv.uniq
-rw-r--r--  1 dmo  staff    14M 25 Apr 05:58 dwp-payments-over-25000-for-2018-05.csv.uniq
-rw-r--r--  1 dmo  staff    13M 25 Apr 05:58 dwp-payments-over-25000-for-2018-06.csv.uniq
-rw-r--r--  1 dmo  staff   2.3M 25 Apr 05:58 dwp-payments-over-25000-for-2018-07.csv.uniq
-rw-r--r--  1 dmo  staff   2.4M 25 Apr 05:58 dwp-payments-over-25000-for-2018-08.csv.uniq
-rw-r--r--  1 dmo  staff   1.9M 25 Apr 05:58 dwp-payments-over-25000-for-2018-09.csv.uniq
-rw-r--r--  1 dmo  staff   2.6M 25 Apr 05:58 dwp-payments-over-25000-for-2018-10.csv.uniq
-rw-r--r--  1 dmo  staff   2.7M 25 Apr 05:58 dwp-payments-over-25000-for-2018-11.csv.uniq
-rw-r--r--  1 dmo  staff   1.9M 25 Apr 05:58 dwp-payments-over-25000-for-2018-12.csv.uniq

February and March

(nino) MacBook-Pro:dwp-spendies dmo$ wc -l *.uniq
    8259 dwp-payments-over-25000-for-2018-01.csv.uniq
   48410 dwp-payments-over-25000-for-2018-02.csv.uniq
   43034 dwp-payments-over-25000-for-2018-03.csv.uniq
    3521 dwp-payments-over-25000-for-2018-04.csv.uniq
   42703 dwp-payments-over-25000-for-2018-05.csv.uniq
   42527 dwp-payments-over-25000-for-2018-06.csv.uniq
   12580 dwp-payments-over-25000-for-2018-07.csv.uniq
   12953 dwp-payments-over-25000-for-2018-08.csv.uniq
   10375 dwp-payments-over-25000-for-2018-09.csv.uniq
   13141 dwp-payments-over-25000-for-2018-10.csv.uniq
   14107 dwp-payments-over-25000-for-2018-11.csv.uniq
    9868 dwp-payments-over-25000-for-2018-12.csv.uniq
  261478 total
@gm3dmo
Copy link
Author

gm3dmo commented Apr 24, 2019

I guess only these:

  12 Supplier
  12 Invoice Number
  12 Invoice Distribution Amount
  12 Invoice Description
  12 Expense Type
  12 Expense Area
  12 Departmental Family
  12 Date
  12 Cost Centre
  12 Account Code

are wanted because there are 12.

@gm3dmo
Copy link
Author

gm3dmo commented Apr 25, 2019

Use it or lose it?

(nino) MacBook-Pro:dwp-spendies dmo$ ls -lh *uniq
-rw-r--r--  1 dmo  staff   2.6M 25 Apr 05:58 dwp-payments-over-25000-for-2018-01.csv.uniq
-rw-r--r--  1 dmo  staff    17M 25 Apr 05:58 dwp-payments-over-25000-for-2018-02.csv.uniq
-rw-r--r--  1 dmo  staff    14M 25 Apr 05:58 dwp-payments-over-25000-for-2018-03.csv.uniq
-rw-r--r--  1 dmo  staff   1.1M 25 Apr 05:58 dwp-payments-over-25000-for-2018-04.csv.uniq
-rw-r--r--  1 dmo  staff    14M 25 Apr 05:58 dwp-payments-over-25000-for-2018-05.csv.uniq
-rw-r--r--  1 dmo  staff    13M 25 Apr 05:58 dwp-payments-over-25000-for-2018-06.csv.uniq
-rw-r--r--  1 dmo  staff   2.3M 25 Apr 05:58 dwp-payments-over-25000-for-2018-07.csv.uniq
-rw-r--r--  1 dmo  staff   2.4M 25 Apr 05:58 dwp-payments-over-25000-for-2018-08.csv.uniq
-rw-r--r--  1 dmo  staff   1.9M 25 Apr 05:58 dwp-payments-over-25000-for-2018-09.csv.uniq
-rw-r--r--  1 dmo  staff   2.6M 25 Apr 05:58 dwp-payments-over-25000-for-2018-10.csv.uniq
-rw-r--r--  1 dmo  staff   2.7M 25 Apr 05:58 dwp-payments-over-25000-for-2018-11.csv.uniq
-rw-r--r--  1 dmo  staff   1.9M 25 Apr 05:58 dwp-payments-over-25000-for-2018-12.csv.uniq

February and March

(nino) MacBook-Pro:dwp-spendies dmo$ wc -l *.uniq
    8259 dwp-payments-over-25000-for-2018-01.csv.uniq
   48410 dwp-payments-over-25000-for-2018-02.csv.uniq
   43034 dwp-payments-over-25000-for-2018-03.csv.uniq
    3521 dwp-payments-over-25000-for-2018-04.csv.uniq
   42703 dwp-payments-over-25000-for-2018-05.csv.uniq
   42527 dwp-payments-over-25000-for-2018-06.csv.uniq
   12580 dwp-payments-over-25000-for-2018-07.csv.uniq
   12953 dwp-payments-over-25000-for-2018-08.csv.uniq
   10375 dwp-payments-over-25000-for-2018-09.csv.uniq
   13141 dwp-payments-over-25000-for-2018-10.csv.uniq
   14107 dwp-payments-over-25000-for-2018-11.csv.uniq
    9868 dwp-payments-over-25000-for-2018-12.csv.uniq
  261478 total

@gm3dmo
Copy link
Author

gm3dmo commented Apr 25, 2019

The December file contains mostly duplicate lines:

-rw-r--r--@ 1 dmo  staff    12M 24 Apr 20:03 dwp-payments-over-25000-for-2018-12.csv
-rw-r--r--  1 dmo  staff   1.9M 25 Apr 05:58 dwp-payments-over-25000-for-2018-12.csv.uniq

12Mb before uniq 1.9Mb afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment