Skip to content

Instantly share code, notes, and snippets.

@bnewbold
Last active November 22, 2021 22:35
Show Gist options
  • Save bnewbold/b8d5392e659f45609e58956978cbe5e3 to your computer and use it in GitHub Desktop.
Save bnewbold/b8d5392e659f45609e58956978cbe5e3 to your computer and use it in GitHub Desktop.
OpenAlex Journal Metadata

Running some quick initial metadata quality checks on OpenAlex Journal list. This is from the pre-release, dated in file names as 2021-10-11 (but announced in late November 2021).

Looking for ISSN-L dupes:

cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -d | wc -l
# 146

cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -D | wc -l
# 293

Looking for ISSN-L not in a recent dump of ISSN-Ls from issn.org:

cat openalex-journals.txt | cut -f5 | rg '\-' | sort -u > openalex-issnl.tsv
cat ISSN-to-ISSN-L.txt | cut -f2 | rg -v "ISSN" | rg '\-' | sort -u > issnl.tsv

comm -23 openalex-issnl.tsv issnl.tsv | wc -l
# 249

comm -23 openalex-issnl.tsv issnl.tsv > openalex_unknown_issnl.txt

Looking for duplicate exact homepage URLs:

There are a few reasons that ISSNs might not be in the public list or available through https://portal.issn.org (eg, sometimes there are typos which then become widely used; or the ISSN is partially registered). But if using the ISSN-L as a persistent identifier, should require it to be valid and publicly registered.

Look for "normalized name" duplicates:

cat openalex-journals.txt | cut -f3 | sort | uniq -d | wc -l
14

Not many, good.

Look for bogus homepage URLs:

cat openalex-journals.txt | cut -f1,10 | rg -v '\t$' | rg -v '://'

JournalId       Webpage
2944001180      www.cjb-rcb.ca
2764771476      123\
2948018973      ores.su/en/journals/chinese-journal-of-ecology/
2764943583      197\
2946866068      www.kais99.org
2764846895      518\
2947334459      www.jasnaoe.or.jp/en/
2764943300      65\
2944560164      www.ijqf.org
2765015668      10\
2764518604      116\
2764649715      430\

HTTP/HTTPS:

cat openalex-journals.txt | cut -f10 | rg '://' | cut -f1 -d: | sort | uniq -c
   5483 http
    873 https

Probably a whole bunch of these could be https:// instead of http://, which would improve end-user security/privacy by default.

Top domains:

cat openalex-journals.txt | cut -f10 | rg '://' | cut -f3 -d/ | sed 's/www\.//g' | sort | uniq -c | sort -nr | head -n20
    463 journals.elsevier.com
    412 onlinelibrary.wiley.com
    304 springer.com
    286 sciencedirect.com
    183 sagepub.com
    183 elsevier.com
    169 tandfonline.com
     91 journals.cambridge.org
     75 worldscinet.com
     63 informahealthcare.com
     62 apa.org
     43 pubs.acs.org
     43 press.jhu.edu
     39 wiley.com
     35 pdcnet.org
     35 journals.uchicago.edu
     35 journals.lww.com
     34 degruyter.com
     33 uk.sagepub.com
     31 rsc.org

These look pretty good! Often catalogs have a bunch of URLs that just point to aggregators, etc, but these seem like real hompage domains.

Any wayback URLs in there?

cat openalex-journals.txt | cut -f1,10 | rg archive.org
172099791       http://web.archive.org/web/20090803131854/http://www.rejecta.org:80/
59114670        http://web.archive.org/web/http://www.multilingualarchive.com/ma/enwiki/es/botaniska_notiser

The first wayback URL seems reasonable (journal is defunct, but homepage was captured).

The second wayback URL isn't good (we don't have a capture, and URL structure isn't complete) and there seems to be a live-web homepage for the backcatalog:

https://journals.lub.lu.se/bn/index
We can make this file beautiful and searchable if this error is corrected: No tabs found in this TSV file in line 0.
http://journals.lww.com/mcnjournal/pages/default.aspx
http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1573-6598
http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)2044-835X
http://rel.sagepub.com/
http://www.apa.org/pubs/journals/edu/
http://www.benthamscience.com/open/tocchemj/index.htm
http://www.pemberleybooks.com/
http://www.psupress.org/journals/jnls_jmrc.html
http://www.psupress.org/journals/jnls_PEGS.html
http://www.rsc.org/Publishing/Journals/dt/index.asp
0001-5113
0001-5113
0003-388X
0003-388X
0006-2278
0006-2278
0007-1102
0007-1102
0008-3496
0008-3496
0021-1818
0021-1818
0021-3667
0021-3667
0021-6704
0021-6704
0022-2151
0022-2151
0022-2526
0022-2526
0023-2130
0023-2130
0024-6115
0024-6115
0025-7273
0025-7273
0026-2692
0026-2692
0026-7074
0026-7074
0029-4810
0029-4810
0030-5987
0030-5987
0031-4749
0031-4749
0035-3906
0035-3906
0036-7672
0036-7672
0036-8326
0036-8326
0038-1861
0038-1861
0041-9907
0041-9907
0043-5163
0043-5163
0044-2011
0044-2011
0046-9750
0046-9750
0077-8923
0077-8923
0079-6034
0079-6034
0079-6034
0090-0311
0090-0311
0094-0496
0094-0496
0096-3941
0096-3941
0101-4846
0101-4846
0120-3347
0120-3347
0145-5680
0145-5680
0147-5479
0147-5479
0161-0457
0161-0457
0161-7109
0161-7109
0165-4888
0165-4888
0173-9565
0173-9565
0191-4642
0191-4642
0213-5949
0213-5949
0219-5259
0219-5259
0253-9837
0253-9837
0256-9507
0256-9507
0260-2105
0260-2105
0265-5012
0265-5012
0275-3987
0275-3987
0276-1114
0276-1114
0300-1652
0300-1652
0300-5186
0300-5186
0307-6962
0307-6962
0351-0026
0351-0026
0368-3249
0368-3249
0369-4232
0369-4232
0390-6701
0390-6701
0391-5387
0391-5387
0514-8499
0514-8499
0580-373X
0580-373X
0768-598X
0768-598X
0814-723X
0814-723X
0844-5621
0844-5621
0856-0056
0856-0056
0867-6046
0867-6046
0872-0754
0872-0754
0887-8722
0887-8722
0950-2688
0950-2688
0957-5820
0957-5820
0971-6580
0971-6580
0973-4775
0973-4775
0974-4150
0974-4150
1012-0386
1012-0386
1018-5615
1018-5615
1035-8811
1035-8811
1036-1073
1036-1073
1074-4797
1074-4797
1086-9379
1086-9379
1087-0156
1087-0156
1097-9638
1097-9638
1108-7471
1108-7471
1124-3562
1124-3562
1183-1189
1183-1189
1300-7688
1300-7688
1302-8723
1302-8723
1326-0111
1326-0111
1329-1947
1329-1947
1347-9032
1347-9032
1411-5115
1411-5115
1414-3518
1414-3518
1415-0980
1415-0980
1415-8426
1415-8426
1447-9540
1447-9540
1462-3846
1462-3846
1463-5771
1463-5771
1470-5427
1470-5427
1470-8175
1470-8175
1474-6778
1474-6778
1476-0835
1476-0835
1520-7439
1520-7439
1527-5299
1527-5299
1532-060X
1532-060X
1535-2811
1535-2811
1543-4273
1543-4273
1552-4825
1552-4825
1553-0981
1553-0981
1555-5623
1555-5623
1559-2332
1559-2332
1575-0922
1575-0922
1578-9705
1578-9705
1611-3683
1611-3683
1648-3480
1648-3480
1655-1532
1655-1532
1674-5507
1674-5507
1681-3472
1681-3472
1744-618X
1744-618X
1754-1476
1754-1476
1808-4532
1808-4532
1818-1171
1818-1171
1832-4274
1832-4274
1857-7431
1857-7431
1861-1303
1861-1303
1862-4057
1862-4057
1863-0383
1863-0383
1867-8521
1867-8521
1869-4195
1869-4195
1907-7505
1907-7505
1907-9931
1907-9931
1932-1031
1932-1031
1947-6108
1947-6108
1949-1042
1949-1042
1985-207X
1985-207X
1991-3877
1991-3877
1996-1960
1996-1960
2050-7526
2050-7526
2089-6867
2089-6867
2095-0160
2095-0160
2146-4189
2146-4189
2219-8229
2219-8229
2227-7242
2227-7242
2230-7885
2230-7885
2251-9130
2251-9130
2252-6773
2252-6773
2307-3489
2307-3489
2310-4155
2310-4155
2329-8456
2329-8456
2477-9539
2477-9539
2588-1205
2588-1205
0004-0429
0012-4079
0012-8309
0017-3754
0019-4395
0019-493X
0020-3785
0022-1817
0031-7497
0031-7721
0031-7810
0039-8627
0040-3180
0041-610X
0042-6695
0043-0781
0044-8591
0046-4147
0047-262X
0047-9411
0048-3753
0069-8040
0071-9544
0115-6136
0123-4567
0143-2524
0155-2173
0168-1273
0189-1774
0189-7543
0189-9171
0191-1030
0201-7369
0208-4317
0233-7029
0250-626X
0255-983X
0258-8021 
0271-2171
0342-183X
0363-1307
0379-4008
0379-5292
0391-8440
0411-972X
0449-2153
0460-0037
0556-3097
0580-9525
0583-337X
0590-4048
0686-3174
0700-9816
0736-7031
0736-904X
0744-8766
0792-8521
0794-4713
0794-4721
0794-4896
0794-5698
0794-7410
0795-0101
0795-5111
0811-0433
0850-7902
0855-0328
0855-2215
0855-3823
0951-1253
0974-9632
1001-5917
1002-4822
1012-2812
1070-194X
1071-894X
1093-2658
1099-0046
1110-5704
1115-0521
1115-3474
1116-2775
1116-4336
1116-4875
1116-5405
1117-272X
1117-4153
1118-0579
1118-2601
1118-5570
1118-6267
1119-4308
1119-443X
1119-8152
1119-9008
1124-3937
1189-3332
1239-2325
1304-3889
1304-4257
1304-7442
1305-385X
1308-4216
1314-7242
1320-2510
1324-048X
1327-8746
1338-1202
1343-3210
1361-3126
1440-6888
1443-458X
1444-1284
1448-5052
1450-2267
1450-2887
1513-489X
1523-4592
1532-7299
1533-9440
1539-854X
1547-4127
1553-0205
1555-7855
1565-1088
1582-1130
1595-0611
1595-4153
1595-5125
1596-2903
1596-2911
1596-292X
1596-3233
1596-5031
1596-5511
1596-6194
1596-6208
1596-6216
1596-6224
1596-6798
1596-6941
1596-7425
1596-9819
1597-0906
1597-1260
1597-2836
1597-4292
1597-4488
1597-913X
1597-9385
1637-3412
1646-1813
1646-6756
1658-354X
1658-6662
1671-0290 
1675-7227
1676-2578
1696-9895
1735-1835
1735-7586
1755-599X
1793-5482
1796-5622
1800-1261
1810-6277
1812-0806
1812-2108
1814-9340
1815-5928
1817-2509
1820-659X
1823-6138
1834-8610
1858-5345
1858-554X
1883-5031
1884-202X
1884-5800
1913-9330
1923-7502
1999-253X
1999-6187
2010-3778
2038-9744
2049-7881
2050-9768
2054-7309
2068-4762
2071-0216\
2071-2573
2091-0304
2091-0363
2091-0800
2091-1009
2091-1459
2093-8114
2147-5369
2149-8407
2155-3625
2166-1200
2169-6129
2175-8042
2185-4092
2201-4268
2201-4624
2221-0989
2224-8358
2225-0573
2226-0373
2228-5415
2232-2981
2240-8053
2252-0414
2252-0848
2277-7694
2282-2305
2285-5696
2286-7104
2300-6285
2315-7844
2322-0090
2334-1548
2334-1841
2345-5993
2349-5103
2351-8014
2367-4598
2367-7600
2373-6367
2383-1375
2383-2495
2384-6028
2392-8021
 2397-0022
2442-3904
2451-0602
2474-6436
2502-4507
2536-9512
2576-063X
2588-1965
2588-2082
3875-2072
4163-5178
4277-4685
7383-1018
7519-1735
8941-1991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment