Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@ssp
Last active December 10, 2015 13:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ssp/4443013 to your computer and use it in GitHub Desktop.
Save ssp/4443013 to your computer and use it in GitHub Desktop.
Count the most frequent years of WorldCat’s mos frequently held items.
# Picking up Karen Coyle’s idea to gather the publication years from all items in WoldCat’s most widely held items.
#
# Data available from OCLC at: ftp://anonftp.oclc.org/pub/researchdata/worldcat/WorldCatMostHighlyHeld-2012-05-15.nt.gz
# The downloaded file is expected to be in the current directory.
#
# Join the following commands together:
#
# 1. gzcat the file to access the text
# 2. grep for the lines containing »datePublished« (this only seems to exist for about half the records?)
# 3. sed with a regular expression to extract four digits after the »datePublished> "« (possibly prefixed by a p or c)
# 4. grep -v to dump all remaining lines with »datePublished in them« (can occur when no digits are matched)
# 5. sort lines
# 6. uniq -c count runs of identical lines
#
# Running the command takes a few minutes even on a fast machine (single-threaded grep seems to be the bottleneck).
gzcat WorldCatMostHighlyHeld-2012-05-15.nt.gz | grep "datePublished\|copyrightYear" | sed -e 's/.*schema.org[^>]*> "[cp]*\([0-9][0-9][0-9][0-9]\).*/\1/' | grep -v "<" | sort | uniq -c
# Apply something like
# sed -E -e 's/ *([0-9]*) ([0-9]*)/\2\t\1/'
# to the result to get a tab separated format suitable for use in spreadsheets.
#
# Results shown at:
# https://docs.google.com/spreadsheet/ccc?key=0Ah9t1ddBuxv8dDhHT1VQT3ExV3ZJV3pKS3A0X0tLeGc
1174 1
1410 1
1762 1
1770 1
1771 1
1776 1
1788 1
1791 1
1802 1
1803 2
1804 1
1808 1
1809 2
1810 2
1811 1
1814 1
1817 2
1818 1
1820 2
1821 2
1823 1
1824 2
1825 1
1826 1
1827 1
1828 3
1829 1
1830 1
1831 1
1832 5
1833 5
1834 4
1836 1
1837 3
1838 5
1839 8
1840 8
1841 9
1842 5
1843 8
1844 7
1845 4
1846 6
1847 4
1848 6
1849 7
1850 7
1851 9
1852 10
1853 12
1854 10
1855 10
1856 12
1857 11
1858 4
1859 6
1860 10
1861 4
1862 8
1863 10
1864 13
1865 13
1866 23
1867 13
1868 12
1869 13
1870 11
1871 12
1872 22
1873 11
1874 18
1875 18
1876 14
1877 9
1878 10
1879 23
1880 33
1881 23
1882 29
1883 32
1884 24
1885 40
1886 28
1887 76
1888 66
1889 57
1890 66
1891 60
1892 65
1893 61
1894 73
1895 77
1896 96
1897 104
1898 101
1899 146
1900 136
1901 173
1902 165
1903 190
1904 188
1905 201
1906 209
1907 233
1908 207
1909 262
1910 298
1911 247
1912 293
1913 311
1914 284
1915 290
1916 310
1917 304
1918 279
1919 329
1920 378
1921 452
1922 506
1923 573
1924 598
1925 696
1926 722
1927 841
1928 929
1929 941
1930 932
1931 900
1932 834
1933 814
1934 904
1935 1007
1936 1120
1937 1179
1938 1227
1939 1315
1940 1338
1941 1414
1942 1376
1943 1275
1944 1234
1945 1396
1946 1761
1947 2151
1948 2279
1949 2566
1950 2798
1951 2848
1952 3079
1953 3155
1954 3247
1955 3639
1956 3809
1957 4519
1958 4710
1959 5219
1960 6356
1961 6873
1962 8034
1963 9000
1964 10102
1965 11190
1966 12010
1967 13215
1968 15053
1969 14910
1970 15148
1971 15206
1972 15431
1973 15499
1974 14719
1975 14809
1976 15335
1977 15934
1978 16182
1979 16130
1980 16702
1981 16934
1982 17701
1983 17967
1984 18722
1985 19142
1986 19679
1987 20228
1988 21121
1989 21800
1990 22617
1991 22606
1992 23597
1993 24946
1994 25764
1995 26398
1996 28777
1997 30433
1998 32392
1999 34009
2000 34971
2001 31958
2002 34289
2003 35175
2004 35047
2005 34666
2006 32715
2007 32316
2008 30810
2009 27114
2010 23747
2011 14436
2012 1288
5677 1
5703 1
5704 2
5705 1
5706 3
8200 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment