Last active
December 10, 2015 13:48
-
-
Save ssp/4443013 to your computer and use it in GitHub Desktop.
Count the most frequent years of WorldCat’s mos frequently held items.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Picking up Karen Coyle’s idea to gather the publication years from all items in WoldCat’s most widely held items. | |
# | |
# Data available from OCLC at: ftp://anonftp.oclc.org/pub/researchdata/worldcat/WorldCatMostHighlyHeld-2012-05-15.nt.gz | |
# The downloaded file is expected to be in the current directory. | |
# | |
# Join the following commands together: | |
# | |
# 1. gzcat the file to access the text | |
# 2. grep for the lines containing »datePublished« (this only seems to exist for about half the records?) | |
# 3. sed with a regular expression to extract four digits after the »datePublished> "« (possibly prefixed by a p or c) | |
# 4. grep -v to dump all remaining lines with »datePublished in them« (can occur when no digits are matched) | |
# 5. sort lines | |
# 6. uniq -c count runs of identical lines | |
# | |
# Running the command takes a few minutes even on a fast machine (single-threaded grep seems to be the bottleneck). | |
gzcat WorldCatMostHighlyHeld-2012-05-15.nt.gz | grep "datePublished\|copyrightYear" | sed -e 's/.*schema.org[^>]*> "[cp]*\([0-9][0-9][0-9][0-9]\).*/\1/' | grep -v "<" | sort | uniq -c | |
# Apply something like | |
# sed -E -e 's/ *([0-9]*) ([0-9]*)/\2\t\1/' | |
# to the result to get a tab separated format suitable for use in spreadsheets. | |
# | |
# Results shown at: | |
# https://docs.google.com/spreadsheet/ccc?key=0Ah9t1ddBuxv8dDhHT1VQT3ExV3ZJV3pKS3A0X0tLeGc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1174 1 | |
1410 1 | |
1762 1 | |
1770 1 | |
1771 1 | |
1776 1 | |
1788 1 | |
1791 1 | |
1802 1 | |
1803 2 | |
1804 1 | |
1808 1 | |
1809 2 | |
1810 2 | |
1811 1 | |
1814 1 | |
1817 2 | |
1818 1 | |
1820 2 | |
1821 2 | |
1823 1 | |
1824 2 | |
1825 1 | |
1826 1 | |
1827 1 | |
1828 3 | |
1829 1 | |
1830 1 | |
1831 1 | |
1832 5 | |
1833 5 | |
1834 4 | |
1836 1 | |
1837 3 | |
1838 5 | |
1839 8 | |
1840 8 | |
1841 9 | |
1842 5 | |
1843 8 | |
1844 7 | |
1845 4 | |
1846 6 | |
1847 4 | |
1848 6 | |
1849 7 | |
1850 7 | |
1851 9 | |
1852 10 | |
1853 12 | |
1854 10 | |
1855 10 | |
1856 12 | |
1857 11 | |
1858 4 | |
1859 6 | |
1860 10 | |
1861 4 | |
1862 8 | |
1863 10 | |
1864 13 | |
1865 13 | |
1866 23 | |
1867 13 | |
1868 12 | |
1869 13 | |
1870 11 | |
1871 12 | |
1872 22 | |
1873 11 | |
1874 18 | |
1875 18 | |
1876 14 | |
1877 9 | |
1878 10 | |
1879 23 | |
1880 33 | |
1881 23 | |
1882 29 | |
1883 32 | |
1884 24 | |
1885 40 | |
1886 28 | |
1887 76 | |
1888 66 | |
1889 57 | |
1890 66 | |
1891 60 | |
1892 65 | |
1893 61 | |
1894 73 | |
1895 77 | |
1896 96 | |
1897 104 | |
1898 101 | |
1899 146 | |
1900 136 | |
1901 173 | |
1902 165 | |
1903 190 | |
1904 188 | |
1905 201 | |
1906 209 | |
1907 233 | |
1908 207 | |
1909 262 | |
1910 298 | |
1911 247 | |
1912 293 | |
1913 311 | |
1914 284 | |
1915 290 | |
1916 310 | |
1917 304 | |
1918 279 | |
1919 329 | |
1920 378 | |
1921 452 | |
1922 506 | |
1923 573 | |
1924 598 | |
1925 696 | |
1926 722 | |
1927 841 | |
1928 929 | |
1929 941 | |
1930 932 | |
1931 900 | |
1932 834 | |
1933 814 | |
1934 904 | |
1935 1007 | |
1936 1120 | |
1937 1179 | |
1938 1227 | |
1939 1315 | |
1940 1338 | |
1941 1414 | |
1942 1376 | |
1943 1275 | |
1944 1234 | |
1945 1396 | |
1946 1761 | |
1947 2151 | |
1948 2279 | |
1949 2566 | |
1950 2798 | |
1951 2848 | |
1952 3079 | |
1953 3155 | |
1954 3247 | |
1955 3639 | |
1956 3809 | |
1957 4519 | |
1958 4710 | |
1959 5219 | |
1960 6356 | |
1961 6873 | |
1962 8034 | |
1963 9000 | |
1964 10102 | |
1965 11190 | |
1966 12010 | |
1967 13215 | |
1968 15053 | |
1969 14910 | |
1970 15148 | |
1971 15206 | |
1972 15431 | |
1973 15499 | |
1974 14719 | |
1975 14809 | |
1976 15335 | |
1977 15934 | |
1978 16182 | |
1979 16130 | |
1980 16702 | |
1981 16934 | |
1982 17701 | |
1983 17967 | |
1984 18722 | |
1985 19142 | |
1986 19679 | |
1987 20228 | |
1988 21121 | |
1989 21800 | |
1990 22617 | |
1991 22606 | |
1992 23597 | |
1993 24946 | |
1994 25764 | |
1995 26398 | |
1996 28777 | |
1997 30433 | |
1998 32392 | |
1999 34009 | |
2000 34971 | |
2001 31958 | |
2002 34289 | |
2003 35175 | |
2004 35047 | |
2005 34666 | |
2006 32715 | |
2007 32316 | |
2008 30810 | |
2009 27114 | |
2010 23747 | |
2011 14436 | |
2012 1288 | |
5677 1 | |
5703 1 | |
5704 2 | |
5705 1 | |
5706 3 | |
8200 1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment