Skip to content

Instantly share code, notes, and snippets.

@iangow
Last active January 18, 2024 10:49
Show Gist options
  • Save iangow/583557b7b91a87ee1e545aa839ccbb8d to your computer and use it in GitHub Desktop.
Save iangow/583557b7b91a87ee1e545aa839ccbb8d to your computer and use it in GitHub Desktop.
A brief look at CRSP-Compustat merges

More up-to-date material on this topic can be found here.


From WRDS:

The WRDS-created linking dataset (ccmxpf_linktable) has been deprecated. It will continue to be created for a transition period of 1 year. SAS programmers should use the Link History dataset (ccmxpf_lnkhist) from CRSP.

And from here:

LINKPRIM clarifies the link's relationship to Compustat's marked primary security within the related range. "P" indicates a primary link marker, as identified by Compustat in monthly security data. "C" indicates a primary link marker, as identified by CRSP to resolve ranges of overlapping or missing primary markers from Compustat in order to produce one primary security throughout the company history. "J" indicates a joiner secondary issue of a company, identified by Compustat in monthly security data.

So let's check it out. Given any GVKEY and a date, is there only one PERMNO that is matched with linkprim IN ('P', 'C')?

ccmxpf_lnkhist <- 
    tbl(pg, sql("SELECT * FROM crsp.ccmxpf_lnkhist")) 

gvkey_permno <-
    ccmxpf_lnkhist %>%
    filter(linkprim %in% c("C", "P"),
           !is.na(lpermno)) 

# Look for overlapping date ranges    
gvkey_permno %>%
    group_by(gvkey) %>%
    arrange(gvkey, linkdt) %>%
    mutate(lead_linkdt = lead(linkdt),
           lag_linkenddt = lag(linkenddt)) %>%
    filter(linkenddt >= lead_linkdt | lag_linkenddt >= linkdt)

Output:

> gvkey_permno %>%
+     group_by(gvkey) %>%
+     arrange(gvkey, linkdt) %>%
+     mutate(lead_linkdt = lead(linkdt),
+            lag_linkenddt = lag(linkenddt)) %>%
+     filter(linkenddt >= lead_linkdt | lag_linkenddt >= linkdt)
Source:   query [?? x 10]
Database: postgres 9.6.1 [igow@iangow.me:5432/crsp]
Groups: gvkey

   gvkey linkprim  liid linktype lpermno lpermco     linkdt  linkenddt
   <chr>    <chr> <chr>    <chr>   <dbl>   <dbl>     <date>     <date>
1 017196        P   01C       LX   77571   28850 1989-11-17 1989-11-29
2 017196        P    01       LC   77571   28850 1989-11-17 1994-07-21
# ... with 2 more variables: lead_linkdt <date>, lag_linkenddt <date>

So there is just one case of overlapping dates, and it has linkprim equal to P in both rows. But the answer is effectively "no" because you get the same PERMNO (77571) in either row.

Note that the question here begins with a GVKEY and asks "given this GVKEY, which PERMNO provides the correct security-related information (return, stock price) for a given date?" The answer is given by lpermno above. Note that there is no mention of datadate (from Compustat) here. But it may be that I am interested in security information on datadate and thus that would drive the selection of lpermno.

Note that the vast majority of GVKEYs map to just one PERMNO:

> gvkey_permno %>%    
+     select(gvkey, lpermno) %>%
+     distinct() %>%
+     group_by(gvkey) %>%
+     mutate(num_permnos = n()) %>%
+     ungroup() %>%
+     count(num_permnos) %>%
+     arrange(num_permnos)
Source:   query [?? x 2]
Database: postgres 9.6.1 [igow@iangow.me:5432/crsp]

  num_permnos     n
        <dbl> <dbl>
1           1 26520
2           2  1496
3           3   129
4           4     8
5           5     5

I think I've looked into the case with 5 PERMNOs before. If I recall correctly, it's a total mess with tracking stock, spin-offs, etc. But one observation doesn't matter much.

> stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
> gvkey_permno %>%
+     select(gvkey, lpermno) %>%
+     distinct() %>%
+     group_by(gvkey) %>%
+     mutate(num_permnos = n()) %>%
+     filter(num_permnos==5L) %>%
+     inner_join(ccmxpf_lnkhist) %>%
+     rename(permno=lpermno) %>%
+     inner_join(stocknames) %>%
+     arrange(linkdt) %>%
+     select(gvkey, permno, linkdt, linkenddt, ncusip, comnam)
Joining, by = c("gvkey", "lpermno")
Joining, by = "permno"
Source:   query [?? x 6]
Database: postgres 9.6.1 [igow@iangow.me:5432/crsp]
Groups: gvkey

   gvkey permno     linkdt  linkenddt   ncusip                      comnam
   <chr>  <dbl>     <date>     <date>    <chr>                       <chr>
1 032280  81875 1995-08-11 1999-03-09 87924V50 TELE COMMUNICATIONS INC NEW
2 032280  86746 1999-03-10 2001-08-09 00195720                A T & T CORP
3 032280  89130 2001-08-10 2006-05-09 53071810      LIBERTY MEDIA CORP NEW
4 032280  89130 2001-08-10 2006-05-09 53071810      LIBERTY MEDIA CORP NEW
5 032280  91275 2006-05-10 2011-09-25 53071M30      LIBERTY MEDIA CORP NEW
6 032280  13008 2011-09-26       <NA> 85571Q10                       STARZ
7 032280  13008 2011-09-26       <NA> 53032210  LIBERTY MEDIA CORP 2ND NEW
8 032280  13008 2011-09-26       <NA> 53032210  LIBERTY MEDIA CORP 2ND NEW
@iangow
Copy link
Author

iangow commented Mar 13, 2017

It seems that there's no real difference between ccmxpf_lnkhist and ccmxpf_linktable once you throw out values with missing gvkey or permno. This should probably be done any way. I figure the goal of these tables is to facilitate LEFT JOINs from CRSP or Compustat. But, if so, just use LEFT JOIN.

There are additional rows in ccmxpf_lnkused but these all have usedflag=-1 and thus probably shouldn't be used.

> # Connect to database and underlying tables.
> suppressPackageStartupMessages(library(dplyr))
> pg <- src_postgres()
> 
> ccmxpf_lnkhist <- tbl(pg, sql("SELECT * FROM crsp.ccmxpf_lnkhist"))
> ccmxpf_lnkused <- tbl(pg, sql("SELECT * FROM crsp.ccmxpf_lnkused"))
> ccmxpf_linktable <- tbl(pg, sql("SELECT * FROM crsp.ccmxpf_linktable"))
> 
> # Rearrange and rename variables for consistency with lnkhist and linktable
> ccmxpf_lnkused_real <-
+     ccmxpf_lnkused %>% 
+     rename(gvkey = ugvkey, linkprim = ulinkprim, liid = uiid, 
+            linktype = ulinktype, lpermno=upermno, lpermco=upermco,
+            linkenddt = ulinkenddt, linkdt = ulinkdt) %>%
+     mutate(usedflag = as.integer(usedflag),
+            lpermno = as.integer(lpermno),
+            lpermco = as.integer(lpermco)) %>%
+     select(gvkey, linkprim, liid, linktype, lpermno,
+            lpermco, usedflag, linkdt, linkenddt) %>%
+     filter(!is.na(lpermno), !is.na(gvkey)) %>%
+     collect()
> 
> ccmxpf_lnkhist_real <-
+     ccmxpf_lnkhist %>%
+     filter(!is.na(lpermno), !is.na(gvkey)) %>%
+     collect()
> 
> ccmxpf_linktable_real <-
+     ccmxpf_linktable %>%
+     filter(!is.na(lpermno), !is.na(gvkey)) %>%
+     collect()
> 
> # First, compare linktable with lnkused
> ccmxpf_linktable_real %>%
+     anti_join(ccmxpf_lnkused_real) 
Joining, by = c("gvkey", "linkprim", "liid", "linktype", "lpermno", "lpermco", "usedflag", "linkdt", "linkenddt")
# A tibble: 0 × 9
# ... with 9 variables: gvkey <chr>, linkprim <chr>, liid <chr>, linktype <chr>, lpermno <int>, lpermco <int>, usedflag <int>, linkdt <date>, linkenddt <date>
> 
> ccmxpf_lnkused_real %>%
+     anti_join(ccmxpf_linktable_real)  %>%
+     count()
Joining, by = c("gvkey", "linkprim", "liid", "linktype", "lpermno", "lpermco", "usedflag", "linkdt", "linkenddt")
# A tibble: 1 × 1
      n
  <int>
1  2721
> 
> ccmxpf_lnkused_real %>%
+     anti_join(ccmxpf_linktable_real)  %>%
+     count(usedflag)
Joining, by = c("gvkey", "linkprim", "liid", "linktype", "lpermno", "lpermco", "usedflag", "linkdt", "linkenddt")
# A tibble: 1 × 2
  usedflag     n
     <int> <int>
1       -1  2721
> 
> # Now, compare linktable with lnkhist
> ccmxpf_linktable_real %>%
+     anti_join(ccmxpf_lnkhist_real) %>%
+     count()
Joining, by = c("gvkey", "linkprim", "liid", "linktype", "lpermno", "lpermco", "linkdt", "linkenddt")
# A tibble: 1 × 1
      n
  <int>
1     0
> 
> ccmxpf_lnkhist_real %>%
+     anti_join(ccmxpf_linktable_real)  %>%
+     count()
Joining, by = c("gvkey", "linkprim", "liid", "linktype", "lpermno", "lpermco", "linkdt", "linkenddt")
# A tibble: 1 × 1
      n
  <int>
1     0
> 
> ccmxpf_lnkhist_real %>% 
+     anti_join(ccmxpf_lnkhist_real)
Joining, by = c("gvkey", "linkprim", "liid", "linktype", "lpermno", "lpermco", "linkdt", "linkenddt")
# A tibble: 0 × 8
# ... with 8 variables: gvkey <chr>, linkprim <chr>, liid <chr>, linktype <chr>, lpermno <dbl>, lpermco <dbl>, linkdt <date>, linkenddt <date>
# Connect to database and underlying tables.
suppressPackageStartupMessages(library(dplyr))
pg <- src_postgres()

ccmxpf_lnkhist <- tbl(pg, sql("SELECT * FROM crsp.ccmxpf_lnkhist"))
ccmxpf_lnkused <- tbl(pg, sql("SELECT * FROM crsp.ccmxpf_lnkused"))
ccmxpf_linktable <- tbl(pg, sql("SELECT * FROM crsp.ccmxpf_linktable"))


# Rearrange and rename variables for consistency with lnkhist and linktable
ccmxpf_lnkused_real <-
    ccmxpf_lnkused %>% 
    rename(gvkey = ugvkey, linkprim = ulinkprim, liid = uiid, 
           linktype = ulinktype, lpermno=upermno, lpermco=upermco,
           linkenddt = ulinkenddt, linkdt = ulinkdt) %>%
    mutate(usedflag = as.integer(usedflag),
           lpermno = as.integer(lpermno),
           lpermco = as.integer(lpermco)) %>%
    select(gvkey, linkprim, liid, linktype, lpermno,
           lpermco, usedflag, linkdt, linkenddt) %>%
    filter(!is.na(lpermno), !is.na(gvkey)) %>%
    collect()

ccmxpf_lnkhist_real <-
    ccmxpf_lnkhist %>%
    filter(!is.na(lpermno), !is.na(gvkey)) %>%
    collect()

ccmxpf_linktable_real <-
    ccmxpf_linktable %>%
    filter(!is.na(lpermno), !is.na(gvkey)) %>%
    collect()

# First, compare linktable with lnkused
ccmxpf_linktable_real %>%
    anti_join(ccmxpf_lnkused_real) 

ccmxpf_lnkused_real %>%
    anti_join(ccmxpf_linktable_real)  %>%
    count()

ccmxpf_lnkused_real %>%
    anti_join(ccmxpf_linktable_real)  %>%
    count(usedflag)

# Now, compare linktable with lnkhist
ccmxpf_linktable_real %>%
    anti_join(ccmxpf_lnkhist_real) %>%
    count()

ccmxpf_lnkhist_real %>%
    anti_join(ccmxpf_linktable_real)  %>%
    count()

ccmxpf_lnkhist_real %>% 
    anti_join(ccmxpf_lnkhist_real)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment