billdueber/slip_flow.md

## slip_flow.md

      
    Raw
  

              slip_flow.md
            
          
    SLIP flow for normal (non-print-holdings or collection-builder) items

A basic run through of how things move through SLIP.
DB Tables overview


slip_rights: (one row per item). A copy-ish of rights_current with
additional information about when an item was last updated.
Populated/updated from vufind solr.
slip_queue: (one row per item-to-update). A list of HTIDs
along with slots to hold information about which (if any) process is
trying to index the item right now. Populated from slip_rights based
on timestamps.
slip_indexed: (one row per item). Every htid,  its shard,
when it was last indexed, and how many times it's been indexed.
slip_errors: Lines from slip_queue where indexing failed.
slip_*_control, slip_*_tmestamp, etc. (one line, or one line
per type of indexing run): On/off flags, when things last ran, etc.

0: update the catalog

For stupid historical reasons, the entire SLIP queuing process is
driven by what's in the catalog, as determined by querying the catalog.
To facilitate this, the catalog also indexes,
for each record, the last-updated dates of every htitem on that record, as
reported by Zephir in a MARC 974 field.
1. rights_j: Update the slip_rights table based on the catalog

slip_rights is a (kind of) copy of rights_current, with one row for
each item, updated whenever an item is determined to need indexing.
In addition to rights data, it has two other fields:

update_time: (date like 20210201) the last time the item (or its record)
were changed
according to zephir.
insert_time: (timestamp) the time at which this line was last
inserted/updated (i.e., DEFAULT = CURRENT_TIMESTAMP). This will be
used later to determine which items actually need (re)indexing.

rights_j grabs likely IDs to (re)index from the catalog and updates
slip_rights with the update date (from the catalog) and changes the
insert_time to NOW.
Psuedocode
last_time_rights_ran = sql("select time from slip_vsolr_timestamp")

vufind_query("ht_id_update:[last_time_rights_ran TO *]").each do |rec|
  rec.hathi_items.each do |item|
    upsert_into_slip_rights(
      id=item.htid, 
      update_time=item.zephir_update_date,
      insert_time = NOW, 
      other_crap)
  end
end

# set this so we know how to query the catalog the next time around
sql("update slip_vsolr_timestamp set time = max_insert_time_in_slip_rights")
2. enqueuer-j: copy stuff from slip_rights to slip_queue

slip_queue holds only rows for things that need indexing. In addition to
the htid and shard, it has slots where processes can put data that indicate
the item is actually being worked on.
It keeps its last-run-time in the extremely poorly named
slip_rights_timestamp table;
last_time_enqueuer_ran = sql("select time from slip_rights_timestamp")

items = sql("select * from slip_rights where insert_time >= last_time_enqueuer_ran")
items.each do |item|
  shard = sql("select shard from slip_indexed where htid=item.htid) || 0
  upsert_into_slip_queue(htid=item.htid, shard=shard)
end

sql("update slip_rights_timestamp set time = NOW")
3. index-j: index the documents

index-j roughly/conceptually does the following (more detail to come):
htid, shard = sql("select htid, shard from slip_queue where pid is NULL")
sql(update slip_queue set 
       pid = $PID,
       host = $HOST,
       proc_status = "indexing")

if shard == 0
  shard = random(12)
end
       
mets = mets_file(htid)

metadata = make_http_call_to_vufind_solr_no_for_real(htid)
other_metadata = mets.metadata_we_want
text = metadata.pages.map {|page| get_page_text(htid, page)}.join(' ')

solr_document = make_solr_doc(metadata, other_metadata, text)

status = http.post(solr_url_for_shard, solr_document)

sql("delete from slip_queue where htid=htid and pid=$PID")

if status == 'ok'
  sql("update slip_indexed set
        shard = shard,
        time = NOW,
        indexed_ct = indexed_ct + 1")
else
  sql("insert into slip_errors values htid, shard, $PID, $HOST, NOW, status")
end