Skip to content

Instantly share code, notes, and snippets.

@felliott
Last active July 10, 2021 03:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save felliott/758af271ceb13d7a10700597dbf254bb to your computer and use it in GitHub Desktop.
Save felliott/758af271ceb13d7a10700597dbf254bb to your computer and use it in GitHub Desktop.
Hyposfesis

Email contents

Hey Dan,

(I've CC'd Nici & Eric on our side. I didn't want to presumptuously add Jon & Robert from h.is. Feel free to pull in the appropriate folks!)

I've continued investigating mislinked annotations in OSF preprints, as related to these two GitHub issues:

Unfortunately, this issue is complicated because it involves multiple overlapping issues: an OSF bug that has since been fixed, an existing bug that we don't have an easy fix for, and bad data that got stored while the bug was present. Because the write up is so long, I've formatted it as Markdown and posted it as a private gist at: https://gist.github.com/felliott/758af271ceb13d7a10700597dbf254bb.

=== tl;dr

There was a period of time on OSF preprints where new comments were being submitted with the wrong url and correct pdf fingerprint. We have since fixed the issue, but the comments are still linked to the wrong url. Even if we search by just the pdf fingerprint, we receive the wrong set of comments. I think this might need to be fixed on the hypothes.is side by manually repointing the data.

Please let me know if there is any other information I can provide!

Cheers, Fitz

Background

We render preprints on the OSF preprints page by inserting an iframe which points to a service (MFR - Modular File Renderer) running on an osf.io subdomain (mfr.osf.io). We do this because users don't always use pdfs to distribute their preprints; .docx, .doc, and even .pptx files are occasionally uploaded. MFR converts those file types to pdfs that we can render on the web. Inside the iframe we load this JavaScript file to initialize and configure hypothes.is. Hypothes.is is not enabled for all MFR iframes; the parent frame can signal whether h.is should be loaded. We currently only do this for a few preprint services that have requested it, including MetaArXiv, SocArXiv, and PsyArXiv. We use two properties to identify our preprints, PDF fingerprints and preprint URLs, both of which we have our own modifications to.

PDF fingerprints and Stable IDs

MFR generates its own stable ids and passes them as the PDF fingerprint to h.is. Our users can upload new versions of their pdfs, but we can't guarantee that they will have the same fingerprint. We would like the comments to be preserved between preprint versions. If we don't define the fingerprint, the comments can disappear. In addition, when we generate a pdf from a .docx file, we cache the generated pdf. When we upgrade LibreOffice (our pdf generator) to get rendering improvements, the cache is cleared and new versions are generated on-demand. These new versions have different fingerprints, which cause the comments to disappear.

These stable ids are a hash of version-stable metadata stored in the OSF. We do not embed this id directly into the file. Instead we define it in a <script> tag inside the iframe, then override the fingerprint attributes inside the PDFViewerApplication. This is why the fingerprint in the downloaded document doesn't match the fingerprint sent in the uri=urn:x-pdf:$foo query parameter.

Preprint url

The default url that the hypothes.is client links comments to is the url of the iframe. Our rendering iframe is designed to be embeddable and contains a lot of information in its query parameters. Ex: https://mfr.osf.io/render?url=https://osf.io/download/605a4c635717460086e6525e/?direct%26mode=render. This is a relatively simple example, but even in this example the url could reference the same target pdf using several different paths. The url may contain a number of different query parameters (as this does). We do not define or enforce a particular order of query parameters either.

To avoid linking comments to opaque urls, we ask for the iframe's referrer path and link comments to that. This makes sense for our use-case: hypothes.is is used on OSF Preprints mainly for reviewer comments and discussion. The parent iframe will tend to be of the format https://osf.io/preprints/metaarxiv/cd5j9/.

referrerpolicy changes and effects

In 2020 browsers started being stricter by default regarding what information about the parent frame was available to the iframe. When we developed this feature in 2017/2018 we used the url of the parent frame, which tends to be much more succinct (e.g. https://osf.io/preprints/metaarxiv/cd5j9/, https://osf.io/sfc38/, https://psyarxiv.com/qaek6/). When the h.is sidebar is loaded, it searches for comments using the parent frame url and the stable id.

Before the browsers changed their defaults, the hypothes.is search query for the preprint at https://osf.io/preprints/metaarxiv/cd5j9/ would be:

https://hypothes.is/api/search?_separate_replies=false&group=__world__&limit=50&order=asc&sort=created&uri=https%3A%2F%2Fosf.io%2Fpreprints%2Fmetaarxiv%2Fcd5j9%2F&uri=urn%3Ax-pdf%3A3b67a8f9a67e369c0b9936dac10cabb6c72d4d56045f9ce4bb6826311993fb16

After the browser default change it became:

https://hypothes.is/api/search?_separate_replies=false&group=__world__&limit=50&order=asc&sort=created&uri=https%3A%2F%2Fosf.io%2F&uri=urn%3Ax-pdf%3A3b67a8f9a67e369c0b9936dac10cabb6c72d4d56045f9ce4bb6826311993fb16

The only difference being the first uri parameter. Before it was uri=https://osf.io/preprints/metaarxiv/cd5j9/. After it was uri=https://osf.io/.

mitigation

We deployed a fix for the problem with this commit in our ember-osf library that is used by OSF preprints. The commit message is below:

specify default referrer policy on MFR iframes

 * Set the referrer-policy on MFR iframes to
   "no-referrer-when-downgrade".  This allows the page within the
   iframe to get both the origin and path of the calling context.
   This was the default assumed by browsers for a long time. Recently
   Chrome has started defaulting to "strict-origin-when-cross-origin",
   giving the child page access to the origin, but not the path.

   The hypothes.is support code inside the MFR iframe relies on
   knowing the origin AND the path to help it identify the relevant
   comments. Setting the default policy back to the old value will
   allow Chrome to correctly map comments to files.

   See: https://developers.google.com/web/updates/2020/07/referrer-policy-new-chrome-default

This fixed the issue for a time, allowing the pdf iframe to get the parent url. Later browser changes made it so that this doesn't work for cross-domain requests. Unfortunately this prevents PsyArXiv from working completely. We have a project on our roadmap to fix this, but no resources to devote to it yet.

Mislinked data issues

Since we have fixed the url issues for non-cross-domain preprints, we'd expect that comments would be correctly linked to their preprints. Unfortunately, we're not seeing that. For example, bringing up the h.is sidebar on the preprint at https://osf.io/preprints/metaarxiv/cd5j9/ shows 222 linked comments. Few of these (if any) seem relevant to the preprint. Inspecting the structure of the first fifty comments shows that some are linked to the url https://osf.io and others are linked to the url https://osf.io/sfc38/. I'm not sure how this is happening, but it seems to be related to the pdf fingerprint. If I search only by the (correct) url, I get no results. If I search only by the PDF fingerprint, I get the 222 results.

This is speculation on my part, but I'm guessing that during the period when we were affected by the referrer-policy bug, something on hypothes.is's end started associating the PDF fingerprint for cd5j9 with https://osf.io. Otherwise, I'm not sure why that fingerprint would be returning that set of comments. This is where we need your help. Why are these comments being pointed here, and how can we update these links? I'm afraid this may be something that is only doable from your end, though we may be able to provide direction on what should point where.

Summary

Internet is hard. COS uses two identifiers to map comments to preprints, but one of those has become volatile due to (understandable) changes in the web's default privacy practices. We've tried to adapt, but some comments may have already been mislinked on the h.is backend and will need to be updated.

Next steps

  1. It would be helpful to understand how newly-submitted comments are mapped on the h.is backend. If we submit a comment with two pieces of identifying information (url and pdf fingerprint), how does that get mapped? What happens if one of those is temporarily incorrect for Reasons of Programming™?

  2. Figure out how to remap incorrectly-linked comments. This may be a collaborative effort between h.is & COS to figure out the appropriate targets for the mislinked targets. Can you give us a sense of how hard this is?

  3. COS needs to fix up MFR so that we don't rely on the referrer being available from within the iframe. That'll probably involve some heavy reworking of MFR. How do we deal w/ PsyArXiv in the meantime?

Reference

These are some of the projects that we have received reports about.

cd5j9 (MetaArXiv) Observing Many Researchers Using the Same Data and Hypothesis Reveals a Hidden Universe of Uncertainty

Opening the hypothes.is client on this preprints brings in a lot of unrelated comments.

URL: https://osf.io/preprints/metaarxiv/cd5j9/

Stable ID / Fingerprint: 3b67a8f9a67e369c0b9936dac10cabb6c72d4d56045f9ce4bb6826311993fb16

vwe36 (SocArXiv) Does Partisan Identity Reduce Support for Electoral Fairness?

Opening the hypothes.is client on this preprints brings in NO comments.

URL: https://osf.io/preprints/socarxiv/vwe36/

Stable ID / Fingerprint: c82d80801a9a8155f56829f7aa49c64f67cc0ae9bbeb3091e9f93b667b14cc46

qaek6 (PsyArXiv) Tapping to unfamiliar and highly syncopated rhythms: Modelling behaviour and cognitive mechanisms

Opening the hypothes.is client on this preprints brings in 7 comments that should be attached to https://psyarxiv.com/3y54r/. Difficult to fix currently because PsyArXiv runs on a branded domain instead of on osf.io.

URL: https://psyarxiv.com/qaek6/

Stable ID / Fingerprint: d4ffda76a7cda3b92f5fd45eeb6c349fb720dc342d0c02e9c1aeb0a028719427

Appendix A: Comments linked to https://psyarxiv.com

URL to search for all comments linked to psyarxiv.com

Returns 7 comments, all of which should be linked to https://psyarxiv.com/3y54r/

Appendix B: Comments linked to https://osf.io

URL to search for all comments linked to osf.io

Returns 222 comments, the first 50 of which which apply to a number of different preprints, including:

https://osf.io/nmvgs/
  0: xLYq7C9sEemrOiPSqQ3XZw => https://osf.io/
Unknown
  1: JmyF5D-hEemfyte7oam8Zg => ???
  19: 2tywdIOgEemgSPPfNvJN-w => ???
  20: R3DLvIOhEem21MenF3u8Sg => ???
  21: Ng4nQoOiEemfQmuPI64bWg => ???
https://osf.io/sfc38/
  2: tYtUbE8PEemGr0NXTxbS1w => https://osf.io/sfc38/
  3: XnvamE_IEem0nSuklJzo1w => https://osf.io/sfc38/
  22-50: ??? => https://osf.io/sfc38/

Appendix C: Scratch Area

This is a nonsense/chaos space for Fitz to deposit rough notes, links, reminders, and unconsidered ideas.

https://osf.io/sfc38/

https://hypothes.is/api/search?_separate_replies=false&group=__world__&limit=50&order=asc&sort=created&uri=https%3A%2F%2Fpsyarxiv.com%2F&uri=urn%3Ax-pdf%3Ad4ffda76a7cda3b92f5fd45eeb6c349fb720dc342d0c02e9c1aeb0a028719427

cd5j9: MFR_STABLE_ID=3b67a8f9a67e369c0b9936dac10cabb6c72d4d56045f9ce4bb6826311993fb16

sfc38: MFR_STABLE_ID=8da0ab446a14375671862daff43726499228d6be635d72cd133d546af08feac9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment