Skip to content

Instantly share code, notes, and snippets.

@lifan0127
Created March 8, 2023 02:53
Show Gist options
  • Star 32 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save lifan0127/e34bb0cfbf7f03dc6852fd3e80b8fb19 to your computer and use it in GitHub Desktop.
Save lifan0127/e34bb0cfbf7f03dc6852fd3e80b8fb19 to your computer and use it in GitHub Desktop.
Streamlining Literature Reviews with Paper QA and Zotero
import os
os.environ['OPENAI_API_KEY'] = '<Your OpenAI API Key>'
# See here on how to find your Zotero info: https://github.com/urschrei/pyzotero#quickstart
ZOTERO_USER_ID = '<Your Zotero User ID>'
ZOTERO_API_KEY = '<Your Zotero API Key>'
ZOTERO_COLLECTION_ID = '<Your Zotero Collection ID>'
question = 'What predictive models are used in materials discovery?'
# The following prompt instruction is injected to limit the number of keywords per query
question_prompt = 'A "keyword search" is a list of no more than 3 words, which separated by whitespace only and with no boolean operators (e.g. "dog canine puppy"). Avoid adding any new words not in the question unless they are synonyms to the existing words.'
from paperqa import Docs
from pyzotero import zotero
import requests
import shutil, sys, re
from bs4 import BeautifulSoup
docs = Docs()
queries = docs.generate_search_query(question + '\n' + question_prompt)
print(f'Search queries: {", ".join(queries)}')
zot = zotero.Zotero(ZOTERO_USER_ID, 'user', ZOTERO_API_KEY)
searches = [zot.collection_items(
ZOTERO_COLLECTION_ID,
q=q.strip('"'),
limit=10,
itemType='attachment',
qmode='everything'
) for q in queries]
attachments = {item['key']: item for search in searches for item in search if item['data']['contentType'] == 'application/pdf'}.values()
parents = set([a['data']['parentItem'] for a in attachments])
citation_dict = {p: zot.item(p, content='bib', style='american-chemical-society')[0] for p in parents}
result_count = len(parents)
if (result_count == 0):
print(f'No matched results in Zotero')
sys.exit()
print(f'Results: {result_count}')
paths = []
citations = []
for attachment in attachments:
link_mode = attachment['data']['linkMode']
file_path = f'data/zotero/{attachment["key"]}.pdf'
parent = citation_dict[attachment['data']['parentItem']]
if link_mode == 'imported_file':
zot.dump(attachment['key'], f'{attachment["key"]}.pdf', 'data/zotero')
elif link_mode == 'linked_file':
shutil.copy(attachment['data']['path'], file_path)
elif link_mode == 'imported_url':
res = requests.get(attachment['data']['url'])
with open(file_path, 'wb') as f:
f.write(res.content)
else:
raise ValueError(f'Unsupported link mode: {link_mode} for {attachment["key"]}.')
paths.append(file_path)
citations.append(re.sub("^\(\d+\)\s+", "", BeautifulSoup(parent, 'html.parser').get_text().strip()))
for d, c in zip(paths, citations):
docs.add(d, c)
answer = docs.query(question)
with open('data/zotero-answer.txt', 'w') as f:
f.write(answer.formatted_answer)
@lifan0127
Copy link
Author

Input

The sample Zotero collection used for this prototype contains 10 open-access articles published in RSC Digital Discovery.
2023-03-07-12-52-56

Output

Question: What predictive models are used in materials discovery?

Various predictive models are used in materials discovery, including crystallographic databases, materials genome approach, network analysis, structure-based synthesizability prediction, graph neural networks, deep learning, machine learning, linear regression models, support vector machines (SVM), support vector regression (SV-R), random forest regression (RF-R), and neural network regression (NN-R) (Gleaves2023 pages 13-14; Haraguchi2022 pages 6-7). These models are used to predict materials properties, which drives materials discovery (Gleaves2023 pages 13-14). The predictive capabilities of these models are typically measured using statistics such as the root-mean-square error (RMSE) or the coefficient of determination (r2) between ML-predicted materials property values and their known values (Borg2023a pages 1-1). Sequential learning (SL) is sometimes used when training data is scarce or extrapolation is necessary (Borg2023a pages 1-1). Composition-based feature vectors (CBFVs) are widely used in materials science for screening materials without the need for DFT calculations or synthesis (Durdy2022a pages 3-4).

References

  1. (Gleaves2023): Gleaves, D.; Fu, N.; Siriwardane, E. M. D.; Zhao, Y.; Hu, J. Materials Synthesizability and Stability Prediction Using a Semi-Supervised Teacher-Student Dual Neural Network. Digital Discovery 2023. https://doi.org/10.1039/D2DD00098A.

  2. (Haraguchi2022): Haraguchi, Y.; Igarashi, Y.; Imai, H.; Oaki, Y. Sparse Modeling for Small Data: Case Studies in Controlled Synthesis of 2D Materials. Digital Discovery 2022, 1 (1), 26–34. https://doi.org/10.1039/D1DD00010A.

  3. (Borg2023a): Borg, C. K. H.; Muckley, E. S.; Nyby, C.; Saal, J. E.; Ward, L.; Mehta, A.; Meredig, B. Quantifying the Performance of Machine Learning Models in Materials Discovery. Digital Discovery 2023. https://doi.org/10.1039/D2DD00113F.

  4. (Durdy2022a): Durdy, S.; Gaultois, M. W.; Gusev, V. V.; Bollegala, D.; Rosseinsky, M. J. Random Projections and Kernelised Leave One Cluster out Cross Validation: Universal Baselines and Evaluation Tools for Supervised Machine Learning of Material Properties. Digital Discovery 2022, 1 (6), 763–778. https://doi.org/10.1039/D2DD00039C.

Tokens Used: 5757 Cost: $0.01

@Edilson-R
Copy link

Excellent script and design. One question: will the Zotero API query link only the files from the online library, or will it query the local file base?

@lifan0127
Copy link
Author

@Edilson-R If you execute this script locally, it should be able to access linked files stored on your hard drive, for example, attachments managed by ZotFile.

@Edilson-R
Copy link

I use Zotfile and I have the folder with the attached files in Google Drive. The algorithm is not able to find the files because I believe it looks in the folder that is in <C:\Users\user\Zotero\storage>. I haven't been able to make any headway with it.

@lifan0127
Copy link
Author

@Edilson-R I follow the same practice (Zotfile + G Drive). This code block is meant to copy externally stored attachments for such linked files: https://gist.github.com/lifan0127/e34bb0cfbf7f03dc6852fd3e80b8fb19#file-paper-qa-zotero-py-L53-L54

Perhaps you can check the attachment paths (attachment['data']['path']) and see if they matches the attachment file locations on your hard drive?

@Edilson-R
Copy link

Hi, @lifan0127 , thank you for always answering me cordially. I'm really grateful for that.

I've been trying all week, still haven't figured out the error. I had to change line 28 (q=q.strip('"') because with this parameter the "searches" list returned with null values.

I'm not working with file links anymore, and I imported them, so it should read by "zot.dump(attachment['key'], f'{attachment["key"]}.pdf', 'data/zotero')", but it doesn't work. I tried using the app from "https://huggingface.co/spaces/lifan0127/zotero-qa", but it doesn't work for my data either.

With these changes, the code runs up to line 47 and returns this error:

`PS C:\Users\edilsonag\Zotero\zotero-qa> & c:/Users/edilsonag/Zotero/zotero-qa/venv/Scripts/Activate.ps1
(venv) PS C:\Users\edilsonag\Zotero\zotero-qa> & c:/Users/edilsonag/Zotero/zotero-qa/venv/Scripts/python.exe c:/Users/edilsonag/Zotero/zotero-qa/teste3.py
Traceback (most recent call last):
File "C:\Users\edilsonag\Zotero\zotero-qa\venv\Lib\site-packages\pyzotero\zotero.py", line 411, in _retrieve_data
self.request.raise_for_status()
File "C:\Users\edilsonag\Zotero\zotero-qa\venv\Lib\site-packages\requests\models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api.zotero.org/users/10100603/items/KUTPMVQH/file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\Users\edilsonag\Zotero\zotero-qa\teste3.py", line 60, in
zot.dump(attachment['key'], f'{attachment["key"]}.pdf', 'data/zotero')
File "C:\Users\edilsonag\Zotero\zotero-qa\venv\Lib\site-packages\pyzotero\zotero.py", line 719, in dump
file = self.file(itemkey)
^^^^^^^^^^^^^^^^^^
File "C:\Users\edilsonag\Zotero\zotero-qa\venv\Lib\site-packages\pyzotero\zotero.py", line 178, in wrapped_f
retrieved = self._retrieve_data(func(self, *args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\edilsonag\Zotero\zotero-qa\venv\Lib\site-packages\pyzotero\zotero.py", line 413, in _retrieve_data
error_handler(self, self.request)
File "C:\Users\edilsonag\Zotero\zotero-qa\venv\Lib\site-packages\pyzotero\zotero.py", line 1642, in error_handler
raise error_codes.get(req.status_code)(err_msg(req))
pyzotero.zotero_errors.ResourceNotFound:
Code: 404
URL: https://api.zotero.org/users/10100603/items/KUTPMVQH/file
Method: GET
Response: Not found
(venv) PS C:\Users\edilsonag\Zotero\zotero-qa>`

@lifan0127
Copy link
Author

@Edilson-R The error message suggests the PDF file was not found. Could it be possibly a synchronization issue? If you open the Zotero web library (https://www.zotero.org/mylibrary) and navigate to the item, are you able to see/open the PDF file there?

@Edilson-R
Copy link

I can see the file in Zotero's web-library. I'm going to save the sqlite database backup and I'm going to reinstall Zotero again...

@Edilson-R
Copy link

I managed to get the code to run. Thank you, this will help me a lot in my current PhD in Industrial Economics. Much appreciated!!!!!

@Edilson-R
Copy link

Edilson-R commented Apr 9, 2023

Hello @lifan0127, good morning. In his hugging face app, the user "ryanrwatkins" talked about using the pickle code block to reuse embeddings. As I am only working with this code, and not with your app built with gradio, where this saving and loading functionality could be implemented in your code (sorry for being a newbie and asking boring questions, but believe me: I started studying Python precisely because of your code. I'm an R user and a PhD in Economics student and it has been very useful for the assisted construction of my Literature Review).

My solution is include in line 19 this block:

if not os.path.exists("data/paperqa/my_docs.pkl"):
docs = Docs()
with open("data/paperqa/my_docs.pkl", "rb") as f:
docs = pickle.load(f)

And include in line 70:

with open("data/paperqa/my_docs.pkl", "wb") as f:
pickle.dump(docs, f)

Is it correct???

@lifan0127
Copy link
Author

@Edilson-R Happy to see your progress!

For regular API calls, the underlying LangChain package has cached the results in a local SQLite database. I believe caching for embeddings is not supported by LangChain yet. However, there is an open PR for this feature: langchain-ai/langchain#1930

Once it is added, it would be easy for us to reuse the embeddings transparently.

@jalalawan
Copy link

Thanks for this - I have an issue with .pdf read as follows:

An error occurred while reading PDF Zotero/storage/QI6TJKBR.pdf: EOF marker not found
An error occurred while reading PDF Zotero/storage/U4XJVRY7.pdf: EOF marker not found
An error occurred while reading PDF Zotero/storage/64WW8VRH.pdf: EOF marker not found
An error occurred while reading PDF Zotero/storage/U5CYRA4T.pdf: EOF marker not found
An error occurred while reading PDF Zotero/storage/L6RLDESA.pdf: EOF marker not found

I've checked the pdfs manually and written a script to ensure EOF Marker is present. Is there a way to force read .pdf files with this error. Thanks again!

@lifan0127
Copy link
Author

Hi @jalalawan I am not sure why the error occurred for you. Does it happen to all your PDF files?

@jalalawan
Copy link

Thanks for responding - it's an issue with only some of my pdf files, I'll try forcing an EOF Marker or download fresh pdf files.

I had another question - how do I change the default GPT-3.5 engine to GPT-4, and change temperature settings, max_token values in the script. I am trying to use the following for GPT-4, but keep getting "Engine not found" error:

llm_gpt4 = AzureOpenAI(
deployment_name="gpt-4-v0314-base",
temperature=0.1,
model_name='gpt-4',
max_tokens=7000)

and using it in the Docs class in the code as follows:

docs = Docs(llm_gpt4)

Appreciate your insights!

@lifan0127
Copy link
Author

@jalalawan Sorry, I don't have experience with Azure OpenAI. Does your account (API key) have access to GPT-4?

@jalalawan
Copy link

@jalalawan Sorry, I don't have experience with Azure OpenAI. Does your account (API key) have access to GPT-4?

Appreciate your response - I do have access to GPT-4, I also looked at the API documentation and it looks like the default for Docs() is GPT3.5, I could not find an option to set temperature, max_tokens parameters.

All said, like others have mentioned here, really appreciate your contribution and making the LLM experience less hallucinatory and better-suited for research.

@andreifoldes
Copy link

Thank you for sharing this - does anyone have experience with how this approach compares to using vector databases that the Chatgpt retrival plugin is advocating?

@lifan0127
Copy link
Author

@andreifoldes I think the retrieval mechanisms are the same. This approach uses the FAISS library for vector similarity based search to find relevant document chunks and then feed them into LLMs for response synthesis.

@andreifoldes
Copy link

Thank you - did you or anyone play around with the different libraries, would there be a reason for one outperforming the rest when it comes to academic Q&A tasks?

@jalalawan
Copy link

Yes, I think the retrieval (vector embeddings) are based on cosine similarity metric for FAISS / GPT etc., the Q&A performance depends largely on whether the model is fine-tuned and/or prompt templates (see langchain library).

I created a summarization and Q&A app also using GPT (and NLTK library for chunking / tokenization). Seems to work well for <15 pages. I'm keeping it open for folks to test for a couple days - please do check out and provide feedback:

https://powerful-dusk-64631.herokuapp.com/

Thanks,
Jalal

@JannikSchneider12
Copy link

Hey,
I wanted to try out the script but it seems that the paper-qa package is updated and the generate_search_query method doesn't exist anymore. I tried a workaround but I am not sure, if it is correct, since it doesn't use a method from Docs anymore. Can someone have a look?

Note: I also tried to change ChatGPT to llama2

See here on how to find your Zotero info: https://github.com/urschrei/pyzotero#quickstart

ZOTERO_USER_ID = ''
ZOTERO_API_KEY = ''
ZOTERO_COLLECTION_ID = ''

question = 'How is deep learning used for clustering mass spectra?'

The following prompt instruction is injected to limit the number of keywords per query

question_prompt = 'A "keyword search" is a list of no more than 3 words, which separated by whitespace only and with no boolean operators (e.g. "dog canine puppy"). Avoid adding any new words not in the question unless they are synonyms to the existing words.'

from bs4 import BeautifulSoup
import requests
import shutil
import re

from paperqa import Docs
from pyzotero import zotero
import requests
import shutil, sys, re
from bs4 import BeautifulSoup

Your Docs class implementation here

docs = Docs()

Generate search queries manually (assuming you don't have the generate_search_query method)

keywords = [word.lower() for word in question.split() if len(word) > 2]
queries = [f'"{keyword}"' for keyword in keywords]

print(queries)

zot = zotero.Zotero(ZOTERO_USER_ID, 'user', ZOTERO_API_KEY)

searches = [zot.collection_items(
ZOTERO_COLLECTION_ID,
q=q.strip('"'),
limit=10,
itemType='attachment',
qmode='everything'
) for q in queries]

print(f'searches:{searches}')

attachments = {item['key']: item for search in searches for item in search if item['data']['contentType'] == 'application/pdf'}.values()

parents = set([a['data']['parentItem'] for a in attachments])
citation_dict = {p: zot.item(p, content='bib', style='american-chemical-society')[0] for p in parents}
result_count = len(parents)

print(f'attachments:{attachments}')
print(f'parents:{parents}')

if result_count == 0:
print(f'No matched results in Zotero')
sys.exit()
print(f'Results: {result_count}')

Define the directory where PDF files will be saved

pdf_directory = 'data/zotero_pdfs/'

if not os.path.exists(pdf_directory):
os.makedirs(pdf_directory)

paths = []
citations = []

for attachment in attachments:
link_mode = attachment['data']['linkMode']
file_path = os.path.join(pdf_directory, f'{attachment["key"]}.pdf')
parent = citation_dict[attachment['data']['parentItem']]
if link_mode == 'imported_file':
zot.dump(attachment['key'], f'{attachment["key"]}.pdf', pdf_directory)
elif link_mode == 'linked_file':
shutil.copy(attachment['data']['path'], file_path)
elif link_mode == 'imported_url':
res = requests.get(attachment['data']['url'])
with open(file_path, 'wb') as f:
f.write(res.content)
else:
raise ValueError(f'Unsupported link mode: {link_mode} for {attachment["key"]}.')
paths.append(file_path)
citations.append(re.sub("^(\d+)\s+", "", BeautifulSoup(parent, 'html.parser').get_text().strip()))

for d, c in zip(paths, citations):
docs.add(d, c)

answer = docs.query(question)

print (answer)
with open('data/zotero-answer.txt', 'w') as f:
f.write(answer.formatted_answer)
`

@lifan0127
Copy link
Author

@JannikSchneider12 This script was written several months ago. It hasn't been tested with the latest paper-qa release.

Meanwhile, I see the paper-qa package has added integration with Zotero. Have you checked it out yet?

https://github.com/whitead/paper-qa/blob/main/paperqa/contrib/zotero.py

@JannikSchneider12
Copy link

@lifan0127 Thanks for your reply. I will have a look at it, but I am still at the very beginning regarding programming.

Btw is there a way to still run your script if I use the exact versions that you used there?

Again thanks for your help and time

@lifan0127
Copy link
Author

@JannikSchneider12 Please check out this Hugging Face space: https://huggingface.co/spaces/lifan0127/zotero-qa, where you can ask questions based on your Zotero library without programming.

Also, I am working on a Zotero plugin to incorporate paper QA, among other feature, into Zotero. Please check out if you are interested: https://github.com/lifan0127/ai-research-assistant

@jasicarose75
Copy link

jasicarose75 commented Feb 19, 2024

This is interesting, thank you. I really like the idea of optimizing literature reviews using tools like Paper QA and Zotero. These tools can greatly simplify and speed up the process of searching and analyzing scientific articles, helping you save time and improve the quality of your work. I had a similar project and I asked do my homework, I found https://edubirdie.com/do-my-homework for this. Now I know a lot about this myself and can give a lot of advice. The main thing is to use these tools effectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment