Skip to content

Instantly share code, notes, and snippets.

@IanMulvany
Created May 22, 2024 07:56
Show Gist options
  • Save IanMulvany/97644a7f8fbf521c47dc566b964feb4d to your computer and use it in GitHub Desktop.
Save IanMulvany/97644a7f8fbf521c47dc566b964feb4d to your computer and use it in GitHub Desktop.
LLM prompts in the data-to-paper project.
issue=dedent_triple_quote_str("""
The code created the output file "{}", but the file is too long!
Here, for context, is the beginning of the output:
```output
{}
```
""").format(filename, extract_to_nearest_newline(content, self.max_tokens)),
------------------------------------------------------------
response_to_self_error: str = dedent_triple_quote_str("""
{}
Please {goal_verb} the {goal_noun} again with this error corrected.
""")
------------------------------------------------------------
response_to_non_matching_citations: str = dedent_triple_quote_str("""
The following citation ids were not found:
{}
Please make sure all citation ids are writen exactly as in the citation lists above.
""")
------------------------------------------------------------
response_to_floating_citations: str = dedent_triple_quote_str("""
The following citation ids are not properly enclosed in \\cite{{}} command:
{}
Please make sure all citation ids are enclosed in a \\cite{{}} command.
""")
------------------------------------------------------------
response_to_self_error: str = dedent_triple_quote_str("""
Your response should include a Python dictionary Dict[str, str], mapping the issues you found (keys), \t
to suggested solutions (values).
If you are sure that there are no issues, you should respond with an empty dictionary, `{}`.
""")
------------------------------------------------------------
system_prompt: str = dedent_triple_quote_str("""
You are a brilliant data scientist. You are writing a Python code to analyze data.
""")
------------------------------------------------------------
present_code_as_fresh: str = dedent_triple_quote_str("""
Here is the code to perform the analysis.
{created_file_names_explanation}
```python
{code}
```
""") # set to None to not present code
------------------------------------------------------------
code_review_formatting_instructions: str = dedent_triple_quote_str("""
Return your choice as a Python Dict[str, str], mapping possible issues to suggested changes in the code.
If you have no suggestions for improvement, return an empty dict:
```python
{}
```
""")
------------------------------------------------------------
CodeReviewPrompt('*', False, dedent_triple_quote_str("""
I ran your code.
Here is the content of the output file(s) that the code created:
{file_contents_str}
Please check if there is anything wrong in these results (like unexpected NaN values, or anything else \t
that may indicate that code improvements are needed).
{code_review_formatting_instructions}
""")
------------------------------------------------------------
response = dedent_triple_quote_str("""
The code has some issues that need to be fixed:
{issues_to_solutions}
{prompt_to_append_at_end_of_response}
""").format(issues_to_solutions=issues,
------------------------------------------------------------
return dedent_triple_quote_str("""
I ran the code and got the following error message:
```error
{}
```
""").format(str_error)
------------------------------------------------------------
dedent_triple_quote_str("""
Please rewrite the complete code again with these issues corrected.
GENERAL FORMATTING INSTRUCTIONS:
Even if you are changing just a few lines, you must return the complete code again in a \t
single code block, including the unchanged parts, so that I can just copy-paste and run it.
{required_headers_prompt}
""")
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Your code should only use these packages: {supported_packages}.
Note that there is a `{var}` in `{known_package}`. Is this perhaps what you needed?
""").format(supported_packages=self.supported_packages, var=var, known_package=KNOWN_MIS_IMPORTS[var]),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Your code should only use these packages: {supported_packages}.
""").format(supported_packages=self.supported_packages),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
As noted in the data description, we only have these files:
{}
Note that all input files are located in the same directory as the code.
""").format(self.data_filenames),
------------------------------------------------------------
issue=dedent_triple_quote_str("""
Your code must contain the following sections:
{headers_required_in_code}.
But I could not find these headers:
{required_strings_not_found}.
""").format(
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
The code can create and write to this output file, but should not read from it.
The only input files from which we can read the data are:
{}
""").format(self.data_filenames),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
We only have these files:
{}
Note that all input files are located in the same directory as the code.
""").format(self.data_filenames),
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Please write literature-search queries that we can use to search for papers related to our study.
You would need to compose search queries to identify prior papers covering these {num_scopes} areas:
{pretty_scopes_to_definitions}
Return your answer as a `Dict[str, List[str]]`, where the keys are the {num_scopes} areas noted above, \t
and the values are lists of query string. Each individual query should be a string with up to 5-10 words.
For example, for a study reporting waning of the efficacy of the covid-19 BNT162b2 vaccine based on analysis \t
of the "United Kingdom National Core Data (UK-NCD)", the queries could be:
```python
{pretty_scopes_to_examples}
```
""")
------------------------------------------------------------
self._raise_self_response_error(dedent_triple_quote_str("""
Queries should be 5-10 word long.
The following queries are too long:
{}
Please return your complete response again, with these queries shortened.
""").format(NiceList(too_long_queries, wrap_with='"', prefix='', suffix='', separator='\n')))
------------------------------------------------------------
failure_message = dedent_triple_quote_str("""
## Run Terminated
Run terminated prematurely during stage `{current_stage}`.
```error
{exception}
```
""")
------------------------------------------------------------
unexpected_error_message = dedent_triple_quote_str("""
# Run failed.
*data-to-paper* exited unexpectedly.
### Exception:
```error
{exception}
```
""")
------------------------------------------------------------
success_message = dedent_triple_quote_str("""
## Completed
This *data-to-paper* research cycle is now completed.
The manuscript is ready.
The created manuscript and all other output files are saved in:
{output_directory}
You can click "Compile Paper" stage button to open the manuscript.
Please check the created manuscript rigorously and carefully.
*Remember that the process is not error-free and the responsibility for the final manuscript \t
remains with you.*
You can close the app now.
""")
------------------------------------------------------------
fake_reviewer_agree_to_help: str = dedent_triple_quote_str("""
Sure, I am happy to guide you {goal_verb} the {goal_noun} and can also provide feedback.
Note that your {goal_noun} should be based on the following research products that you have now \t
already obtained:
```highlight
{vertical_actual_background_product_names}
```
Please carefully review these intermediate products and then proceed according to my guidelines below.
""")
------------------------------------------------------------
warning_about_non_matching_values: str = dedent_triple_quote_str("""
########################
####### WARNING: #######
########################
Some of the specified values {} are not explicitly extracted from:
{names_of_products_from_which_to_extract}
""")
------------------------------------------------------------
report_non_match_prompt: str = dedent_triple_quote_str("""
Any numeric value in your section must be based on the `provided data` above, namely on numerical values \t
extracted from:
{names_of_products_from_which_to_extract}
However, upon reviewing your section, I've identified certain `potentially problematic values`, \t
which don't directly match the `provided data`. They are:
{}
For transparency, please revise your section such that it includes only values \t
explicitly extracted from the `provided data` above, or derived from them using the \t
`\\num{<formula>, "explanation"}` syntax.
Examples:
- If you would like to report the difference between two provided values 87 and 65, you should write:
"The initial price of 87 was changed to 65, representing a difference of \\num{87 - 65}"
- If you would like to report the odds ratio corresponding to a provided regression coefficient of 1.234, \t
you should write:
"The regression coefficient was 1.234 corresponding to an odds ratio of \\num{exp(1.234)}"
- If the provided data includes a distance of 9.1e3 cm, and you would like to report the distance in meters, \t
you should write:
"Our analysis revealed a distance of \\num{9.1e3 / 100} meters"
IMPORTANT NOTE:
If we need to include a numeric value that was not calculated or is not explicitly given in the \t
Tables or "{additional_results}", \t
and cannot be derived from them, \t
then indicate `[unknown]` instead of the numeric value.
For example:
"The regression coefficient for the anti-cancer drugs was [unknown]."
""")
------------------------------------------------------------
report_non_match_prompt: str = dedent_triple_quote_str("""
Your section contains some improperly referenced numeric values, specifically:
{}
Numeric values must be included with \\hyperlink matching the \\hypertarget in the provided sources above.
The hyperlinks must include only the numeric values.
For example:
- Correct syntax: 'P $<$ \\hyperlink{Z3c}{1e-6}'
- Incorrect syntax: 'P \\hyperlink{Z3c}{$<$ 1e-6}'
See the examples I provided in my previous message.
Remember, you can also include such hyperlinked numeric values within the <formula> of \t
\\num{<formula>, "explanation"}.
This allows you to derive new numeric values from the provided source data.
Changing units, calculating differences, converting regression coefficients to odds ratios, etc.
For example:
'The treatment odds ratio was \\num{exp(\\hyperlink{Z3a}{0.17}), \t
"Translating the treatment regression coefficient to odds ratio"}'
In summary:
Either provided as a stand alone or within the <formula> of \\num{<formula>, "explanation"}, \t
all numeric values must have \\hyperlink references \t
that match the \\hypertarget references in the provided sources above.
IMPORTANT NOTE:
If we need to include a numeric value that is not explicitly provided in the Tables and other results above, \t
and cannot be derived from them, then indicate `[unknown]` instead of the numeric value.
For example:
'The p-value of the regression coefficient of the treatment was [unknown].'
""")
------------------------------------------------------------
other_system_prompt: str = dedent_triple_quote_str("""
You are a {reviewer} for a {performer} who needs to {goal_verb} {goal_noun}.
Your job is to advise me, the {performer}, and provide constructive bullet-point feedback in repeated cycles \t
of improvements and feedback.
When you feel that the goal has been achieved, respond explicitly with:
"{termination_phrase}" (approving-phrase)
If you feel that the initial {goal_noun} is already good enough, it is perfectly fine and encouraged \t
to respond with the approving-phrase immediately, without requesting any improvement cycles.
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_reviewer_response: str = dedent_triple_quote_str("""\n\n
Please correct your response according to any points you find relevant and applicable in my feedback.
Send back a complete rewrite of the {goal_noun}.
Make sure to send the full corrected {goal_noun}, not just the parts that were revised.
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_reviewer_response: str = dedent_triple_quote_str("""\n\n
Please correct your response according to any points you find relevant and applicable in my feedback.
Send back a complete rewrite of the {goal_noun}.
{quote_request}
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Please choose one of the following options:
1. Looks good. Choice 1.
2. Something is wrong. Choice 2.
{choice_instructions}
""")
------------------------------------------------------------
choice_instructions: str = dedent_triple_quote_str("""
Answer with just a single character, designating the option you choose {possible_choices}.
""")
------------------------------------------------------------
lrbox_table = dedent_triple_quote_str(r"""
% Define the save box within the document block
\newsavebox{\mytablebox} % Create a box to store the table
% Save only the tabular part of table in the \mytablebox without typesetting it:
\begin{lrbox}{\mytablebox}
<tabular>%
\end{lrbox}
% Typeset the entire table:
<table>
% Print the width of the tabular part of the table and the width of the page margin to the log file
\typeout{Table width: \the\wd\mytablebox}
\typeout{Page margin width: \the\textwidth}
""").replace('<tabular>', get_tabular_block(latex_table)).replace('<table>', latex_table)
------------------------------------------------------------
issue = dedent_triple_quote_str(f"""
The code was supposed to create at least {requirement.minimal_count} files \t
of "{requirement.filename}", \t
but it only created {len(output_files)} files of this type.
""")
------------------------------------------------------------
issue=dedent_triple_quote_str(f"""
Your code modifies, but doesn't save, some of the dataframes:
{read_but_unsaved_filenames}.
"""),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
The code should use `to_csv` to save any modified dataframe in a new file \t
in the same directory as the code.
"""),
------------------------------------------------------------
sentence_to_add_at_the_end_of_performer_response: str = dedent_triple_quote_str("""
Please provide feedback on the above {goal_noun}, with specific attention to whether it can be \t
studied using only the provided dataset, without requiring any additional data \t
(pay attention to using only data explicitly available in the provided headers of our data files \t
as described in our dataset, above).
Do not suggest changes to the {goal_noun} that may require data not available in our dataset.
If you are satisfied, respond with "{termination_phrase}".
""")
------------------------------------------------------------
goal_guidelines: str = dedent_triple_quote_str("""\n
Guidelines:
* Try to avoid trivial hypotheses (like just testing for simple linear associations).
Instead, you could perhaps explore more complex associations and relationships, like testing for \t
moderation effects or interactions between variables.
* Do not limit yourself to the provided data structure and variables; \t
you can create new variables from the existing ones, and use them in your hypotheses.
* Make sure that your suggested hypothesis can be studied using only the provided dataset, \t
without requiring any additional data. In particular, pay attention to using only data available \t
based on the provided headers of our data files (see "{data_file_descriptions_no_headers}", above).
{project_specific_goal_guidelines}\t
* Do not suggest methodology. Just the goal and an hypothesis.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""\n
Please suggest a research goal and an hypothesis that can be studied using only the provided dataset.
The goal and hypothesis should be interesting and novel.
{goal_guidelines}
{quote_request}
""")
------------------------------------------------------------
quote_request: str = dedent_triple_quote_str("""
INSTRUCTIONS FOR FORMATTING YOUR RESPONSE:
Please return the goal and hypothesis enclosed within triple-backticks, like this:
```
# Research Goal:
<your research goal here>
# Hypothesis:
<your hypothesis here>
```
""")
------------------------------------------------------------
other_system_prompt: str = dedent_triple_quote_str("""
You are a {reviewer} for a {performer} who needs to {goal_verb} {goal_noun}.
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_performer_response: str = dedent_triple_quote_str("""
Please provide constructive bullet-point feedback on the above {goal_noun}.
Specifically:
* If the hypothesis cannot be tested using only the provided dataset (without \t
requiring additional data), suggest how to modify the hypothesis to better fit the dataset.
* If the hypothesis is not interesting and novel, suggest how to modify it to make it more interesting.
* If the hypothesis is broad or convoluted, suggest how best to focus it on a single well defined question.
Do not provide positive feedback; if these conditions are all satisfied, just respond with:
"{termination_phrase}".
If you feel that the initial goal and hypothesis satisfy the above conditions, \t
respond solely with "{termination_phrase}".
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
From the literature search above, list up to 5 key papers whose results are most \t
similar/overlapping with our research goal and hypothesis.
Return your response as a Python Dict[str, str], where the keys are bibtex ids of the papers, \t
and the values are the titles of the papers. For example:
```python
{
"Smith2020TheAB": "A title of a paper most overlapping with our goal and hypothesis",
"Jones2021AssortedCD": "Another title of a paper that is similar to our goal and hypothesis",
}
```
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_reviewer_response: str = dedent_triple_quote_str("""
Please correct your {goal_noun} based on the feedback provided.
Make sure to return your full assessment, as \t
a Python dictionary {'similarities': List[str], 'differences': List[str], 'choice': str, 'explanation': str}.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
We would like to assess the novelty of our {research_goal} with respect to the literature.
Given the related papers listed above, please return a Python dictionary \t
with the following structure \t
{'similarities': List[str], 'differences': List[str], 'choice': str, 'explanation': str}:
* 'similarities': Provide a List[str] of potential similarities between our goal and hypothesis, \t
and the related papers listed above.
* 'differences': Provide a List[str] of potential differences, if any, between our stated {research_goal} \t
and the related papers listed above.
* 'choice': Given your assessment above, choose one of the following two options:
a. Our goal and hypothesis offer a significant novelty compared to existing literature, and \t
will likely lead to interesting and novel findings {'choice': 'OK'}.
b. Our goal and hypothesis have overlap with existing literature, and I can suggest ways to \t
revise them to make them more novel {'choice': 'REVISE'}.
* 'explanation': Provide a brief explanation of your choice.
Your response should be formatted as a Python dictionary, like this:
```python
{
'similarities': ['Our research goal is similar to the paper by ... in that ...',
'Our research goal somewhat overlaps with the findings of ...'],
'Our hypothesis is similar to the paper by ... in that ...'],
'differences': ['Our goal and hypothesis are distinct because ...',
'Our hypothesis differs from the paper by ... in that ...'],
'choice': 'OK' # or 'REVISE'
'explanation': 'While our goal and hypothesis have some overlap with existing literature, \t
I believe that the ... aspect of our research is novel and will lead to ...'
# or 'The overlap with the result of ... is too significant, and I think we can \t
# revise our goal to make it more novel, for example by ...'
}
```
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Based on the result of the literature search above, \t
please revise, or completely re-write, the research goal and hypothesis that we have so that they \t
do not completely overlap existing literature.
{goal_guidelines}
{quote_request}
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
We would like to test the specified hypotheses using the provided dataset.
Please follow these two steps:
(1) Return a bullet-point review of relevant statistical issues.
Read the "{data_file_descriptions_no_headers}" and the "{codes_and_outputs:data_exploration}" provided above, \t
and then for each of the following generic \t
statistical issues determine if they are relevant for our case and whether they should be accounted for:
* multiple comparisons.
* confounding variables (see available variables in the dataset that we can adjust for).
* dependencies between data points.
* missing data points.
* any other relevant statistical issues.
(2) Create a Python Dict[str, str], mapping each hypothesis (dict key) to the statistical test that \t
would be most adequate for testing it (dict value).
The keys of this dictionary should briefly describe each of our hypotheses.
The values of this dictionary should specify the most adequate statistical test for each hypothesis, \t
and describe how it should be performed while accounting for any issues you have outlined above as relevant.
For each of our hypotheses, suggest a *single* statistical test.
If there are several possible ways to test a given hypothesis, specify only *one* statistical test \t
(the simplest one).
Your response for this part should be formatted as a Python dictionary, like this:
```python
{
"xxx is associated with yyy and zzz":
"linear regression with xxx as the independent variable and \t
yyy and zzz as the dependent variables while adjusting for aaa, bbb, ccc",
"the association between xxx and yyy is moderated by zzz":
"repeat the above linear regression, \t
while adding the interaction term between yyy and zzz",
}
```
These of course are just examples. Your actual response should be based on the goal and hypotheses that \t
we have specified above (see the "{research_goal}" above).
Note how in the example shown the different hypotheses are connected to each other, building towards a single
study goal.
Remember to return a valid Python dictionary Dict[str, str].
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Choose the most appropriate citations to add for the sentence:
"{sentence}"
Choose as many relevant citations as possible from the following citations:
{citations}
Send your reply formatted as a Python list of str, representing the ids of the citations you choose.
For example, write:
```python
["AuthorX2022", "AuthorY2009"]
```
where AuthorX2022 and AuthorY2009 are the ids of the citations you think are making a good fit for the sentence.
Choose only citations that are relevant to the sentence.
You can choose one or more citations, or you can choose not adding citations to this sentence by replying `[]`.
""")
------------------------------------------------------------
response_to_self_error = dedent_triple_quote_str("""
{}
Please try again making sure you return the chosen citations with the correct format, like this:
```
["AuthorX2022Title", "AuthorY2009Title"]
```
""")
------------------------------------------------------------
system_prompt: str = dedent_triple_quote_str("""
You are a scientific citation expert.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Extract from the above section of a scientific paper all the factual sentences to which we need to \t
add citations.
Return a Python Dict[str, str] mapping each chosen sentence to a short literature search query \t
(up to a maximum of 5 words), like this:
```python
{
"This is a sentence that needs to have references": "Query for searching citations for this sentence",
"This is another important claim": "Some important keywords for this sentence",
"This is the another factual sentence that needs a source": "This is the best query for this sentence",
}
```
Identify *all* the sentences that you think we need to add citations to - you should include any sentence
that can benefit from a reference.
However, be cautious to avoid choosing sentences that do not refer to existing knowledge, but rather \t
describe the finding of the current paper.
""")
------------------------------------------------------------
response_to_self_error: str = dedent_triple_quote_str("""
{}
Please try again making sure you return the results with the correct format, like this:
```python
{"sentence extracted from the section": "query of the key sentence",
"another sentence extracted from the section": "the query of this sentence"}
```
""")
------------------------------------------------------------
latex_instructions: str = dedent_triple_quote_str("""
Write in tex format, escaping any math or symbols that needs tex escapes.
""")
------------------------------------------------------------
request_triple_quote_block: Optional[str] = dedent_triple_quote_str("""
The {goal_noun} should be enclosed within triple-backtick "latex" code block, like this:
```latex
\\section{<section name>}
<your latex-formatted writing here>
```
""")
------------------------------------------------------------
system_prompt: str = dedent_triple_quote_str("""
You are a data-scientist with experience writing accurate scientific research papers.
You will write a scientific article for the journal {journal_name}, following the instructions below:
1. Write the article section by section: Abstract, Introduction, Results, Discussion, and Methods.
2. Write every section of the article in scientific language, in `.tex` format.
3. Write the article in a way that is fully consistent with the scientific results we have.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Based on the material provided above ({actual_background_product_names}), \t
please {goal_verb} only the {goal_noun} for a {journal_name} article.
Do not write any other parts!
{section_specific_instructions}
{latex_instructions}
{request_triple_quote_block}
""")
------------------------------------------------------------
other_system_prompt: str = dedent_triple_quote_str("""
You are a reviewer for a scientist who is writing a scientific paper about their data analysis results.
Your job is to provide constructive bullet-point feedback.
We will write each section of the research paper separately.
If you feel that the paper section does not need further improvements, you should reply only with:
"{termination_phrase}".
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_reviewer_response: str = dedent_triple_quote_str("""\n\n
Please correct your response according to any points in my feedback that you find relevant and applicable.
Send back a complete rewrite of the {pretty_section_names}.
Make sure to send the full corrected {pretty_section_names}, not just the parts that were revised.
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_performer_response: str = dedent_triple_quote_str("""
Please provide a bullet-point list of constructive feedback on the above {pretty_section_names} \t
for my paper. Do not provide positive feedback, only provide actionable instructions for improvements in \t
bullet points.
In particular, make sure that the section is correctly grounded in the information provided above.
If you find any inconsistencies or discrepancies, please mention them explicitly in your feedback.
{section_review_specific_instructions}
You should only provide feedback on the {pretty_section_names}. Do not provide feedback on other sections \t
or other parts of the paper, like LaTex Tables or Python code, provided above.
If you don't see any flaws, respond solely with "{termination_phrase}".
IMPORTANT: You should EITHER provide bullet-point feedback, or respond solely with "{termination_phrase}"; \t
If you chose to provide bullet-point feedback then DO NOT include "{termination_phrase}".
""")
------------------------------------------------------------
request_triple_quote_block: Optional[str] = dedent_triple_quote_str("""
The {goal_noun} should be enclosed within triple-backtick "latex" code block, like this:
```latex
\\title{<your latex-formatted paper title here>}
\\begin{abstract}
<your latex-formatted abstract here>
\\end{abstract}
```
""")
------------------------------------------------------------
section_specific_instructions: str = dedent_triple_quote_str("""\n
The Title should:
* be short and meaningful.
* convey the main message, focusing on discovery not on methodology nor on the data source.
* not include punctuation marks, such as ":,;" characters.
The Abstract should provide a concise, interesting to read, single-paragraph summary of the paper, \t
with the following structure:
* short statement of the subject and its importance.
* description of the research gap/question/motivation.
* short, non-technical, description of the dataset used and a non-technical explanation of the methodology.
* summary of each of the main results. It should summarize each key result which is evident from the tables, \t
but without referring to specific numeric values from the tables.
* statement of limitations and implications.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Bases on the material provided above ({actual_background_product_names}), please help me improve the \t
title and abstract for a {journal_name} research paper.
{section_specific_instructions}
I especially want you to:
(1) Make sure that the abstract clearly states the main results of the paper \t
(see above the {paper_sections:results}).
(2) Make sure that the abstract correctly defines the literature gap/question/motivation \t
(see above Literature Searches for list of related papers).
{latex_instructions}
{request_triple_quote_block}
""")
------------------------------------------------------------
section_specific_instructions: str = dedent_triple_quote_str("""\n
The introduction should be interesting and pique your reader’s interest.
It should be written while citing relevant papers from the Literature Searches above.
Specifically, the introduction should follow the following multi-paragraph structure:
* Introduce the topic of the paper and why it is important \t
(cite relevant papers from the above "{literature_search:writing:background}").
* Explain what was already done and known on the topic, and what is then the research gap/question \t
(cite relevant papers from the above "{literature_search:writing:results}"). If there is only a minor gap, \t
you can use language such as "Yet, it is still unclear ...", "However, less is known about ...", \t
etc.
* State how the current paper addresses this gap/question \t
(cite relevant papers from the above "{literature_search:writing:dataset}" and \t
"{literature_search:writing:results}").
* Outline the methodological procedure and briefly state the main findings \t
(cite relevant papers from the above "{literature_search:writing:methods}")
Note: each of these paragraphs should be 5-6 sentence long. Do not just write short paragraphs with less \t
than 5 sentences!
Citations should be added in the following format: \\cite{paper_id}.
Do not add a \\section{References} section, I will add it later manually.
Note that it is not advisable to write about limitations, implications, or impact in the introduction.
""")
------------------------------------------------------------
section_review_specific_instructions: str = dedent_triple_quote_str("""\n
Also, please suggest if you see any specific additional citations that are adequate to include \t
(from the Literature Searches above).
""")
------------------------------------------------------------
section_specific_instructions: str = dedent_triple_quote_str("""\n
The Methods section should be enclosed within triple-backtick "latex" code block \
and have 3 subsections, as follows:
```latex
\\section{Methods}
\\subsection{Data Source}
- Describe our data sources (see above "{data_file_descriptions}")
\\subsection{Data Preprocessing}
- Describe preprocessing of the data done by the Python code (see above "{codes:data_analysis}").
- Do not include preprocessing steps that were not performed by the code.
- Do not include preprocessing steps that were performed by the code, but were not used as basis \t
for further analysis affecting the result output.
\\subsection{Data Analysis}
- Describe each of the specific analysis steps performed by the Python code to yield the results.
- Do not be over technical.
- Do not enumerate the steps as a list; instead, describe the steps in a narrative form.
```
Throughout the Methods section, do NOT include any of the following:
- Missing steps not done by the code.
- Specific version of software packages, file names, column names.
- Names of package functions (e.g., do not say "We used sklearn.linear_model.LinearRegression", say instead \t
"We used a linear regression model")
- URLs, links or references.""")
------------------------------------------------------------
request_triple_quote_block: str = dedent_triple_quote_str("""
Remember to enclose the Methods section within triple-backtick "latex" code block.
""")
------------------------------------------------------------
section_specific_instructions: str = dedent_triple_quote_str("""
{general_result_instructions}
{numeric_values_instructions}
""")
------------------------------------------------------------
general_result_instructions: str = dedent_triple_quote_str("""\n
Use the following guidelines when writing the Results:
* Include 3-4 paragraphs, each focusing on one of the Tables:
You should typically have a separate paragraph describing each of the Tables. \t
In each such paragraph, indicate the motivation/question for the analysis, the methodology, \t
and only then describe the results. You should refer to the Tables by their labels (using \\ref{table:xxx}) \t
and explain their content, but do not add the tables themselves (I will add the tables later manually).
* Story-like flow:
It is often nice to have a story-like flow between the paragraphs, so that the reader \t
can follow the analysis process with emphasis on the reasoning/motivation behind each analysis step.
For example, the first sentence of each paragraph can be a story-guiding sentences like:
"First, to understand whether xxx, we conducted a simple analysis of ..."; "Then, to test yyy, we performed a \t
..."; "Finally, to further verify the effect of zzz, we tested whether ...".
* Conclude with a summary of the results:
You can summarize the results at the end, with a sentence like: "In summary, these results show ...", \t
or "Taken together, these results suggest ...".
IMPORTANT NOTE: Your summary SHOULD NOT include a discussion of conclusions, implications, limitations, \t
or of future work. \t
(These will be added later as part the Discussion section, not the Results section).
""")
------------------------------------------------------------
numeric_values_instructions: str = dedent_triple_quote_str("""
* Numeric values:
- Sources:
You can extract numeric values from the above provided sources: "{latex_tables_linked}", \t
"{additional_results_linked}", and "{data_file_descriptions_no_headers_linked}".
All numeric values in these sources have a \\hypertarget with a unique label.
- Cited numeric values should be formatted as \\hyperlink{<label>}{<value>}:
Any numeric value extracted from the above sources should be written with a proper \\hyperlink to its \t
corresponding source \\hypertarget.
- Dependent values should be calculated using the \\num command.
In scientific writing, we often need to report values which are not explicitly provided in the sources, \t
but can rather be derived from them. For example: changing units, \t
calculating differences, transforming regression coefficients into odds ratios, etc (see examples below).
To derive such dependent values, please use the \\num{<formula>, "explanation"} command.
The <formula> contains a calculation, which will be automatically replaced with its result upon pdf compilation.
The "explanation" is a short textual explanation of the calculation \t
(it will not be displayed directly in the text, but will be useful for review and traceability).
- Toy example for citing and calculating numeric values:
Suppose our provided source data includes:
```
No-treatment response: \\hypertarget{Z1a}{0.65}
With-treatment response: \\hypertarget{Z2a}{0.87}
Treatment regression:
coef = \\hypertarget{Z3a}{0.17}, STD = \\hypertarget{Z3b}{0.072}, pvalue = <\\hypertarget{Z3c}{1e-6}
```
Then, here are some examples of proper ways to report these provided source values:
```
The no-treatment control group had a response of \\hyperlink{Z1a}{0.65} while the with-treatment \t
group had a response of \\hyperlink{Z2a}{0.87}.
The regression coefficient for the treatment was \\hyperlink{Z3a}{0.17} with a standard deviation of \t
\\hyperlink{Z3b}{0.072} (P-value: < \\hyperlink{Z3c}{1e-6}).
```
And are some examples of proper ways to calculate dependent values, using the \\num command:
```
The difference in response was \\num{\\hyperlink{Z2a}{0.87} - \\hyperlink{Z1a}{0.65}, \t
"Difference between responses with and without treatment"}.
The treatment odds ratio was \t
\\num{exp(\\hyperlink{Z3a}{0.17}), \t
"Translating the treatment regression coefficient to odds ratio"} (CI: \t
\\num{exp(\\hyperlink{Z3a}{0.17} - 1.96 * \\hyperlink{Z3b}{0.072}), \t
"low CI for treatment odds ratio, assuming normality"}, \t
\\num{exp(\\hyperlink{Z3a}{0.17} + 1.96 * \\hyperlink{Z3b}{0.072}), \t
"high CI for treatment odds ratio, assuming normality"}).
```
* Accuracy:
Make sure that you are only mentioning details that are explicitly found within the Tables and \t
Numerical Values.
* Unknown values:
If we need to include a numeric value that is not explicitly given in the \t
Tables or "{additional_results_linked}", and cannot be derived from them using the \\num command, \t
then indicate `[unknown]` instead of the numeric value.
For example:
```
The no-treatment response was \\hyperlink{Z1a}{0.65} (STD: [unknown]).
```
""")
------------------------------------------------------------
other_mission_prompt: str = dedent_triple_quote_str("""
Based on the material provided above, please write the Results section for a {journal_name} research paper.
{general_result_instructions}
* You can use the \\num command to calculate dependent values from the provided numeric values \t
(they will be automatically replaced with the actual numeric values in compilation).
""")
------------------------------------------------------------
section_review_specific_instructions: str = dedent_triple_quote_str("""
Do not suggest adding missing information, or stating whats missing from the Tables and Numerical Values, \t
only suggest changes that are relevant to the Results section itself and that are supported by the given \t
Tables and Numerical Values.
Do not suggest changes to the {goal_noun} that may require data not available in the the \t
Tables and Numerical Values.
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_reviewer_response: str = dedent_triple_quote_str("""\n\n
Please correct your response according to any points in my feedback that you find relevant and applicable.
Send back a complete rewrite of the {pretty_section_names}.
Make sure to send the full corrected {pretty_section_names}, not just the parts that were revised.
Remember to include the numeric values in the format \\hyperlink{<label>}{<value>} and use the \\num command \t
for dependent values.
""")
------------------------------------------------------------
self._raise_self_response_error(dedent_triple_quote_str(f"""
The {section_name} section should specifically reference each of the Tables that we have.
Please make sure we have a sentence addressing Table "{table_label}".
The sentence should have a reference like this: "Table~\\ref{{{table_label}}}".
"""))
------------------------------------------------------------
section_review_specific_instructions: str = dedent_triple_quote_str("""\n
Also, please suggest if you see any specific additional citations that are adequate to include \t
(from the Literature Searches above).
""")
------------------------------------------------------------
section_specific_instructions: str = dedent_triple_quote_str("""\n
The Discussion section should follow the following structure:
* Recap the subject of the study (cite relevant papers from the above "{literature_search:writing:background}").
* Recap our methodology (see "Methods" section above) and the main results \t
(see "{paper_sections:results}" above), \t
and compare them to the results from prior literature (see above "{literature_search:writing:results}").
* Discuss the limitations of the study.
* End with a concluding paragraph summarizing the main results, their implications and impact, \t
and future directions.
Citations should be added in the following format: \\cite{paper_id}.
Do not add a \\section{References} section, I will add it later manually.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Please return a triple-backtick Latex Block explaining what the code above does.
Do not provide a line-by-line explanation, rather provide a \t
high-level explanation of the code in a language suitable for a Methods section of a research \t
paper.
Focus on analysis steps. There is no need to explain trivial parts, like reading/writing a file, etc.
{actual_requesting_output_explanation}
Your explanation should be written in LaTeX, and should be enclosed within a LaTeX Code Block, like this:
```latex
\\section{Code Explanation}
<your code explanation here>
```
Remember to enclose your explanation within a LaTeX Code Block, so that I can easily copy-paste it!
""")
------------------------------------------------------------
request_triple_quote_block: Optional[str] = dedent_triple_quote_str("""
Your code explanation should be enclosed within a triple-backtick "latex" block.
""")
------------------------------------------------------------
requesting_output_explanation: str = dedent_triple_quote_str("""
Also explain what does the code write into the "{output_filename}" file.
""")
------------------------------------------------------------
requesting_explanation_for_a_new_dataframe: str = dedent_triple_quote_str("""
The code creates a new file named "{dataframe_file_name}", with the following columns:
{columns}.
Explain the content of the file, and how it was derived from the original data.
Importantly: do NOT explain the content of columns that are already explained for the \t
original dataset (see above DESCRIPTION OF THE DATASET).
""")
------------------------------------------------------------
requesting_explanation_for_a_modified_dataframe: str = dedent_triple_quote_str("""
Explain the content of all the new or modified columns of "{dataframe_file_name}".
Return your explanation as a dictionary, where the keys are the column names {columns}, \t
and the values are the strings that explain the content of each column.
All information you think is important should be encoded in this dictionary.
Do not send additional free text beside the text in the dictionary.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
As part of a data-exploration phase, please write a complete short Python code for getting a \t
first sense of the data.
Your code should create an output text file named "{output_filename}", which should \t
contain a summary of the data.
The output file should be self-contained; any results you choose to save to this file \t
should be accompanied with a short header.
The output file should be formatted as follows:
```output
# Data Size
<Measure of the scale of our data (e.g., number of rows, number of columns)>
# Summary Statistics
<Summary statistics of all or key variables>
# Categorical Variables
<As applicable, list here categorical values and their most common values>
# Missing Values
<Counts of missing, unknown, or undefined values>
<As applicable, counts of special numeric values that stand for unknown/undefined if any \t
(check in the "{all_file_descriptions}" above for any)>
# <title of other summary you deem relevant, if any>
<Add any other summary of the data you deem relevant>
# <etc for any other summary you deem relevant.>
```
If any of the above sections is not applicable, then write "# Not Applicable" under that section.
If needed, you can use the following packages which are already installed:
{supported_packages}
Do not provide a sketch or pseudocode; write a complete runnable code.
Do not create any graphics, figures or any plots.
Do not send any presumed output examples.
""")
------------------------------------------------------------
CodeReviewPrompt('*', False, dedent_triple_quote_str("""
I ran your code.
Here is the content of the output file that the code created:
{file_contents_str}
Please follow these two steps:
(1) Check the code and the output for any issues, and return a bullet-point response addressing these points:
* Are there any unexpected NaN values in the output.
* Can results be understood from the output file? In particular, do we have a short label for each result?
* Are there any results that are missing. Check that under each header in the output file there is \t
a corresponding meaningful result (or "Not Applicable" if not applicable).
* Any other issues you find.
(2) Based on your assessment above, return a Python Dict[str, str] mapping the issues you have noted \t
above (dict keys) to specific suggested corrections/improvements in the code (dict values).
For example:
```python
{
"The result of the average of variable ... is missing":
"Add the missing calculation of ... to the code.",
"The average of the variable ... is `Nan`":
"Remove missing values in the calculation."
}
```
Try to be as specific as possible when describing the issues and proposed fixes.
Include in the dict as many issues as you find.
If there are no issues, and the code and tables are just perfect and need no corrections or enhancements, \t
then return an empty dict:
```python
{}
```
Important:
* Do not return the revised code, only the issues and suggested fixes.
* If there are no critical issues, then return an empty dict: `{}`.
* Do not create positive issues that require no change in the code. In particular, do not write \t
{"No issues found": "No corrections or improvements are needed."}, return an empty dict instead.
"""), name='output file'),
------------------------------------------------------------
return dedent_triple_quote_str("""
- In linear regression, if interactions terms are included:
* did we remember to include the main effects?
* did we use the `*` operator in statsmodels formula as recommended? \t
(as applicable, better use `formula = "y ~ a * b"`, instead of trying to \t
manually multiply the variables)
""")
------------------------------------------------------------
return dedent_triple_quote_str("""
- In mediation analysis:
* did we calculate the mediation effect (e.g., using the Sobel test or other)?
* did we account for relevant confounding factors? \t
(by adding these same confounding factors to both the 'a' and 'b' paths)
""")
------------------------------------------------------------
return dedent_triple_quote_str("""
- For created Machine-Learning models:
* Check whether we adequately perform hyperparameter tuning using cross-validation (as appropriate).
* Check whether the best hyperparameters are reported \t
(either in a table file or in the "additional_results.pkl" file).
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
As part of a data-preprocessing phase, please write a complete short Python code for getting a \t
cleaned, normalized, same-unit, balanced version of the data, ready for use in following analysis \t
steps that will include statistical tests and/or machine learning models on the processed data.
Your code should create one or more new csv files containing the preprocessed data, saved with \t
sensible file names.
Depending on the specifics of the dataset and the goal and hypothesis specified above, \t
you might want to preform the following steps:
* Dealing with missing values - imputation, deletion, etc.
* Normalization of numeric values with different units into same-unit values.
* Scaling numeric values into a common scale (e.g., 0-1) using min-max scaling, z-score, etc.
* Encoding categorical variables into numeric values (e.g., using one-hot encoding)
* Balancing the data by under-sampling, over-sampling, or more advanced techniques to deal with class imbalance
* Any other data preprocessing you deem relevant
You are not obliged to perform all of the above steps, choose the ones that suits the data and the hypothesis
we are testing (see research goal above).
If needed, you can use the following packages which are already installed:
{supported_packages}
Do not provide a sketch or pseudocode; write a complete runnable code.
Do not create any graphics, figures or any plots.
""")
------------------------------------------------------------
instructions=dedent_triple_quote_str(f"""
You should use the "{from_formula}" function instead, so that the formula is clearly \t
specified as a string.
Reminder: For interactions, if any, use the `*` operator in the formula, rather than \t
manually multiplying the variables.
"""),
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Write a complete Python code to analyze the data and create dataframes as basis for scientific Tables \t
for our paper.
The code must have the following sections (with these exact capitalized headers):
`# IMPORT`
`import pickle`
You can also import here any other packages you need from:
{supported_packages}
`# LOAD DATA`
Load the data from the original data files described above (see "{data_file_descriptions}").
{list_additional_data_files_if_any}\t
`# DATASET PREPARATIONS`
* Join data files as needed.
* Dealing with missing, unknown, or undefined values, or with special numeric values that stand for \t
unknown/undefined (check in the "{data_file_descriptions}" for any such values, and \t
consider also the "{outputs:data_exploration}").
* Create new variables as needed.
* Restrict the data based on exclusion/inclusion criteria (to match study goal, if applicable).
* Standardize numeric values with different units into same-unit values.
If no dataset preparations are needed, write below this header: \t
`# No dataset preparations are needed.`
`# DESCRIPTIVE STATISTICS`
* In light of our study goals and the hypothesis testing plan (see above "{research_goal}" and \t
"{hypothesis_testing_plan}"), decide whether and which descriptive statistics are needed to be included in \t
the research paper and create a relevant table.
For example:
`## Table 0: "Descriptive statistics of height and age stratified by sex"`
Write here the code to create a descriptive statistics dataframe `df0` and save it using:
`df0.to_pickle('table_0.pkl')`
If no descriptive statistics are needed, write: \t
`# No descriptive statistics table is needed.`
`# PREPROCESSING`
Perform any preprocessing steps needed to prepare the data for the analysis.
For example, as applicable:
* Creating dummy variables for categorical variables.
* Any other data preprocessing you deem relevant.
If no preprocessing is needed, write: \t
`# No preprocessing is needed, because <your reasons here>.`
`# ANALYSIS`
Considering our "{research_goal}" and "{hypothesis_testing_plan}", decide on 1-3 tables \t
(in addition to the above descriptive statistics, if any) we should create for our scientific paper. \t
Typically, we should have at least one table for each hypothesis test.
For each such scientific table:
[a] Write a comment with a suggested table's caption.
Choose a caption that clearly describes the table's content and its purpose.
For example:
`## Table 1: "Test of association between age and risk of death, accounting for sex and race"`
Avoid generic captions such as `## Table 1: "Results of analysis"`.
[b] Perform analysis
- Perform appropriate analysis and/or statistical tests (see above our "{hypothesis_testing_plan}").
- Account for relevant confounding variables, as applicable.
- Note that you may need to perform more than one test for each hypothesis.
- Try using inherent functionality and syntax provided in functions from the available \t
Python packages (above). Avoid, as possible, manually implementing generically available functionality.
For example, to include interactions in regression analysis (if applicable), use the `formula = "y ~ a * b"` \t
syntax in statsmodels formulas, rather than trying to manually multiply the variables.
{mediation_note_if_applicable}\t
[c] Create and save a dataframe representing the scientific table (`df1`, `df2`, etc):
* Only include information that is relevant and suitable for inclusion in a scientific table.
* Nominal values should be accompanied by a measure of uncertainty (CI or STD and p-value).
* Exclude data not important to the research goal, or that are too technical.
* Do not repeat the same data in multiple tables.
* The table should have labels for both the columns and the index (rows):
- As possible, do not invent new names; just keep the original variable names from the dataset.
- As applicable, also keep any attr names from statistical test results.
Overall, the section should have the following structure:
`# ANALYSIS`
`## Table 1: <your chosen table name here>`
Write here the code to analyze the data and create a dataframe df1 for the table 1
`df1.to_pickle('table_1.pkl')`
`## Table 2: <your chosen table name here>`
etc, up to 3 tables.
# SAVE ADDITIONAL RESULTS
At the end of the code, after completing the tables, create a dict containing any additional \t
results you deem important to include in the scientific paper, and save it to a pkl file \t
'additional_results.pkl'.
For example:
`additional_results = {
'Total number of observations': <xxx>,
'accuracy of <mode name> model': <xxx>,
# etc, any other results and important parameters that are not included in the tables
}
with open('additional_results.pkl', 'wb') as f:
pickle.dump(additional_results, f)
`
Avoid the following:
Do not provide a sketch or pseudocode; write a complete runnable code including all '# HEADERS' sections.
Do not create any graphics, figures or any plots.
Do not send any presumed output examples.
Avoid convoluted or indirect methods of data extraction and manipulation; \t
For clarity, use direct attribute access for clarity and simplicity.
For clarity, access dataframes using string-based column/index names, \t
rather than integer-based column/index positions.
""")
------------------------------------------------------------
code_review_formatting_instructions: str = dedent_triple_quote_str("""
Try to be as specific as possible when describing the issues and proposed fixes.
Include in the dict as many issues as you find.
If you are sure that there are no issues, and the code and tables need no revision, \t
then return an empty dict: `{}`.
""")
------------------------------------------------------------
CodeReviewPrompt(None, False, dedent_triple_quote_str("""
The code runs ok, but I am worried that it may contain some fundamental mathematical or statistical \t
flaws. To check for such flaws, I will need you to carefully follow these two steps:
(1) Deeply check your Python code for any fundamental coding/mathematical/statistical flaws \t
and return a bullet-point response addressing these points (as applicable):
* WRONG CALCULATIONS:
- List all key mathematical calculations used in the code and indicate for each one if it is correct, \t
or if it should be revised.
* TRIVIALLY-TRUE STATISTICAL TESTS:
Are there any statistical tests that are mathematically trivial? Like:
- testing whether the mean of all values above 0 is above 0.
- comparing distributions that have different underlying scales (or different ranges), \t
and which were not properly normalized.
- testing whether the mean of X + Y is larger than the mean of X, when Y is positive.
- etc, any other tests that you suspect are trivial.
* OTHER:
Any other mathematical or statistical issues that you can identify.
(2) Based on your assessment above, return a Python Dict[str, str] mapping the issues you have noted
above (dict keys) to specific suggested corrections/improvements in the code (dict values).
For example:
```python
{
"The formula for the regression model is incorrect":
"revise the code to use the following formula: ...",
"The statistical test for association of ... and ... is trivial":
"revise the code to perform the following more meaningful test: ...",
}
```
{code_review_formatting_instructions}
"""), name='code flaws'),
------------------------------------------------------------
CodeReviewPrompt(None, False, dedent_triple_quote_str("""
Please follow these two steps:
(1) Check your Python code and return a bullet-point response addressing these points (as applicable):
* DATASET PREPARATIONS:
- Missing values. If applicable, did we deal with missing, unknown, or undefined values, \t
or with special numeric values that stand for unknown/undefined \t
(check the "{data_file_descriptions}" for any such missing values)?
- Units. If applicable, did we correctly standardize numeric values with different units into same-unit values?
- Data restriction. If applicable, are we restricting the analysis to the correct part of the data \t
(based on the study goal)?
* DESCRIPTIVE STATISTICS:
If applicable:
- Did we correctly report descriptive statistics?
- Is the choice of descriptive statistics and chosen variables contribute to the scope of study?
- Is descriptive analysis done on the correct data (for example, before any data normalization steps)?
* PREPROCESSING:
Review the above "{data_file_descriptions}", then check our data preprocessing:
- Are we performing any preprocessing steps that are not needed?
- Are we missing any preprocessing steps that are needed?
* ANALYSIS:
As applicable, check for any data analysis issues, including:
- Analysis that should be performed on the preprocessed data is mistakenly performed on the original data.
- Analysis that should be performed on the original data is mistakenly performed on the preprocessed data.
- Incorrect choice of statistical test.
- Imperfect implementation of statistical tests.
- Did we correctly chose the variables that best represent the tested hypothesis?
- Are we accounting for relevant confounding variables (consult the "{data_file_descriptions}")?
{regression_comments}\t
{mediation_comments}\t
{machine_learning_comments}\t
{scipy_unpacking_comments}\t
- Any other statistical analysis issues.
(2) Based on your assessment above, return a Python Dict[str, str] mapping the issues you have noted
above (dict keys) to specific suggested corrections/improvements in the code (dict values).
For example:
```python
{
"The model does not adequately account for confounding variables":
"revise the code to add the following confounding variables ...",
"The descriptive statistics is performed on the wrong data":
"revise the code to perform the descriptive statistics on the preprocessed data.",
}
```
{code_review_formatting_instructions}
"""), name='data handling'),
------------------------------------------------------------
CodeReviewPrompt('table_*.pkl', True, dedent_triple_quote_str("""
I ran your code.
Here is the content of the table '{filename}' that the code created for our scientific paper:
{file_contents_str}
Please review the table and follow these two steps:
(1) Check the created table and return a bullet-point response addressing these points:
* Sensible numeric values: Check each numeric value in the table and make sure it is sensible.
For example:
- If the table reports the mean of a variable, is the mean value sensible?
- If the table reports CI, are the CI values flanking the mean?
- Do values have correct signs?
- Do you see any values that are not sensible (too large, too small)?
- Do you see any 0 values that do not make sense?
* Measures of uncertainty: If the table reports nominal values (like regression coefs), does \t
it also report their measures of uncertainty (like p-value, CI, or STD, as applicable)?
* Missing data: Are we missing key variables, or important results, that we should calculate and report?
* Any other issues you find.
(2) Based on your assessment above, return a Python Dict[str, str] mapping the issues you have noted
above (dict keys) to specific suggested corrections/improvements in the code (dict values).
For example:
```python
{
"Table {filename} reports incomplete results":
"revise the code to add the following new column '<your suggested column name>'",
"Table {filename} reports nominal values without measures of uncertainty":
"revise the code to add STD and p-value.",
}
```
{code_review_formatting_instructions}
"""), name='output of "{filename}"'),
------------------------------------------------------------
CodeReviewPrompt('*', False, dedent_triple_quote_str("""
I ran your code.
Here is the content of the file(s) that the code created for our scientific paper:
{file_contents_str}
Please review the code and theses output files and return a bullet-point response addressing these points:
* Does the code create and output all needed results to address our {hypothesis_testing_plan}?
* Sensible numeric values: Check each numeric value in the tables and in the additional results file \t
and make sure it is sensible.
For example:
- If a table reports the mean of a variable, is the mean value sensible?
- If a table reports CI, are the CI values flanking the mean?
- Do values have correct signs?
- Do you see any values that are not sensible (too large, too small)?
* Measures of uncertainty: If a table reports a nominal value (like mean of a variable), does \t
it also report its measures of uncertainty (CI, or STD, as applicable)?
* Missing data in a table: Are we missing key variables in a given table?
{missing_tables_comments}
* Any other issues you find.
(2) Based on your assessment above, return a Python Dict[str, str] mapping the issues you have noted
above (dict keys) to specific suggested corrections/improvements in the code (dict values).
For example:
```python
{
"A table is missing":
"revise the code to add the following new table '<your suggested table caption>'",
"Table <n> reports nominal values without measures of uncertainty":
"revise the code to add STD and p-value.",
}
```
{code_review_formatting_instructions}
"""), name='all output files'),
------------------------------------------------------------
return dedent_triple_quote_str("""
* Missing tables: \t
You did not create any tables. \t
Note that research papers typically have 2 or more tables. \t
Please suggest which tables to create and additional analysis needed.\n
""")
------------------------------------------------------------
return dedent_triple_quote_str("""
* Missing tables: \t
You only produced 1 table. \t
Note that research papers typically have 2 or more tables. \t
Are you sure all relevant tables are created? Can you suggest any additional analysis leading \t
to additional tables?'\n
""")
------------------------------------------------------------
return dedent_triple_quote_str("""
* Missing tables: \t
Considering our research goal and hypothesis testing plan, \t
are all relevant tables created? If not, can you suggest any additional tables?\n
""")
------------------------------------------------------------
return dedent_triple_quote_str("""
- If you are doing a mediation analysis, don't forget to calculate both the 'a' and 'b' \t
paths (and add the same confounding variables to both paths, as needed).
""")
------------------------------------------------------------
provided_code: str = dedent_triple_quote_str('''
def to_latex_with_note(df, filename: str, caption: str, label: str, \t
note: str = None, legend: Dict[str, str] = None, **kwargs):
"""
Converts a DataFrame to a LaTeX table with optional note and legend added below the table.
Parameters:
- df, filename, caption, label: as in `df.to_latex`.
- note (optional): Additional note below the table.
- legend (optional): Dictionary mapping abbreviations to full names.
- **kwargs: Additional arguments for `df.to_latex`.
"""
def is_str_in_df(df: pd.DataFrame, s: str):
return any(s in level for level in getattr(df.index, 'levels', [df.index]) + \t
getattr(df.columns, 'levels', [df.columns]))
AbbrToNameDef = Dict[Any, Tuple[Optional[str], Optional[str]]]
def split_mapping(abbrs_to_names_and_definitions: AbbrToNameDef):
abbrs_to_names = {abbr: name for abbr, (name, definition) in \t
abbrs_to_names_and_definitions.items() if name is not None}
names_to_definitions = {name or abbr: definition for abbr, (name, definition) in \t
abbrs_to_names_and_definitions.items() if definition is not None}
return abbrs_to_names, names_to_definitions
''')
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str('''
Please write a Python code to convert and re-style the "table_?.pkl" dataframes created \t
by our "{codes:data_analysis}" into latex tables suitable for our scientific paper.
Your code should use the following 3 custom functions provided for import from `my_utils`:
```python
{provided_code}
```
Your code should:
* Rename column and row names: You should provide a new name to any column or row label that is abbreviated \t
or technical, or that is otherwise not self-explanatory.
* Provide legend definitions: You should provide a full definition for any name (or new name) \t
that satisfies any of the following:
- Remains abbreviated, or not self-explanatory, even after renaming.
- Is an ordinal/categorical variable that requires clarification of the meaning of each of its possible values.
- Contains unclear notation, like '*' or ':'
- Represents a numeric variable that has units, that need to be specified.
To avoid re-naming mistakes, you should define for each table a dictionary, \t
`mapping: AbbrToNameDef`, which maps any original \t
column and row names that are abbreviated or not self-explanatory to an optional new name, \t
and an optional definition.
If different tables share several common labels, then you can build a `shared_mapping`, \t
from which you can extract the relevant labels for each table.
Overall, the code must have the following structure:
```python
# IMPORT
import pandas as pd
from my_utils import to_latex_with_note, is_str_in_df, split_mapping, AbbrToNameDef
# PREPARATION FOR ALL TABLES
# <As applicable, define a shared mapping for labels that are common to all tables. For example:>
shared_mapping: AbbrToNameDef = {
'AvgAge': ('Avg. Age', 'Average age, years'),
'BT': ('Body Temperature', '1: Normal, 2: High, 3: Very High'),
'W': ('Weight', 'Participant weight, kg'),
'MRSA': (None, 'Infected with Methicillin-resistant Staphylococcus aureus, 1: Yes, 0: No'),
...: (..., ...),
}
# <This is of course just an example. Consult with the "{data_file_descriptions}" \t
and the "{codes:data_analysis}" for choosing the labels and their proper scientific names \t
and definitions.>
# TABLE {first_table_number}:
df{first_table_number} = pd.read_pickle('table_{first_table_number}.pkl')
# FORMAT VALUES <include this sub-section only as applicable>
# <Rename technical values to scientifically-suitable values. For example:>
df{first_table_number}['MRSA'] = df{first_table_number}['MRSA'].apply(lambda x: 'Yes' if x == 1 else 'No')
# RENAME ROWS AND COLUMNS <include this sub-section only as applicable>
# <Rename any abbreviated or not self-explanatory table labels to scientifically-suitable names.>
# <Use the `shared_mapping` if applicable. For example:>
mapping{first_table_number} = dict((k, v) for k, v in shared_mapping.items() \t
if is_str_in_df(df{first_table_number}, k))
mapping{first_table_number} |= {
'PV': ('P-value', None),
'CI': (None, '95% Confidence Interval'),
'Sex_Age': ('Age * Sex', 'Interaction term between Age and Sex'),
}
abbrs_to_names{first_table_number}, legend{first_table_number} = split_mapping(mapping{first_table_number})
df{first_table_number} = df{first_table_number}.rename(columns=abbrs_to_names{first_table_number}, \t
index=abbrs_to_names{first_table_number})
# SAVE AS LATEX:
to_latex_with_note(
df{first_table_number}, 'table_{first_table_number}.tex',
caption="<choose a caption suitable for a table in a scientific paper>",
label='table:<chosen table label>',
note="<If needed, add a note to provide any additional information that is not captured in the caption>",
legend=legend{first_table_number})
# TABLE <?>:
# <etc, all 'table_?.pkl' files>
```
Avoid the following:
Do not provide a sketch or pseudocode; write a complete runnable code including all '# HEADERS' sections.
Do not create any graphics, figures or any plots.
Do not send any presumed output examples.
''')
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Please revise the code making sure the table is built with an index that has meaningful row labels.
Labeling row with sequential numbers is not common in scientific tables.
Though, if you are sure that starting each row with a sequential number is really what you want, \t
then convert it from int to strings, so that it is clear that it is not a mistake.
"""),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
This is likely a mistake and is surely confusing to the reader.
Please revise the code so that the table does not repeat the same values in multiple cells.
"""),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
In scientific tables, it is not customary to include the same values in multiple tables.
Please revise the code so that each table include its own unique data.
"""),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Note that in scientific tables, it is not customary to include quantiles, or min/max values, \t
especially if the mean and std are also included.
Please revise the code so that the tables only include scientifically relevant statistics.
"""),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Please revise the code making sure all tables are created with `index=True`, and that the index is \t
meaningful.
""") + msg,
------------------------------------------------------------
instructions=dedent_triple_quote_str(f"""
Please revise the code so that it:
* Finds the unique values (use `{column_label}_unique = df["{column_label}"].unique()`)
* Asserts that there is only one value. (use `assert len({column_label}_unique) == 1`)
* Creates the table without this column (use `df.drop(columns=["{column_label}"])`)
* Adds the unique value, {column_label}_unique[0], \t
in the table note (use `note=` in the function `to_latex_with_note`).
There is no need to add corresponding comments to the code.
"""),
------------------------------------------------------------
issue=dedent_triple_quote_str("""
Here is the created table:
```latex
{table}
```
When trying to compile it using pdflatex, I got the following error:
{error}
""").format(filename=filename, table=latex, error=e),
------------------------------------------------------------
transpose_message = dedent_triple_quote_str("""\n
- Alternatively, consider completely transposing the table. Use `df = df.T`.
""")
------------------------------------------------------------
drop_column_message = dedent_triple_quote_str("""\n
- Drop unnecessary columns. \t
If the labels cannot be shortened much, consider whether there might be any \t
unnecessary columns that we can drop. \t
Use `to_latex_with_note(df, filename, columns=...)`.
""")
------------------------------------------------------------
index_note = dedent_triple_quote_str("""\n
- Rename the index labels to shorter names. Use `df.rename(index=...)`
""")
------------------------------------------------------------
issue=dedent_triple_quote_str("""
Here is the created table:
```latex
{table}
```
I tried to compile it, but the table is too wide.
""").format(filename=filename, table=latex),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Please change the code to make the table narrower. Consider any of the following options:
- Rename column labels to shorter names. Use `df.rename(columns=...)`
""") + index_note + drop_column_message + transpose_message,
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Please revise the code making sure all tables are created with labeled rows.
Use `index=True` in the function `to_latex_with_note`.
"""),
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
Please revise the code making sure all tables are created with a caption and a label.
Use the arguments `caption` and `label` of the function `to_latex_with_note`.
Captions should be suitable for a table in a scientific paper.
Labels should be in the format `table:<your table label here>`.
In addition, you can add:
- an optional note for further explanations \t
(use the argument `note` of the function `to_latex_with_note`)
- a legend mapping any abbreviated row/column labels to their definitions \t
(use the argument `legend` of the function `to_latex_with_note`)
"""),
------------------------------------------------------------
instructions = dedent_triple_quote_str("""
Please revise the code making sure all abbreviated labels (of both column and rows!) are explained \t
in their table legend.
Add the missing abbreviations and their explanations as keys and values in the `legend` argument of the \t
function `to_latex_with_note`.
""")
------------------------------------------------------------
instructions += dedent_triple_quote_str("""
Alternatively, since the table is not too wide, you can also replace the abbreviated labels with \t
their full names in the dataframe itself.
""")
------------------------------------------------------------
issue = dedent_triple_quote_str("""
The `legend` argument of `to_latex_with_note` includes only the following keys:
{legend_keys}
We need to add also the following abbreviated row/column labels:
{un_mentioned_abbr_labels}
""").format(legend_keys=list(legend.keys()), un_mentioned_abbr_labels=un_mentioned_abbr_labels)
------------------------------------------------------------
issue = dedent_triple_quote_str("""
The table needs a legend explaining the following abbreviated labels:
{un_mentioned_abbr_labels}
""").format(un_mentioned_abbr_labels=un_mentioned_abbr_labels)
------------------------------------------------------------
instructions=dedent_triple_quote_str("""
The legend keys should be a subset of the table labels.
Please revise the code changing either the legend keys, or the table labels, accordingly.
As a reminder: you can also use the `note` argument to add information that is related to the
table as a whole, rather than to a specific label.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Please write a short Python code for finding the largest number below our chosen max number.
Your code should create an output text file named "{output_filename}", which should \t
contain the following text:
"The largest prime number below xxx is yyy".
If needed, you can use the following packages which are already installed:
{supported_packages}
""")
------------------------------------------------------------
system_prompt: str = dedent_triple_quote_str("""
You are a writer.
You will write a fake funny scientific article for the journal {journal_name}.
""")
------------------------------------------------------------
mission_prompt: str = dedent_triple_quote_str("""
Based on the material provided above ({actual_background_product_names}), \t
please {goal_verb} only the {goal_noun} for a {journal_name} article.
Do not write any other parts!
While making it funny, please make sure to specifically relate to the specific numerical results that we have.
{latex_instructions}
""")
------------------------------------------------------------
latex_instructions: str = dedent_triple_quote_str("""
Write in tex format including the \\title{} and \\begin{abstract} ... \\end{abstract} commands, \t
and any math or symbols that needs tex escapes.
""")
------------------------------------------------------------
other_system_prompt: str = dedent_triple_quote_str("""
You are a reviewer for a writer who is writing a funny scientific-like paper about their results.
Your job is to provide constructive bullet-point feedback.
If you feel that the writing does not need further improvements, you should reply only with:
"{termination_phrase}".
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_reviewer_response: str = dedent_triple_quote_str("""\n\n
Please correct your response according to any points in my feedback that you find relevant and applicable.
Send back a complete rewrite of the {pretty_section_names}.
Make sure to send the full corrected {pretty_section_names}, not just the parts that were revised.
""")
------------------------------------------------------------
sentence_to_add_at_the_end_of_performer_response: str = dedent_triple_quote_str("""
Please provide a bullet-point list of constructive feedback on the above {pretty_section_names} \t
for my paper. Do not provide positive feedback, only provide actionable instructions for improvements in \t
bullet points.
In particular, make sure that the section is correctly grounded in the information provided above, \t
yet is written in a funny way.
If you find any inconsistencies or discrepancies, please mention them explicitly in your feedback.
You should only provide feedback on the {pretty_section_names}. Do not provide feedback on other sections \t
or other parts of the paper, like tables or Python code, provided above.
If you don't see any flaws, respond solely with "{termination_phrase}".
IMPORTANT: You should EITHER provide bullet-point feedback, or respond solely with "{termination_phrase}"; \t
If you chose to provide bullet-point feedback then DO NOT include "{termination_phrase}".
""")
------------------------------------------------------------
user_choice = input(dedent_triple_quote_str("""
Please carefully check that you are willing to proceed with this LLM API call.
We suggest reading the current ongoing conversation and especially the last USER message \t
to understand the instructions we are sending to the LLM.
If you are willing to proceed, please type Y, otherwise type N.
Note: if you choose N, the program will immediately abort.
"""))
------------------------------------------------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment