Skip to content

Instantly share code, notes, and snippets.

@JayGwod
Last active September 16, 2023 15:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JayGwod/7feaa84b5364d8f51f35c25bcb0569e1 to your computer and use it in GitHub Desktop.
Save JayGwod/7feaa84b5364d8f51f35c25bcb0569e1 to your computer and use it in GitHub Desktop.
[Research tools]Some useful apps and websites, including literature management using #zotero, #keras, #jupyter.

科研工具

找论文

使用文献管理工具

Zotero 的安装及使用

Zotero 的相关插件安装及使用

整理文献

主要目的:能够在任何条件下,快速找到所需信息。任何好用的软件,都不如大批量多批次的文献阅读。

主要方法:轻整理重搜索。轻整理,是指不对文献分类,或者只是对文献简单分类。重搜索,是指利用不同的搜索工具,快速定位到需要的文献。在搜索技术已经很强大的情况下,如果利用笔记等手段整理,反而容易造成条条框框,在对于一篇文献关注太长的时间,不利于提高效率。

重搜索

  1. 在新项目(写一篇综述,开始一个新课题或者完成一份大作业)开始之前,在Zotero中根据不同项目,建立一个新文件夹。
  2. 一个项目刚开始的时候,文件夹中没有文献,就需要先建立一份本地的文献原始积累。根据关键词去谷歌学术找所需要的文献,Zotero(需要配合ZotFile插件)会自动下载和提取文献信息,按照文献的作者,发表年份,和文章题目将文献重命名,并将该文献自动整理到指定文件夹中,完成文献的原始积累。
  3. 将这个文件夹中所有文献用云盘同步。之后要看文献时,只从这个文件夹中打开文献。这样做的好处就是,把所有的文献和笔记信息全部集中化了,不会造成信息碎片。

轻整理

随着时间的流逝,一个文件夹下的文献慢慢变多,这个时候就需要建立子文件夹了,也就是传统意义上的整理。对一个项目下简单分类,使每个子文件夹中的文献大概不超过50篇,再通过发表年份,期刊名,作者名等信息,也可以很容易找到所需的文章了。不要等文献很多了之后再去建立子文件夹,而是会在平时读文献的时候,根据一个项目下的不同问题,建立一些小的分类。同一篇文献是可以归属不同的文件夹的,所以在归文件夹的时候也不用那么纠结。

来源:如何总结和整理学术文献? - nerfing的回答 - 知乎

读论文

In this video, Prof. Pete Carr (faculty member at the University of Minnesota, Department of Chemistry) shares an algorithm to read a scientific paper more efficiently.

Structure of a Jounal Article

  1. Title
  2. Keywords
  3. Abstract
  4. Introduction
  5. Experimrntal
  6. Results and Disscussion
    1. tables
    2. figures
  7. Summary/Conclusion
  8. References

One might start reading the paper in the order in which it is written, for example, title, abstract, introduction, etc., however, there is a more efficient method to extract the most information from the article, in the least amount of time.

Phase 1: Survey the Article

Feel free to stop reading the article at any point.

  1. Read the title and keywords (these are probably what got you to look at the paper)
  2. Read the absrtract.
  3. Read the conclusions.

Phase 2: Read the Article

  • Look at the tables and figures (including captions).
    • This is really what was done in the work. This does not take much time so it is worth looking at before really getting into the details which will slow down the reading.
  • Read the introduction.
    • This is the background needed and why the study was done.
  • Read the results and discussion.
    • This is the heart of the paper.
  • Read the experimental.
    • This is how they did the work. You get to this point if you are really interested and need to understand exactly what was done to better understand the meaning of the data and its interpretation.

做笔记

Write some notes so you don't have to read the whole paper again.

Citations in the Jupyter Notebook

python3 -m pip install cite2c
python3 -m cite2c.install
# Start/Restart the Notebook server

写代码

使用框架

找一个基线

在别人写的代码上修改,而不是从头开始写。如果有需要再重构。这也是好的研究实践。

  • Could be someone else’s code... as long as you can read it
  • Even better if this code already modularizes what you want to change
    • On the other hand: Re-implementing a SOTA baseline is incredibly helpful for understanding what’s going on, and where some decisions might have been made better
  • Just go fast and find something that works, then go back and refactor (if you made something useful)

确保好的代码风格

为人写代码,而不是机器。

  • Meaningful names
  • Shape comments on tensors
  • Comments describing non-obvious logic

测试代码要精简,但不能没有

  • A test that checks experimental behavior is a waste of time
  • But, some parts of your code aren’t experimental
    • Makes sure data processing works consistently, that tensor operations run, gradients are non-zero
    • Ensure models can train, save and load
    • Run on small test fixtures, so debugging cycle is seconds, not minutes

不要硬编码

代码要有一定程度的抽象,这样既方便以后的受控实验,也让模型结构更清晰。

做实验

核心:确保实验的正确性(correctness)和可重复性( reproducibility)

项目结构

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

来源: Cookiecutter Data Science

数据不可变

  1. 永远不要编辑你的原始数据,尤其是不要手工编辑,也不要用Excel。不要覆写原始数据,也不要存多版本。把数据(及其格式)看成是不可变的。任何人都能复现最终的结果,只用src里的源代码和data/raw里的数据。
  2. 因为数据是不可变的,所以不需要对数据做版本控制。如果数据量较小,可以把数据放到代码仓库里。存储或同步大数据可以用AWS S3的同步工具。
  3. Currently by default, we ask for an S3 bucket and use AWS CLI to sync data in the data folder with the server.

用 Notebooks 探索和交流

When we use notebooks in our work, we often subdivide the notebooks folder. For example, notebooks/exploratory contains initial explorations, whereas notebooks/reports is more polished work that can be exported as html to the reports directory.

There are two steps we recommend for using notebooks effectively:

  1. Follow a naming convention that shows the owner and the order the analysis was done in. We use the format <step>-<ghuser>-<description>.ipynb (e.g., 0.3-bull-visualize-distributions.ipynb).
  2. Refactor the good parts. Don't write code to do the same task in multiple notebooks. If it's a data preprocessing task, put it in the pipeline at src/data/make_dataset.py and load data from data/interim. If it's useful utility code, refactor it to src.
# OPTIONAL: Load the "autoreload" extension so that code can change
%load_ext autoreload

# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
%autoreload 2

from src.data import make_dataset

Keep secrets and configuration out of version control

Create a .env file in the project root folder. Thanks to the .gitignore, this file should never get committed into the version control repository. Here's an example:

# example .env file
DATABASE_URL=postgres://username:password@localhost:5432/dbname
AWS_ACCESS_KEY=myaccesskey
AWS_SECRET_ACCESS_KEY=mysecretkey
OTHER_VARIABLE=something

Use a package to load these variables automatically. Here's an example snippet adapted from the python-dotenv documentation:

# src/data/dotenv_example.py
import os
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)

database_url = os.environ.get("DATABASE_URL")
other_variable = os.environ.get("OTHER_VARIABLE")

Keep track of what you ran

  1. 重点是记下每次实验的哈希码,方便复现;
  2. 受控实验,每次只改一件事;
  3. 用配置文件跟踪每次的改变。

experiments

调参技巧

Sharing Your Research

部署用docker,本地开发用虚拟环境。

写论文

In this video, Prof. Carr (faculty member at the University of Minnesota, Department of Chemistry) is explaining the Algorithm of writing a paper in a weekend.

Preliminaries

  • Review and Renew Your Literature Search.
  • Determine Who Your Audience Is.
    • What kind of paper is it - research, review, tutorial.
    • What journal is it intended for.
    • Undergraduates, researchers, but always reviewers.

The Big Picture

Writing the initial draft is the creative part of the job. Resist the tempatation to correct and edit as you go. You job now is the produce a complete first draft.

The "Algorithm"

  • Just get started don't procrastinate.
  • Create an outline by making a list of all your figures and tables. Put them in order of presentation as they may appear in the results and discussion. Always work from an outline. If you have to stop you can easily pick up the writing later. You have the data so this part is easy.
  • Do not write the introduction now. It is the hardest part of the part to write. Again it could be a waste of time to write it now.
  • Begin with the experimental section. It is the easiest part to write and getting it done will give you a feeling of progress.
  • Now write the results and discussion following the outline.
  • Then comes the really hard part - critical editing where you make sure that the English is concise and coherent, and the science is correct.
  • Write the conclusions. I like a numbered format.
    • 1...
    • 2...
  • And you write the "Abstract" and "Acknowledgements" after the "Conclusions"!
  • Now we have to do the introduction, and there are two very important things that need to be covered in the introduction:
    • Why was the study done? What is its purpose?
    • You've got to collect the relevant essential background information and put that together in the introduction.
  • The very last step is producing the references for the paper. It's a good idea to write some notes as you go through the first draft and manuscript indicating what references might be needed, what they would be about. But not to stop and collect the references at that time.

A few final words

Reading maketh a full man, conference a ready man and writing an exact man. - Sir Francis Bacon

  1. Writing is the most exacting part of what we do as a scientist.
  2. Always review the manuscript requirements for the journal of interest.
  3. A wonderful short paper by Professor Royce Murray in mock style tells you the worst things you can do in writing a manuscript:
    1. Never explain the objectives of the paper in a single sentence or paragraph and in particular never at the beginning of the paper.
    2. Similarly, never describe the experiment(s) in a single sentence or paragraph and never at the beginning. Instead, to enhance the reader’s pleasure of discovery, treat your experiment as a mystery, in which you divulge one essential detail on this page and a hint of one on the next and complete the last details only after a few results have been presented. It’s also really fun to divulge the reason that the experiment should successfully provide the information sought only at the very end of the paper, as any good mystery writer would do.
    3. Diagrams are worth a thousand words, so in the interest of writing a concise paper, omit all words that explain the diagram, including labels. Let the reader use his/her fertile imagination.
    4. Great writers invent abbreviations for complex topics, which also saves a lot of words. Really short abbreviations should be used for very complex topics, and more complicated ones for simple ideas.
    5. In referring to the previous literature, be careful to cite only the papers that make claims that would support your own, especially those that contain little evidence for the claim, so that your paper shines in comparison.
    6. It should be anathema to use any original phrasing or humor in your language, so as to adhere to the principle that scientific writing must be stiff and formal and without personality.
    7. Your readers are intelligent folks, so don’t bother to explain your reasoning in the interpretation of the results. Especially don’t bother to point out their impact on or consistency with other authors’ results and interpretation, so that your paper can be an island of original thinking.

Recommended References and Reading

  • W.Strunk and E.B. White., "The Elements of Style".
  • ACS Author's Guide.
  • R.W.Murray, Anal. Chem. 2011, 83, 633 "Skillful Writing of an Awful Research Paper". Seven Rules to Follow.
  • R. Schoenfeld, "The Chemist's English".
  • A. Eisenberg, "Effective Technical Communication".
  • P.T. O'Connerr "Words Fail Me".
  • George M. Whitesides: Whiteside's Group: Writing a Paper (Adv. Mater. 2004, 16, 1375.)

写作及校对工具

Workflow

在研究生学术生涯中,导师不是最关键的,最关键的是你的目标、决心和努力。一名合格的研究生,应该是全栈研究生,也就是阅读参考文献,想idea,修正idea,设计实验,跑实验,写论文,修改论文,做presentation,这一整条技术栈你都要可以独立进行。如果你缺失了任何一个环节,你都会受制于你的导师。这个问题下的回答已经非常多了,大道理你也都看过,也都明白,问题是,你是否真的做到了背水一战的决心和努力?

来源:没有导师的指导,研究生如何阅读文献、提出创见、写论文? - 王鸿伟的回答 - 知乎

Workflow

  1. Create a notebook with some content!
  2. optionally create a .bib file and external images
  3. Adjust the notebook and cell metadata.
  4. install ipypublish and run the nbpublish for either the specific notebook, or a folder containing multiple notebooks.
  5. A converted folder will be created, into which final .tex .pdf and .html files will be output, named by the notebook or folder input

来源:A workflow for creating and editing publication ready scientific reports and presentations, from one or more Jupyter Notebooks, without leaving the browser!

学习法

研究结果也显示,效果拔群的科学方法(学习方法)只有2个,一个是模拟测试,一个是分散法。模拟测试无需说明,边学边加入各种小Test,随时随地地测试所学内容。分散法指的是,比如学习时间为1小时,那么20分钟学数学,20分钟学历史,剩下20分钟学英语,就像这样,细细分割切换学习时间(就如交叉学习),总之要切换学习内容,然后在隔天重复进行的方法。

来源:美国Kent 州立大学的研究调查:“科学上高效学习法只有2个”,附英文论文链接 - 王俊的文章 - 知乎

形成你的知识树

我昨天还请教林毓生院士,他今年已经七十几岁了,我告诉他我今天要来做演讲,就问他:“你如果讲这个题目你要怎么讲?”他说:“只有一点,就是那重要的五六本书要读好几遍。”因为林毓生先生是哈耶克,还有几位近代思想大师在芝加哥大学的学生,他们受的训练中很重要的一部分是精读原典。这句话很有道理,虽然你不可能只读那几本重要的书,但是那五六本书将逐渐形成你知识树的主干,此后的东西要挂在上面,都可以参照这一个架构,然后把不相干的东西暂放一边。生也有涯,知也无涯,你不可能读遍天下所有的好书,所以要学习取舍,了解自己无法看遍所有感兴趣的书,而且一旦看遍所有感兴趣的书,很可能就会落得普林斯顿街上的那位旧书店的老板一般,因为阅读太多不是自己所关心的领域的知识,它对你来说只是一地的散钱。

掌握工具

在这个阶段一定要掌握语文与合适的工具。要有一个外语可以非常流畅地阅读,要有另外一个语文至少可以看得懂文章的标题,能学更多当然更好,但是至少要有一个语文,不管是英文、日文、法文……一定要有一个语文能够非常流畅地阅读相关书籍,这是起码的前提。一旦这个工具没有了,你的视野就会因此大受限制,因为语文就如同一扇天窗,没有这个天窗你这房间就封闭住了。为什么要看得懂标题?因为这样才不会有重要的文章而你不知道,如果你连标题都看不懂,就不知道如何找人来帮你或是自己查相关的资料。其他的工具,不管是统计还是其他任何工具,你也一定要多掌握,因为你将来没有时间学会使用这样的工具。

养成遵照学术格式的写作习惯

另外一个最基本的训练,就是平时不管你写一万字、三万字还是五万字都要养成遵照学术规范的习惯,要让它自然天成,就是说你论文的脚注、格式,在一开始进入研究生的阶段就要培养成为你生命中的一个部分。如果这个习惯没有养成,人家就会觉得这个论文不严谨,而且之后修改也要花很多时间,因为你的论文规模很大,可能几百页,如果一开始弄错了,后来再从头改到尾,一定很耗时费力,因此要在一开始就养成习惯。因为我们是在写论文而不是在写散文,哪一个逗点应该在哪里、哪一个书名号该在哪里、哪一个地方要用引号、哪一个要什么标点符号,都有一定的规定。用中文写还好,用英文有一大堆简称。在1960年代台湾知识还很封闭的时候,有一个人从美国回来就说:“美国有个不得了的情形,因为有一个人非常不得了。”有人问他为什么不得了,他说:“因为这个人的作品到处被引用。”他的名字就叫ibid。所谓ibid就是同前作者,这个字是从拉丁文发展出来的,拉丁文有一大堆简称,像et.al.就是两人共同编的。英文有一本The Chicago Manual of Style就是专门说明这一些写作规范。各位要尽早学会中英文的写作规范,慢慢练习,最后随性下笔,就能写出符合规范的文章。

找到学习的楷模

我刚到美国念书的时候,每次写报告头皮就重得不得了,因为我们的英文报告三四十页,一个学期有四门课的话就有一百六十页,可是你连脚注都要从头学习。后来我找到一个好办法,就是我每次要写的时候,把一篇我最喜欢的论文放在旁边,虽然他写的题目跟我写的都没关系,不过我每次都看他如何写,看看他的注脚,读几行,然后我就开始写。就像最有名的男高音Pavarotti唱歌剧的时候都会捏着一条手帕,因为他说:“上舞台就像下地狱,太紧张了。”为了克服紧张,他有习惯性的动作,就是捏着白手帕。我想当年那一篇论文抽印本就像是我的白手帕一样,能让我开始好好写这篇报告,我学习它里面如何思考、如何构思、如何照顾全体、如何用英文做脚注。好好地把一位大师的作品读完,开始模仿和学习他,是入门最好的方法,逐步的,你也开始写出自己的东西。我也常常鼓励我的学生,出国半年或是一年到国外看看。像现在“国科会”有各式各样的机会,可以增长眼界,可以知道现在的餐馆正在卖些什么菜,回来后自己要做菜也才知道要如何着手。

来源:没有导师的指导,研究生如何阅读文献、提出创见、写论文? - 社会科学文献出版社的回答 - 知乎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment