Skip to content

Instantly share code, notes, and snippets.

@disulfidebond
Last active October 29, 2019 16:03
Show Gist options
  • Save disulfidebond/c6998882189e62281f014f32ef9fe173 to your computer and use it in GitHub Desktop.
Save disulfidebond/c6998882189e62281f014f32ef9fe173 to your computer and use it in GitHub Desktop.
Jane Churpek Lab terms and definitions for bioinformatics research

Effort Assessment

There are numerous software tools available for bioinformatics research, with the usual associated hazards that these tools may be out of date, poorly documented, and ill-suited to the task at hand. The standard operating procedure in this case is to assess the effort and time that would be required to build a tool from scratch versus the effort and time that would be required to adapt/modify an existing tool. After an assessment that encompasses the whole project, make a determination of which option would require the least time and effort, then proceed with the workflow.

Always proceed with the effort assessment in a top-down manner. For example, if you are looking for a data-mining tool to scan metadata from abstracts on Pubmed, start by investigting existing datamining tools that have been designed to scan abstracts. If this exists, assess it as described. If this does not exist or your assessment is the existing software cannot be adapted, then look for software tools that can obtain PMIDs, which will then be used in a custom script to scan downloaded abstracts, and so on.

Important: If the PI overrules your decision, or there is a conflict between your assessment and the PI's instructions, always follow the instructions from the PI unconditionally and without reservations.

Natural Language Processing (NLP)

An type of artifical intelligence that trains computers to recognize, analyze, and make predictions from a set of data that either involves a human language, or a set of data that can be structured to resemble a human language.

Wikipedia and the site becominghuman both have somewhat succinct overviews of NLP.

A common application of NLP is with dictation software and speech recognition. However, NLP has several other applications, and its use is constantly evolving.

Supervised and Unsupervised Learning

Supervised learning typically involves building a computer model that categorizes a set of values as beloging to one or more predefined values, and then once sufficient "supervision" or training has taken place, using this computer model to (correctly) predict how a new value should be categorized.

Unsupervised learning looks for similarities within a dataset, and creates clusters of datapoints within this dataset based on their similarities. An important distinction between supervised and unsupervised learning is unlike supervised learning, no prior or predefined knowledge exists for use in unsupervised learning.

Examples of supervised learning are logistic regression, k-nearest neighbors, and neural networks.

Examples of unsupervised learning are Principal Component Analysis (PCA), centroid clustering, and k-means clustering.

The site towardsdatascience has a good overview of the difference between supervised and unsupervised learning.

Package Manager

A package manager is a program on a computer that keeps track of installed software, and manages the download and installation of new software. More specifically, it handles conflicts with software versions, and provides advice on how to resolve conflicts with software that may conflict with each other.

Nearly all operating systems have some type of package manager installed, and there are also third party package managers that can be installed alongside existing package managers. However, great care should be taken before installing a second package manager to ensure that conflicts with existing package managers do not occur.

Below is a brief overview of existing package managers for Windows, Ubuntu Linux, Mac OSX, and Redhat Linux

Operating System Pre-installed package manager Types of installers that are recognized Notes and comments
Ubuntu Linux apt-get .dpkg apt-get can install .deb packages, albeit extreme care should be used
Redhat Linux yum .deb
Mac OSX Mac App Store .pkg 1) After Mac OSX Yosemite, the software update package manager was merged with the Mac App Store package manager 2) Usually, homebrew can be installed with few difficulties on MacOSX. Only the default settings for installation of homebrew should be used unless there is a very good and specific reason otherwise.
Windows 10 Windows Installer .exe Use Installation Wizard for installation, unless specified otherwise by the installer

JSON

JSON is short for JavaScript Object Notation.

Javascript

Javascript is a programming language, frequently used in websites and by search engines. Within the context of knowing what JSON is, that's all we need to know.

Object Notation

There are many ways of storing and accessing data. Object-Oriented notation (or Object notation) is one that will be described here.

Imagine that we need a way to store cancer genome data. We could simply make a list:

  • Tumor
  • Cell_mutation_signature
  • cancers_observed
  • tumor_cell_line

This approach works, but it would become very disorganized very quickly, and we couldn't necessarily show a tumor_cell_line belonged to a type of tumor

We could organize it by type:

  • tumor_type: invasive lobular carcinoma

  • cell_mutation_signature: 1A

  • cancers_observed: Ovarian

But, we'd run into problems again! If we wanted to link 'carcinoma' to similar mutation signatures, then we'd have to create a new listing, and things would go from organized to disorganized very quickly:

  • tumor_type: invasive lobular carcinoma

  • cell_mutation_signature: 1A

  • signature_also_in: carcinoma

  • cancers_observed: Ovarian

Object Notation is one approach to this problem. Picture you have at your disposal a library with computers, staff, and research librarians to assist you, but no books. We can add whatever books we want to the library, and as long as we follow the librarians' rules for organization, we can set up whatever system we like to store and to access the data in the books. Let's say that we wanted to create broad categories for cell types as 'carcinoma', or 'sarcoma'. Then, we wanted to add a descriptor for the types of tumors that have been identified for each, and finally which databases have research data for these tumors. To create this setup, we'd ask the librarians to create this organization for us:

    { 
      'Cancer_type_object' : [
        tumor_types : [
          tumor_type1,
          tumor_type2
          ],
        data_found_in : [
          TCGA,
          DBGaP
          ]
        ]
    }

After looking the organization over, the librarian would tell us this organization works, and would caution us that each Cancer_type_object objects need to be unique, so that the librarian staff can identify them by their name. The tumor types and where the data can be found, for example, TCGA in the data_found_in list, do not need to be unique, because these can be looked up by their position in the list.

Here's an example of what this could look like:

    {
      "Ovarian Cancer" : [
        "carcinoma" : [
          "metastatic",
          "benign"
          ],
        "online_repositories" : [
          "TCGA",
          "DBGaP
          ]
      ],
      "Breast Cancer" : [
        "sarcoma" : [
        "metastatic",
        "benign"
        ],
        "online_repositories" : [
        "TCGA",
        "DBGaP"
        ]
      ]
    }

In the examples above, JSON objects are denoted by curly braces {}. Data from this object is retrieved by looking up a unique identifier term, also called a key. JSON arrays are denoted by square brackets []. Data from arrays is retrieved by looking up the position (also called an index) of an item in the array. Names of items in arrays do not need to be unique.

JSON allows storing objects within objects, objects within arrays, arrays within objects, and arrays within arrays. This is called nesting, and also called a parent-child relationship. In the example above, Breast Cancer is a key within an object, and must be unique. sarcoma is an item within an array, it does not need to be unique, and is a nested child term of the parent Breast Cancer.

The key online_repositories is part of an object, but it only needs to be unique within its parent object. Sharp readers will note this term has been repeated within both the Breast Cancer and Ovarian Cancer objects, but this is completely ok. It is acceptable to repeat a child key term within multiple different objects, because the child key terms within objects cannot 'see' outside of their parent object, and these key terms are considered unique within the scope of each individual object.

Key Takeaways: JSON (the library described above) is a data structure that can be used to store and organize data (the books in the library), but JSON itself is not the data. JSON has objects, denoted by {}, which must have unique identifiers that are used to locate data. JSON also has arrays, denoted by [], which does not need to have unique identifiers, because the contents of the array are identified by their position and not their names. JSON can have arrays within objects or vice-versa; this is called nesting, or parent-child relationships.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment