Skip to content

Instantly share code, notes, and snippets.

@epifanio
Created November 14, 2019 12:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save epifanio/96d6fb55bb857c99590b6c7fc1c029fa to your computer and use it in GitHub Desktop.
Save epifanio/96d6fb55bb857c99590b6c7fc1c029fa to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pre-processing\n",
"\n",
"## Manual changes\n",
"\n",
"1. Recalculate percent and density values (convert to counts)\n",
"2. Find cases where two taxa were entered in the same cell owing to occurring one on top of the other. Create an additional row for the second organism, and copy the abundance/set the abundance to 1? I will assume this is not necessary for now...\n",
"3. Type in “Lophelia pertusa” where it was entered only as a comment or substrate\n",
"\n",
"## Automated changes\n",
"5. Remove all special characters and unwanted strings:\n",
"```\n",
"“;”, “.”, “-”, \":\", “(“, “)“, “cm”, “juvenile”, “juv”, “m2”, \"%\", \"percent\", trailing \"D\"\n",
"```\n",
"6. Replace \"?\" with \"cf\"\n",
"5. Create list of unique taxon names for the dataset at hand\n"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Q: Should I already run a WoRMS validation at this point and get rid of some names?\n",
"A: no! I can't because of the \"cf's\"... (if something is typed correctly except for the \"cf\" it will show up as an exact match."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cleanup of custom names\n",
"\n",
"## Undecided identifications\n",
"\n",
"These mostly comprise all names containing the symbol \"/\" (undecided between more than one option, e.g. \"Tunicata/Porifera\"). It is important that this symbol is not removed in the previous step!\n",
"\n",
"It is hard to grab all (and only) the undecided identifications. Some of the problems found include:\n",
"* A common typo where the digit \"7\" is typed in place of the \"/\"\n",
"\n",
"A subset of the undecided indentifications are (1) recurrent, and (2) precise enough to give information at near-species level. These can be checked against a list (e.g. \"Lycodonus/Lycenchelis\", etc.). I refer to these as \"limited sets of undistinguishable species\".\n",
"\n",
"The rest are neither recurrent nor necessarily precise, and they can only be compared among themselves (via a similarity matrix) to achieve consistency (e.g., \"Tunicata/Porifera\" = \"Porifera/Tunicata\").\n",
"\n",
"## Unsure identifications\n",
"\n",
"These mostly comprise all names containing \"cf\". Unfortunately it is hard to grab all (and only) the unsure identifications as well. Some of the problems found include:\n",
"* The use of \"cf\" for names awaiting confirmation from an expert, which may later get changed to an official name\n",
"\n",
"They need to be compared among themselves (via a similarity matrix) to achieve consistency.\n",
"\n",
"## Morphospecies\n",
"\n",
"\"Morphospecies\" are defined as taxon names devised for internal use only of taxa that can be unambiguously identified and can be equated to species but whose official name is not known (e.g. \"Corymorpha sand stolon\").\n",
"\n",
"There is no way of grabbing these names so the complete list of unique names has to be used.\n",
"\n",
"This can be compared against a list of morphospecies.\n",
"\n",
"## Provisional names\n",
"\n",
"Names awaiting confirmation, e.g. \"Porifera cf Grantia compressa\".\n",
"\n",
"There is no way of grabbing these names alone so the complete list of unique names has to be used\n",
"\n",
"This can be compared against a list of morphospecies.\n",
" \n",
"## Nicknames\n",
"\n",
"### \"Lazy\" names\n",
"\n",
"### Norwegian names\n",
"\n",
"# Cleanup of official names: WoRMS validation\n",
"\n",
"## Pre-cleaning\n",
"Remove colors from taxa where the color does not indicate another species (Henricia, Paragorgia)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## WoRMS Validation\n",
"\n",
"2. Check against WoRMS. Include the following fields\n",
"```\n",
"Aphia ID\n",
"Rank\n",
"Valid name\n",
"Type of match\n",
"Number of matches\n",
"```\n",
"3. Review. Matches that seem to work but are wrong: Axinella sp, acesta (this should be taken care of if lazy species are done prior to this). These fuzzy matches did not work: lophius sp., physis, stenophora.\n",
"\n",
"# If interested in calculating alpha-diversity...\n",
"\n",
"To calculate measures of alpha diversity names need only be consistent within station. Below are the steps required for this\n",
"\n",
"## Calculate similarity matrix\n",
" \n",
"5. Make similarity matrix of remainder and find groups of similar names\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment