Skip to content

Instantly share code, notes, and snippets.

@cdw
Created July 11, 2017 23:02
Show Gist options
  • Save cdw/1f3edd9a20873a1fbeda1ea377ceb57b to your computer and use it in GitHub Desktop.
Save cdw/1f3edd9a20873a1fbeda1ea377ceb57b to your computer and use it in GitHub Desktop.
20170525_metadata_gathering/Metadata merging .ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Gathering metadata\n\nIt would be nice to have all the information on microscopes and cell lines and passages &t, all in one location. Let's see if we can do that. "
},
{
"metadata": {
"collapsed": true,
"trusted": true
},
"cell_type": "code",
"source": "import numpy as np\nimport pandas as pd",
"execution_count": 34,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The expectation is that this next step will take quite a bit of time, as we are slooooowly reading a 3.9GB file off the samba share."
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "df_ad = pd.read_csv('../20170517_alldat_to_kde/alldat.csv')\ndf_pd = pd.read_csv('./PipelineData_Celigo_WithAllHamiltonMetadataJoined.csv', nrows=200)",
"execution_count": 35,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The main new metadata we have here is from the \"PipelineData_Celigo_WithAllHamiltonMetadataJoined.csv\" file. Let's take a look at the columns it contains:"
},
{
"metadata": {
"collapsed": false,
"scrolled": false,
"trusted": true
},
"cell_type": "code",
"source": "_ = [print(\"%03i: %s\"%(i,n)) for i,n in enumerate(df_pd.columns[:])]",
"execution_count": 36,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "000: Unnamed: 0\n001: PipelineID\n002: ExperimentID\n003: PlateID\n004: PlateLayoutID\n005: SplitDate(Pipeline Exp label)\n006: Pipeline_DescriptionPipeline\n007: Pipeline_DescriptionExperiment\n008: Pipeline_WebLink\n009: ChosenForPlating\n010: Pipeline_ImportDate\n011: Pipeline_SourceFile\n012: PlateLayout_column_id\n013: PlateLayout_row_id\n014: WellID\n015: contents\n016: PlateLayout_concentration\n017: CellLine\n018: clone_id\n019: PlateLayout_well_type\n020: PlateLayout_dyes_applied\n021: PlateLayout_ImportDate\n022: PlateLayout_SourceFile\n023: SourcePlates_Unnamed: 0\n024: SourcePlateID\n025: SourcePlates_Description\n026: SourcePlates_PlateType\n027: SourcePlates_Maintainer\n028: SourcePlates_CoatingType\n029: SourcePlates_CoatingLot\n030: SourcePlates_CoatingDate\n031: SourcePlates_CoatingMethod\n032: CellBatch\n033: SourcePlates_SourceVial\n034: SourcePlates_SourcePlate_0\n035: SourcePlates_CellCount (Vicell)\n036: SourcePlates_PassageNumber\n037: SourcePlates_SeedingDensity\n038: SourcePlates_CellViability\n039: SourcePlates_SeedDate\n040: SourcePlates_FeedDate\n041: SourcePlates_Score\n042: SourcePlates_ScoreBy\n043: SourcePlates_ScoreComment\n044: SourcePlates_Notes\n045: SourcePlates_ImportDate\n046: SourcePlates_SourceFile\n047: SourcePlates_SourcePlate_1\n048: SourcePlates_SourcePlate_2\n049: SourcePlates_SourcePlate_3\n050: SourcePlates_SourcePlate_4\n051: SourcePlates_SourcePlate_5\n052: SourcePlates_SourcePlate_6\n053: SourcePlates_SourcePlate_7\n054: SourcePlates_SourcePlate_8\n055: SourcePlates_ImportDate.1\n056: SourcePlates_SourceFile.1\n057: Seeding_ExperimentNum\n058: Seeding_ImportDate\n059: Seeding_ProcessingDate\n060: Seeding_SourceFile\n061: Seeding_bead_LotNum\n062: Seeding_contLine_1\n063: Seeding_contLine_1_cloneID\n064: Seeding_contLine_1_conc\n065: Seeding_contLine_2\n066: Seeding_contLine_2_cloneID\n067: Seeding_contLine_2_conc\n068: Seeding_expLine_1\n069: Seeding_expLine_1_cloneID\n070: Seeding_expLine_1_conc\n071: Seeding_mTesr_RI_LotNum\n072: Seeding_runDateTime\n073: Feeding_mTesr_PlusLot\n074: Feeding_mTesr_MinusLot\n075: Feeding_feed\n076: Feeding_scan\n077: Feeding_ProcessingDate\n078: Feeding_ImportDate\n079: Feeding_SourceFile\n080: Matrigel_MatrigelLotNumber\n081: Matrigel_ProcessingDate\n082: Matrigel_ImportDate\n083: Matrigel_SourceFile\n084: Celigo_Unnamed: 0\n085: Celigo_Image_ImageNumber\n086: Celigo_Colony_ObjectNumber\n087: Celigo_Image_Metadata_Scene\n088: Celigo_Image_Metadata_WellImageNotTiled_QCFlag\n089: Celigo_Image_Metadata_ZHeight\n090: Celigo_Image_Metadata_ch_0_out_dir\n091: Celigo_Image_Metadata_file_out_dir\n092: Celigo_Image_Metadata_t\n093: Celigo_Colony_AreaShape_Area\n094: Celigo_Colony_AreaShape_Center_X\n095: Celigo_Colony_AreaShape_Center_Y\n096: Celigo_Colony_AreaShape_Compactness\n097: Celigo_Colony_AreaShape_Eccentricity\n098: Celigo_Colony_AreaShape_EulerNumber\n099: Celigo_Colony_AreaShape_Extent\n100: Celigo_Colony_AreaShape_FormFactor\n101: Celigo_Colony_AreaShape_MajorAxisLength\n102: Celigo_Colony_AreaShape_MaxFeretDiameter\n103: Celigo_Colony_AreaShape_MaximumRadius\n104: Celigo_Colony_AreaShape_MeanRadius\n105: Celigo_Colony_AreaShape_MedianRadius\n106: Celigo_Colony_AreaShape_MinFeretDiameter\n107: Celigo_Colony_AreaShape_MinorAxisLength\n108: Celigo_Colony_AreaShape_Orientation\n109: Celigo_Colony_AreaShape_Perimeter\n110: Celigo_Colony_AreaShape_Solidity\n111: Celigo_Colony_Location_Center_X\n112: Celigo_Colony_Location_Center_Y\n113: Celigo_Colony_Math_Colony_Area_um2\n114: Celigo_Colony_Math_Colony_Location_um_X\n115: Celigo_Colony_Math_Colony_Location_um_Y\n116: Celigo_Colony_Math_Colony_MajorAxisLength_um\n117: Celigo_Colony_Math_Colony_MinorAxisLength_um\n118: Celigo_Colony_Number_Object_Number\n119: Celigo_Colony_Parent_AllColony\n120: Celigo_Colony_Texture_AngularSecondMoment_ResizedRaw_16_0\n121: Celigo_Colony_Texture_AngularSecondMoment_ResizedRaw_16_135\n122: Celigo_Colony_Texture_AngularSecondMoment_ResizedRaw_16_45\n123: Celigo_Colony_Texture_AngularSecondMoment_ResizedRaw_16_90\n124: Celigo_Colony_Texture_Contrast_ResizedRaw_16_0\n125: Celigo_Colony_Texture_Contrast_ResizedRaw_16_135\n126: Celigo_Colony_Texture_Contrast_ResizedRaw_16_45\n127: Celigo_Colony_Texture_Contrast_ResizedRaw_16_90\n128: Celigo_Colony_Texture_Correlation_ResizedRaw_16_0\n129: Celigo_Colony_Texture_Correlation_ResizedRaw_16_135\n130: Celigo_Colony_Texture_Correlation_ResizedRaw_16_45\n131: Celigo_Colony_Texture_Correlation_ResizedRaw_16_90\n132: Celigo_Colony_Texture_DifferenceEntropy_ResizedRaw_16_0\n133: Celigo_Colony_Texture_DifferenceEntropy_ResizedRaw_16_135\n134: Celigo_Colony_Texture_DifferenceEntropy_ResizedRaw_16_45\n135: Celigo_Colony_Texture_DifferenceEntropy_ResizedRaw_16_90\n136: Celigo_Colony_Texture_DifferenceVariance_ResizedRaw_16_0\n137: Celigo_Colony_Texture_DifferenceVariance_ResizedRaw_16_135\n138: Celigo_Colony_Texture_DifferenceVariance_ResizedRaw_16_45\n139: Celigo_Colony_Texture_DifferenceVariance_ResizedRaw_16_90\n140: Celigo_Colony_Texture_Entropy_ResizedRaw_16_0\n141: Celigo_Colony_Texture_Entropy_ResizedRaw_16_135\n142: Celigo_Colony_Texture_Entropy_ResizedRaw_16_45\n143: Celigo_Colony_Texture_Entropy_ResizedRaw_16_90\n144: Celigo_Colony_Texture_InfoMeas1_ResizedRaw_16_0\n145: Celigo_Colony_Texture_InfoMeas1_ResizedRaw_16_135\n146: Celigo_Colony_Texture_InfoMeas1_ResizedRaw_16_45\n147: Celigo_Colony_Texture_InfoMeas1_ResizedRaw_16_90\n148: Celigo_Colony_Texture_InfoMeas2_ResizedRaw_16_0\n149: Celigo_Colony_Texture_InfoMeas2_ResizedRaw_16_135\n150: Celigo_Colony_Texture_InfoMeas2_ResizedRaw_16_45\n151: Celigo_Colony_Texture_InfoMeas2_ResizedRaw_16_90\n152: Celigo_Colony_Texture_InverseDifferenceMoment_ResizedRaw_16_0\n153: Celigo_Colony_Texture_InverseDifferenceMoment_ResizedRaw_16_135\n154: Celigo_Colony_Texture_InverseDifferenceMoment_ResizedRaw_16_45\n155: Celigo_Colony_Texture_InverseDifferenceMoment_ResizedRaw_16_90\n156: Celigo_Colony_Texture_SumAverage_ResizedRaw_16_0\n157: Celigo_Colony_Texture_SumAverage_ResizedRaw_16_135\n158: Celigo_Colony_Texture_SumAverage_ResizedRaw_16_45\n159: Celigo_Colony_Texture_SumAverage_ResizedRaw_16_90\n160: Celigo_Colony_Texture_SumEntropy_ResizedRaw_16_0\n161: Celigo_Colony_Texture_SumEntropy_ResizedRaw_16_135\n162: Celigo_Colony_Texture_SumEntropy_ResizedRaw_16_45\n163: Celigo_Colony_Texture_SumEntropy_ResizedRaw_16_90\n164: Celigo_Colony_Texture_SumVariance_ResizedRaw_16_0\n165: Celigo_Colony_Texture_SumVariance_ResizedRaw_16_135\n166: Celigo_Colony_Texture_SumVariance_ResizedRaw_16_45\n167: Celigo_Colony_Texture_SumVariance_ResizedRaw_16_90\n168: Celigo_Colony_Texture_Variance_ResizedRaw_16_0\n169: Celigo_Colony_Texture_Variance_ResizedRaw_16_135\n170: Celigo_Colony_Texture_Variance_ResizedRaw_16_45\n171: Celigo_Colony_Texture_Variance_ResizedRaw_16_90\n172: Celigo_Well_AreaShape_Area\n173: Celigo_Well_AreaShape_Center_X\n174: Celigo_Well_AreaShape_Center_Y\n175: Celigo_Well_AreaShape_Compactness\n176: Celigo_Well_AreaShape_Eccentricity\n177: Celigo_Well_AreaShape_EulerNumber\n178: Celigo_Well_AreaShape_Extent\n179: Celigo_Well_AreaShape_FormFactor\n180: Celigo_Well_AreaShape_MajorAxisLength\n181: Celigo_Well_AreaShape_MaxFeretDiameter\n182: Celigo_Well_AreaShape_MaximumRadius\n183: Celigo_Well_AreaShape_MeanRadius\n184: Celigo_Well_AreaShape_MedianRadius\n185: Celigo_Well_AreaShape_MinFeretDiameter\n186: Celigo_Well_AreaShape_MinorAxisLength\n187: Celigo_Well_AreaShape_Orientation\n188: Celigo_Well_AreaShape_Perimeter\n189: Celigo_Well_AreaShape_Solidity\n190: Celigo_Well_Location_Center_X\n191: Celigo_Well_Location_Center_Y\n192: Celigo_Well_Math_Well_Location_um_X\n193: Celigo_Well_Math_Well_Location_um_Y\n194: Celigo_Well_Math_Well_MajorAxisLength_um\n195: Celigo_Well_Math_Well_MinorAxisLength_um\n196: Celigo_Well_Number_Object_Number\n197: Celigo_Colony_Math_Colony_LocationRelativeToCenter_um_X\n198: Celigo_Colony_Math_Colony_LocationRelativeToCenter_um_Y\n199: Celigo_Image_Metadata_WellRow\n200: Celigo_Image_Metadata_WellColumn\n201: Celigo_ImportDate\n202: Celigo_SourceFile\n203: Celigo_ScanDate\n204: Celigo_ImagesOriginal\n205: Celigo_ImagesProcessed\n206: Celigo_ImportDate.1\n207: Celigo_SourceFile.1\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "If we are to stand a chance of connecting this to the existing \"alldat.csv\" data we'll need a linking element. Fortunately such is provided by our filenames in alldat, or at least some of them. Let's take a look at what we have there:"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "[fn.split('/')[-1].split('.')[0] for fn in df_ad.save_paths_2D_flat[::300].values]",
"execution_count": 37,
"outputs": [
{
"data": {
"text/plain": "['20160705_I01_001_2',\n '20160705_I01_051_1',\n '20160705_S03_040_5',\n '20160708_I01_015_5',\n '20160711_C01_037_15',\n '20160719_S01_040_8',\n '20160929_I01_025_8',\n '20161220_C01_030_4',\n '20170117_I01_036_4',\n '20170203_C01_037_3',\n '3500000418_100X_20170117_F06_P16_1',\n '3500000490_100X_20170127_E07_P27_2',\n '20161209_C01_016_1',\n '20161216_C01_059_4',\n '20161216_I03_036_1',\n '20161219_S01_033_5',\n '20170124_C07_031_10',\n '20170207_C01_049_2',\n '3500000497_100X_20170130_E05_P06_1',\n '3500000552_100X_20170207_F06_P30_4',\n '3500000583_100X_20170213_E05_P07_1']"
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now I'm only showing a small subset of the filenames here, but what we can see is that there are roughly two classes:\n \n1. 20160719_S01_040_8\n2. 3500000490_100X_20170127_E07_P27_2\n\nClass one is data collected by assay development and class two is data collected by microscopy. The metadata file that we have is from microscopy and so we will only be able to associate it with cells from class two. How many of each class do we have?\n\nLet's use the fact that the microscopy names have the magnification listed as MAGX_."
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "ismicroscopy = lambda fn: fn.count('X_')==1\nn_microscopy = np.count_nonzero([ismicroscopy(fn) for fn in df_ad.save_paths_2D_flat.values])\nn_assaydev = np.count_nonzero([not ismicroscopy(fn) for fn in df_ad.save_paths_2D_flat.values])\nprint(\"There are %i images from microscopy and %i images from assay dev\"%(n_microscopy, n_assaydev))",
"execution_count": 41,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "There are 1579 images from microscopy and 4498 images from assay dev\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "So that is how many images we have, but how can we connect them to metadata? First we need to parse them and that involves knowing how to read the filename. Here is the breakdown:\n\nBBBBBBBBBB_NNNN_DDDDDDDD_WWW_PPP_C\n\n- BBBBBBBBBB: the barcode of the plate used, allowing us to link the image to the treatment the plate received\n- NNNN: the magnification the images were taken at, 100X in all cases in alldata\n- DDDDDDDD: the date the image was taken on\n- WWW: the well number within the plate, denoted with a letter and a number ala:\n```\n 01 02 03 04 05 06 07 08 09 10 11 12\n A A01 A02 ... A12\n B B01 B02 ... B12\n C . .\n D . .\n E . . \n F\n G\n H H01 H02 ... H12\n```\n- PPP: position within well image series, how many z-stacks into this single well we are\n- C: the cell number within this single z-stack after segmentation\n\nLet's document this in a function"
},
{
"metadata": {
"collapsed": true,
"trusted": true
},
"cell_type": "code",
"source": "def fn_to_info(fn):\n \"\"\"convert a BBBBBBBBBB_NNNN_DDDDDDDD_WWW_PPP_C formatted name into a dict\"\"\"\n plate_barcode, mag, date, well, position, cell = fn.split('_')\n plate_barcode = int(plate_barcode)\n mag = int(mag.strip(\"X\"))\n date = date #keep as a string, don't want a datetime object here\n well = well #keep as a string, letter prefix is significant\n position = int(position.strip(\"P\"))\n cell = int(cell)\n return plate_barcode, mag, date, well, position, cell ",
"execution_count": 42,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now let's figure out which of the bits of info the filename gives us match up to the big metadata file. From experimentation:\n\n- PlateID -> plate_barcode"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "df_pd.",
"execution_count": 44,
"outputs": [
{
"data": {
"text/plain": "0 L3500000259\n1 L3500000259\n2 L3500000259\n3 L3500000259\n4 L3500000259\n5 L3500000259\n6 L3500000259\n7 L3500000259\n8 L3500000259\n9 L3500000259\n10 L3500000259\n11 L3500000259\n12 L3500000259\n13 L3500000259\n14 L3500000259\n15 L3500000259\n16 L3500000259\n17 L3500000259\n18 L3500000259\n19 L3500000259\n20 L3500000259\n21 L3500000259\n22 L3500000259\n23 L3500000259\n24 L3500000259\n25 L3500000259\n26 L3500000259\n27 L3500000259\n28 L3500000259\n29 L3500000259\n ... \n170 L3500000259\n171 L3500000259\n172 L3500000259\n173 L3500000259\n174 L3500000259\n175 L3500000259\n176 L3500000259\n177 L3500000259\n178 L3500000259\n179 L3500000259\n180 L3500000259\n181 L3500000259\n182 L3500000259\n183 L3500000259\n184 L3500000259\n185 L3500000259\n186 L3500000259\n187 L3500000259\n188 L3500000259\n189 L3500000259\n190 L3500000259\n191 L3500000259\n192 L3500000259\n193 L3500000259\n194 L3500000259\n195 L3500000259\n196 L3500000259\n197 L3500000259\n198 L3500000259\n199 L3500000259\nName: PlateLayoutID, Length: 200, dtype: object"
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "df_ad.save_path_cell.values",
"execution_count": 48,
"outputs": [
{
"data": {
"text/plain": "array([ '/Volumes/aics///Modeling/gregj/projects/release_4_1_17/results//aligned/3D/Tom20/20160705_I01_001_2.ome.tif_cell.tif',\n '/Volumes/aics///Modeling/gregj/projects/release_4_1_17/results//aligned/3D/Tom20/20160705_I01_001_3.ome.tif_cell.tif',\n '/Volumes/aics///Modeling/gregj/projects/release_4_1_17/results//aligned/3D/Tom20/20160705_I01_001_4.ome.tif_cell.tif',\n ...,\n '/Volumes/aics///Modeling/gregj/projects/release_4_1_17/results//aligned/3D/Sec61 beta/3500000583_100X_20170213_F08_P18_2.ome.tif_cell.tif',\n '/Volumes/aics///Modeling/gregj/projects/release_4_1_17/results//aligned/3D/Sec61 beta/3500000583_100X_20170213_F08_P19_1.ome.tif_cell.tif',\n '/Volumes/aics///Modeling/gregj/projects/release_4_1_17/results//aligned/3D/Sec61 beta/3500000583_100X_20170213_F08_P19_2.ome.tif_cell.tif'], dtype=object)"
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"collapsed": true,
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.6.0",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "20170525_metadata_gathering/Metadata merging .ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment