Skip to content

Instantly share code, notes, and snippets.

@kny5
Last active December 2, 2022 05:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kny5/ef9b82705ba1d6df28fb3d8623f454cc to your computer and use it in GitHub Desktop.
Save kny5/ef9b82705ba1d6df28fb3d8623f454cc to your computer and use it in GitHub Desktop.

Autor: Antonio de Jesus Anaya Hernandez, DevOps eng. for the IoPA Autor: The internet of Production Alliance, 2022 Data was collected by "Field Ready" The Open Know Where OKW Standard is part of the Internet of Production Alliance and its members. License: CC BY SA CC BY SA

OKW data review

Table of contents:

  1. Introduction
  2. OKW Data Schema
  3. Review of ODK as data capture tool
  4. Ideas for the self summission data collection tool

Introduction

The hands off process of the ODK platform and the OKW data collection has been insightful and the purpose of this material is to give a few bullet points of where the data had more difficulties in the processing and use this information to design and deploy improved data capturing tools.

OKW Data Schema

Manufacturing facility:

  • Name
  • Location <-- Automation of location by reverse geocoding GPS [...
    • Address
      • Number
      • Street
      • District
      • City
      • Region
      • Country
      • Postcode
    • GPS Coordinates
    • What 3 words address
      • Address
      • Language
    • Directions <-- Automation of location by reverse geocoding GPS ...]
  • Owner
  • Contact
  • Afilitation
  • Facility status
    • Active
    • Planned
    • Temporaly closed
    • Closed
  • Opening hours <-- Input needs to be formatted and split in opening and closing.
  • Description
  • Date founded
  • Access type
    • Restricted
    • Restricted with public hours
    • Shared space
    • Public
    • Membership
  • Wheelchair accessibility
  • Equipment <-- [3] Automation get wiki URL by searching keyword in wikidata API
  • Human capacity
    • Headcount
    • Maker
  • Manufacturing processes <-- [1] Automation
    • Uses Wikipedia
  • Typical batch size
    • 0-50
    • 50-500
    • 500-5000
    • 5000+
  • Size-floorsize
  • Typical materials <-- [2] Automation, scan barcode or QR code when possible.
    • Material
      • Material type
      • Manufacturer
      • Brand
      • Supplier location
      • Defined material type
        • Uses list of defined materials
      • Material classification
        • Uses wikipedia
  • Storage capacity
  • Certifications
  • Backup generator
  • Uninterrupted power Supply
  • Road access
  • Loading dock
  • Maintenance schedule
  • Typical products <-- [4] Automation, use product description to get materials
  • Partner / Funder
  • Customer Reviews
  • Innovation Space Properties

Agent

  • Name
  • Location
  • Contact person
  • Bio
  • Contact
    • Landline
    • Mobile
    • Fax
    • Email
    • Whatsapp
  • Website
  • Social media
    • Facebook
    • Twitter
    • Instagram
    • Other URLs
  • Languages
  • Mailing list
  • Images/media

Review of ODK as data capture tool

Input

Free typing inputs have cause a lot of issues while capturing the information, the more free inputs we have the more chance of human error we give. A list of columns that need restricted inputs are:

  • open time, clock input
  • closing time, clock input
  • manufacturing processes, list input
  • material, needs an optional barcode scanner

The agent info needs to be stored in a separated table, in the ODK platform it was captured every time, making the input process tedious.

A possibility, for the social media and images we could add a hashtag to photos of the manufacturing facilities to use it as an external image storage database.

Automation

The GPS location could be used to reverse geocoding the address, and save time in the capturing process.

Information like Equipment, material and manufacturing process are linked and thus having partial information, could give generic data. A defined table of processes could be useful for the data capturing process, but also, machines, materials and processes should be considered as part of the OKH standard and complain with its specifications.

Quality of the data captured

Some random generated entries were found in the database, duplicated entries and heavy dense captured areas were also detected while using Open Refine, facets and filtering tools. A list of columns by {blank inputs} No data captured are shown:

  • Status 6885
  • ReviewState 6879
  • Electronics_type 6868
  • Ceramics_type 6860
  • size_floor_size_other 6855
  • typical_batch_size_other 6852
  • storage_capacity_other 6844
  • Elastomer_s_type 6809
  • country_other 6805
  • Plastic_s_type 6784
  • if_Other_please_specify 6628
  • number 6277
  • social_twitter 6043
  • email 6024
  • social_insta 6014
  • Wood_type 5929
  • social_fb 5499
  • Please_provide_the_serial_number 5485
  • email_001 5465
  • postcode 5180
  • Please_Specify_the_e_nufacturing_facility 5115
  • Others 4371
  • Metal_s_type 3965
  • affiliation 3906
  • partner_funder 3786
  • Wikipedia_URL_of_the_machine 3456
  • Wikipedia_URL_of_the_anufacturing_process 3224
  • enumerator_other 3030
  • Please_provide_the_model 1829
  • certifications 1783
  • Manufacturing_process_001 1106
  • maintenance_schedule 1003
  • address 920
  • How_many_are_there 680
  • date_founded 459
  • owner 459
  • human_capacity-maker 424
  • The_equipment_available_for_us 202
  • human_capacity-headcount 178
  • Materials_used 155
  • Please_specify_the_t_uced_by_the_facility 137
  • Working_hours 115
  • typical_batch_size 74
  • size_floor_size 68
  • working_days 67
  • storage_capacity 65
  • loading_dock 51
  • Type 36
  • uninterrupted_power_supply 18
  • name 11

These entries are critical for the usability of the data, and having them incomplete compromise the validity of the database entries.

  • Status 6885
  • Please_Specify_the_e_nufacturing_facility 5115
  • Manufacturing_process_001 1106

Bangladesh case

Especially in Bangladesh it was found random generated inputs in the address column, and the contact column had a consistent pattern of repeated phone numbers. Which means some scripting was used to generate random inputs, compromising the trust of the inputs.

Filter "//" random name generated inputs in address column

Filter shows random generates inputs with the string "//"

  • bangladesh 715

Duplicates in lat-long in Bangladesh

More than one record per location:

  • lat-long true 445

Discarding data using facets

Facet duplicates in column name

By filtering duplicates were found 1974 entries with the same name in all countries:

  • false 4911
  • true 1974

Facet by Country:

Duplicated name entries by country:

  • Kenya 1099
  • bangladesh 488
  • Uganda 363
  • iraq 10

by lat-long:

More than one record per GPS location:

  • false 339
  • true 1635

Filter duplicate in name and lat-long

  • Kenya 948
  • bangladesh 447
  • Uganda 219
  • other 11
  • iraq 10

Preliminary disposed data

  • 1614 out of 6885 entries

Visualization tools to understand how data changed in the cleaning process.

In progress

Conclusion

Having tools designed with focus in the data capturing process will increase the data quality and reliability, the case of the ODK platform has showed that the schema and input fields had made the collection process slow and difficult, specially for the large ammounts of unformated data captured.

Preliminary useful records by country:

  • UGANDA 2520
  • KENYA 2426
  • BANGLADESH 423
  • CONGO 36
  • SOMALILAND 24
  • IRAQ 17
  • BURUNDI 3
  • (blank) 2

Ideas for the self summission data collection tool

  1. Reduce the number of inputs to the essential ones and use the contact information to re-contact the facility and run short concise and meaningful questions regarding their inventory.

  2. Change the way data is collected. Make a Telegram or whatsapp bot application.

  3. About the manufacturing capabilities, show reference photos of projects adn ask if they can realize that activity instead of asking questions like processes or machine types ans serial numbers.

  4. Introduce a contact information verification process. Like an SMS or email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment