Skip to content

Instantly share code, notes, and snippets.

@meiqimichelle
Last active October 21, 2019 03:10
Show Gist options
  • Save meiqimichelle/a9165cbf91f82e5c4ab12b5bc310edff to your computer and use it in GitHub Desktop.
Save meiqimichelle/a9165cbf91f82e5c4ab12b5bc310edff to your computer and use it in GitHub Desktop.

Data Architect Maria needs to replicate huge datasets across many data centers to make it highly available for scientific study.

North star

Someone managing large data (either in number of files or in volume) should not need to copy-and-paste hashes or peer ids into the standard IPFS CLI to do their work. We need to actively market and develop IPFS Cluster as the entry point for the ‘enterprise management’ use case. We also need to build a lightweight admin interface that makes it easy for data administrators to manage pins and peers.

Priorities

In the short term, we should re-write the IPFS and IPFS Cluster homepages (and any other useful communication points) to actively market IPFS Cluster as our ‘enterprise admin’ solution. We should also build a very simple, v0 enterprise admin panel that controls and explains several key IPFS data admin concepts (for example: replication factors; types of Clusters; pinsets; and followers).

User workflow

“I need to be able to manage and move massive datasets across my data and research centers to support science.”

User workflow IPFS interaction
Maria download and installs IPFS Cluster. As part of the installation flow, IPFS Cluster asks a few short questions about data size and shape, and then spins up a Cluster with settings that are most likely to meet that users’ starting needs. It also prompts the user to visit the admin panel.
As prompted, Maria explores her new IFPS Cluster admin panel, and learns how she can change replication factors; types of Clusters; permissions across peers; manage her pinsets; and more. The Cluster admin panel is functional, but also educational, especially when someone installs Cluster for the first time, as people need to know *why* certain choices might be made across its options.
Via the admin panel, Maria creates a ‘followers’ link so that she can ask scientists across her data centers to join her Cluster. The admin panel provides an easy way to add follower peers to a given Cluster. These are peers that participate in hosting information in a Cluster, but don’t have equal ability to impact Cluster settings in other ways.
As scientists click her followers link, she sees their peers join her Cluster via the admin panel. She can explore performance metrics across her Cluster, and troubleshoot issues via logs. The admin panel gives data managers visibility into their system of peers, and either provides performance metrics and logs there, or links to other IPFS tools/views that provide that information.
Maria programmatically moves large datasets to IPFS via the Cluster CLI. Cluster replicates the information the user adds, and adds+pins it automatically to IPFS. The user can see metadata about their collection of pins (pinsets) via the admin panel.
When scientists leave her organization, Maria can remove their peers as followers via the admin panel. The Cluster admin panel allows the admin to add and remove peers as needed. When this happens, Cluster automatically rebalances data load across the remaining peers.

Success metrics

  • User engagement rates via opt-in metrics
    • Number of interactions with IPFS Cluster per week
    • Amount of data across peers
    • Number and types of peers
  • Number of downloads
  • Opt-in share of error logs
  • User frustration/happiness self-reporting
    • “Was this helpful”-type questionnaire where appropriate, with open-ended box or issue for optional feedback
  • Ease of user task completion via usability tests
    • Adding and removing peers
    • Creating follower links; adding and removing followers
    • Being added as a follower (from the follower peer perspective)
    • Adding and removing data

Long-term user workflow vision

  • Data Architect Maria works for a global environmental data and information service. She’s always looking for ways to solve tricky problems in the large data archiving and access space.
  • One of the problems that keeps her up at night is the sheer size of some of the newest satellite-derived information that her centers will need to archive and provide access to. The data is too large to go over normal network connections in any reasonable amount of time. Getting that information to scientists all over the world is a real challenge. (Not to mention the difficulties involved in running repeatable experiments on datasets when you can’t identify and share them with confidence.)
  • Maria hears about IPFS and its ability to move information in a peer-to-peer, content-addressed way. She also learns about IPFS Cluster, which can help her, as a data administrator, manage information across many peers in bulk.
  • She tests it by spinning up a Cluster with peers across several data centers, which she does via command line because that’s what she’s used to.
  • She adds a few large files to the system. She’s used to large file ingest taking a bit of time. IPFS seems to handle this in about the same time as other similar systems.
  • After she’s added these files, Cluster lets her know that she can view and manage her peers via an admin panel.
  • Via the admin panel, Maria learns about her options with regard to user and peer permissions, replication factors, and standard settings for different types of Clusters.
  • Once Maria is confident that she understands the system, she runs a public beta. Network data access endpoints don’t change, but Cluster now balances availability in the background, all the while collecting metrics and logs that Maria can use to fine-tune her systems.
  • Next, she adds some of her science power users to the Cluster. She wants these scientists to be able to access the information they need directly, and also help support data availability by hosting information in their regions as well. The Cluster interface allows her to create a “Join my Cluster” link that she can send to these scientists. This will add their peers to her Cluster without giving them full admin access.
  • Over the course of several years, Maria makes agreements with other data centers to support each others’ information via IPFS. As their p2p stewardship practice matures, they’ve been able to realize the benefits of block-level de-dupe across their datasets, which has reduced the amount of information that needs to be stored and transported. This has really gotten at what Maria wanted: a better way to maintain and move massive datasets.

Notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment