Skip to content

Instantly share code, notes, and snippets.

@johnnyaug
johnnyaug / HttpRangeInputStream.java
Last active March 2, 2023 07:27
Hadoop FSInputStream for HTTP Byte-Range requests
import java.io.EOFException;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FSExceptionMessages;
import org.apache.hadoop.fs.FSInputStream;
@johnnyaug
johnnyaug / lakefs_gc.scala
Last active December 27, 2022 14:55
Understanding GC in lakeFS
// Databricks notebook source
// MAGIC %md
// MAGIC ### Understanding Garbage Collection in lakeFS
// MAGIC
// MAGIC This notebook will allow you to investigate the results of a GC dry run.
// MAGIC
// MAGIC Run the cells of this notebook one by one.
// MAGIC
// MAGIC **In the next cell, fill in the repository name.**
This file has been truncated, but you can view the full file.
<!DOCTYPE html>
<html>
<head>
<meta name="databricks-html-version" content="1">
<title>Understanding GC in lakeFS - Databricks</title>
<meta charset="utf-8">
<meta name="google" content="notranslate">
<meta name="robots" content="nofollow">
<meta http-equiv="Content-Language" content="en">
-- Athena queries to share inventory stats with the Treeverse team.
-- The result of these queries will help us a great deal in engineering lakeFS, without exposing any data or object names.
-- Thank you for your effort and help!
-- It is assumed that you have a table named inventory_tbl representing the inventory.
-- A CREATE statement for such a table is available here: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html#storage-inventory-athena-query
-- It is also assumed that Object-Level versioning is enabled on the bucket, and on the inventory (i.e. the inventory contains is_latest as a field).
-- If you don't have Object-Level versioning, please contact us and we will provide an alternative set of queries.
-- Replace the dates with the dates with the ones you intend to share with us (NOTE: they appear in three queries).

lakeFS with MinIO

lakeFS gives Git-like capabilities over your MinIO storage, allowing you to coordinate with colleagues when working on your data.

In the following example, we will use lakeFS to create a branch on your storage, commit changes to it, and then merge it to the master branch.

Prerequisites

  • Install MinIO Server from here.
  • Install mc from here.
  • Install docker-compose from here.
# lakeFS with MinIO
lakeFS gives Git-like capabilities over your MinIO storage, allowing you to coordinate with colleagues when working on your data.
In the following example, we will use lakeFS to create a branch on your storage, commit changes to it, and then merge it to the master branch.
## Prerequisites
* Install MinIO Server from [here](https://docs.min.io/docs/minio-quickstart-guide).
* Install `mc` from [here](https://docs.min.io/docs/minio-client-quickstart-guide).
* Install docker-compose from [here](https://docs.docker.com/compose/install/).