Skip to content

Instantly share code, notes, and snippets.

@stachjankowski
Created March 26, 2020 16:30
Show Gist options
  • Save stachjankowski/5bf7747477812d9ee1a7eff7e52630de to your computer and use it in GitHub Desktop.
Save stachjankowski/5bf7747477812d9ee1a7eff7e52630de to your computer and use it in GitHub Desktop.
Finds all the files matching a specified pattern in Spark.
import org.apache.hadoop.fs.{FileStatus, Path}
import org.apache.spark.sql.SparkSession
object ListFiles extends App {
val spark = SparkSession.builder.config("spark.master", "local[1]").getOrCreate()
val path = new Path("/data/wiki-dumps/dumps/*wiki-*-pages-articles-multistream.xml*")
val fileSystem = path.getFileSystem(spark.sparkContext.hadoopConfiguration)
val files: List[FileStatus] = fileSystem.globStatus(path).toList
files.foreach(println)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment