Skip to content

Instantly share code, notes, and snippets.

@tdunning
Created March 23, 2015 19:54
Show Gist options
  • Save tdunning/47f876e107663d1d69e5 to your computer and use it in GitHub Desktop.
Save tdunning/47f876e107663d1d69e5 to your computer and use it in GitHub Desktop.
File manipulation
Log in to the cluster:
ted:downloads$ ssh se-node10.se.lab
Last login: Mon Mar 23 17:35:37 2015 from 10.250.0.220
Please check the cluster reservation calendar:
https://www.google.com/calendar/embed?src=maprtech.com_2d38343133383836382d313737%40resource.calendar.google.com
Poke around looking for my volume and such:
[tdunning@se-node10 ~]$ ls /mapr/se1/user/t
tdunning/ tlojko/
[tdunning@se-node10 ~]$ ls /mapr/se1/user/tdunning/
old-cluster
[tdunning@se-node10 ~]$ maprcli volume list -columns volumename | grep tdunning
home.tdunning
Also note that we are already in my volume (because my home directory is on the cluster)
[tdunning@se-node10 ~]$ pwd
/mapr/se1/user/tdunning
When we look at the contents of my home directory, we see old stuff and my new volume. The new volume name is not the same as the mount point. It is the mount point that we see here.
[tdunning@se-node10 ~]$ ls
new-vol old-cluster
[tdunning@se-node10 ~]$ cd new-vol/
[tdunning@se-node10 new-vol]$ ls
Inside the new volume, create some empty files just because we can:
[tdunning@se-node10 new-vol]$ touch x y z
[tdunning@se-node10 new-vol]$ ls
x y z
[tdunning@se-node10 new-vol]$ cd ..
[tdunning@se-node10 ~]$ ls
new-vol old-cluster
OK... back in my home directory (which is on the cluster, of course), download the log-synth code and compile it:
[tdunning@se-node10 ~]$ pwd
/mapr/se1/user/tdunning
[tdunning@se-node10 ~]$ git clone https://github.com/tdunning/log-synth.git
Initialized empty Git repository in /mapr/se1/user/tdunning/log-synth/.git/
remote: Counting objects: 1421, done.
remote: Total 1421 (delta 0), reused 0 (delta 0), pack-reused 1421
Receiving objects: 100% (1421/1421), 2.48 MiB | 2.29 MiB/s, done.
Resolving deltas: 100% (572/572), done.
[tdunning@se-node10 ~]$ cd log-synth/
[tdunning@se-node10 log-synth]$ mvn -q -DskipTests package
... much goo deleted ...
[loading ZipFileIndexFileObject[/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/lib/ct.sym(META-INF/sym/rt.jar/java/lang/reflect/AnnotatedElement.class)]]
[loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/apache/mahout/mahout-math/0.9/mahout-math-0.9.jar(org/apache/mahout/common/RandomWrapper.class)]]
[loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/slf4j/slf4j-api/1.6.6/slf4j-api-1.6.6.jar(org/slf4j/Marker.class)]]
[wrote RegularFileObject[/mapr/se1/user/tdunning/log-synth/target/test-classes/com/mapr/stats/UpperQuantileTest.class]]
[total 680ms]
Check to see that the executable was created by the compilation. The file log-synth is the one we care about.
[tdunning@se-node10 log-synth]$ ls target
archive-tmp generated-sources log-synth log-synth-0.1-SNAPSHOT-jar-with-dependencies.jar maven-status
classes generated-test-sources log-synth-0.1-SNAPSHOT.jar maven-archiver test-classes
Back home, we try to run this program and it tells us that we need to give it a schema
[tdunning@se-node10 log-synth]$ cd ..
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth
Exception in thread "main" java.lang.IllegalArgumentException: Must specify schema file using [-schema filename] option
at com.mapr.synth.Synth.main(Synth.java:94)
So we create a trivial schema that will generate three dates. Usually, we will do something more interesting here. See the README at https://github.com/tdunning/log-synth for more information.
[tdunning@se-node10 ~]$ cat > schema.json
[
{"name":"first_visit", "class":"date", "format":"MM/dd/yyyy"},
{"name":"second_date", "class":"date", "start":"2014-01-31", "end":"2014-02-07"},
{"name":"third_date", "class":"date", "format":"MM/dd/yyyy", "start":"01/31/1995", "end":"02/07/1999"}
]
Now create a tiny output file with 20 lines.
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20 -format JSON
R 1 0.0 0 0.0 0.000
F 1 0.0 20 654.8 820.850
[tdunning@se-node10 ~]$ cat foo/synth-0000
{"first_visit":"05/17/2013","second_date":"2014-01-31","third_date":"04/08/1998"}
{"first_visit":"10/05/2012","second_date":"2014-02-01","third_date":"06/17/1997"}
... 16 lines of similar goo omitted ...
{"first_visit":"06/27/2013","second_date":"2014-02-04","third_date":"02/18/1996"}
{"first_visit":"11/11/2012","second_date":"2014-02-03","third_date":"07/31/1995"}
Let's clean up and generate something bigger with 20 million lines generated using 20 threads. This isn't a Hadoop program since it is ordinary Java running on a single machine, but it is a simple form of parallelism.
[tdunning@se-node10 ~]$ rm -rf foo
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20M -threads 20 -format JSON
R 20 0.0 0 0.0 0.000
R 20 10.0 2370000 236178.3 237005.826
R 20 20.0 5014500 250296.7 264464.988
R 20 30.0 7657500 254959.0 264299.625
R 20 40.0 10098000 252234.2 244050.373
R 20 50.0 12846000 256744.3 274800.099
R 20 60.0 15470000 257686.2 262399.031
R 20 70.0 17838000 254703.8 236799.218
F 20 78.1 20000000 256182.6 269071.616
Now we have lots of files and they are much bigger.
[tdunning@se-node10 ~]$ du -sh foo/*
79M foo/synth-0000
79M foo/synth-0001
79M foo/synth-0002
79M foo/synth-0003
79M foo/synth-0004
79M foo/synth-0005
79M foo/synth-0006
79M foo/synth-0007
79M foo/synth-0008
79M foo/synth-0009
79M foo/synth-0010
79M foo/synth-0011
79M foo/synth-0012
79M foo/synth-0013
79M foo/synth-0014
79M foo/synth-0015
79M foo/synth-0016
79M foo/synth-0017
79M foo/synth-0018
79M foo/synth-0019
[tdunning@se-node10 ~]$ du -sh foo
1.6G foo
[tdunning@se-node10 ~]$ wc -l foo/*
1000000 foo/synth-0000
1000000 foo/synth-0001
1000000 foo/synth-0002
1000000 foo/synth-0003
1000000 foo/synth-0004
1000000 foo/synth-0005
1000000 foo/synth-0006
1000000 foo/synth-0007
1000000 foo/synth-0008
1000000 foo/synth-0009
1000000 foo/synth-0010
1000000 foo/synth-0011
1000000 foo/synth-0012
1000000 foo/synth-0013
1000000 foo/synth-0014
1000000 foo/synth-0015
1000000 foo/synth-0016
1000000 foo/synth-0017
1000000 foo/synth-0018
1000000 foo/synth-0019
20000000 total
Clean up again and make something much smaller.
[tdunning@se-node10 ~]$ rm -rf foo
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 2M -threads 10 -format JSON
R 10 0.0 0 0.0 0.000
F 10 7.1 2000000 279924.3 281345.025
[tdunning@se-node10 ~]$ du -sh foo
157M foo
Only 157MB instead of 1.6GB. Much nicer. Next we package this up as a single file to make web storage easier.
[tdunning@se-node10 ~]$ tar zcvf foo.tgz foo
foo/
foo/synth-0009
foo/synth-0004
foo/synth-0003
foo/synth-0008
foo/synth-0006
foo/synth-0005
foo/synth-0007
foo/synth-0002
foo/synth-0000
foo/synth-0001
OK. While you weren't watching (because you were reading this) I copied that file to the Public directory in Dropbox on my laptop using scp. That means that I can delete foo.tgz from my cluster home directory and try downloading from the web. First I use a utility called `wget`. Later, I will use `curl` which is more commonly used by the SE team.
[tdunning@se-node10 ~]$ rm foo.tgz
[tdunning@se-node10 ~]$ wget http://bit.ly/se-onboarding-data
--2015-03-23 18:52:32-- http://bit.ly/se-onboarding-data
Resolving bit.ly... 69.58.188.39, 69.58.188.40
Connecting to bit.ly|69.58.188.39|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://dl.dropboxusercontent.com/u/36863361/foo.tgz [following]
--2015-03-23 18:52:32-- https://dl.dropboxusercontent.com/u/36863361/foo.tgz
Resolving dl.dropboxusercontent.com... 23.21.196.214, 50.17.184.208, 54.221.192.137, ...
Connecting to dl.dropboxusercontent.com|23.21.196.214|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13278296 (13M) [application/x-gtar]
Saving to: “se-onboarding-data”
100%[===========================================================================================================================================>] 13,278,296 2.60M/s in 5.6s
2015-03-23 18:52:41 (2.26 MB/s) - “se-onboarding-data” saved [13278296/13278296]
The `wget` command gives me lots of information by default so I can see the redirection that happens with bitly short links. I can now unpack the file and see that it has the original data:
[tdunning@se-node10 ~]$ tar xvf se-onboarding-data
foo/
foo/synth-0009
foo/synth-0004
foo/synth-0003
foo/synth-0008
foo/synth-0006
foo/synth-0005
foo/synth-0007
foo/synth-0002
foo/synth-0000
foo/synth-0001
OK. So we clean up again to get ready to use `curl` for the same task.
[tdunning@se-node10 ~]$ rm -rf foo foo.tgz
[tdunning@se-node10 ~]$ rm se-onboarding-data
[tdunning@se-node10 ~]$ ls
log-synth new-vol old-cluster schema.json side-log
[tdunning@se-node10 ~]$ cd new-vol/
[tdunning@se-node10 new-vol]$ ls
x y z
[tdunning@se-node10 new-vol]$ pwd
/mapr/se1/user/tdunning/new-vol
The `curl` command works a bit differently from `wget`. If I try to simply grab the contents of the bitly link, `curl` doesn't follow the redirect by default. It also puts the content on the standard output which can be bad if you are downloading tens of megabytes.
[tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data
<html>
<head><title>Bitly</title></head>
<body><a href="https://dl.dropboxusercontent.com/u/36863361/foo.tgz">moved here</a></body>
</html>[tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data
The `-L` option forces `curl` to follow redirects. Note also the use of `> foo.tgz` to redirect the output to a file with an useful name.
[tdunning@se-node10 new-vol]$ curl -L http://bit.ly/se-onboarding-data > foo.tgz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12.6M 100 12.6M 0 0 1593k 0 0:00:08 0:00:08 --:--:-- 2884k
And again, note that the contents can be extracted as before.
[tdunning@se-node10 new-vol]$ tar xvf foo.tgz
foo/
foo/synth-0009
foo/synth-0004
foo/synth-0003
foo/synth-0008
foo/synth-0006
foo/synth-0005
foo/synth-0007
foo/synth-0002
foo/synth-0000
foo/synth-0001
[tdunning@se-node10 new-vol]$ wc -l foo/*
200000 foo/synth-0000
200000 foo/synth-0001
200000 foo/synth-0002
200000 foo/synth-0003
200000 foo/synth-0004
200000 foo/synth-0005
200000 foo/synth-0006
200000 foo/synth-0007
200000 foo/synth-0008
200000 foo/synth-0009
2000000 total
[tdunning@se-node10 new-vol]$ ls
foo foo.tgz x y z
You can see MapR DB tables from the command line as well. They appear as symbolic links to a special location. I created the table called `data-table` in the `new-vol` directory using the MCS. Here is the result.
[tdunning@se-node10 new-vol]$ pwd
/mapr/se1/user/tdunning/new-vol
[tdunning@se-node10 new-vol]$ ls -l
total 12969
lr-------- 1 tdunning tdunning 2 Mar 23 19:11 data-table -> mapr::table::3315.125.131512
drwxrwxr-x 2 tdunning tdunning 10 Mar 23 18:43 foo
-rw-rw-r-- 1 tdunning tdunning 13278296 Mar 23 19:06 foo.tgz
-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 x
-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 y
-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 z
[tdunning@se-node10 new-vol]$ ls
data-table foo foo.tgz x y z
[tdunning@se-node10 new-vol]$ find . -type l
./data-table
I can search for other tables, subject to file permissions.
[tdunning@se-node10 new-vol]$ find /mapr/se1/user -type l -ls | grep ::
744048204 0 lrwx------ 1 apernsteiner apernsteiner 2 Mar 5 19:34 /mapr/se1/user/apernsteiner/tables/andytable -> mapr::table::2097.76.529274
874071677 0 lr-------- 1 tdunning tdunning 2 Mar 23 19:11 /mapr/se1/user/tdunning/new-vol/data-table -> mapr::table::3315.125.131512
907626118 0 lr-------- 1 kbotzum kbotzum 2 Mar 12 19:31 /mapr/se1/user/kbotzum/tables/ycsb -> mapr::table::2275.134.393918
907626140 0 lr-------- 1 kbotzum kbotzum 2 Mar 12 19:53 /mapr/se1/user/kbotzum/tables/ycsb2 -> mapr::table::2275.156.393962
188302894 0 lr-------- 1 jbates jbates 2 Dec 16 16:43 /mapr/se1/user/jbates/mapr_table -> mapr::table::2314.46.262516
^C
Finally, I clean up everything by deleting my the `new-vol` volume using the MCS.
[tdunning@se-node10 ~]$ ls
log-synth old-cluster schema.json side-log
[tdunning@se-node10 ~]$ rm -rf foo-orig/
[tdunning@se-node10 ~]$ ls -l
total 2
drwxr-xr-x 6 tdunning tdunning 11 Mar 23 18:32 log-synth
drwxr-xr-x 6 tdunning tdunning 6 Oct 11 05:55 old-cluster
-rw-rw-r-- 1 tdunning tdunning 252 Mar 23 18:34 schema.json
-rw-rw-r-- 1 tdunning tdunning 260 Mar 23 18:43 side-log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment