Skip to content

Instantly share code, notes, and snippets.

@andre-merzky
Created October 13, 2014 23:07
Show Gist options
  • Save andre-merzky/fd80d083e30d9df7e701 to your computer and use it in GitHub Desktop.
Save andre-merzky/fd80d083e30d9df7e701 to your computer and use it in GitHub Desktop.
Request #302541
inconsistent backend state ?
Andre Merzky
Jul 03 06:18
I have trouble to understand the semantics of various globus online commands, as the results do not reflect my expectations. In particular, I would expect that an operation which returns no error has actually *completed* on the backend, i.e. that the respective changes are committed onto the storage system. But in fact I get a different impression, and I am not sure if that is because of aggressive caching, or because of different, conflicting code paths, or something else. My best guess is that GO uses different protocols for different operations, and state is getting out of sync?
As example I include a globus online shell session below, which is an exemplary for the kind of problem I am encountering in different contexts (i.e. with different operations), too. [Some output lines of ls which refer to other people's files have been omitted in the session log, for clarity]
---------------------------------------------------------------------------------
$ ls -l gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/
drwxrwxr-x tg803521 G-81625 4096 2014-07-03 10:52 am/
drwxrwxr-x tg803521 G-81625 4096 2014-07-02 22:43 am1/
$ rm -r -f gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/am
Task ID: 7f1dd2f9-02a0-11e4-b581-12313940394d
Type <CTRL-C> to cancel or bg<ENTER> to background
[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 1/1 0.00 mbps
$ rm -r -f gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/am1
Task ID: 81c16d8b-02a0-11e4-b581-12313940394d
Type <CTRL-C> to cancel or bg<ENTER> to background
[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 1/1 0.00 mbps
$ ls -l gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/
drwxrwxr-x tg803521 G-81625 4096 2014-07-03 10:52 am/
drwxrwxr-x tg803521 G-81625 4096 2014-07-02 22:43 am1/
$ rm -r -f gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/am/
Task ID: 879bf14f-02a0-11e4-b581-12313940394d
Type <CTRL-C> to cancel or bg<ENTER> to background
[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 1/1 0.00 mbps
$ ls -l gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/
$ mkdir gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/am
Error: Path already exists
Details: Error (mkdir)
Server: andremerzky#gsiftp_gridftp.stampede.tacc.xsede.org
(gridftp.stampede.tacc.xsede.org:2811)
Message: Path '/tmp/am' already exists
$ ls -l gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/
$ mkdir gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/am/
$ ls -l gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/
drwxrwxr-x tg803521 G-81625 4096 2014-07-03 10:52 am/
drwxrwxr-x tg803521 G-81625 4096 2014-07-02 22:43 am1/
---------------------------------------------------------------------------------
Note that operations seem to fail for the wrong reason, seem not to fail but not to do anything either, and that the file system entries seem to come and go somewhat randomly.
What am I missing? Is this expected behavior? Is this a problem of this specific backend (I did not test other backends thoroughly)?
I am trying to use those commands programatically, and the (apparent) inconsistent state is causing me quite some grief, to be honest... For completeness, some details on the current setup of the shell:
---------------------------------------------------------------------------------
$ profile
User Name: andremerzky
DN: /C=US/O=National Center for Supercomputing Applications/CN=Andre Merzky
Email Address: andre--globus@merzky.net
Task Notifications: No
$ endpoint-list -v
Name : andremerzky#gsiftp_gridftp.stampede.tacc.xsede.org
Host(s) : gsiftp://gridftp.stampede.tacc.xsede.org:2811
Subject(s) :
Target Endpoint : n/a
Default Directory : n/a
Force Encrypted Transfer: No
Disable Verify : No
MyProxy Server : n/a
MyProxy DN : n/a
MyProxy OAuth Server : n/a
Credential Status : ACTIVE
Credential Expires : 2014-07-12 09:03:41Z
Credential Subject : /C=US/O=National Center for Supercomputing Applications/CN=Andre Merzky/CN=1960031471
S3 URL : n/a
Owner Activated : No
Name : andremerzky#gsisftp_trestles-dm1.sdsc.edu
Host(s) : gsiftp://trestles-dm1.sdsc.edu:2811
Subject(s) :
Target Endpoint : n/a
Default Directory : n/a
Force Encrypted Transfer: No
Disable Verify : No
MyProxy Server : n/a
MyProxy DN : n/a
MyProxy OAuth Server : n/a
Credential Status : ACTIVE
Credential Expires : 2014-07-12 09:03:41Z
Credential Subject : /C=US/O=National Center for Supercomputing Applications/CN=Andre Merzky/CN=1960031471
S3 URL : n/a
Owner Activated : No
---------------------------------------------------------------------------------
Thanks, Andre.
Comments
User photo
Globus Team - Diane
globus support
Hello Andre,
Thanks for reaching out to Support! We are researching your questions. An engineer will get back to you when they have more information to provide, or need to ask questions.
Thanks & Regards,
Diane Collins
Globus HelpDesk
July 03, 2014 09:53
User photo
Globus Team - Stephen
globus support
Hello Andre,
To the best of our ability to tell, after some investigation, everything is working correctly.
The issues that you are experiencing arise from the semantics of Globus Transfer operations.
As you note below, when an command in the CLI returns, it does not indicate that the requested Transfer operation has been completed.
This is not an accident, but part of the CLI design. The return of a CLI operation without error messages indicates that the Transfer task has been submitted.
You can check on the state of running Transfers with the CLI's `status` command, which by default lists all of your active Transfer tasks.
In one of our CLI tutorial documents, there is a section on "Monitoring" that you may find useful: https://support.globus.org/entries/29642203
I hope that this answers your questions satisfactorily. If you have further issues, or want more information than is provided in the above document, please don't hesitate to contact us again.
Thanks,
-Stephen
July 08, 2014 14:31
User photo
Andre Merzky
But alas, I don't think that has anything to do with tasks, really. I don't use '-D' on the rm calls, so the call should (according to the man page) only return success if the task was successfully *completed*. I can also inspect the listed task ID afterwards, and see it in 'SUCCEEDED' state 00 and still the result is not as expected. Worse, I see directories which are unrelated to the last command *appearing* on the next ls:
----------------------------------------------------------------------------------
merzky@cameo:~ $ gsissh andremerzky@cli.globusonline.org
Welcome to globusonline.org, andremerzky. Type 'help' for help.
$ ls -l gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/
drwxrwxr-x tg803521 G-81625 4096 2014-07-03 19:25 am/
-rw------- root root 4294967296 2013-03-11 17:00 swapfile
...
$ rm -r -f gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/am
Task ID: 5aaf9c91-06dc-11e4-b589-12313940394d
Type <CTRL-C> to cancel or bg<ENTER> to background
[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 1/1 0.00 mbps
$ status 5aaf9c91-06dc-11e4-b589-12313940394d
Task ID : 5aaf9c91-06dc-11e4-b589-12313940394d
Request Time: 2014-07-08 20:13:41Z
Command : rm -r -f gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/am
Label : n/a
Status : SUCCEEDED
$ ls -l gsiftp_gridftp.stampede.tacc.xsede.org:/tmp/
drwxrwxr-x tg803521 G-81625 4096 2014-07-03 12:27 am/
drwxrwxr-x tg803521 G-81625 4096 2014-07-02 22:43 am1/
-rw------- root root 4294967296 2013-03-11 17:00 swapfile
...
-----------------------------------------------------------------------------------------
Can I really be sure that those operations end up on the same storage backend? It kind of looks like this, as some files have the exact same size and access time (which would be very surprising otherwise) -- I included an example ('swapfile) in the listing above.
If you insist that this is expected behavior, I can accept that -- but that makes it neigh impossible for me to use the CLI programatically I'm afraid :/
Thanks :) Andre.
July 08, 2014 15:22
User photo
Globus Team - Stephen
globus support
Hi Andre,
I apologize for misunderstanding your issue. You are correct, the `rm` calls without `-D` are synchronous commands that return upon completion.
I'm not aware of any instances of this type of problem appearing on smaller systems, so it seems that this is an issue with GridFTP with a distributed Lustre backend.
I've opened up discussion within our team about this issue, and I will get back to you when I know more.
Thanks,
-Stephen
July 09, 2014 14:17
User photo
Andre Merzky
Hi Stephen,
thanks for the follow-up! I didn't bother to test other systems similarly, but if you have the feeling it might be caused by the backend FS type, I'll run a couple of tests on non-lustre machines. Honestly, I can't imagine this problem to be very prevalent, it would have triggered problems all over the place...
Thanks again, Andre.
July 09, 2014 14:24
User photo
Globus Team - Stephen
globus support
Hi Andre,
After discussing this with our team, we have found two possible sources for this issue.
The first is that you are using the `-f` option to `rm` in the CLI. This ignores some classes of errors silently, so the files may not actually be deleted by these `rm` commands. To determine if this is interfering, simply run without the `-f` option.
The second is a delayed write of file metadata in Lustre: https://jira.hpdd.intel.com/browse/LU-274
The bug reported there is not identical to your issue, but it may be related. Unfortunately, I'm not sure how we can determine whether or not this is the case unless Stampede is running an unaffected version of Lustre.
We will continue to look into possible sources of this problem, but it seems likely at this stage that they are related to the configuration of Stampede, not of Globus.
Thanks,
-Stephen
July 14, 2014 10:22
User photo
Globus Team - Stephen
globus support
Hello Andre,
We haven't heard from you on this ticket for a while, and there has been no new information at our end.
As a result, we're going to assume that the issue has been resolved to your satisfaction, or that you have resolved it yourself, and close the ticket.
If you have any further questions, please don't hesitate to contact us again.
Thanks,
-Stephen
July 24, 2014 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment