Skip to content

Instantly share code, notes, and snippets.

@ioscott
Last active March 30, 2016 12:17
Show Gist options
  • Save ioscott/850837584d55ab414200 to your computer and use it in GitHub Desktop.
Save ioscott/850837584d55ab414200 to your computer and use it in GitHub Desktop.
Lightning Batch Processor examples.

The following are created using Ian's ID so that the Extractor he trained can be used.

If not logged in, run the following from the command line.

curl -c da-user-cookies.txt -XPOST -d "username=your-name&password=your-password" "https://api.staging-owl.com/auth/login"

The Extractor is only able to keep track of one run at a time. If multiple people are following these instructions then they will interfere with each other, it is not a bug. The UI will prevent this however curl doesn't care.

The URLs batch will use must be attached to the Extractor urlList field. Note: In the example below the final url is a deliberate error. Replace the attachment if different urls are required.

curl -vv -b da-user-cookies.txt -H "Content-Type: text/plain"  -XPUT "https://store.staging-owl.com/extractor/f1b0eb39-d617-49f1-8571-6183881c9895/_attachment/urlList" -d 'http://doom.import.io/php/playback/playback-simple-1-results.php?query=asdar                           [11:53:16]
http://doom.import.io/php/playback/playback-simple-1-results.php?query=asda2
http://doom.import.io/php/playback/playback-simple-1-results.php?query=asda4
http://doom.import.io/php/playback/playback-simple-1-results.php?query=asdar3
sttp://doom.import.io/php/playback/playback-simple-1-results.php?query=asdar
'

If the CrawlRun extractor.nextCrawlRunId has a non-terminal status then the existing job will need to be cancelled. Note, as you are working as Ian you may be able cancelling someone else's (likely Ian's) job.

curl -b da-user-cookies.txt -XPOST 'https://run.staging-owl.com/f1b0eb39-d617-49f1-8571-6183881c9895/cancel'

To start the run:

curl -b da-user-cookies.txt -XPOST 'https://run.staging-owl.com/f1b0eb39-d617-49f1-8571-6183881c9895/start

When completed the

{
_meta: {
timestamp: 1458835191309,
lastEditorGuid: "7c574c79-0a4e-40af-8a9d-c6234137c230",
ownerGuid: "7c574c79-0a4e-40af-8a9d-c6234137c230",
creatorGuid: "84920b9e-9578-3948-0174-5f15b344d094",
creationTimestamp: 1458835174211
},
guid: "8761e499-b82e-4109-af38-e8c263b80850",
runtimeConfigId: "6d508bc2-10cf-4893-b1c1-4d9f05e682e8",
extractorId: "f1b0eb39-d617-49f1-8571-6183881c9895",
stoppedAt: 1458835175618,
totalUrlCount: 5,
successUrlCount: 4,
failedUrlCount: 1,
state: "FINISHED",
urlListId: "d01de56b-b2e5-4781-83e4-945b6578d7c1",
json: "ab62191f-5027-420a-8c39-6138c3e8e6be",
csv: "f35baf94-62ff-428d-b35f-b55b4f1f611a",
log: "4026cc68-b910-4f97-bbcd-f3205c680c86",
sample: "94f14818-c307-4141-bf52-f73a41655c40"
}

The attachements for each can be retrieved using a command like the follwing:

curl -b da-user-cookies.txt --remote-name  -H "Accept-Encoding: gzip" -XGET 'https://store.staging-owl.com/store/crawlRun/aaeed814-f78f-49cb-95b6-3d1963ee9a36/_attachment/log/7d421728-53b5-4107-ba32-7eb80503ec07'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment