Parsec: Galaxy at the Speed of Light

Documentation Requirements Status Build Status License

Command-line utilities to assist in working with Galaxy servers.

Installation

$ pip install galaxy-parsec
$ parsec init

Questions?

Gitter

Quick Start

This quick start demonstrates using parsec commands to manipulate Galaxy histories and datasets. You will want to install jq if you do not have it already.

Connect to a Galaxy server

To connect to a running Galaxy server, you will need an account on that Galaxy instance and an API key for the account. Instructions on getting an API key can be found at http://wiki.galaxyproject.org/Learn/API .

First initialize parsec:

$ parsec init

Once initialized, parsec will be usable from the command line. Please note that an admin account is required for a few actions like creation of data libraries, or access to user API keys. Your configuration must allow access to /api without need for a username or password. More infomration can be found at https://galaxyproject.org/admin/config/performance/production-server/

Introduction To Parsec

Parsec is a set of automatically generated wrappers for BioBlend functions. I found myself writing a large number of small / one-off scripts that invoked simple bioblend functions. These scripts were impossible to compose and use in a linux-friendly manner. I copied and pasted code between all of these utility scripts.

Parsec is the answer to all of these problems. It extracts all of the individual functions I was writing as separate CLI commands that can be piped together, run in parallel, etc.

After installation, running parsec will present you with a list of sub-commands you can execute.

$ parsec
Usage: parsec [OPTIONS] COMMAND [ARGS]...

  Command line wrappers around BioBlend functions. While this sounds
  unexciting, with parsec and jq you can easily build powerful command line
  scripts.

Options:
  --version               Show the version and exit.
  -v, --verbose           Enables verbose mode.
  --galaxy_instance TEXT  name of galaxy instance from ~/.planemo.yml
                          [required]
  --help                  Show this message and exit.

Commands:
  config
  datasets
  datatypes
  folders
  forms
  ...

Each of these commands has more commands under it:

$ parsec histories
Usage: parsec histories [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  create_dataset_collection       Create a new dataset collection
  create_history                  Create a new history, optionally setting
                                  the...
  create_history_tag              Create history tag
  delete_dataset                  Mark corresponding dataset as deleted.
  delete_dataset_collection       Mark corresponding dataset collection as...
  delete_history                  Delete a history.
  download_dataset                Deprecated method, use...
  download_history                Download a history export archive.
  export_history                  Start a job to create an export archive
                                  for...
  ...

Viewing Histories and Datasets

To get information on the Histories currently in your account, call history get_histories, and we will pipe this to a jq command which selects the first element from the JSON array.

$ parsec histories get_histories | jq '.[0]'

Parsec will respond with information about your first history

{
  "name": "BuildID=Manual-2017.05.02T16:13 WF=PAP_2017_Comparative_(v1.0)_BOOTSTRAPPED Org=CCS Source=Jenkins",
  "url": "/galaxy/api/histories/548c0777ac615645",
  "annotation": null,
  "model_class": "History",
  "id": "548c0777ac615645",
  "tags": [
    "Automated",
    "Annotation",
    "BICH464"
  ],
  "purged": false,
  "published": false,
  "deleted": false
}

This may not be all of the information you were expecting about your history. In that case, you might want to call show_history which will show you more details about a single history. You can either manually type parsec histories show_history 548c0777ac615645, or we can do this in batch:

$ parsec histories get_histories | jq '.[0].id' | xargs -n 1 parsec histories show_history

Which pulls out the first history, select the id attribute, before passing it to xargs. If you have not used it before, xargs allows us to execute multiple commands for some input data. Here we execute the command parsec histories show_history for each line of input (i.e. each ID returned to us from the jq call). xargs -n 1 ensures that we will only pass a single ID to a single call of show_history. If you were to use jq '.[].id' instead of jq '.[0].id' it would output the IDs for every history you own. You could then pipe this to xargs and run show_history on all of your histories!

{
  "annotation": null,
  "contents_url": "/galaxy/api/histories/548c0777ac615645/contents",
  "create_time": "2017-05-02T16:18:21.285382",
  "deleted": false,
  "empty": false,
  "genome_build": null,
  "id": "548c0777ac615645",
  "importable": true,
  "model_class": "History",
  "name": "BuildID=Manual-2017.05.02T16:13 WF=PAP_2017_Comparative_(v1.0)_BOOTSTRAPPED Org=CCS Source=Jenkins",
  "published": false,
  "purged": false,
  "size": 34760258,
  "slug": "buildidmanual-20170502t1613-wfpap2017comparativev10bootstrapped-orgccs-sourcejenkins",
  "state": "ok",
  "state_details": {
    "discarded": 0,
    "empty": 0,
    "error": 0,
    "failed_metadata": 0,
    "new": 0,
    "ok": 29,
    "paused": 0,
    "queued": 0,
    "running": 0,
    "setting_metadata": 0,
    "upload": 0
  },
  "state_ids": {
    "discarded": [
      "a6cc986453fae8ba",
      "f2f9b7b017f20578",
      "70eb5af78c588bd1"
    ],
    "empty": [],
    "error": [
      "d643e34e1114cc52",
      "98ae3d35d73f82c9"
    ],
    "failed_metadata": [],
    "new": [],
    "ok": [
      "e510305efbee5f49",
      "0d595b7c2b6e9b93",
      "d04ac6f949ae266c",
      "175f283ddaeca39c",
      "b34432b8a0847c04",
      "ea7ff5323ddebcb8",
      "3e40a393efafc45c",
      "7ce5ec5d51ef85cb",
      "577e4242cdfbe1aa",
      "193d15527d13f45e",
      "4543f9456af7f0df",
      "5e1293df75b4f95b",
      "a57bae35eca5fbfe",
      "6c306b2ed4533f1f",
      "97c5f81b159505f0",
      "64d1d8e46b4554bd",
      "8e9432496d7e2b43",
      "5c8579257c579aae",
      "243ad216fbfa268e",
      "8336d9eb27b27677",
      "a1d4cc61bdba629d",
      "7f93a80890822fa9",
      "c479b351902302e2",
      "36b60fb58ad24a71",
      "041dd3cb6879f1f7",
      "36992e90715c9c77",
      "4bddfe152467e972",
      "2d9f5c0c36d89e10",
      "e53ad6f3133b2816"
    ],
    "paused": [
      "4a8143557292a233",
      "b0f8a75aa6be2c1d"
    ],
    "queued": [],
    "running": [],
    "setting_metadata": [],
    "upload": []
  },
  "tags": [
    "Automated",
    "Annotation",
    "BICH464"
  ],
  "update_time": "2017-05-02T16:49:07.941097",
  "url": "/galaxy/api/histories/548c0777ac615645",
  "user_id": "f570ade6e7840ba0",
  "username_and_slug": "u/helena-rasche/h/buildidmanual-20170502t1613-wfpap2017comparativev10bootstrapped-orgccs-sourcejenkins"
}

So much metadata to play with and filter on! Note that many of these commands have additional flags, for example parsec histories show_history --help will tell us that we can also pass the –contents option to retrieve a list of datasets in that history, even filtering on their visibility.

$ parsec histories show_history --help
Usage: parsec histories show_history [OPTIONS] HISTORY_ID

  Get details of a given history. By default, just get the history meta
  information.

Options:
  --contents      When ``True``, the complete list of datasets in the given
                  history.
  --deleted TEXT  Used when contents=True, includes deleted datasets in
                  history dataset list
  --visible TEXT  Used when contents=True, includes only visible datasets in
                  history dataset list
  --details TEXT  Used when contents=True, includes dataset details. Set to
                  'all' for the most information

Thus with a simple query

$ parsec histories show_history 548c0777ac615645 --contents --deleted True | jq -S '.[0]'

We see the first deleted dataset in the history.

{
  "create_time": "2017-05-02T16:18:54.272050",
  "dataset_id": "93c926a0dabafde3",
  "deleted": true,
  "extension": "fasta",
  "hid": 30,
  "history_content_type": "dataset",
  "history_id": "548c0777ac615645",
  "id": "d643e34e1114cc52",
  "name": "Feature Sequence Export Unique on data 27 and data 20",
  "purged": false,
  "state": "error",
  "type": "file",
  "type_id": "dataset-d643e34e1114cc52",
  "update_time": "2017-05-02T16:47:57.807506",
  "url": "/galaxy/api/histories/548c0777ac615645/contents/d643e34e1114cc52",
  "visible": true
}

This gives us a dictionary containing the History’s metadata. With contents=False (the default), we only get a list of ids of the datasets contained within the History; with contents=True we would get metadata on each dataset. We can also directly access more detailed information on a particular dataset by passing its id to the show_dataset method:

$ parsec datasets_show_dataset 10a4b652da44e82a
{
    "accessible": true,
    "annotation": null,
    "api_type": "file",
    "create_time": "2015-02-27T23:46:27.642906",
    "data_type": "galaxy.datatypes.data.Text",
    "dataset_id": "10a4b652da44e82a",
    "deleted": false,
    "display_apps": [],
    "display_types": [],
    "download_url": "/api/histories/f3c2b0f3ecac9f02/contents/10a4b652da44e82a/display",
    "extension": "fastq",
    "file_ext": "fastq",
    "file_path": null,
    "file_size": 16527060,
    "genome_build": "dm3",
    "hda_ldda": "hda",
    "hid": 1,
    "history_content_type": "dataset",
    "history_id": "f3c2b0f3ecac9f02",
    "id": "10a4b652da44e82a",
    "meta_files": [],
    "metadata_data_lines": 4,
    "metadata_dbkey": "dm3",
    "misc_blurb": "15.8 MB",
    "misc_info": "uploaded fastqsanger file",
    "model_class": "HistoryDatasetAssociation",
    "name": "C1_R2_1.chr4.fq",
    "purged": false,
    "resubmitted": false,
    "state": "ok",
    "tags": [],
    "type": "file",
    "update_time": "2015-02-27T23:46:34.659590",
    "url": "/api/histories/f3c2b0f3ecac9f02/contents/10a4b652da44e82a",
    "uuid": "ccad6f3a-f75d-472f-9142-2d4c39ad1a35",
    "visible": true,
    "visualizations": []
}

On JQ

It is worth it to look at some of the things possible with JQ for a moment. The above example may not be so exciting at first blush, but you can do incredible things with the combination of parsec, jq, and xargs. Here are some examples to consider:

  • find all histories with a public link, but not published in the shared-histories section, and print out their history name and the shared link.

    $ parsec histories get_histories | \
       jq '.[].id' | \
       xargs -n 1 parsec histories show_history | \
       jq '. | select(.published == false) | select(.importable == true) | [.published, .importable, .id, .username_and_slug] | @tsv' -r
    
  • reset the API keys for 30 users at once.

    $ parsec users get_users | \
       jq '.[] | \
       select(.username | contains("janedoe")) | .id' | \
       xargs -n 1 parsec users create_user_apikey
    
  • download all of the OK datasets in a set of histories

    $ parsec histories get_histories | \
       jq '.[].id' | \ # Or other, more complex filtering?
       xargs -n 1 parsec histories show_history | \ # Get history details
       jq '.state_ids.ok[]' | \ # Find OK datasets
       xargs -n 1 parsec datasets download_dataset --file_path '.' --use_default_filename # Download
    

View Workflows

Methods for accessing workflows are grouped under GalaxyInstance.workflows.*.

To get information on the Workflows currently in your account, use:

$ parsec workflows get_workflows
[
    {
        'id': 'e8b85ad72aefca86',
        'name': u"TopHat + cufflinks part 1",
        'url': '/api/workflows/e8b85ad72aefca86'
    },
    {
       'id': 'b0631c44aa74526d',
        'name': 'CuffDiff',
        'url': '/api/workflows/b0631c44aa74526d'
    }
]

For example, to further investigate a workflow, we can request:

$ parsec workflows show_workflow ded67e5aa1371841 | jq 'del(.steps)'

The workflow output is generally quite large as it embeds a full copy of the workflow. In the above JQ command I have removed the steps attribute from the output for brevity.

{
  "annotation": "",
  "model_class": "StoredWorkflow",
  "latest_workflow_uuid": "94c40212-c4bb-43b7-a43b-eadc1a3b2894",
  "id": "ded67e5aa1371841",
  "url": "/galaxy/api/workflows/ded67e5aa1371841",
  "deleted": false,
  "tags": [],
  "owner": "helena-rasche",
  "name": "PAP 2017 Functional (v8.15)",
  "inputs": {
    "0": {
      "value": "",
      "uuid": "9397916e-afb7-4e48-b89e-d4c99bf202de",
      "label": "Apollo Organism JSON File"
    },
    "2": {
      "value": "",
      "uuid": "eca835c6-328a-4698-a387-d0719b24d19d",
      "label": "Genome Sequence"
    },
    "1": {
      "value": "",
      "uuid": "5511d038-e96b-49b2-998a-d037935f6e06",
      "label": "Annotation Set"
    }
  },
  "published": false
}

View Users

Methods for managing users are grouped under GalaxyInstance.users.*. User management is only available to Galaxy administrators, that is, the API key used to connect to Galaxy must be that of an admin account.

To get a list of users, call:

$ parsec users get_users
[
    {
        "username": "test",
        "model_class": "User",
        "email": "test@local.host",
        "id": "f2db41e1fa331b3e"
    },
    ...
]

In Depth Example

As a more detailed example, we’ll launch a simple workflow.

Step 1. What are the Inputs

$ parsec workflows show_workflow ded67e5aa1371841 | jq .inputs > inputs.json

In practice this file probably looks similar to this:

{
  "0": {
    "value": "",
    "uuid": "9397916e-afb7-4e48-b89e-d4c99bf202de",
    "label": "Apollo Organism JSON File"
  },
  "2": {
    "value": "",
    "uuid": "eca835c6-328a-4698-a387-d0719b24d19d",
    "label": "Genome Sequence"
  },
  "1": {
    "value": "",
    "uuid": "5511d038-e96b-49b2-998a-d037935f6e06",
    "label": "Annotation Set"
  }
}

Step 2: Prepare History and Load Datasets

First, we’ll create a history to manage all of our work:

$ HISTORY_ID=$(parsec histories create_history | jq .id)
$ parsec histories update_history --name 'Parsec test'

Next we have to fetch some datasets. You could upload them:

$ parsec tools upload_file my-file.gff3 $HISTORY_ID

But in my case, I need to run a tool which produces them:

JOB_ID=$(parsec tools run_tool $HISTORY_ID edu.tamu.cpt2.webapollo.export \
   '{"org_source|source_select": "direct", "org_source|org_raw": "Miro"}' | \
   jq .id)

$ parsec jobs show_job .outputs $JOB_ID

By storing the job ID in a variable, we can make repeated requests to check on it. The second parsec statement fetches the output datasets from this step.

{
  "fasta_out": {
    "id": "61513e15ce98c986",
    "src": "hda",
    "uuid": "0de1442b-c410-4a38-b9ca-49cff973d9b8"
  },
  "gff_out": {
    "id": "62ee69adcf74378c",
    "src": "hda",
    "uuid": "887aaf6f-ed07-4ee8-a396-c16612f83d83"
  },
  "json_out": {
    "id": "1f73e96543934ac8",
    "src": "hda",
    "uuid": "3be3d364-83c5-4a23-87fa-ebd8c27f2094"
  }
}

Step 3: Invoking the Workflow

Remembering back to the inputs in step 1, we will match them up and create an inputs.json file

  • 0 / organism json file => json_out
  • 1 / genome sequence => gff_out
  • 2 / annotation set => fasta_out

This gives us an inputs.json that looks like so:

{
  "0": {
    "id": "1f73e96543934ac8",
    "src": "hda"
  },
  "1": {
    "id": "62ee69adcf74378c",
    "src": "hda"
  },
  "2": {
    "id": "61513e15ce98c986",
    "src": "hda"
  }
}

We can now invoke our workflow using parsec! Since the inputs is a JSON parameter, it can be supplied many different ways for your convenience. All of the following behave identically.

$ cat params.json | parsec jobs search_jobs -; # Stdin
$ parsec jobs search_jobs params.json; # Filename
$ parsec jobs search_jobs $(cat params.json); # String argument

Running the invocation:

$ parsec workflows invoke_workflow ded67e5aa1371841 --inputs inputs.json --history_id $HISTORY_ID

Produces a very succinct workflow launch output:

{
    "uuid": "94246003-2f8b-11e7-9427-20474784cc00",
    "state": "new",
    "workflow_id": "3daf5606d767a471",
    "id": "c7f60cfda02f0f46",
    "update_time": "2017-05-02T23:03:39.693288",
    "model_class": "WorkflowInvocation",
    "history_id": "0d17c6f8cd8d49a5"
}

We can now use parsec to check on the status of all of the datasets:

$ parsec workflows show_invocation 3daf5606d767a471 c7f60cfda02f0f46 | jq '.steps[].state' | sort | uniq -c
   3 "running"
  72 "new"
   3 null
   1 "ok"

Or we can use one of the utility scripts to wait on that workflow to finish before continuing on to some other task:

$ parsec utils wait_on_invocation 3daf5606d767a471 c7f60cfda02f0f46 && ...

License

Copyright 2016-2017 Galaxy IUC

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Support

This material is based upon work supported by the National Science Foundation under Grant Number (Award 1565146)