Sample Scripts

This section provides reusable examples of scripts that allow you to perform actions on Genestack that would be tedious to accomplish through the web interface, such as multiple file uploads with custom metadata.

Retrieving task logs

This is a simple script to retrieve and print the logs of an initialization task on Genestack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from future import standard_library
standard_library.install_aliases()
from builtins import *
from genestack_client import TaskLogViewer, get_connection, make_connection_parser

# add extra arguments to the Genestack arguments parser for this script
parser = make_connection_parser()
parser.add_argument('-f', '--follow', action='store_true', help="Follow the logs' output if the task is not done")
parser.add_argument('-t', '--type', metavar='<log_type>',
                    choices=[TaskLogViewer.STDERR, TaskLogViewer.STDOUT],
                    default=TaskLogViewer.STDOUT,
                    help="Type of logs to display ('{0}' or '{1}' ; default is '{0}')".format(
                        TaskLogViewer.STDOUT, TaskLogViewer.STDERR))
parser.add_argument('accession', metavar='<accession>', help='Accession of the file for which to display the logs')
arguments = parser.parse_args()

# connect to Genestack
connection = get_connection(arguments)
log_viewer = TaskLogViewer(connection)

# print task logs
log_viewer.print_log(arguments.accession, log_type=arguments.type, follow=arguments.follow)

The script connects to Genestack and uses the TaskLogViewer class defined in the client library, to retrieve the logs for a file whose accession should be passed as a command-line parameter to the script.

TaskLogViewer is a child class of Application, and it provides an interface to the Genestack application genestack/task-log-viewer which exposes a public Java method (getFileInitializationLog) to access the initialization logs of a file.

Uploading multiple files with custom metainfo

A typical situation when you want to upload data is that you have some raw sequencing files somewhere (on an FTP site, on your local computer, etc.) and a spreadsheet with information about these files, that you would want to record in Genestack.

So let’s imagine that we have a comma-delimited CSV file with the following format:

name,organism,disease,link
HPX12,Homo sapiens,lung cancer,ftp://my.ftp/raw_data/HPX12_001.fq.gz
HPZ24,Homo sapiens,healthy,ftp://my.ftp/raw_data/HPZ24_001.fq.gz
.........

Now let’s write a Python script that uses the Genestack Client Library to upload these files to a Genestack instance, with the right metainfo. The script will take as input the CSV file, and create a Genestack Experiment with a Sequencing Assay for each row of the CSV file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
from __future__ import unicode_literals

from future import standard_library
standard_library.install_aliases()
from builtins import *
import csv

from genestack_client import (BioMetaKeys, DataImporter, GenestackException, Metainfo,
                              get_connection, make_connection_parser, unaligned_reads)

# keys that must be supplied in the CSV file
MANDATORY_KEYS = ['name', 'link']

# keys that have existing dedicated "Genestack" metainfo key names
SPECIAL_KEYS = {'name': Metainfo.NAME, 'organism': BioMetaKeys.ORGANISM,
                'method': BioMetaKeys.METHOD, 'sex': BioMetaKeys.SEX,
                'cell line': BioMetaKeys.CELL_LINE}

# parse script arguments
parser = make_connection_parser()
parser.add_argument('csv_file', help='Path to the local comma-delimited CSV file containing the data')
parser.add_argument('--name', help='Name of the experiment to create in Genestack')
parser.add_argument('--description', help='Description of the experiment to display in Genestack')

args = parser.parse_args()
csv_input = args.csv_file

print('Connecting to Genestack...')

# get connection and application handlers
connection = get_connection(args)
importer = DataImporter(connection)

# file format of the reads to import
file_format = unaligned_reads.compose_format_map(unaligned_reads.Space.BASESPACE, unaligned_reads.Format.PHRED33,
                                                unaligned_reads.Type.SINGLE)

# create the experiment where we will store the data in Genestack
experiment = importer.create_experiment(name=args.name or "Imported experiment",
                                        description=args.description or "No description provided")

print('Created a new experiment with accession %s...' % experiment)


# parse the CSV file
with open(csv_input, 'r') as the_file:
    reader = csv.DictReader(the_file, delimiter=",")
    field_names = reader.fieldnames

    # check if mandatory keys are in the CSV file
    for mandatory_key in MANDATORY_KEYS:
        if mandatory_key not in field_names:
            raise GenestackException("The key '%s' must be supplied in the CSV file" % mandatory_key)

    for file_data in reader:

        # for each entry, prepare a Metainfo object
        metainfo = Metainfo()
        for key in field_names:
            # 'link' and 'organism' are treated separately, as they are added to the metainfo using specific methods
            if key == "link":
                url = file_data[key]
                metainfo.add_external_link(key=BioMetaKeys.READS_LINK, text="link", url=url, fmt=file_format)
            elif key == "organism":
                metainfo.add_string(BioMetaKeys.ORGANISM, file_data[key])
            # all the other keys are added as strings
            else:
                metainfo_key = SPECIAL_KEYS.get(key.lower(), key)
                metainfo.add_string(metainfo_key, file_data[key])

        # create the sequencing assay on Genestack
        created_file = importer.create_sequencing_assay(experiment, metainfo=metainfo)

        print('Created file "%s" (%s)' % (file_data['name'], created_file))

print('All done! Bye now...')

This script uses many features from the client library:

  • we start by adding arguments to the argument parser to process our metadata file
  • then we establish a connection to Genestack, and instantiate a DataImporter to be able to create our files on Genestack
  • we create the experiment where we will store our data
  • we parse the CSV file using a Python csv.DictReader, create a new metainfo object for each row and a corresponding Sequencing Assay with that metainfo

We can then run the script like this:

python make_experiment_from_csv.py --name "My experiment" --description "My description" my_csv_metadata_file.csv

The metainfo of each Sequencing Assay specified inside the CSV file needs to contain at least a name and valid link (either to a local or a remote file). By default, the experiment will be created inside the user’s Imported Files folder on Genestack, since we haven’t specified a folder.

Note

One could easily extend this script to support two files per sample (in the case of paired-end reads).

Importing ENCODE RNA-seq data

We can extend the previous script to download data from the ENCODE project . If we select a list of experiments from the ENCODE experiment matrix, we can obtain a link to a CSV file which contains all the metadata and data for the matching assays. For instance, this is the link for all FASTQ files for human RNA-seq experiments .

By browsing this TSV file, we see that it contains the following useful fields:

  • File accession
  • Experiment accession
  • Biosample sex
  • Biosample organism
  • Biosample term name: cell line or tissue
  • Biosample Age
  • Paired with: if the sample was paired-end, this field points to the file accession of the other mate

We also notice that file download URLs always follow the template:

https://www.encodeproject.org/files/<FILE_ACCESSION>/@@download/<FILE_ACCESSION>.fastq.gz

We can use this observation to generate the reads URLs from the fields File accession and possibly Paired with. We use the following logic: we read through the metadata file, while keeping a set of all the accessions of the paired FASTQ files handled so far. If the current line corresponds to a file that has already been created (second mate of a paired-end file), then we skip it. Otherwise we prepare a metainfo object for the file and create the Genestack file.

If the row contains a Paired with accession, we also add the corresponding URL to the current metadata, and add the accession to the set of FASTQ files seen so far.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/env python
# -*- coding: utf-8 -*-

# This script parses ENCODE metadata files such as this one:
# https://www.encodeproject.org/metadata/type=Experiment&replicates.library.biosample.donor.organism.scientific_name = Homo + sapiens & files.file_type = fastq & assay_title = RNA - seq / metadata.tsv


from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
from __future__ import unicode_literals

from future import standard_library
standard_library.install_aliases()
from builtins import *
import csv

from genestack_client import (BioMetaKeys, DataImporter, Metainfo, get_connection,
                              make_connection_parser)

# ENCODE data
FILE_ACCESSION = "File accession"
PAIRED_ACCESSION = "Paired with"

# dictionary: ENCODE file column name -> Genestack metainfo key (None when identical)
VALID_FIELDS = {
    FILE_ACCESSION: Metainfo.NAME,
    "Experiment accession": None,
    "Biosample sex": BioMetaKeys.SEX,
    "Biosample organism": BioMetaKeys.ORGANISM,
    "Biosample Age": None,
    "Biosample term name": BioMetaKeys.CELL_TYPE,
    "Platform": BioMetaKeys.PLATFORM
}

ENCODE_URL_PATTERN = "https://www.encodeproject.org/files/{0}/@@download/{0}.fastq.gz"

# parse script arguments
parser = make_connection_parser()
parser.add_argument('tsv_file', metavar='<tsv_file>',
                    help='Path to the local tab-delimited file containing the data')

args = parser.parse_args()
tsv_input = args.tsv_file

print('Connecting to Genestack...')

# get connection and application handlers
connection = get_connection(args)
importer = DataImporter(connection)

# create the experiment where we will store the data in Genestack
experiment = importer.create_experiment(name="ENCODE Human RNA-seq",
                                        description="Human RNA-seq assays from ENCODE")

print('Created a new experiment with accession %s...' % experiment)

created_pairs = set()

# parse the CSV file
with open(tsv_input, 'r') as the_file:
    reader = csv.DictReader(the_file, dialect='excel_tab')
    field_names = reader.fieldnames

    for file_data in reader:

        # skip the entry if the file was already included in a previously created paired-end assay
        if file_data[FILE_ACCESSION] in created_pairs:
            continue

        # for each entry, prepare a Metainfo object
        metainfo = Metainfo()
        for key in VALID_FIELDS.keys():
            metainfo.add_string(VALID_FIELDS.get(key) or key, file_data[key])
        metainfo.add_external_link(BioMetaKeys.READS_LINK, ENCODE_URL_PATTERN.format(file_data[FILE_ACCESSION]))

        if file_data.get(PAIRED_ACCESSION):
            # add URL of second mate if the reads are paired-end
            metainfo.add_string(FILE_ACCESSION, PAIRED_ACCESSION)
            metainfo.add_external_link(BioMetaKeys.READS_LINK, ENCODE_URL_PATTERN.format(file_data[PAIRED_ACCESSION]))
            created_pairs.add(file_data[PAIRED_ACCESSION])

        # create the sequencing assay on Genestack
        created_file = importer.create_sequencing_assay(experiment, metainfo=metainfo)

        print('Created file "%s" (%s)' % (file_data[FILE_ACCESSION], created_file))

print('All done!')

Editing the metainfo of existing files

In the real world, data and metadata live in different places and you may not have access to both of them at the same time. Sometimes, you may be in a situation where you have uploaded data on Genestack and you are only provided with metadata later on. The following script takes as input a comma-delimited CSV file containing metadata and adds that metadata to existing files on Genestack. The files should be located in a specific folder, and the correspondence between records in the CSV file and the remote files is determined by the name of the remote files. The name of the files should be stored in a specific column of the CSV file, whose name must be supplied to the script as local-key parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
from __future__ import unicode_literals

from future import standard_library
standard_library.install_aliases()
from builtins import *
import csv

from genestack_client import (BioMetaKeys, FilesUtil, GenestackException, Metainfo, get_connection,
                              make_connection_parser)

# keys that have existing dedicated "Genestack" metainfo key names
SPECIAL_KEYS = {'name': Metainfo.NAME, 'organism': BioMetaKeys.ORGANISM,
                'method': BioMetaKeys.METHOD, 'sex': BioMetaKeys.SEX,
                'gender': BioMetaKeys.SEX, 'age': BioMetaKeys.AGE,
                'cell line': BioMetaKeys.CELL_LINE, 'accession': Metainfo.ACCESSION}


# Logic to parse 'boolean-like' values as metainfo booleans
TRUE_VALUES = {'true', 'yes', 'y'}
FALSE_VALUES = {'false', 'no', 'n'}


def parse_as_boolean(s):
    if s.lower().strip() in TRUE_VALUES:
        return True
    elif s.lower().strip() in FALSE_VALUES:
        return False
    return None


if __name__ == "__main__":
    # parse script arguments
    parser = make_connection_parser()
    parser.add_argument('csv_file', help='Path to the local comma-delimited CSV file containing the data')
    parser.add_argument('local_key', help='Name of the local key to match CSV records and Genestack files names')
    parser.add_argument('folder', help='Accession of the Genestack folder containing the files')

    args = parser.parse_args()
    csv_input = args.csv_file
    local_key = args.local_key

    print('Connecting to Genestack...')

    # get connection and application handlers
    connection = get_connection(args)
    files_util = FilesUtil(connection)

    print('Collecting files...')
    files = files_util.get_file_children(args.folder)
    print('Found %d files. Collecting metadata...' % len(files))
    infos = files_util.get_infos(files)

    identifier_map = {info['name']: info['accession'] for info in infos}

    # parse the CSV file
    with open(csv_input, 'r') as the_file:
        reader = csv.DictReader(the_file, delimiter=",")
        field_names = reader.fieldnames

        if args.local_key not in field_names:
            raise GenestackException("Error: the local key %s is not present in the supplied CSV file" % args.local_key)

        for file_data in reader:
            # find the corresponding file
            local_identifier = file_data[local_key]
            remote_file = identifier_map.get(local_identifier)
            if not remote_file:
                print('Warning: no match found for file name "%s"' % local_identifier)
                continue

            # prepare a Metainfo object
            metainfo = Metainfo()
            for key in field_names:
                # key parsing logic
                value = file_data[key]
                if value == "" or value is None:
                    continue
                if key == args.local_key:
                    continue
                if key == "organism":
                    metainfo.add_string(BioMetaKeys.ORGANISM, value)
                else:
                    metainfo_key = SPECIAL_KEYS.get(key.lower(), key)
                    if parse_as_boolean(value) is not None:
                        metainfo.add_boolean(metainfo_key, parse_as_boolean(value))
                    else:
                        metainfo.add_string(metainfo_key, value)

            # edit the metadata on Genestack
            files_util.add_metainfo_values(remote_file, metainfo)
            print("Edited metainfo for '%s' (%s)" % (local_identifier, remote_file))

    print('All done!')

For instance, imagine we have the following CSV file:

file_name,organism,disease
Patient 1,Homo sapiens,Asthma
Patient 2,Homo sapiens,Healthy
....

The script is then called with the following syntax:

python add_metainfo_from_table.py my_csv_file.csv file_name GSF12345

Organising files into folders based on their metainfo

Keeping your files organised is a difficult thing. A common thing to do when you have many files belonging to the same project is to group them into folders based on their application. The following script takes as input a folder of files and organises these files into subfolders, such that all files created with the same application will go into the same subfolder. We will also provide an option to unlink the files from their folder of origin. The script illustrates the use of the FilesUtil class to perform typical file manipulation operations.

The script can be called with the following syntax:

python group_files_into_folders.py [--move-files] <source_folder_accession>

You can easily adapt the script to group files based on some other criterion from their metainfo, like their organism, their creation date, or in fact any metainfo value.

Running a data analysis pipeline

Generally, if you want to run multiple files through the same analysis pipeline, the easiest way to do it is using the Data Flow Editor through the web interface. This tool is powerful enough to cover most of the use cases you could think of. However, some complex pipelines are not supported by the Data Flow Editor. In that case, you can write your own script to generate all the files on Genestack programmatically.

Our script will take as input the Genestack accession of a folder containing Unaligned Reads files. It will then produce for each file three downstream files: a Mapped Reads file produced with Bowtie (possibly using a custom reference genome), a Mapped Reads QC Report, and a Variant Calling file.

To do this, we define a BatchFilesCreator class with some simple methods to create multiple files from a CLA. We also create a BowtieBatchFilesCreator inheriting from this class, that has additional logic to change the reference genome. You can easily adapt this logic to your own pipeline.

Moreover, we will create Variant Calling files with non-default parameter (specifically, we only want to call SNPs and not indels). To do this, we use the method CLApplication.change_command_line_arguments, which takes as input the accession of the CLA file, and the list of command-line strings to use.

Since the syntax of the command-line strings can vary from one CLA to another, the easiest way to know what command-line strings you should specify is to first create a file with the corresponding CLA on Genestack and change the parameters to the ones you want using the user interface. Then, look at the metainfo of the file (for example by clicking the name of the file inside the CLA page and choosing “View metainfo”). The metainfo field “Parameters” will store the list of command-line strings that you want (strings are separated by commas in the Metainfo Viewer dialog).

Note

File references (like reference genomes) are not specified in the parameters strings. They are stored in separate metainfo fields. The code for BowtieBatchFilesCreator illustrates how to use a custom reference genome for the Bowtie-based Unspliced Mapping CLA.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#!/usr/bin/env python
# -*- coding: utf-8 -*-


from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
from __future__ import unicode_literals

from future import standard_library
standard_library.install_aliases()
from builtins import *
from builtins import object
from genestack_client import (AlignedReadsQC, BioMetaKeys, BowtieApplication, FilesUtil,
                              SpecialFolders, VariationCaller2Application, get_connection,
                              make_connection_parser)


# base class to create multiple files with a CLA
class BatchFilesCreator(object):
    def __init__(self, cla, base_folder, friendly_name, custom_args=None):
        """
        Constructor of the general batch files creator, to create multiple files from a CLA.

        :param cla: a ``CLApplication`` object, wrapper for the corresponding CLA
        :param base_folder: accession of the base folder where the pipeline files will be organised into subfolders
        :param friendly_name: user-friendly name of the files produced by the app ; used in the on-screen statements
        and in the name of the project subfolders
        :param custom_args: list of custom command-line argument strings for the files. Default is ``None``
        """

        self._cla = cla
        self._files_util = FilesUtil(cla.connection)
        self._base_folder = base_folder
        self._friendly_name = friendly_name
        self._custom_args = custom_args

    def create_files(self, sources):
        print('Creating %s files...' % self._friendly_name)
        output_folder = self._files_util.create_folder(self._friendly_name, parent=self._base_folder)
        output_files = []
        for i, source in enumerate(sources, 1):
            output = self._create_output_file(source)
            self._files_util.link_file(output, output_folder)
            print('Created %s file %s (%d/%d)' % (self._friendly_name, output, i, len(output)))
            output_files.append(output)
        return output_files

    # this method can be overridden in child classes to allow for more complex file creation logic
    def _create_output_file(self, source):
        output = self._cla.create_file(source)
        if self._custom_args:
            self._cla.change_command_line_arguments(output, self._custom_args)
        return output


# special class for Bowtie to replace the default reference genome
class BowtieBatchFilesCreator(BatchFilesCreator):
    def __init__(self, cla, base_folder, friendly_name, custom_args=None, ref_genome=None):
        BatchFilesCreator.__init__(self, cla, base_folder, friendly_name, custom_args)
        self._ref_genome = ref_genome

    def _create_output_file(self, source):
        output = BatchFilesCreator._create_output_file(self, source)
        # replace reference genome
        if self._ref_genome:
            self._files_util.remove_metainfo_value([output], BioMetaKeys.REFERENCE_GENOME)
            self._cla.replace_file_reference(output, BioMetaKeys.REFERENCE_GENOME, None, self._ref_genome)
        return output

# These CLA arguments correspond to all default options except the type of variants to look for (SNPs only).
# The easiest way to know the syntax of the command-line arguments for a specific app is to look at the "Parameters"
# metainfo field of a CLA file on Genestack that has the parameters you want.
VC_ARGUMENTS_NO_INDELS = ["--skip-indels -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP", "",
                          "--skip-variants indels --multiallelic-caller --variants-only"]

if __name__ == "__main__":
    # parse script arguments
    parser = make_connection_parser()
    parser.add_argument('raw_reads_folder',
                        help='Genestack accession of the folder containing the raw reads files to process')
    parser.add_argument('--name', default="New Project",
                        help='Name of the Genestack folder where to put the output files')
    parser.add_argument('--ref-genome', help='Accession of the reference genome to use for the mapping step')

    args = parser.parse_args()
    project_name = args.name

    print('Connecting to Genestack...')

    # get connection and create output folder
    connection = get_connection(args)
    files_util = FilesUtil(connection)
    created_files_folder = files_util.get_special_folder(SpecialFolders.CREATED)
    project_folder = files_util.create_folder(project_name, parent=created_files_folder)

    # create application wrappers and batch files creators
    bowtie_app = BowtieApplication(connection)
    mapped_qc_app = AlignedReadsQC(connection)
    variant_calling_app = VariationCaller2Application(connection)

    bowtie_creator = BowtieBatchFilesCreator(bowtie_app, project_folder, "Mapped Reads", ref_genome=args.ref_genome)
    mapped_qc_creator = BatchFilesCreator(mapped_qc_app, project_folder, "Mapped Reads QC")
    vc_creator = BatchFilesCreator(variant_calling_app, project_folder, "Variants", custom_args=VC_ARGUMENTS_NO_INDELS)

    # collect files
    print('Collecting raw reads...')
    raw_reads = files_util.get_file_children(args.raw_reads_folder)
    files_count = len(raw_reads)
    print('Found %d files to process' % files_count)

    # Create pipeline files
    mapped_reads = bowtie_creator.create_files(raw_reads)
    mapped_reads_qcs = mapped_qc_creator.create_files(mapped_reads)
    vc_creator.create_files(mapped_reads)

    print('All done! Your files are in the folder %s' % project_folder)

The script can then be called from a terminal with the following syntax:

python run_vc_pipeline.py -u <user_alias> --ref-genome <custom_ref_genome_accession> --name "My project name" <raw_reads_folder>

Note that the folder supplied as input to the script should only contain Unaligned Reads files.