Genecentric - A package to uncover graph-theoretic structure in high-throughput epistasis data.

Documentation

This page is split into several sections. Namely, there is a section for each command provided by Genecentric and a section for each file format that is used by Genecentric.

Also, the goal of this page is to provide explanation; if you're looking for examples, please see our examples page.

Commands
File formats

Commands

The Genecentric package is currently made up for four different commands: genecentric-bpms, genecentric-from-csv, genecentric-go and genecentric-fainfo. On Linux/Mac, these commands should automatically be in your PATH if you've installed Genecentric. On Windows, you will need to use each command by invoking the Python interpreter. For example, in a command prompt, you can invoke the genecentric-bpms command like so:

C:/Path/To/Python2.7/python.exe genecentric-bpms --help

And similarly for the other commands.

For all Genecentric commands, the --help option will provide a list and a short description of every option supported by that command.

genecentric-bpms

usage: genecentric-bpms [-h] [-e IGNORE_FILE] [-c RATIO] [-j JACCARD_INDEX] [-m NUMBER_BIPARTITIONS] [--no-squaring] [--minimum-size MIN_SIZE] [--maximum-size MAX_SIZE] [-p PROCESSES] [--no-jaccard] [--no-progress] [-v] INPUT_GENETIC_INTERACTION_FILE OUTPUT_BPM_FILE

Example usages (1) and (2).

The genecentric-bpms command produces BPMs from genetic interaction data. In particular, the output is in the BPM file format. BPMs can then be used with genecentric-go to generate GO enrichment data for each module of each BPM.

genecentric-bpms uses parallelization heavily. Without it, and with sufficiently large numbers of partitions, the run-time performance of Genecentric suffers. (On the order of several minutes, depending upon the quality of your CPU.)

INPUT_GENETIC_INTERACTION_FILE: A required parameter. It specifies the file containing the genetic interaction data.
OUTPUT_BPM_FILE: A required parameter. It specifies the file that genecentric-bpms will write the BPM data to.
-e IGNORE_FILE, --ignore-list IGNORE_FILE: IGNORE_FILE is the location of a file that contains a list of genes that will not be used in any computation. It is imperative that you specify the same value for IGNORE_FILE for both the genecentric-bpms and genecentric-go command. The format of IGNORE_FILE is simple: one gene identifier per line.
-c RATIO, --gene-ratio RATIO: RATIO corresponds to the percentage of bipartitions in which any gene is on either the same or opposite side RATIO% of the time of some gene g that generates a BPM. In particular, if a gene is on the same side RATIO% of the time of some gene g, then that gene is in the same module as g. Conversely, if a gene is on the opposite side RATIO% of the time of some gene g, then that gene is in the opposite module as g. Otherwise, that gene is not included in the BPM generated by g. Decreasing this value introduces more variation and increasing this value decreases variation. By default, RATIO is set to 0.9.
-j JACCARD_INDEX, --jaccard JACCARD_INDEX: JACCARD_INDEX is the similarity threshold used when pruning the set of BPMs. In particular, a BPM is generated for every gene in the set of genes in the genetic interaction data and thus produces redundant BPMs. The Jaccard index is used to prune these redundant BPMs such that no two BPMs in the final set have a Jaccard index similarity score greater than JACCARD_INDEX. Increasing this value will produce more BPMs and decreasing this value will produce fewer BPMs. By default, JACCARD_INDEX is set to 0.66.
-m NUMBER_BIPARTITIONS, --num-bipartitions NUMBER_BIPARTITIONS: The number of random bipartitions to generate. As the number grows, the variability between BPM results decreases but the run-time of Genecentric increases. By default, NUMBER_BIPARTITIONS is set to 250. (If you have a lot of CPUs to spare, you may get better results with 500 with reasonable run-time performance.)
--no-squaring: When set, the genetic interaction scores in the genetic interaction data are not squared. Typically, with E-MAP data, squaring the interaction scores can lead to quicker convergence on happy bipartitions. By default, interaction scores are squared.
--minimum-size MIN_SIZE: MIN_SIZE is an integer indicating the smallest allowable module. If a module is found with fewer than MIN_SIZE genes, its corresponding BPM is pruned from the final result. By default, MIN_SIZE is set to 3.
--maximum-size MAX_SIZE: MAX_SIZE is an integer indicating the largest allowable module. If a module is found with more than MAX_SIZE genes, its corresponding BPM is pruned from the final result. By default, MAX_SIZE is set to 25.
-p PROCESSES, --processes PROCESSES: PROCESSES is the maximum number of concurrent processes to spawn. By default, this is set to the number of CPUs detected. This is not always the desired behavior; however, sometimes performance is better when this number is slightly lower than the total number of CPUs on your machine. If this is set to 1, then concurrent features will not be used.
--no-jaccard: When set, pruning based on the Jaccard index is not done. However, the --minimum-size and --maximum-size options will still have an effect if they are not manually set to 0.
--no-progress: When set, the progress bar is not shown.
-v, --verbose: Not used.

genecentric-from-csv

usage: genecentric-from-csv [-h] [--delimiter DELIMITER] [--no-header] [--g1-name G1_NAME] [--g2-name G2_NAME] [--g1-allele G1_ALLELE] [--g2-allele G2_ALLELE] [--int-score INT_SCORE] INPUT_EMAP_FILE OUTPUT_GI_FILE

Example usage.

This command is dedicated soley to transforming E-MAP data that you get from the wild into a genetic interaction data file that can be read by Genecentric. Genecentric forces one input because genetic interaction data can come in many different formats; it would be infeasible to build in support for all of them. (With that said, we may provide other commands in the future to convert formats if they are popular enough.)

The genecentric-from-csv is actually quite simple, and if you have some programming experience, you could very easily convert any genetic interaction into the format that Genecentric understands. All Genecentric needs is a tab-delimited file with three columns where each row represents a genetic interaction: the first two columns are the gene identifiers in the genetic interaction, and the third column is the genetic interaction score.

genecentric-from-csv only requires that your E-MAP data be in some delimited file, where the delimiter can be specified using the --delimiter option.

The parameters of the genecentric-from-csv command are set by default to work with the Collins et al. data set.

An explanation of each of the options:

INPUT_EMAP_FILE: A required parameter. It specifies the file containing the raw E-MAP data.
OUTPUT_GI_FILE: A required parameter. It specifies the file that genecentric-from-csv will write the genetic interaction data to.
--delimiter DELIMITER: The character that separates each field in your E-MAP data. By default, the DELIMITER is set to a tab.
--no-header: When set, genecentric-from-csv will start reading data from your E-MAP file on the first line. When this option is omitted, genecentric-from-csv will assume the first line contains column headers and will thus ignore it.
--g1-name G1_NAME: G1_NAME is the column number that contains the first gene identifier in each pair. Column numbers start from 0.
--g2-name G2_NAME: G2_NAME is the column number that contains the second gene identifier in each pair. Column numbers start from 0.
--g1-allele G1_ALLELE: G1_ALLELE is the column number that contains the type of genetic interaction for the first gene. This is used with the Collins et al. data set to omit all genetic interactions that aren't "deletion" genetic interactions. If your E-MAP data set does not have this information, set G1_ALLELE to -1. It will be be ignored. Column numbers start from 0.
--g2-allele G2_ALLELE: G2_ALLELE is the column number that contains the type of genetic interaction for the second gene. This is used with the Collins et al. data set to omit all genetic interactions that aren't "deletion" genetic interactions. If your E-MAP data set does not have this information, set G2_ALLELE to -1. It will be be ignored. Column numbers start from 0.
--int-score INT_SCORE: INT_SCORE is the column number that contains the genetic interaction score. It is okay if the score is missing from some rows; it will be automatically set to 0. Column numbers start from 0.

genecentric-go

usage: genecentric-go [-h] [-e IGNORE_FILE] [-s GO_SORT] [-t GO_ORDER] [-p PROCESSES] [--hide-enriched-genes] [--fa-species FA_SPECIES] [--fa-namespace FA_NAMESPACE] [--fa-cutoff FA_CUTOFF] [--fa-species-genespace] [--no-progress] [-v] INPUT_GENETIC_INTERACTION_FILE INPUT_BPM_FILE OUTPUT_ENRICHMENT_FILE

Example usages (1), (2) and (3).

The genecentric-go command performs GO enrichment analysis on a set of BPMs generated by the genecentric-bpms command. It takes as input both a genetic interaction data file and a BPM file and produces a GO BPM file as output.

Most of the options for the genecentric-go command are for configuring the behavior of FuncAssociate.

Because genecentric-go uses FuncAssociate, an Internet connection is required in order for genecentric-go to run.

Please make sure to familiarize yourself with the GO BPM file format, as some of the options described below are related to the output format.

An explanation of each of the options:

INPUT_GENETIC_INTERACTION_FILE: A required parameter. It specifies the file containing the genetic interaction data.
INPUT_BPM_FILE: A required parameter. It specifies the file containing the BPMs generated by genecentric-bpms.
OUTPUT_ENRICHMENT_FILE: A required parameter. It specifies the file that genecentric-go will write the GO BPM data to.
-e IGNORE_FILE, --ignore-list IGNORE_FILE: IGNORE_FILE is the location of a file that contains a list of genes that will not be used in any computation. It is imperative that you specify the same value for IGNORE_FILE for both the genecentric-bpms and genecentric-go command. The format of IGNORE_FILE is simple: one gene identifier per line.
-s GO_SORT, --sort-go-by GO_SORT: This option controls the order in which GO annotations are sorted for each entry in the GO BPM file. GO_SORT can be one of four values: p, accession, name or num_genes_with. If it's p, then the GO annotations are sorted by their p-values. If it's accession, then the GO annotations are sorted by their accession numbers (which look like "GO:0000001"). If it's name, then the GO annotations are sorted by their GO term name, i.e., "histone exchange." If it's num_genes_with, then the GO annotations are sorted by the number of genes in the particular BPM module that are enriched with that GO term. The default value is p.
-t GO_ORDER, --order-go GO_ORDER: This controls the order of the sort used with the --sort-go-by option. Namely, GO_ORDER can either be asc or desc where the former corresponds to ascending or increasing order, and the latter corresponds to descending or decreasing order. The default is asc.
-p PROCESSES, --processes PROCESSES: PROCESSES is the number of processes that should be spawned to run concurrently. For the genecentric-go command, this amounts to the number of simultaneous requests sent to FuncAssociate. By default, this is set to the number of CPUs detected on your machine or 6 if there are more than 6 CPUs on your machine. This is to make sure that we don't launch too many simulanteous requests to FuncAssociate on accident. However, you may set as large a number as you wish manually.
--hide-enriched-genes: When this is set, the enriched genes for each GO annotation in the GO BPM output are omitted. Depending upon the number of modules in your BPM set, this may decrease file size modestly. It may also be useful to make the file easier to read if you don't care about this information.
--fa-species FA_SPECIES: FA_SPECIES should be set to the species that your genes belong to. You can get a list of species supported by FuncAssociate using the genecentric-fainfo command. By default, FA_SPECIES is set to Saccharomyces cerevisiae.
--fa-namespace FA_NAMESPACE: FA_NAMESPACE should be set to the namespace that your gene identifiers conform to. You can get a list of namespaces supported by FuncAssociate using the genecentric-fainfo command. By default, FA_NAMESPACE is set to sgd_systematic.
--fa-cutoff FA_CUTOFF: FA_CUTOFF should be set to a p-value in the interval (0, 1]. Only GO annotations with a p-value less than or equal to this cutoff will be returned by FuncAssociate. By default, FA_CUTOFF is set to 0.05.
--fa-species-genespace: When set, FuncAssociate will set the genespace to all genes in the species. If not set, the genespace sent to FuncAssociate will be equivalent to the set of genes found in the genetic interaction data file.
--no-progress: When set, the progress bar is not shown.
-v, --verbose: Not used.

genecentric-fainfo

usage: genecentric-fainfo [-h] [-v] QUERY_COMMAND [QUERY_SPECIES]

Example usage.

genecentric-fainfo is a simple command designed to return lists of species and namespaces supported by FuncAssociate. Namely, QUERY_COMMAND is either species or namespaces. In the case of the latter, genecentric-fainfo takes a second parameter QUERY_SPECIES—which is the name of a species in the list returned by using species QUERY_COMMAND.

The values returned by genecentric-fainfo can be used with the --fa-species and --fa-namespaces options of the genecentric-go command.

The species list is self-explanatory; the species belonging to the genes you're using should be used.

Your choice from the namespaces list depends upon what kind of gene identifiers you're using. For example, with Saccharomyces cerevisiae and the Collins et al. data set, the namespace is sgd_systematic which uses gene identifiers that look like "YAL054C".

If you're not sure which namespace to use, there should be an example of a gene identifier next to each namespace name when you run genecentric-fainfo with the namespaces QUERY_COMMAND.

File formats

There are three file formats used by Genecentric: genetic interaction data files, BPM files and GO enrichment data on BPMs files. The convention is to use the 'gi', 'bpm' and 'gobpm' file extensions for each of the formats, respectively. Genecentric does not enforce them.

Also note that Genecentric does not care which gene identifiers are used. In fact, they are entirely arbitrary from the perspective of Genecentric, so long as each identifier uniquely identifies a gene. Therefore, the gene identifiers used in your genetic interaction data will be the gene identifiers used in BPM and GO BPM files. (Note: If you are doing GO enrichment with Genecentric, FuncAssociate is used to perform the anaylsis. FuncAssociate does care about gene identifiers, and you'll have to set the namespace appropriately.)

What follows is a brief description of each file format. A more technical description of each format can be found in the README file in the root directory of the distribution.

gi (genetic interaction data)

Genetic interaction data can come in many different kinds of formats, and so it was necessary to adopt a universal and simple format as input to Genecentric.

A genetic interaction file is tab-delimited and made up of three columns: two gene identifiers and a genetic interaction score. There should be a line for every pair of genes with an interaction score.

Note that an interaction score must always be present in the third column. If the source data omits an interaction, use 0.0 as the genetic interaction score in the third column. (You may also omit the gene pair entirely, in which case, its interaction is considered to be zero.)

There should be no column headers in genetic interaction data files.

Please see the genecentric-from-csv command for more information how to transform data into a gi file.

Sample data:

R0020C YAL011W 0.152836 R0020C YAL013W 0.172871 R0020C YAL015C -0.213015

bpm (list of BPMs)

BPM files are the output produced by the genecentric-bpms command. They are tab-delimited and both human and machine readable. Each line contains a single module identifier in the first column and that module's corresponding genes in each subsequent column. A BPM is made up of two modules (and thus two lines).

Sample data:

BPM0/Module1 YAL011W YGR181W YMR156C ... BPM0/Module2 YML124C YNR010W YEL018W ... BPM1/Module1 YML124C YEL018W YIL040W ... BPM1/Module2 YGR181W YML060W YML041C ...

Programmer's tip: The bpm/bpmreader.py module contains a read function that takes a BPM file name as a parameter and returns a list of BPMs as tuples of modules (where each module is a list of gene identifiers).

gobpm (GO enrichment data for BPMs)

GO BPM files are the output produced by the genecentric-go command. They are designed to be human readable as plain text files, but are also machine readable.

Each entry in the gobpm file corresponds to enrichment analysis on each module of every BPM from the BPM input file. Each entry has three sections.

The first section is always the first line of the entry and always starts with '> ' and is followed by a BPM and module identifier string.

The second section is always the second line of the entry and corresponds to a tab-delimited list of genes in the module.

The third section is the rest of the lines in the entry up to and not including the next line that starts with '> ' or the end of the file. Each line in the third section corresponds to a GO annotation for the BPM module. Each GO annotation has the following information: the GO accession number, the p-value, the ratio of genes annotated with the term in its BPM module, the GO term, and a list of genes in the BPM module that have been annotated by this particular GO term. (Note: The list of genes may be absent if that particular output option was disabled with genecentric-go.)

Sample data:

> BPM0/Module0 YAL011W YGR181W YMR156C ... GO:0043044 0.000000 8/14 ATP-dependent chromatin remodeling YML041C YNL107W YDR334W ... GO:0034621 0.011000 9/14 cellular macromolecular complex ... YML041C YNL107W YDR334W ... ... > BPM0/Module1 YML124C YNR010W YEL018W ... GO:0009058 0.038000 13/15 biosynthetic process YNR010W YNL097C YMR263W ... GO:0044249 0.038000 13/15 cellular biosynthetic process YNR010W YNL097C YMR263W ...

Programmer's tip: The bpm/enrichment.py module contains read_bpm and write_bpm functions for reading and writing entires in a gobpm file.