PyMEGABASE¶

The PyMEGABASE classes perform subcompartment and compartment annotations based on 1D chromatin enrichment profiles.

class PyMEGABASE.PyMEGABASE(cell_line='GM12878', assembly='hg19', organism='human', signal_type='signal p-value', file_format='bigWig', ref_cell_line_path='tmp_meta', cell_line_path=None, types_path=None, histones=True, tf=False, atac=False, small_rna=False, total_rna=False, n_states=19, extra_filter='', res=50, chromosome_sizes=None, AB=False)[source]¶

Bases: object

The PyMEGABASE class performs genomic annotations .

The PyMEGABASE sets the environment to generate prediction of genomic annotations.

Parameters

cell_line (str, required) – Name of target cell type
assembly (str, required) – Reference assembly of target cell line (‘hg19’,’GRCh38’,’mm10’)
organism (str, required) – Target cell type organism (str, required):
signal_type (str, required) – Input files signal type (‘signal p-value’, ‘fold change over control’, …)
ref_cell_line_path (str, optional) – Folder/Path to place reference/training data (‘tmp_meta’)
cell_line_path (str, optional) – Folder/Path to place target cell data
types_path (str, optional) – Folder/Path where the genomic annotations are located
histones (bool, required) – Whether to use Histone Modification data from the ENCODE databank for prediction
tf (bool, required) – Whether to use Transcription Factor data from the ENCODE databank for prediction
small_rna (bool, required) – Whether to use Small RNA-Seq data from the ENCODE databank for prediction
total_rna (bool, required) – Whether to use Total RNA-seq data from the ENCODE databank for prediction
n_states (int, optional) – Number of states for the D-nodes
extra_filter (str, optional) – Add filter to the fetching data url to download cell type data
res (int, optional) – Resolution for genomic annotations calling in kilobasepairs (5, 50, 100)
chromosome_sizes (list, optional) – Chromosome sizes based on the reference genome assembly - required for non-human assemblies
file_format (str, optional) – File format for the input data

build_state_vector(int_types, all_averages)[source]¶

Builds the set of state vectors used on the training process

Parameters

int_types (list, required) – Genomic annotations
all_averages (list, required) – D-node data

custom_bed_track(experiment, bed_file)[source]¶

Function to introduce custom bed tracks

Parameters

experiment (str, required) – Name of the experiment
bed_file (str, required) – Path to the custom track

custom_bw_track(experiment, bw_file)[source]¶

Function to introduce custom bigwig tracks

Parameters

experiment (str, required) – Name of the experiment
bw_file (str, required) – Path to the custom track

download_and_process_cell_line_data(nproc=10, all_exp=True)[source]¶

Download and preprocess target cell data for the D-nodes

Parameters

nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment. Set as ‘False’ to download only 1 replica per experiment

download_and_process_ref_data(nproc, all_exp=True)[source]¶

Download and preprocess reference data for the D-nodes

Parameters

nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment. Set as ‘False’ to download only 1 replica per experiment

filter_exp()[source]¶: Performs assestment on experiment signal-to-noise ration based on mean and std of the signal compared to the GM12878 equivalent using chromosomes 1 and 2

get_tmatrix(chrms, silent=False)[source]¶

Extract the training data

Parameters

chrms (list, optional) – Set of chromosomes from the reference data used as the training set
silent (bool, optional) – Silence outputs

prediction_X(chr='X', h_and_J_file=None, energies=False, probabilities=False)[source]¶

Predicts and outputs the genomic annotations for chromosome X

Parameters

chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome): Predicted annotations

prediction_all_chrm(path=None, save_subcompartments=True, save_compartments=True, energies=False, probabilities=False)[source]¶

Predicts and outputs the genomic annotations for all the chromosomes

Parameters

path (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations for each chromosomes
save_compartments (bool, optional) – Whether generate files with compartment annotations for each chromosomes

Returns

predictions_subcompartments (dict), predictions_compartments (dict): Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes

prediction_single_chrom(chr=1, h_and_J_file=None, energies=False, probabilities=False)[source]¶

Predicts and outputs the genomic annotations for chromosome X

Parameters

chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome): Predicted annotations

printHeader()[source]¶

process_replica_bed(line, cell_line_path, chrm_size)[source]¶: Preprocess function for each replica formated in bed files :param line: Information about the replica: name, ENCODE id and replica id :type line: lsit, required :param cell_line_path: Path to target cell type data :type cell_line_path: str, required :param chrm_size: Chromosome sizes based on the assembly :type chrm_size: list, required

process_replica_bw(line, cell_line_path, chrm_size)[source]¶: Preprocess function for each replica formated in bigwig file :param line: Information about the replica: name, ENCODE id and replica id :type line: list, required :param cell_line_path: Path to target cell type data :type cell_line_path: str, required :param chrm_size: Chromosome sizes based on the assembly :type chrm_size: list, required

test_set(chr=1, silent=False)[source]¶

Predicts and outputs the genomic annotations for chromosome X

Parameters

chr (int, required) – Chromosome to extract input data fro the D-nodes
silent (bool, optional) – Avoid printing information

Returns

array (size of chromosome,5*number of unique experiments): D-node input data

training(nproc=10, lambda_h=100, lambda_J=100)[source]¶

Performs the training of the Potts model based on the reference data

Parameters

nproc (int, required) – Number of processors used to train
lambda_h (bool, optional) – Value for the intensity of the regularization value for the h energy term
lambda_J (float, optional) – Value for the intensity of the regularization value for the J energy term

training_set_up(chrms=None, filter=True)[source]¶

Formats data to allow the training

Parameters

chrms (list, optional) – Set of chromosomes from the reference data to use as the training set
filter (bool, optional) – Filter experiments based on the baseline

write_bed(out_file='predictions', compartments=True, subcompartments=True)[source]¶

Formats and saves predictions on BED format

Parameters

out_file (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations
save_compartments (bool, optional) – Whether generate files with compartment annotations

Returns

predictions_subcompartments (dict), predictions_compartments (dict): Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes

class PyMEGABASE.PyMEGABASE_legacy(cell_line='GM12878', assembly='hg19', organism='human', signal_type='signal p-value', ref_cell_line_path='tmp_meta', cell_line_path=None, types_path='PyMEGABASE/types', histones=True, tf=False, atac=False, small_rna=False, total_rna=False, n_states=19, extra_filter='', res=50, chromosome_sizes=None, AB=False)[source]¶

Bases: object

The PyMEGABASE class performs genomic annotations .

The PyMEGABASE sets the environment to generate prediction of genomic annotations.

Parameters

cell_line (str, required) – Name of target cell type
assembly (str, required) – Reference assembly of target cell line (‘hg19’,’GRCh38’,’mm10’)
organism (str, required) – Target cell type organism (str, required):
signal_type (str, required) – Input files signal type (‘signal p-value’, ‘fold change over control’, …)
ref_cell_line_path (str, optional) – Folder/Path to place reference/training data (‘tmp_meta’)
cell_line_path (str, optional) – Folder/Path to place target cell data
types_path (str, optional) – Folder/Path where the genomic annotations are located
histones (bool, required) – Whether to use Histone Modification data from the ENCODE databank for prediction
tf (bool, required) – Whether to use Transcription Factor data from the ENCODE databank for prediction
small_rna (bool, required) – Whether to use Small RNA-Seq data from the ENCODE databank for prediction
total_rna (bool, required) – Whether to use Total RNA-seq data from the ENCODE databank for prediction
n_states (int, optional) – Number of states for the D-nodes
extra_filter (str, optional) – Add filter to the fetching data url to download cell type data
res (int, optional) – Resolution for genomic annotations calling in kilobasepairs (5, 50, 100)
chromosome_sizes (list, optional) – Chromosome sizes based on the reference genome assembly - required for non-human assemblies

build_state_vector(int_types, all_averages)[source]¶

Builds the set of state vectors used on the training process

Parameters

int_types (list, required) – Genomic annotations
all_averages (list, required) – D-node data

download_and_process_cell_line_data(nproc=10, all_exp=True)[source]¶

Download and preprocess target cell data for the D-nodes

Parameters

nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment

download_and_process_ref_data(nproc, all_exp=True)[source]¶

Download and preprocess reference data for the D-nodes

Parameters

nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment

extra_track(experiment, bw_file)[source]¶

Function to introduce custom tracks

Parameters

experiment (str, required) – Name of the experiment
bw_file (str, required) – Path to the custom track

filter_exp()[source]¶: Performs baseline assestment on experiment baselines

get_tmatrix(chrms, silent=False)[source]¶

Extract the training data

Parameters

chrms (list, optional) – Set of chromosomes from the reference data used as the training set
silent (bool, optional) – Silence outputs

prediction_X(chr='X', h_and_J_file=None)[source]¶

Predicts and outputs the genomic annotations for chromosome X

Parameters

chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome): Predicted annotations

prediction_all_chrm(path=None, save_subcompartments=True, save_compartments=True)[source]¶

Predicts and outputs the genomic annotations for all the chromosomes

Parameters

path (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations for each chromosomes
save_compartments (bool, optional) – Whether generate files with compartment annotations for each chromosomes

Returns

predictions_subcompartments (dict), predictions_compartments (dict): Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes

prediction_single_chrom(chr=1, h_and_J_file=None)[source]¶

Predicts and outputs the genomic annotations for chromosome X

Parameters

chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome): Predicted annotations

printHeader()[source]¶

process_replica(line, cell_line_path, chrm_size)[source]¶

Preprocess function for each replica

Parameters

line (lsit, required) – Information about the replica: name, ENCODE id and replica id
cell_line_path (str, required) – Path to target cell type data
chrm_size (list, required) – Chromosome sizes based on the assembly

test_set(chr=1, silent=False)[source]¶

Predicts and outputs the genomic annotations for chromosome X

Parameters

chr (int, required) – Chromosome to extract input data fro the D-nodes
silent (bool, optional) – Avoid printing information

Returns

array (size of chromosome,5*number of unique experiments): D-node input data

training(nproc=10, lambda_h=100, lambda_J=100)[source]¶

Performs the training of the Potts model based on the reference data

Parameters

nproc (int, required) – Number of processors used to train
lambda_h (bool, optional) – Value for the intensity of the regularization value for the h energy term
lambda_J (float, optional) – Value for the intensity of the regularization value for the J energy term

Returns

array (size of chromosome,5*number of unique experiments): D-node input data

training_set_up(chrms=None, filter=True)[source]¶

Formats data to allow the training

Parameters

chrms (list, optional) – Set of chromosomes from the reference data to use as the training set
filter (bool, optional) – Filter experiments based on the baseline

write_bed(out_file='predictions', compartments=True, subcompartments=True)[source]¶

Formats and saves predictions on BED format

Parameters

out_file (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations
save_compartments (bool, optional) – Whether generate files with compartment annotations

Returns

predictions_subcompartments (dict), predictions_compartments (dict): Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes