PyMEGABASE

The PyMEGABASE classes perform subcompartment and compartment annotations based on 1D chromatin enrichment profiles.

class PyMEGABASE.PyMEGABASE(cell_line='GM12878', assembly='hg19', organism='human', signal_type='signal p-value', file_format='bigWig', ref_cell_line_path='tmp_meta', cell_line_path=None, types_path=None, histones=True, tf=False, atac=False, small_rna=False, total_rna=False, n_states=19, extra_filter='', res=50, chromosome_sizes=None, AB=False)[source]

Bases: object

The PyMEGABASE class performs genomic annotations .

The PyMEGABASE sets the environment to generate prediction of genomic annotations.

Parameters
  • cell_line (str, required) – Name of target cell type

  • assembly (str, required) – Reference assembly of target cell line (‘hg19’,’GRCh38’,’mm10’)

  • organism (str, required) – Target cell type organism (str, required):

  • signal_type (str, required) – Input files signal type (‘signal p-value’, ‘fold change over control’, …)

  • ref_cell_line_path (str, optional) – Folder/Path to place reference/training data (‘tmp_meta’)

  • cell_line_path (str, optional) – Folder/Path to place target cell data

  • types_path (str, optional) – Folder/Path where the genomic annotations are located

  • histones (bool, required) – Whether to use Histone Modification data from the ENCODE databank for prediction

  • tf (bool, required) – Whether to use Transcription Factor data from the ENCODE databank for prediction

  • small_rna (bool, required) – Whether to use Small RNA-Seq data from the ENCODE databank for prediction

  • total_rna (bool, required) – Whether to use Total RNA-seq data from the ENCODE databank for prediction

  • n_states (int, optional) – Number of states for the D-nodes

  • extra_filter (str, optional) – Add filter to the fetching data url to download cell type data

  • res (int, optional) – Resolution for genomic annotations calling in kilobasepairs (5, 50, 100)

  • chromosome_sizes (list, optional) – Chromosome sizes based on the reference genome assembly - required for non-human assemblies

  • file_format (str, optional) – File format for the input data

build_state_vector(int_types, all_averages)[source]

Builds the set of state vectors used on the training process

Parameters
  • int_types (list, required) – Genomic annotations

  • all_averages (list, required) – D-node data

custom_bed_track(experiment, bed_file)[source]

Function to introduce custom bed tracks

Parameters
  • experiment (str, required) – Name of the experiment

  • bed_file (str, required) – Path to the custom track

custom_bw_track(experiment, bw_file)[source]

Function to introduce custom bigwig tracks

Parameters
  • experiment (str, required) – Name of the experiment

  • bw_file (str, required) – Path to the custom track

download_and_process_cell_line_data(nproc=10, all_exp=True)[source]

Download and preprocess target cell data for the D-nodes

Parameters
  • nproc (int, required) – Number of processors dedicated to download and process data

  • all_exp (bool, optional) – Download and process all replicas for each experiment. Set as ‘False’ to download only 1 replica per experiment

download_and_process_ref_data(nproc, all_exp=True)[source]

Download and preprocess reference data for the D-nodes

Parameters
  • nproc (int, required) – Number of processors dedicated to download and process data

  • all_exp (bool, optional) – Download and process all replicas for each experiment. Set as ‘False’ to download only 1 replica per experiment

filter_exp()[source]

Performs assestment on experiment signal-to-noise ration based on mean and std of the signal compared to the GM12878 equivalent using chromosomes 1 and 2

get_tmatrix(chrms, silent=False)[source]

Extract the training data

Parameters
  • chrms (list, optional) – Set of chromosomes from the reference data used as the training set

  • silent (bool, optional) – Silence outputs

prediction_X(chr='X', h_and_J_file=None, energies=False, probabilities=False)[source]

Predicts and outputs the genomic annotations for chromosome X

Parameters
  • chr (int, optional) – Chromosome to predict

  • h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome)

Predicted annotations

prediction_all_chrm(path=None, save_subcompartments=True, save_compartments=True, energies=False, probabilities=False)[source]

Predicts and outputs the genomic annotations for all the chromosomes

Parameters
  • path (str, optional) – Folder/Path to save the prediction results

  • save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations for each chromosomes

  • save_compartments (bool, optional) – Whether generate files with compartment annotations for each chromosomes

Returns

predictions_subcompartments (dict), predictions_compartments (dict)

Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes

prediction_single_chrom(chr=1, h_and_J_file=None, energies=False, probabilities=False)[source]

Predicts and outputs the genomic annotations for chromosome X

Parameters
  • chr (int, optional) – Chromosome to predict

  • h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome)

Predicted annotations

printHeader()[source]
process_replica_bed(line, cell_line_path, chrm_size)[source]

Preprocess function for each replica formated in bed files :param line: Information about the replica: name, ENCODE id and replica id :type line: lsit, required :param cell_line_path: Path to target cell type data :type cell_line_path: str, required :param chrm_size: Chromosome sizes based on the assembly :type chrm_size: list, required

process_replica_bw(line, cell_line_path, chrm_size)[source]

Preprocess function for each replica formated in bigwig file :param line: Information about the replica: name, ENCODE id and replica id :type line: list, required :param cell_line_path: Path to target cell type data :type cell_line_path: str, required :param chrm_size: Chromosome sizes based on the assembly :type chrm_size: list, required

test_set(chr=1, silent=False)[source]

Predicts and outputs the genomic annotations for chromosome X

Parameters
  • chr (int, required) – Chromosome to extract input data fro the D-nodes

  • silent (bool, optional) – Avoid printing information

Returns

array (size of chromosome,5*number of unique experiments)

D-node input data

training(nproc=10, lambda_h=100, lambda_J=100)[source]

Performs the training of the Potts model based on the reference data

Parameters
  • nproc (int, required) – Number of processors used to train

  • lambda_h (bool, optional) – Value for the intensity of the regularization value for the h energy term

  • lambda_J (float, optional) – Value for the intensity of the regularization value for the J energy term

training_set_up(chrms=None, filter=True)[source]

Formats data to allow the training

Parameters
  • chrms (list, optional) – Set of chromosomes from the reference data to use as the training set

  • filter (bool, optional) – Filter experiments based on the baseline

write_bed(out_file='predictions', compartments=True, subcompartments=True)[source]

Formats and saves predictions on BED format

Parameters
  • out_file (str, optional) – Folder/Path to save the prediction results

  • save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations

  • save_compartments (bool, optional) – Whether generate files with compartment annotations

Returns

predictions_subcompartments (dict), predictions_compartments (dict)

Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes

class PyMEGABASE.PyMEGABASE_legacy(cell_line='GM12878', assembly='hg19', organism='human', signal_type='signal p-value', ref_cell_line_path='tmp_meta', cell_line_path=None, types_path='PyMEGABASE/types', histones=True, tf=False, atac=False, small_rna=False, total_rna=False, n_states=19, extra_filter='', res=50, chromosome_sizes=None, AB=False)[source]

Bases: object

The PyMEGABASE class performs genomic annotations .

The PyMEGABASE sets the environment to generate prediction of genomic annotations.

Parameters
  • cell_line (str, required) – Name of target cell type

  • assembly (str, required) – Reference assembly of target cell line (‘hg19’,’GRCh38’,’mm10’)

  • organism (str, required) – Target cell type organism (str, required):

  • signal_type (str, required) – Input files signal type (‘signal p-value’, ‘fold change over control’, …)

  • ref_cell_line_path (str, optional) – Folder/Path to place reference/training data (‘tmp_meta’)

  • cell_line_path (str, optional) – Folder/Path to place target cell data

  • types_path (str, optional) – Folder/Path where the genomic annotations are located

  • histones (bool, required) – Whether to use Histone Modification data from the ENCODE databank for prediction

  • tf (bool, required) – Whether to use Transcription Factor data from the ENCODE databank for prediction

  • small_rna (bool, required) – Whether to use Small RNA-Seq data from the ENCODE databank for prediction

  • total_rna (bool, required) – Whether to use Total RNA-seq data from the ENCODE databank for prediction

  • n_states (int, optional) – Number of states for the D-nodes

  • extra_filter (str, optional) – Add filter to the fetching data url to download cell type data

  • res (int, optional) – Resolution for genomic annotations calling in kilobasepairs (5, 50, 100)

  • chromosome_sizes (list, optional) – Chromosome sizes based on the reference genome assembly - required for non-human assemblies

build_state_vector(int_types, all_averages)[source]

Builds the set of state vectors used on the training process

Parameters
  • int_types (list, required) – Genomic annotations

  • all_averages (list, required) – D-node data

download_and_process_cell_line_data(nproc=10, all_exp=True)[source]

Download and preprocess target cell data for the D-nodes

Parameters
  • nproc (int, required) – Number of processors dedicated to download and process data

  • all_exp (bool, optional) – Download and process all replicas for each experiment

download_and_process_ref_data(nproc, all_exp=True)[source]

Download and preprocess reference data for the D-nodes

Parameters
  • nproc (int, required) – Number of processors dedicated to download and process data

  • all_exp (bool, optional) – Download and process all replicas for each experiment

extra_track(experiment, bw_file)[source]

Function to introduce custom tracks

Parameters
  • experiment (str, required) – Name of the experiment

  • bw_file (str, required) – Path to the custom track

filter_exp()[source]

Performs baseline assestment on experiment baselines

get_tmatrix(chrms, silent=False)[source]

Extract the training data

Parameters
  • chrms (list, optional) – Set of chromosomes from the reference data used as the training set

  • silent (bool, optional) – Silence outputs

prediction_X(chr='X', h_and_J_file=None)[source]

Predicts and outputs the genomic annotations for chromosome X

Parameters
  • chr (int, optional) – Chromosome to predict

  • h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome)

Predicted annotations

prediction_all_chrm(path=None, save_subcompartments=True, save_compartments=True)[source]

Predicts and outputs the genomic annotations for all the chromosomes

Parameters
  • path (str, optional) – Folder/Path to save the prediction results

  • save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations for each chromosomes

  • save_compartments (bool, optional) – Whether generate files with compartment annotations for each chromosomes

Returns

predictions_subcompartments (dict), predictions_compartments (dict)

Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes

prediction_single_chrom(chr=1, h_and_J_file=None)[source]

Predicts and outputs the genomic annotations for chromosome X

Parameters
  • chr (int, optional) – Chromosome to predict

  • h_and_J_file (str, optional) – Model energy term file path

Returns

array (size of chromosome)

Predicted annotations

printHeader()[source]
process_replica(line, cell_line_path, chrm_size)[source]

Preprocess function for each replica

Parameters
  • line (lsit, required) – Information about the replica: name, ENCODE id and replica id

  • cell_line_path (str, required) – Path to target cell type data

  • chrm_size (list, required) – Chromosome sizes based on the assembly

test_set(chr=1, silent=False)[source]

Predicts and outputs the genomic annotations for chromosome X

Parameters
  • chr (int, required) – Chromosome to extract input data fro the D-nodes

  • silent (bool, optional) – Avoid printing information

Returns

array (size of chromosome,5*number of unique experiments)

D-node input data

training(nproc=10, lambda_h=100, lambda_J=100)[source]

Performs the training of the Potts model based on the reference data

Parameters
  • nproc (int, required) – Number of processors used to train

  • lambda_h (bool, optional) – Value for the intensity of the regularization value for the h energy term

  • lambda_J (float, optional) – Value for the intensity of the regularization value for the J energy term

Returns

array (size of chromosome,5*number of unique experiments)

D-node input data

training_set_up(chrms=None, filter=True)[source]

Formats data to allow the training

Parameters
  • chrms (list, optional) – Set of chromosomes from the reference data to use as the training set

  • filter (bool, optional) – Filter experiments based on the baseline

write_bed(out_file='predictions', compartments=True, subcompartments=True)[source]

Formats and saves predictions on BED format

Parameters
  • out_file (str, optional) – Folder/Path to save the prediction results

  • save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations

  • save_compartments (bool, optional) – Whether generate files with compartment annotations

Returns

predictions_subcompartments (dict), predictions_compartments (dict)

Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes