PyMEGABASE¶
The PyMEGABASE classes perform subcompartment and compartment annotations based on 1D chromatin enrichment profiles.
- class PyMEGABASE.PyMEGABASE(cell_line='GM12878', assembly='hg19', organism='human', signal_type='signal p-value', file_format='bigWig', ref_cell_line_path='tmp_meta', cell_line_path=None, types_path=None, histones=True, tf=False, atac=False, small_rna=False, total_rna=False, n_states=19, extra_filter='', res=50, chromosome_sizes=None, AB=False)[source]¶
Bases:
objectThe
PyMEGABASEclass performs genomic annotations .The
PyMEGABASEsets the environment to generate prediction of genomic annotations.- Parameters
cell_line (str, required) – Name of target cell type
assembly (str, required) – Reference assembly of target cell line (‘hg19’,’GRCh38’,’mm10’)
organism (str, required) – Target cell type organism (str, required):
signal_type (str, required) – Input files signal type (‘signal p-value’, ‘fold change over control’, …)
ref_cell_line_path (str, optional) – Folder/Path to place reference/training data (‘tmp_meta’)
cell_line_path (str, optional) – Folder/Path to place target cell data
types_path (str, optional) – Folder/Path where the genomic annotations are located
histones (bool, required) – Whether to use Histone Modification data from the ENCODE databank for prediction
tf (bool, required) – Whether to use Transcription Factor data from the ENCODE databank for prediction
small_rna (bool, required) – Whether to use Small RNA-Seq data from the ENCODE databank for prediction
total_rna (bool, required) – Whether to use Total RNA-seq data from the ENCODE databank for prediction
n_states (int, optional) – Number of states for the D-nodes
extra_filter (str, optional) – Add filter to the fetching data url to download cell type data
res (int, optional) – Resolution for genomic annotations calling in kilobasepairs (5, 50, 100)
chromosome_sizes (list, optional) – Chromosome sizes based on the reference genome assembly - required for non-human assemblies
file_format (str, optional) – File format for the input data
- build_state_vector(int_types, all_averages)[source]¶
Builds the set of state vectors used on the training process
- Parameters
int_types (list, required) – Genomic annotations
all_averages (list, required) – D-node data
- custom_bed_track(experiment, bed_file)[source]¶
Function to introduce custom bed tracks
- Parameters
experiment (str, required) – Name of the experiment
bed_file (str, required) – Path to the custom track
- custom_bw_track(experiment, bw_file)[source]¶
Function to introduce custom bigwig tracks
- Parameters
experiment (str, required) – Name of the experiment
bw_file (str, required) – Path to the custom track
- download_and_process_cell_line_data(nproc=10, all_exp=True)[source]¶
Download and preprocess target cell data for the D-nodes
- Parameters
nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment. Set as ‘False’ to download only 1 replica per experiment
- download_and_process_ref_data(nproc, all_exp=True)[source]¶
Download and preprocess reference data for the D-nodes
- Parameters
nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment. Set as ‘False’ to download only 1 replica per experiment
- filter_exp()[source]¶
Performs assestment on experiment signal-to-noise ration based on mean and std of the signal compared to the GM12878 equivalent using chromosomes 1 and 2
- get_tmatrix(chrms, silent=False)[source]¶
Extract the training data
- Parameters
chrms (list, optional) – Set of chromosomes from the reference data used as the training set
silent (bool, optional) – Silence outputs
- prediction_X(chr='X', h_and_J_file=None, energies=False, probabilities=False)[source]¶
Predicts and outputs the genomic annotations for chromosome X
- Parameters
chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path
- Returns
- array (size of chromosome)
Predicted annotations
- prediction_all_chrm(path=None, save_subcompartments=True, save_compartments=True, energies=False, probabilities=False)[source]¶
Predicts and outputs the genomic annotations for all the chromosomes
- Parameters
path (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations for each chromosomes
save_compartments (bool, optional) – Whether generate files with compartment annotations for each chromosomes
- Returns
- predictions_subcompartments (dict), predictions_compartments (dict)
Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes
- prediction_single_chrom(chr=1, h_and_J_file=None, energies=False, probabilities=False)[source]¶
Predicts and outputs the genomic annotations for chromosome X
- Parameters
chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path
- Returns
- array (size of chromosome)
Predicted annotations
- process_replica_bed(line, cell_line_path, chrm_size)[source]¶
Preprocess function for each replica formated in bed files :param line: Information about the replica: name, ENCODE id and replica id :type line: lsit, required :param cell_line_path: Path to target cell type data :type cell_line_path: str, required :param chrm_size: Chromosome sizes based on the assembly :type chrm_size: list, required
- process_replica_bw(line, cell_line_path, chrm_size)[source]¶
Preprocess function for each replica formated in bigwig file :param line: Information about the replica: name, ENCODE id and replica id :type line: list, required :param cell_line_path: Path to target cell type data :type cell_line_path: str, required :param chrm_size: Chromosome sizes based on the assembly :type chrm_size: list, required
- test_set(chr=1, silent=False)[source]¶
Predicts and outputs the genomic annotations for chromosome X
- Parameters
chr (int, required) – Chromosome to extract input data fro the D-nodes
silent (bool, optional) – Avoid printing information
- Returns
- array (size of chromosome,5*number of unique experiments)
D-node input data
- training(nproc=10, lambda_h=100, lambda_J=100)[source]¶
Performs the training of the Potts model based on the reference data
- Parameters
nproc (int, required) – Number of processors used to train
lambda_h (bool, optional) – Value for the intensity of the regularization value for the h energy term
lambda_J (float, optional) – Value for the intensity of the regularization value for the J energy term
- training_set_up(chrms=None, filter=True)[source]¶
Formats data to allow the training
- Parameters
chrms (list, optional) – Set of chromosomes from the reference data to use as the training set
filter (bool, optional) – Filter experiments based on the baseline
- write_bed(out_file='predictions', compartments=True, subcompartments=True)[source]¶
Formats and saves predictions on BED format
- Parameters
out_file (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations
save_compartments (bool, optional) – Whether generate files with compartment annotations
- Returns
- predictions_subcompartments (dict), predictions_compartments (dict)
Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes
- class PyMEGABASE.PyMEGABASE_legacy(cell_line='GM12878', assembly='hg19', organism='human', signal_type='signal p-value', ref_cell_line_path='tmp_meta', cell_line_path=None, types_path='PyMEGABASE/types', histones=True, tf=False, atac=False, small_rna=False, total_rna=False, n_states=19, extra_filter='', res=50, chromosome_sizes=None, AB=False)[source]¶
Bases:
objectThe
PyMEGABASEclass performs genomic annotations .The
PyMEGABASEsets the environment to generate prediction of genomic annotations.- Parameters
cell_line (str, required) – Name of target cell type
assembly (str, required) – Reference assembly of target cell line (‘hg19’,’GRCh38’,’mm10’)
organism (str, required) – Target cell type organism (str, required):
signal_type (str, required) – Input files signal type (‘signal p-value’, ‘fold change over control’, …)
ref_cell_line_path (str, optional) – Folder/Path to place reference/training data (‘tmp_meta’)
cell_line_path (str, optional) – Folder/Path to place target cell data
types_path (str, optional) – Folder/Path where the genomic annotations are located
histones (bool, required) – Whether to use Histone Modification data from the ENCODE databank for prediction
tf (bool, required) – Whether to use Transcription Factor data from the ENCODE databank for prediction
small_rna (bool, required) – Whether to use Small RNA-Seq data from the ENCODE databank for prediction
total_rna (bool, required) – Whether to use Total RNA-seq data from the ENCODE databank for prediction
n_states (int, optional) – Number of states for the D-nodes
extra_filter (str, optional) – Add filter to the fetching data url to download cell type data
res (int, optional) – Resolution for genomic annotations calling in kilobasepairs (5, 50, 100)
chromosome_sizes (list, optional) – Chromosome sizes based on the reference genome assembly - required for non-human assemblies
- build_state_vector(int_types, all_averages)[source]¶
Builds the set of state vectors used on the training process
- Parameters
int_types (list, required) – Genomic annotations
all_averages (list, required) – D-node data
- download_and_process_cell_line_data(nproc=10, all_exp=True)[source]¶
Download and preprocess target cell data for the D-nodes
- Parameters
nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment
- download_and_process_ref_data(nproc, all_exp=True)[source]¶
Download and preprocess reference data for the D-nodes
- Parameters
nproc (int, required) – Number of processors dedicated to download and process data
all_exp (bool, optional) – Download and process all replicas for each experiment
- extra_track(experiment, bw_file)[source]¶
Function to introduce custom tracks
- Parameters
experiment (str, required) – Name of the experiment
bw_file (str, required) – Path to the custom track
- get_tmatrix(chrms, silent=False)[source]¶
Extract the training data
- Parameters
chrms (list, optional) – Set of chromosomes from the reference data used as the training set
silent (bool, optional) – Silence outputs
- prediction_X(chr='X', h_and_J_file=None)[source]¶
Predicts and outputs the genomic annotations for chromosome X
- Parameters
chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path
- Returns
- array (size of chromosome)
Predicted annotations
- prediction_all_chrm(path=None, save_subcompartments=True, save_compartments=True)[source]¶
Predicts and outputs the genomic annotations for all the chromosomes
- Parameters
path (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations for each chromosomes
save_compartments (bool, optional) – Whether generate files with compartment annotations for each chromosomes
- Returns
- predictions_subcompartments (dict), predictions_compartments (dict)
Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes
- prediction_single_chrom(chr=1, h_and_J_file=None)[source]¶
Predicts and outputs the genomic annotations for chromosome X
- Parameters
chr (int, optional) – Chromosome to predict
h_and_J_file (str, optional) – Model energy term file path
- Returns
- array (size of chromosome)
Predicted annotations
- process_replica(line, cell_line_path, chrm_size)[source]¶
Preprocess function for each replica
- Parameters
line (lsit, required) – Information about the replica: name, ENCODE id and replica id
cell_line_path (str, required) – Path to target cell type data
chrm_size (list, required) – Chromosome sizes based on the assembly
- test_set(chr=1, silent=False)[source]¶
Predicts and outputs the genomic annotations for chromosome X
- Parameters
chr (int, required) – Chromosome to extract input data fro the D-nodes
silent (bool, optional) – Avoid printing information
- Returns
- array (size of chromosome,5*number of unique experiments)
D-node input data
- training(nproc=10, lambda_h=100, lambda_J=100)[source]¶
Performs the training of the Potts model based on the reference data
- Parameters
nproc (int, required) – Number of processors used to train
lambda_h (bool, optional) – Value for the intensity of the regularization value for the h energy term
lambda_J (float, optional) – Value for the intensity of the regularization value for the J energy term
- Returns
- array (size of chromosome,5*number of unique experiments)
D-node input data
- training_set_up(chrms=None, filter=True)[source]¶
Formats data to allow the training
- Parameters
chrms (list, optional) – Set of chromosomes from the reference data to use as the training set
filter (bool, optional) – Filter experiments based on the baseline
- write_bed(out_file='predictions', compartments=True, subcompartments=True)[source]¶
Formats and saves predictions on BED format
- Parameters
out_file (str, optional) – Folder/Path to save the prediction results
save_subcompartments (bool, optional) – Whether generate files with subcompartment annotations
save_compartments (bool, optional) – Whether generate files with compartment annotations
- Returns
- predictions_subcompartments (dict), predictions_compartments (dict)
Predicted subcompartment annotations and compartment annotations on dictionaries organized by chromosomes