read_data module

itrails.read_data.get_idx_state(state)[source]

Given a state index, returns an array of resolved state indices by recursively replacing any ambiguous ‘N’ in the observed state string with ‘A’, ‘C’, ‘T’, and ‘G’.

Parameters:

state (int.) – index of the observed state in the state dictionary.

Returns:

numpy array of resolved state indices.

Return type:

np.ndarray.

itrails.read_data.get_idx_state_new_method(state)[source]

Given a state index using the new observed state dictionary, returns an array of resolved state indices by recursively replacing any ambiguous ‘N’ in the observed state string with ‘A’, ‘C’, ‘T’, and ‘G’.

Parameters:

state (int.) – index of the observed state in the new state dictionary.

Returns:

numpy array of resolved state indices.

Return type:

np.ndarray.

itrails.read_data.get_obs_state_dct()[source]

Returns a list of all possible 4-character nucleotide state strings based on ‘A’, ‘C’, ‘T’, ‘G’ and, if not already present, appends additional states using ‘N’.

Returns:

list of observed state strings.

Return type:

list[str].

itrails.read_data.get_obs_state_dct_new_method()[source]

Returns a list of observed state strings using a new method that generates 3-character nucleotide strings from ‘A’, ‘C’, ‘T’, ‘G’ and appends additional states using ‘N’ if not already present.

Returns:

list of observed state strings using the new method.

Return type:

list[str].

itrails.read_data.maf_parser(file, sp_lst)[source]

Parses a MAF file to extract sequence alignments for the specified species. for each alignment block, collects sequences for species in sp_lst, replaces gaps ‘-’ with ‘N’, and converts each column of nucleotides to an index using the observed state dictionary from get_obs_state_dct.

Parameters:
  • file – path to the MAF file. :type file: str.

  • sp_lst (list[str].) – list of species names (expected length 4) to extract sequences for.

Returns:

list of numpy arrays where each array contains the state indices for a block.

Return type:

list[np.ndarray].

itrails.read_data.maf_parser_new_method(file, sp_lst)[source]

Parses a MAF file to extract sequence alignments for the specified species using the new observed state dictionary. for each alignment block, collects sequences for species in sp_lst, replaces gaps ‘-’ with ‘N’, and converts each column of nucleotides to an index using the observed state dictionary from get_obs_state_dct_new_method.

Parameters:
  • file – path to the MAF file. :type file: str.

  • sp_lst (list[str].) – list of species names (expected length 4) to extract sequences for.

Returns:

list of numpy arrays where each array contains the state indices for a block. :rtype: list[np.ndarray].

itrails.read_data.parse_coordinates(file, sp_lst, ref)[source]

Parses the coordinates of a MAF file polarized by a reference sequence. The reference does not need to be one of the species used for Viterbi or posterior decoding.

Parameters:
  • file (str.) – path to the MAF file.

  • sp_lst (list[str].) – list of species names (expected length 4) to extract sequences for.

  • ref (str.) – name of the reference to polarize the coordinates.

Returns:

list of numpy arrays where each array contains the state indices for a block. :rtype: list[list].