Introduction

Bioinformatics is the science of transforming and processing biological data to gain new insights, particularly omics data: genomics, proteomics, metabolomics, etc.. Bioinformatics is mostly a mix of biology, computer science, and statistics / data science.

Algorithms

K-mer

A k-mer is a substring of length k within some larger biological sequence (e.g. DNA or amino acid chain). For example, in the DNA sequence GAAATC, the following k-mer's exist:

k k-mers
1 G A A A T C
2 GA AA AA AT TC
3 GAA AAA AAT ATC
4 GAAA AAAT AATC
5 GAAAT AAATC
6 GAAATC

Common scenarios involving k-mers:

Reverse Complement

WHAT: Given a DNA k-mer, calculate its reverse complement.

WHY: Depending on the type of biological sequence, a k-mer may have one or more alternatives. For DNA sequences specifically, a k-mer of interest may have an alternate form. Since the DNA molecule comes as 2 strands, where ...

Kroki diagram output

, ... the reverse complement of that k-mer may be just as valid as the original k-mer. For example, if an enzyme is known to bind to a specific DNA k-mer, it's possible that it might also bind to the reverse complement of that k-mer.

ALGORITHM:

ch1_code/src/ReverseComplementADnaKmer.py (lines 5 to 22):

def reverse_complement(strand: str):
    ret = ''
    for i in range(0, len(strand)):
        base = strand[i]
        if base == 'A' or base == 'a':
            base = 'T'
        elif base == 'T' or base == 't':
            base = 'A'
        elif base == 'C' or base == 'c':
            base = 'G'
        elif base == 'G' or base == 'g':
            base = 'C'
        else:
            raise Exception('Unexpected base: ' + base)

        ret += base
    return ret[::-1]

Original: TAATCCG

Reverse Complement: CGGATTA

Hamming Distance

WHAT: Given 2 k-mers, the hamming distance is the number of positional mismatches between them.

WHY: Imagine an enzyme that looks for a specific DNA k-mer pattern to bind to. Since DNA is known to mutate, it may be that enzyme can also bind to other k-mer patterns that are slight variations of the original. For example, that enzyme may be able to bind to both AAACTG and AAAGTG.

ALGORITHM:

ch1_code/src/HammingDistanceBetweenKmers.py (lines 5 to 13):

def hamming_distance(kmer1: str, kmer2: str) -> int:
    mismatch = 0

    for ch1, ch2 in zip(kmer1, kmer2):
        if ch1 != ch2:
            mismatch += 1

    return mismatch

Kmer1: ACTTTGTT

Kmer2: AGTTTCTT

Hamming Distance: 2

Hamming Distance Neighbourhood

↩PREREQUISITES↩

WHAT: Given a source k-mer and a minimum hamming distance, find all k-mers such within the hamming distance of the source k-mer. In other words, find all k-mers such that hamming_distance(source_kmer, kmer) <= min_distance.

WHY: Imagine an enzyme that looks for a specific DNA k-mer pattern to bind to. Since DNA is known to mutate, it may be that enzyme can also bind to other k-mer patterns that are slight variations of the original. This algorithm finds all such variations.

ALGORITHM:

ch1_code/src/FindAllDnaKmersWithinHammingDistance.py (lines 5 to 20):

def find_all_dna_kmers_within_hamming_distance(kmer: str, hamming_dist: int) -> set[str]:
    def recurse(kmer: str, hamming_dist: int, output: set[str]) -> None:
        if hamming_dist == 0:
            output.add(kmer)
            return

        for i in range(0, len(kmer)):
            for ch in 'ACTG':
                neighbouring_kmer = kmer[:i] + ch + kmer[i + 1:]
                recurse(neighbouring_kmer, hamming_dist - 1, output)

    output = set()
    recurse(kmer, hamming_dist, output)

    return output

Kmers within hamming distance 1 of AAAA: {'ATAA', 'AACA', 'AAAC', 'GAAA', 'ACAA', 'AAAT', 'CAAA', 'AAAG', 'AGAA', 'AAGA', 'AATA', 'TAAA', 'AAAA'}

Find Locations

↩PREREQUISITES↩

WHAT: Given a k-mer, find where that k-mer occurs in some larger sequence. The search may potentially include the k-mer's variants (e.g. reverse complement).

WHY: Imagine that you know of a specific k-mer pattern that serves some function in an organism. If you see that same k-mer pattern appearing in some other related organism, it could be a sign that k-mer pattern serves a similar function. For example, the same k-mer pattern could be used by 2 related types of bacteria as a DnaA box.

The enzyme that operates on that k-mer may also operate on its reverse complement as well as slight variations on that k-mer. For example, if an enzyme binds to AAAAAAAAA, it may also bind to its...

ALGORITHM:

ch1_code/src/FindLocations.py (lines 11 to 32):

class Options(NamedTuple):
    hamming_distance: int = 0
    reverse_complement: bool = False


def find_kmer_locations(sequence: str, kmer: str, options: Options = Options()) -> List[int]:
    # Construct test kmers
    test_kmers = set()
    test_kmers.add(kmer)
    [test_kmers.add(alt_kmer) for alt_kmer in find_all_dna_kmers_within_hamming_distance(kmer, options.hamming_distance)]
    if options.reverse_complement:
        rc_kmer = reverse_complement(kmer)
        [test_kmers.add(alt_rc_kmer) for alt_rc_kmer in find_all_dna_kmers_within_hamming_distance(rc_kmer, options.hamming_distance)]

    # Slide over the sequence's kmers and check for matches against test kmers
    k = len(kmer)
    idxes = []
    for seq_kmer, i in slide_window(sequence, k):
        if seq_kmer in test_kmers:
            idxes.append(i)
    return idxes

Found AAAA in AAAAGAACCTAATCTTAAAGGAGATGATGATTCTAA at index [0, 1, 2, 3, 12, 15, 16, 30]

Find Clumps

↩PREREQUISITES↩

WHAT: Given a k-mer, find where that k-mer clusters in some larger sequence. The search may potentially include the k-mer's variants (e.g. reverse complement).

WHY: An enzyme may need to bind to a specific region of DNA to begin doing its job. That is, it looks for a specific k-mer pattern to bind to, where that k-mer represents the beginning of some larger DNA region that it operates on. Since DNA is known to mutate, oftentimes you'll find multiple copies of the same k-mer pattern clustered together -- if one copy mutated to become unusable, the other copies are still around.

For example, the DnaA box is a special k-mer pattern used by enzymes during DNA replication. Since DNA is known to mutate, the DnaA box can be found repeating multiple times in the region of DNA known as the replication origin. Finding the DnaA box clustered in a small region is a good indicator that you've found the replication origin.

ALGORITHM:

ch1_code/src/FindClumps.py (lines 10 to 26):

def find_kmer_clusters(sequence: str, kmer: str, min_occurrence_in_cluster: int, cluster_window_size: int, options: Options = Options()) -> List[int]:
    cluster_locs = []

    locs = find_kmer_locations(sequence, kmer, options)
    start_i = 0
    occurrence_count = 1
    for end_i in range(1, len(locs)):
        if locs[end_i] - locs[start_i] < cluster_window_size:  # within a cluster window?
            occurrence_count += 1
        else:
            if occurrence_count >= min_occurrence_in_cluster:  # did the last cluster meet the min ocurr requirement?
                cluster_locs.append(locs[start_i])
            start_i = end_i
            occurrence_count = 1

    return cluster_locs

Found clusters of GGG (at least 3 occurrences in window of 13) in GGGACTGAACAAACAAATTTGGGAGGGCACGGGTTAAAGGAGATGATGATTCAAAGGGT at index [19, 37]

Find Repeating

↩PREREQUISITES↩

WHAT: Given a sequence, find clusters of unique k-mers within that sequence. In other words, for each unique k-mer that exists in the sequence, see if it clusters in the sequence. The search may potentially include variants of k-mer variants (e.g. reverse complements of the k-mers).

WHY: An enzyme may need to bind to a specific region of DNA to begin doing its job. That is, it looks for a specific k-mer pattern to bind to, where that k-mer represents the beginning of some larger DNA region that it operates on. Since DNA is known to mutate, oftentimes you'll find multiple copies of the same k-mer pattern clustered together -- if one copy mutated to become unusable, the other copies are still around.

For example, the DnaA box is a special k-mer pattern used by enzymes during DNA replication. Since DNA is known to mutate, the DnaA box can be found repeating multiple times in the region of DNA known as the replication origin. Given that you don't know the k-mer pattern for the DnaA box but you do know the replication origin, you can scan through the replication origin for repeating k-mer patterns. If a pattern is found to heavily repeat, it's a good candidate that it's the k-mer pattern for the DnaA box.

ALGORITHM:

ch1_code/src/FindRepeating.py (lines 12 to 41):

from Utils import slide_window


def count_kmers(data: str, k: int, options: Options = Options()) -> Counter[str]:
    counter = Counter()
    for kmer, i in slide_window(data, k):
        neighbourhood = find_all_dna_kmers_within_hamming_distance(kmer, options.hamming_distance)
        for neighbouring_kmer in neighbourhood:
            counter[neighbouring_kmer] += 1

        if options.reverse_complement:
            kmer_rc = reverse_complement(kmer)
            neighbourhood = find_all_dna_kmers_within_hamming_distance(kmer_rc, options.hamming_distance)
            for neighbouring_kmer in neighbourhood:
                counter[neighbouring_kmer] += 1

    return counter


def top_repeating_kmers(data: str, k: int, options: Options = Options()) -> Set[str]:
    counts = count_kmers(data, k, options)

    _, top_count = counts.most_common(1)[0]

    top_kmers = set()
    for kmer, count in counts.items():
        if count == top_count:
            top_kmers.add((kmer, count))
    return top_kmers

Top 5-mer frequencies for GGGACTGAACAAACAAATTTGGGAGGGCACGGGTTAAAGGAGATGATGATTCAAAGGGT:

Find Repeating in Window

↩PREREQUISITES↩

WHAT: Given a sequence, find regions within that sequence that contain clusters of unique k-mers. In other words, ...

The search may potentially include variants of k-mer variants (e.g. reverse complements of the k-mers).

WHY: An enzyme may need to bind to a specific region of DNA to begin doing its job. That is, it looks for a specific k-mer pattern to bind to, where that k-mer represents the beginning of some larger DNA region that it operates on. Since DNA is known to mutate, oftentimes you'll find multiple copies of the same k-mer pattern clustered together -- if one copy mutated to become unusable, the other copies are still around.

For example, the DnaA box is a special k-mer pattern used by enzymes during DNA replication. Since DNA is known to mutate, the DnaA box can be found repeating multiple times in the region of DNA known as the replication origin. Given that you don't know the k-mer pattern for the DnaA box but you do know the replication origin, you can scan through the replication origin for repeating k-mer patterns. If a pattern is found to heavily repeat, it's a good candidate that it's the k-mer pattern for the DnaA box.

ALGORITHM:

ch1_code/src/FindRepeatingInWindow.py (lines 20 to 67):

def scan_for_repeating_kmers_in_clusters(sequence: str, k: int, min_occurrence_in_cluster: int, cluster_window_size: int, options: Options = Options()) -> Set[KmerCluster]:
    def neighborhood(kmer: str) -> Set[str]:
        neighbourhood = find_all_dna_kmers_within_hamming_distance(kmer, options.hamming_distance)
        if options.reverse_complement:
            kmer_rc = reverse_complement(kmer)
            neighbourhood = find_all_dna_kmers_within_hamming_distance(kmer_rc, options.hamming_distance)
        return neighbourhood

    kmer_counter = {}

    def add_kmer(kmer: str, loc: int) -> None:
        if kmer not in kmer_counter:
            kmer_counter[kmer] = set()
        kmer_counter[kmer].add(window_idx + kmer_idx)

    def remove_kmer(kmer: str, loc: int) -> None:
        kmer_counter[kmer].remove(window_idx - 1)
        if len(kmer_counter[kmer]) == 0:
            del kmer_counter[kmer]

    clustered_kmers = set()

    old_first_kmer = None
    for window, window_idx in slide_window(sequence, cluster_window_size):
        first_kmer = window[0:k]
        last_kmer = window[-k:]

        # If first iteration, add all kmers
        if window_idx == 0:
            for kmer, kmer_idx in slide_window(window, k):
                for alt_kmer in neighborhood(kmer):
                    add_kmer(alt_kmer, window_idx + kmer_idx)
        else:
            # Add kmer that was walked in to
            for new_last_kmer in neighborhood(last_kmer):
                add_kmer(new_last_kmer, window_idx + cluster_window_size - k)
            # Remove kmer that was walked out of
            if old_first_kmer is not None:
                for alt_kmer in neighborhood(old_first_kmer):
                    remove_kmer(alt_kmer, window_idx - 1)

        old_first_kmer = first_kmer

        # Find clusters within window -- tuple is k-mer, start_idx, occurrence_count
        [clustered_kmers.add(KmerCluster(k, min(v), len(v))) for k, v in kmer_counter.items() if len(v) >= min_occurrence_in_cluster]

    return clustered_kmers

Found clusters of k=9 (at least 6 occurrences in window of 20) in TTTTTTTTTTTTTCCCTTTTTTTTTCCCTTTTTTTTTTTTT at...

Probability of Appearance

↩PREREQUISITES↩

WHAT: Given ...

... find the probability of that k-mer appearing at least c times within an arbitrary sequence of length n. For example, the probability that the 2-mer AA appears at least 2 times in a sequence of length 4:

The probability is 7/256.

This isn't trivial to accurately compute because the occurrences of a k-mer within a sequence may overlap. For example, the number of times AA appears in AAAA is 3 while in CAAA it's 2.

WHY: When a k-mer is found within a sequence, knowing the probability of that k-mer being found within an arbitrary sequence of the same length hints at the significance of the find. For example, if some 10-mer has a 0.2 chance of appearing in an arbitrary sequence of length 50, that's too high of a chance to consider it a significant find -- 0.2 means 1 in 5 chance that the 10-mer just randomly happens to appear.

Bruteforce Algorithm

ALGORITHM:

This algorithm tries every possible combination of sequence to find the probability. It falls over once the length of the sequence extends into the double digits. It's intended to help conceptualize what's going on.

ch1_code/src/BruteforceProbabilityOfKmerInArbitrarySequence.py (lines 9 to 39):

# Of the X sequence combinations tried, Y had the k-mer. The probability is Y/X.
def bruteforce_probability(searchspace_len: int, searchspace_symbol_count: int, search_for: List[int], min_occurrence: int) -> (int, int):
    found = 0
    found_max = searchspace_symbol_count ** searchspace_len

    str_to_search = [0] * searchspace_len

    def count_instances():
        ret = 0
        for i in range(0, searchspace_len - len(search_for) + 1):
            if str_to_search[i:i + len(search_for)] == search_for:
                ret += 1
        return ret

    def walk(idx: int):
        nonlocal found

        if idx == searchspace_len:
            count = count_instances()
            if count >= min_occurrence:
                found += 1
        else:
            for i in range(0, searchspace_symbol_count):
                walk(idx + 1)
                str_to_search[idx] += 1
            str_to_search[idx] = 0

    walk(0)

    return found, found_max

Brute-forcing probability of ACTG in arbitrary sequence of length 8

Probability: 0.0195159912109375 (1279/65536)

Selection Estimate Algorithm

ALGORITHM:

⚠️NOTE️️️⚠️

The explanation in the comments below are a bastardization of "1.13 Detour: Probabilities of Patterns in a String" in the Pevzner book...

This algorithm tries estimating the probability by ignoring the fact that the occurrences of a k-mer in a sequence may overlap. For example, searching for the 2-mer AA in the sequence AAAT yields 2 instances of AA:

If you go ahead and ignore overlaps, you can think of the k-mers occurring in a string as insertions. For example, imagine a sequence of length 7 and the 2-mer AA. If you were to inject 2 instances of AA into the sequence to get it to reach length 7, how would that look?

2 instances of a 2-mer is 4 characters has a length of 5. To get the sequence to end up with a length of 7 after the insertions, the sequence needs to start with a length of 3:

SSS

Given that you're changing reality to say that the instances WON'T overlap in the sequence, you can treat each instance of the 2-mer AA as a single entity being inserted. The number of ways that these 2 instances can be inserted into the sequence is 10:

I = insertion of AA, S = arbitrary sequence character

IISSS  ISISS  ISSIS  ISSSI
SIISS  SISIS  SISSI
SSIIS  SSISI
SSSII

Another way to think of the above insertions is that they aren't insertions. Rather, you have 5 items in total and you're selecting 2 of them. How many ways can you select 2 of those 5 items? 10.

The number of ways to insert can be counted via the "binomial coefficient": bc(m, k) = m!/(k!(m-k)!), where m is the total number of items (5 in the example above) and k is the number of selections (2 in the example above). For the example above:

bc(5, 2) = 5!/(2!(5-2)!) = 10

Since the SSS can be any arbitrary nucleotide sequence of 3, count the number of different representations that are possible for SSS: 4^3 = 4*4*4 = 64 (4^3, 4 because a nucleotide can be one of ACTG, 3 because the length is 3). In each of these representations, the 2-mer AA can be inserted in 10 different ways:

64*10 = 640

Since the total length of the sequence is 7, count the number of different representations that are possible:

4^7 = 4*4*4*4*4*4*4 = 16384

The estimated probability is 640/16384. For...

⚠️NOTE️️️⚠️

Maybe try training a deep learning model to see if it can provide better estimates?

ch1_code/src/EstimateProbabilityOfKmerInArbitrarySequence.py (lines 57 to 70):

def estimate_probability(searchspace_len: int, searchspace_symbol_count: int, search_for: List[int], min_occurrence: int) -> float:
    def factorial(num):
        if num == 1:
            return num
        else:
            return num * factorial(num - 1)

    def bc(m, k):
        return factorial(m) / (factorial(k) * factorial(m - k))

    k = len(search_for)
    n = (searchspace_len - min_occurrence * k)
    return bc(n + min_occurrence, min_occurrence) * (searchspace_symbol_count ** n) / searchspace_symbol_count ** searchspace_len

Estimating probability of ACTG in arbitrary sequence of length 8

Probability: 0.01953125

GC Skew

WHAT: Given a sequence, create a counter and walk over the sequence. Whenever a ...

WHY: Given the DNA sequence of an organism, some segments may have lower count of Gs vs Cs.

During replication, some segments of DNA stay single-stranded for a much longer time than other segments. Single-stranded DNA is 100 times more susceptible to mutations than double-stranded DNA. Specifically, in single-stranded DNA, C has a greater tendency to mutate to T. When that single-stranded DNA re-binds to a neighbouring strand, the positions of any nucleotides that mutated from C to T will change on the neighbouring strand from G to A.

⚠️NOTE️️️⚠️

Recall that the reverse complements of ...

It mutated from C to T. Since its now T, its complement is A.

Plotting the skew shows roughly which segments of DNA stayed single-stranded for a longer period of time. That information hints at special / useful locations in the organism's DNA sequence (replication origin / replication terminus).

ALGORITHM:

ch1_code/src/GCSkew.py (lines 8 to 21):

def gc_skew(seq: str):
    counter = 0
    skew = [counter]
    for i in range(len(seq)):
        if seq[i] == 'G':
            counter += 1
            skew.append(counter)
        elif seq[i] == 'C':
            counter -= 1
            skew.append(counter)
        else:
            skew.append(counter)
    return skew

Calculating skew for: ...

Result: [0, -1, -1,...

GC Skew Plot

Motif

↩PREREQUISITES↩

A motif is a pattern that matches many different k-mers, where those matched k-mers have some shared biological significance. The pattern matches a fixed k where each position may have alternate forms. The simplest way to think of a motif is a regex pattern without quantifiers. For example, the regex [AT]TT[GC]C may match to ATTGC, ATTCC, TTTGC, and TTTCC.

A common scenario involving motifs is to search through a set of DNA sequences for an unknown motif: Given a set of sequences, it's suspected that each sequence contains a k-mer that matches some motif. But, that motif isn't known beforehand. Both the k-mers and the motif they match need to be found.

For example, each of the following sequences contains a k-mer that matches some motif:

Sequences
ATTGTTACCATAACCTTATTGCTAG
ATTCCTTTAGGACCACCCCAAACCC
CCCCAGGAGGGAACCTTTGCACACA
TATATATTTCCCACCCCAAGGGGGG

That motif is the one described above ([AT]TT[GC]C):

Sequences
ATTGTTACCATAACCTTATTGCTAG
ATTCCTTTAGGACCACCCCAAACCC
CCCCAGGAGGGAACCTTTGCACACA
TATATATTTCCCACCCCAAGGGGGG

A motif matrix is a matrix of k-mers where each k-mer matches a motif. In the example sequences above, the motif matrix would be:

0 1 2 3 4
A T T G C
A T T C C
T T T G C
T T T C C

A k-mer that matches a motif may be referred to as a motif member.

Consensus String

WHAT: Given a motif matrix, generate a k-mer where each position is the nucleotide most abundant at that column of the matrix.

WHY: Given a set of k-mers that are suspected to be part of a motif (motif matrix), the k-mer generated by selecting the most abundant column at each index is the "ideal" k-mer for the motif. It's a concise way of describing the motif, especially if the columns in the motif matrix are highly conserved.

ALGORITHM:

⚠️NOTE️️️⚠️

It may be more appropriate to use a hybrid alphabet when representing consensus string because alternate nucleotides could be represented as a single letter. The Pevzner book doesn't mention this specifically but multiple online sources discuss it.

ch2_code/src/ConsensusString.py (lines 5 to 15):

def get_consensus_string(kmers: List[str]) -> str:
    count = len(kmers[0]);
    out = ''
    for i in range(0, count):
        c = Counter()
        for kmer in kmers:
            c[kmer[i]] += 1
        ch = c.most_common(1)
        out += ch[0][0]
    return out

Consensus is TTTCC in

ATTGC
ATTCC
TTTGC
TTTCC
TTTCA

Motif Matrix Count

WHAT: Given a motif matrix, count how many of each nucleotide are in each column.

WHY: Having a count of the number of nucleotides in each column is a basic statistic that gets used further down the line for tasks such as scoring a motif matrix.

ALGORITHM:

ch2_code/src/MotifMatrixCount.py (lines 7 to 21):

def motif_matrix_count(motif_matrix: List[str], elements='ACGT') -> Dict[str, List[int]]:
    rows = len(motif_matrix)
    cols = len(motif_matrix[0])

    ret = {}
    for ch in elements:
        ret[ch] = [0] * cols
    
    for c in range(0, cols):
        for r in range(0, rows):
            item = motif_matrix[r][c]
            ret[item][c] += 1
            
    return ret

Counting nucleotides at each column of the motif matrix...

ATTGC
TTTGC
TTTGG
ATTGC

Result...

('A', [2, 0, 0, 0, 0])
('C', [0, 0, 0, 0, 3])
('G', [0, 0, 0, 4, 1])
('T', [2, 4, 4, 0, 0])

Motif Matrix Profile

↩PREREQUISITES↩

WHAT: Given a motif matrix, for each column calculate how often A, C, G, and T occur as percentages.

WHY: The percentages for each column represent a probability distribution for that column. For example, in column 1 of...

0 1 2 3 4
A T T C G
C T T C G
T T T C G
T T T T G

These probability distributions can be used further down the line for tasks such as determining the probability that some arbitrary k-mer conforms to the same motif matrix.

ALGORITHM:

ch2_code/src/MotifMatrixProfile.py (lines 8 to 22):

def motif_matrix_profile(motif_matrix_counts: Dict[str, List[int]]) -> Dict[str, List[float]]:
    ret = {}
    for elem, counts in motif_matrix_counts.items():
        ret[elem] = [0.0] * len(counts)

    cols = len(counts)  # all elems should have the same len, so just grab the last one that was walked over
    for i in range(cols):
        total = 0
        for elem in motif_matrix_counts.keys():
            total += motif_matrix_counts[elem][i]
        for elem in motif_matrix_counts.keys():
            ret[elem][i] = motif_matrix_counts[elem][i] / total

    return ret

Profiling nucleotides at each column of the motif matrix...

ATTCG
CTTCG
TTTCG
TTTTG

Result...

('A', [0.25, 0.0, 0.0, 0.0, 0.0])
('C', [0.25, 0.0, 0.0, 0.75, 0.0])
('G', [0.0, 0.0, 0.0, 0.0, 1.0])
('T', [0.5, 1.0, 1.0, 0.25, 0.0])

Motif Matrix Score

WHAT: Given a motif matrix, assign it a score based on how similar the k-mers that make up the matrix are to each other. Specifically, how conserved the nucleotides at each column are.

WHY: Given a set of k-mers that are suspected to be part of a motif (motif matrix), the more similar those k-mers are to each other the more likely it is that those k-mers are members of the same motif. This seems to be the case for many enzymes that bind to DNA based on a motif (e.g. transcription factors).

Popularity Algorithm

ALGORITHM:

This algorithm scores a motif matrix by summing up the number of unpopular items in a column. For example, imagine a column has 7 Ts, 2 Cs, and 1A. The Ts are the most popular (7 items), meaning that the 3 items (2 Cs and 1 A) are unpopular -- the score for the column is 3.

Sum up each of the column scores to the get the final score for the motif matrix. A lower score is better.

ch2_code/src/ScoreMotif.py (lines 17 to 39):

def score_motif(motif_matrix: List[str]) -> int:
    rows = len(motif_matrix)
    cols = len(motif_matrix[0])

    # count up each column
    counter_per_col = []
    for c in range(0, cols):
        counter = Counter()
        for r in range(0, rows):
            counter[motif_matrix[r][c]] += 1
        counter_per_col.append(counter)

    # sum counts for each column AFTER removing the top-most count -- that is, consider the top-most count as the
    # most popular char, so you're summing the counts of all the UNPOPULAR chars
    unpopular_sums = []
    for counter in counter_per_col:
        most_popular_item = counter.most_common(1)[0][0]
        del counter[most_popular_item]
        unpopular_sum = sum(counter.values())
        unpopular_sums.append(unpopular_sum)

    return sum(unpopular_sums)

Scoring...

ATTGC
TTTGC
TTTGG
ATTGC

3

Entropy Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm scores a motif matrix by calculating the entropy of each column in the motif matrix. Entropy is defined as the level of uncertainty for some variable. The more uncertain the nucleotides are in the column of a motif matrix, the higher (worse) the score. For example, given a motif matrix with 10 rows, a column with ...

Sum the output for each column to get the final score for the motif matrix. A lower score is better.

ch2_code/src/ScoreMotifUsingEntropy.py (lines 10 to 38):

# According to the book, method of scoring a motif matrix as defined in ScoreMotif.py isn't the method used in the
# real-world. The method used in the real-world is this method, where...
# 1. each column has its probability distribution calculated (prob of A vs prob C vs prob of T vs prob of G)
# 2. the entropy of each of those prob dist are calculated
# 3. those entropies are summed up to get the ENTROPY OF THE MOTIF MATRIX
def calculate_entropy(values: List[float]) -> float:
    ret = 0.0
    for value in values:
        ret += value * (log(value, 2.0) if value > 0.0 else 0.0)
    ret = -ret
    return ret

def score_motify_entropy(motif_matrix: List[str]) -> float:
    rows = len(motif_matrix)
    cols = len(motif_matrix[0])

    # count up each column
    counts = motif_matrix_count(motif_matrix)
    profile = motif_matrix_profile(counts)

    # prob dist to entropy
    entropy_per_col = []
    for c in range(cols):
        entropy = calculate_entropy([profile['A'][c], profile['C'][c], profile['G'][c], profile['T'][c]])
        entropy_per_col.append(entropy)

    # sum up entropies to get entropy of motif
    return sum(entropy_per_col)

Scoring...

ATTGC
TTTGC
TTTGG
ATTGC

1.811278124459133

Relative Entropy Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm scores a motif matrix by calculating the entropy of each column relative to the overall nucleotide distribution of the sequences from which each motif member came from. This is important when finding motif members across a set of sequences. For example, the following sequences have a nucleotide distribution highly skewed towards C...

Sequences
CCCCCCCCCCCCCCCCCATTGCCCC
ATTCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCTTTGCCCCCC
CCCCCCTTTCTCCCCCCCCCCCCCC

Given the sequences in the example above, of all motif matrices possible for k=5, basic entropy scoring will always lead to a matrix filled with Cs:

0 1 2 3 4
C C C C C
C C C C C
C C C C C
C C C C C

Even though the above motif matrix scores perfect, it's likely junk. Members containing all Cs score better because the sequences they come from are biased (saturated with Cs), not because they share some higher biological significance.

To reduce bias, the nucleotide distributions from which the members came from need to be factored in to the entropy calculation: relative entropy.

ch2_code/src/ScoreMotifUsingRelativeEntropy.py (lines 10 to 84):

# NOTE: This is different from the traditional version of entropy -- it doesn't negate the sum before returning it.
def calculate_entropy(probabilities_for_nuc: List[float]) -> float:
    ret = 0.0
    for value in probabilities_for_nuc:
        ret += value * (log(value, 2.0) if value > 0.0 else 0.0)
    return ret


def calculate_cross_entropy(probabilities_for_nuc: List[float], total_frequencies_for_nucs: List[float]) -> float:
    ret = 0.0
    for prob, total_freq in zip(probabilities_for_nuc, total_frequencies_for_nucs):
        ret += prob * (log(total_freq, 2.0) if total_freq > 0.0 else 0.0)
    return ret


def score_motif_relative_entropy(motif_matrix: List[str], source_strs: List[str]) -> float:
    # calculate frequency of nucleotide across all source strings
    nuc_counter = Counter()
    nuc_total = 0
    for source_str in source_strs:
        for nuc in source_str:
            nuc_counter[nuc] += 1
        nuc_total += len(source_str)
    nuc_freqs = dict([(k, v / nuc_total) for k, v in nuc_counter.items()])

    rows = len(motif_matrix)
    cols = len(motif_matrix[0])

    # count up each column
    counts = motif_matrix_count(motif_matrix)
    profile = motif_matrix_profile(counts)
    relative_entropy_per_col = []
    for c in range(cols):
        # get entropy of column in motif
        entropy = calculate_entropy(
            [
                profile['A'][c],
                profile['C'][c],
                profile['G'][c],
                profile['T'][c]
            ]
        )
        # get cross entropy of column in motif (mixes in global nucleotide frequencies)
        cross_entropy = calculate_cross_entropy(
            [
                profile['A'][c],
                profile['C'][c],
                profile['G'][c],
                profile['T'][c]
            ],
            [
                nuc_freqs['A'],
                nuc_freqs['C'],
                nuc_freqs['G'],
                nuc_freqs['T']
            ]
        )
        relative_entropy = entropy - cross_entropy
        # Right now relative_entropy is calculated by subtracting cross_entropy from (a negated) entropy. But, according
        # to the Pevzner book, the calculation of relative_entropy can be simplified to just...
        # def calculate_relative_entropy(probabilities_for_nuc: List[float], total_frequencies_for_nucs: List[float]) -> float:
        #     ret = 0.0
        #     for prob, total_freq in zip(probabilities_for_nuc, total_frequency_for_nucs):
        #         ret += value * (log(value / total_freq, 2.0) if value > 0.0 else 0.0)
        #     return ret
        relative_entropy_per_col.append(relative_entropy)

    # sum up entropies to get entropy of motif
    ret = sum(relative_entropy_per_col)

    # All of the other score_motif algorithms try to MINIMIZE score. In the case of relative entropy (this algorithm),
    # the greater the score is the better of a match it is. As such, negate this score so the existing algorithms can
    # still try to minimize.
    return -ret

⚠️NOTE️️️⚠️

In the outputs below, the score in the second output should be less than (better) the score in the first output.

Scoring...

CCCCC
CCCCC
CCCCC
CCCCC

... which was pulled from ...

CCCCCCCCCCCCCCCCCATTGCCCC
ATTCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCTTTGCCCCCC
CCCCCCTTTCTCCCCCCCCCCCCCC

-1.172326268185115

Scoring...

ATTGC
ATTCC
CTTTG
TTTCT

... which was pulled from ...

CCCCCCCCCCCCCCCCCATTGCCCC
ATTCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCTTTGCCCCCC
CCCCCCTTTCTCCCCCCCCCCCCCC

-10.194105327448927

Motif Logo

↩PREREQUISITES↩

WHAT: Given a motif matrix, generate a graphical representation showing how conserved the motif is. Each position has its possible nucleotides stacked on top of each other, where the height of each nucleotide is based on how conserved it is. The more conserved a position is, the taller that column will be. This type of graphical representation is called a sequence logo.

WHY: A sequence logo helps more quickly convey the characteristics of the motif matrix it's for.

ALGORITHM:

For this particular logo implementation, a lower entropy results in a taller overall column.

ch2_code/src/MotifLogo.py (lines 15 to 39):

def calculate_entropy(values: List[float]) -> float:
    ret = 0.0
    for value in values:
        ret += value * (log(value, 2.0) if value > 0.0 else 0.0)
    ret = -ret
    return ret

def create_logo(motif_matrix_profile: Dict[str, List[float]]) -> Logo:
    columns = list(motif_matrix_profile.keys())
    data = [motif_matrix_profile[k] for k in motif_matrix_profile.keys()]
    data = list(zip(*data))  # trick to transpose data

    entropies = list(map(lambda x: 2 - calculate_entropy(x), data))

    data_scaledby_entropies = [[p * e for p in d] for d, e in zip(data, entropies)]

    df = pd.DataFrame(
        columns=columns,
        data=data_scaledby_entropies
    )
    logo = lm.Logo(df)
    logo.ax.set_ylabel('information (bits)')
    logo.ax.set_xlim([-1, len(df)])
    return logo

Generating logo for the following motif matrix...

TCGGGGGTTTTT
CCGGTGACTTAC
ACGGGGATTTTC
TTGGGGACTTTT
AAGGGGACTTCC
TTGGGGACTTCC
TCGGGGATTCAT
TCGGGGATTCCT
TAGGGGAACTAC
TCGGGTATAACC

Result...

Motif Logo

K-mer Match Probability

↩PREREQUISITES↩

WHAT: Given a motif matrix and a k-mer, calculate the probability of that k-mer being a member of that motif.

WHY: Being able to determine if a k-mer is potentially a member of a motif can help speed up experiments. For example, imagine that you suspect 21 different genes of being regulated by the same transcription factor. You isolate the transcription factor binding site for 6 of those genes and use their sequences as the underlying k-mers for a motif matrix. That motif matrix doesn't represent the transcription factor's motif exactly, but it's close enough that you can use it to scan through the k-mers in the remaining 15 genes and calculate the probability of them being members of the same motif.

If a k-mer exists such that it conforms to the motif matrix with a high probability, it likely is a member of the motif.

ALGORITHM:

Imagine the following motif matrix:

0 1 2 3 4 5
A T G C A C
A T G C A C
A T C C A C
A T C C A C

Calculating the counts for that motif matrix results in:

0 1 2 3 4 5
A 4 0 0 0 4 0
C 0 0 2 4 0 4
T 0 4 0 0 0 0
G 0 0 2 0 0 0

Calculating the profile from those counts results in:

0 1 2 3 4 5
A 1 0 0 0 1 0
C 0 0 0.5 1 0 1
T 0 1 0 0 0 0
G 0 0 0.5 0 0 0

Using this profile, the probability that a k-mer conforms to the motif matrix is calculated by mapping the nucleotide at each position of the k-mer to the corresponding nucleotide in the corresponding position of the profile and multiplying them together. For example, the probability that the k-mer...

Of these two k-mers, ...

Both of these k-mers should have a reasonable probability of being members of the motif. However, notice how the second k-mer ends up with a 0 probability. The reason has to do with the underlying concept behind motif matrices: the entire point of a motif matrix is to use the known members of a motif to find other potential members of that same motif. The second k-mer contains a T at index 0, but none of the known members of the motif have a T at that index. As such, its probability gets reduced to 0 even though the rest of the k-mer conforms.

Cromwell's rule says that when a probability is based off past events, a hard 0 or 1 values shouldn't be used. As such, a quick workaround to the 0% probability problem described above is to artificially inflate the counts that lead to the profile such that no count is 0 (pseudocounts). For example, for the same motif matrix, incrementing the counts by 1 results in:

0 1 2 3 4 5
A 5 1 1 1 5 1
C 1 1 3 5 1 5
T 1 5 1 1 1 1
G 1 1 3 1 1 1

Calculating the profile from those inflated counts results in:

0 1 2 3 4 5
A 0.625 0.125 0.125 0.125 0.625 0.125
C 0.125 0.125 0.375 0.625 0.125 0.625
T 0.125 0.625 0.125 0.125 0.125 0.125
G 0.125 0.125 0.375 0.125 0.125 0.125

Using this new profile, the probability that the previous k-mers conform are:

Although the probabilities seem low, it's all relative. The probability calculated for the first k-mer (ATGCAC) is the highest probability possible -- each position in the k-mer maps to the highest probability nucleotide of the corresponding position of the profile.

ch2_code/src/FindMostProbableKmerUsingProfileMatrix.py (lines 9 to 46):

# Run this on the counts before generating the profile to avoid the 0 probability problem.
def apply_psuedocounts_to_count_matrix(counts: Dict[str, List[int]], extra_count: int = 1):
    for elem, elem_counts in counts.items():
        for i in range(len(elem_counts)):
            elem_counts[i] += extra_count


# Recall that a profile matrix is a matrix of probabilities. Each row represents a single element (e.g. nucleotide) and
# each column represents the probability distribution for that position.
#
# So for example, imagine the following probability distribution...
#
#     1   2   3   4
# A: 0.2 0.2 0.0 0.0
# C: 0.1 0.6 0.0 0.0
# G: 0.1 0.0 1.0 1.0
# T: 0.7 0.2 0.0 0.0
#
# At position 2, the probability that the element will be C is 0.6 while the probability that it'll be T is 0.2. Note
# how each column sums to 1.
def determine_probability_of_match_using_profile_matrix(profile: Dict[str, List[float]], kmer: str):
    prob = 1.0
    for idx, elem in enumerate(kmer):
        prob = prob * profile[elem][idx]
    return prob


def find_most_probable_kmer_using_profile_matrix(profile: Dict[str, List[float]], dna: str):
    k = len(list(profile.values())[0])

    most_probable: Tuple[str, float] = None  # [kmer, probability]
    for kmer, _ in slide_window(dna, k):
        prob = determine_probability_of_match_using_profile_matrix(profile, kmer)
        if most_probable is None or prob > most_probable[1]:
            most_probable = (kmer, prob)

    return most_probable

Motif matrix...

ATGCAC
ATGCAC
ATCCAC

Probability that TTGCAC matches the motif 0.0...

Find Motif Matrix

↩PREREQUISITES↩

WHAT: Given a set of sequences, find k-mers in those sequences that may be members of the same motif.

WHY: A transcription factor is an enzyme that either increases or decreases a gene's transcription rate. It does so by binding to a specific part of the gene's upstream region called the transcription factor binding site. That transcription factor binding site consists of a k-mer that matches the motif expected by that transcription factor, called a regulatory motif.

A single transcription factor may operate on many different genes. Oftentimes a scientist will identify a set of genes that are suspected to be regulated by a single transcription factor, but that scientist won't know ...

The regulatory motif expected by a transcription factor typically expects k-mers that have the same length and are similar to each other (short hamming distance). As such, potential motif candidates can be derived by finding k-mers across the set of sequences that are similar to each other.

Bruteforce Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm scans over all k-mers in a set of DNA sequences, enumerates the hamming distance neighbourhood of each k-mer, and uses the k-mers from the hamming distance neighbourhood to build out possible motif matrices. Of all the motif matrices built, it selects the one with the lowest score.

Neither k nor the mismatches allowed by the motif is known. As such, the algorithm may need to be repeated multiple times with different value combinations.

Even for trivial inputs, this algorithm falls over very quickly. It's intended to help conceptualize the problem of motif finding.

ch2_code/src/ExhaustiveMotifMatrixSearch.py (lines 9 to 41):

def enumerate_hamming_distance_neighbourhood_for_all_kmer(
        dna: str,             # dna strings to search in for motif
        k: int,               # k-mer length
        max_mismatches: int   # max num of mismatches for motif (hamming dist)
) -> Set[str]:
    kmers_to_check = set()
    for kmer, _ in slide_window(dna, k):
        neighbouring_kmers = find_all_dna_kmers_within_hamming_distance(kmer, max_mismatches)
        kmers_to_check |= neighbouring_kmers

    return kmers_to_check


def exhaustive_motif_search(dnas: List[str], k: int, max_mismatches: int):
    kmers_for_dnas = [enumerate_hamming_distance_neighbourhood_for_all_kmer(dna, k, max_mismatches) for dna in dnas]

    def build_next_matrix(out_matrix: List[str]):
        idx = len(out_matrix)
        if len(kmers_for_dnas) == idx:
            yield out_matrix[:]
        else:
            for kmer in kmers_for_dnas[idx]:
                out_matrix.append(kmer)
                yield from build_next_matrix(out_matrix)
                out_matrix.pop()

    best_motif_matrix = None
    for next_motif_matrix in build_next_matrix([]):
        if best_motif_matrix is None or score_motif(next_motif_matrix) < score_motif(best_motif_matrix):
            best_motif_matrix = next_motif_matrix

    return best_motif_matrix

Searching for motif of k=5 and a max of 1 mismatches in the following...

ATAAAGGGATA
ACAGAAATGAT
TGAAATAACCT

Found the motif matrix...

ATAAT
ATAAT
ATAAT

Median String Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm takes advantage of the fact that the same score can be derived by scoring a motif matrix either row-by-row or column-by-column. For example, the score for the following motif matrix is 3...

0 1 2 3 4 5
A T G C A C
A T G C A C
A T C C T C
A T C C A C
Score 0 0 2 0 1 0 3

For each column, the number of unpopular nucleotides is counted. Then, those counts are summed to get the score: 0 + 0 + 2 + 0 + 1 + 0 = 3.

That exact same score scan be calculated by working through the motif matrix row-by-row...

0 1 2 3 4 5 Score
A T G C A C 1
A T G C A C 1
A T C C T C 1
A T C C A C 0
3

For each row, the number of unpopular nucleotides is counted. Then, those counts are summed to get the score: 1 + 1 + 1 + 0 = 3.

0 1 2 3 4 5 Score
A T G C A C 1
A T G C A C 1
A T C C T C 1
A T C C A C 0
Score 0 0 2 0 1 0 3

Notice how each row's score is equivalent to the hamming distance between the k-mer at that row and the motif matrix's consensus string. Specifically, the consensus string for the motif matrix is ATCCAC. For each row, ...

Given these facts, this algorithm constructs a set of consensus strings by enumerating through all possible k-mers for some k. Then, for each consensus string, it scans over each sequence to find the k-mer that minimizes the hamming distance for that consensus string. These k-mers are used as the members of a motif matrix.

Of all the motif matrices built, the one with the lowest score is selected.

Since the k for the motif is unknown, this algorithm may need to be repeated multiple times with different k values. This algorithm also doesn't scale very well. For k=10, 1048576 different consensus strings are possible.

ch2_code/src/MedianStringSearch.py (lines 8 to 33):

# The name is slightly confusing. What this actually does...
#   For each dna string:
#     Find the k-mer with the min hamming distance between the k-mers that make up the DNA string and pattern
#   Sum up the min hamming distances of the found k-mers (equivalent to the motif matrix score)
def distance_between_pattern_and_strings(pattern: str, dnas: List[str]) -> int:
    min_hds = []

    k = len(pattern)
    for dna in dnas:
        min_hd = None
        for dna_kmer, _ in slide_window(dna, k):
            hd = hamming_distance(pattern, dna_kmer)
            if min_hd is None or hd < min_hd:
                min_hd = hd
        min_hds.append(min_hd)
    return sum(min_hds)


def median_string(k: int, dnas: List[str]):
    last_best: Tuple[str, int] = None  # last found consensus string and its score
    for kmer in enumerate_patterns(k):
        score = distance_between_pattern_and_strings(kmer, dnas)  # find score of best motif matrix where consensus str is kmer
        if last_best is None or score < last_best[1]:
            last_best = kmer, score
    return last_best

Searching for motif of k=3 in the following...

AAATTGACGCAT
GACGACCACGTT
CGTCAGCGCCTG
GCTGAGCACCGG
AGTTCGGGACAG

Found the consensus string GAC with a score of 2

Greedy Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm begins by constructing a motif matrix where the only member is a k-mer picked from the first sequence. From there, it goes through the k-mers in the ...

  1. second sequence to find the one that has the highest match probability to the motif matrix and adds it as a member to the motif matrix.
  2. third sequence to find the one that has the highest match probability to the motif matrix and adds it as a member to the motif matrix.
  3. fourth sequence to find the one that has the highest match probability to the motif matrix and adds it as a member to the motif matrix.
  4. ...

This process repeats once for every k-mer in the first sequence. Each repetition produces a motif matrix. Of all the motif matrices built, the one with the lowest score is selected.

This is a greedy algorithm. It builds out potential motif matrices by selecting the locally optimal k-mer from each sequence. While this may not lead to the globally optimal motif matrix, it's fast and has a higher than normal likelihood of picking out the correct motif matrix.

ch2_code/src/GreedyMotifMatrixSearchWithPsuedocounts.py (lines 12 to 33):

def greedy_motif_search_with_psuedocounts(k: int, dnas: List[str]):
    best_motif_matrix = [dna[0:k] for dna in dnas]

    for motif, _ in slide_window(dnas[0], k):
        motif_matrix = [motif]
        counts = motif_matrix_count(motif_matrix)
        apply_psuedocounts_to_count_matrix(counts)
        profile = motif_matrix_profile(counts)

        for dna in dnas[1:]:
            next_motif, _ = find_most_probable_kmer_using_profile_matrix(profile, dna)
            # push in closest kmer as a motif member and recompute profile for the next iteration
            motif_matrix.append(next_motif)
            counts = motif_matrix_count(motif_matrix)
            apply_psuedocounts_to_count_matrix(counts)
            profile = motif_matrix_profile(counts)

        if score_motif(motif_matrix) < score_motif(best_motif_matrix):
            best_motif_matrix = motif_matrix

    return best_motif_matrix

Searching for motif of k=3 in the following...

AAATTGACGCAT
GACGACCACGTT
CGTCAGCGCCTG
GCTGAGCACCGG
AGTTCGGGACAG

Found the motif matrix...

GAC
GAC
GTC
GAG
GAC

Randomized Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm selects a random k-mer from each sequence to form an initial motif matrix. Then, for each sequence, it finds the k-mer that has the highest probability of matching that motif matrix. Those k-mers form the members of a new motif matrix. If the new motif matrix scores better than the existing motif matrix, the existing motif matrix gets replaced with the new motif matrix and the process repeats. Otherwise, the existing motif matrix is selected.

In theory, this algorithm works because all k-mers in a sequence other than the motif member are considered to be random noise. As such, if no motif members were selected when creating the initial motif matrix, the profile of that initial motif matrix would be more or less uniform:

0 1 2 3 4 5
A 0.25 0.25 0.25 0.25 0.25 0.25
C 0.25 0.25 0.25 0.25 0.25 0.25
T 0.25 0.25 0.25 0.25 0.25 0.25
G 0.25 0.25 0.25 0.25 0.25 0.25

Such a profile wouldn't allow for converging to a vastly better scoring motif matrix.

However, if at least one motif member were selected when creating the initial motif matrix, the profile of that initial motif matrix would skew towards the motif:

0 1 2 3 4 5
A 0.333 0.233 0.233 0.233 0.333 0.233
C 0.233 0.233 0.333 0.333 0.233 0.333
T 0.233 0.333 0.233 0.233 0.233 0.233
G 0.233 0.233 0.233 0.233 0.233 0.233

Such a profile would lead to a better scoring motif matrix where that better scoring motif matrix contains the other members of the motif.

In practice, this algorithm may trip up on real-world data. Real-world sequences don't actually contain random noise. The hope is that the only k-mers that are highly similar to each other in the sequences are members of the motif. It's possible that the sequences contain other sets of k-mers that are similar to each other but vastly different from the motif members. In such cases, even if a motif member were to be selected when creating the initial motif matrix, the algorithm may converge to a motif matrix that isn't for the motif.

This is a monte carlo algorithm. It uses randomness to deliver an approximate solution. While this may not lead to the globally optimal motif matrix, it's fast and as such can be run multiple times. The run with the best motif matrix will likely be a good enough solution (it captures most of the motif members, or parts of the motif members if k was too small, or etc..).

ch2_code/src/RandomizedMotifMatrixSearchWithPsuedocounts.py (lines 13 to 32):

def randomized_motif_search_with_psuedocounts(k: int, dnas: List[str]) -> List[str]:
        motif_matrix = []
        for dna in dnas:
            start = randrange(len(dna) - k + 1)
            kmer = dna[start:start + k]
            motif_matrix.append(kmer)

        best_motif_matrix = motif_matrix

        while True:
            counts = motif_matrix_count(motif_matrix)
            apply_psuedocounts_to_count_matrix(counts)
            profile = motif_matrix_profile(counts)

            motif_matrix = [find_most_probable_kmer_using_profile_matrix(profile, dna)[0] for dna in dnas]
            if score_motif(motif_matrix) < score_motif(best_motif_matrix):
                best_motif_matrix = motif_matrix
            else:
                return best_motif_matrix

Searching for motif of k=3 in the following...

AAATTGACGCAT
GACGACCACGTT
CGTCAGCGCCTG
GCTGAGCACCGG
AGTTCGGGACAG

Running 1000 iterations...

Best found the motif matrix...

GAC
GAC
GCC
GAG
GAC

Gibbs Sampling Algorithm

↩PREREQUISITES↩

ALGORITHM:

⚠️NOTE️️️⚠️

The Pevzner book mentions there's more to Gibbs Sampling than what it discussed. I looked up the topic but couldn't make much sense of it.

This algorithm selects a random k-mer from each sequence to form an initial motif matrix. Then, one of the k-mers from the motif matrix is randomly chosen and replaced with another k-mer from the same sequence that the removed k-mer came from. The replacement is selected by using a weighted random number algorithm, where how likely a k-mer is to be chosen as a replacement has to do with how probable of a match it is to the motif matrix.

This process of replacement is repeated for some user-defined number of cycles, at which point the algorithm has hopefully homed in on the desired motif matrix.

This is a monte carlo algorithm. It uses randomness to deliver an approximate solution. While this may not lead to the globally optimal motif matrix, it's fast and as such can be run multiple times. The run with the best motif matrix will likely be a good enough solution (it captures most of the motif members, or parts of the motif members if k was too small, or etc..).

The idea behind this algorithm is similar to the idea behind the randomized algorithm for motif matrix finding, except that this algorithm is more conservative in how it converges on a motif matrix and the weighted random selection allows it to potentially break out if stuck in a local optima.

ch2_code/src/GibbsSamplerMotifMatrixSearchWithPsuedocounts.py (lines 14 to 59):

def gibbs_rand(prob_dist: List[float]) -> int:
    # normalize prob_dist -- just incase sum(prob_dist) != 1.0
    prob_dist_sum = sum(prob_dist)
    prob_dist = [p / prob_dist_sum for p in prob_dist]

    while True:
        selection = randrange(0, len(prob_dist))
        if random() < prob_dist[selection]:
            return selection


def determine_probabilities_of_all_kmers_in_dna(profile_matrix: Dict[str, List[float]], dna: str, k: int) -> List[int]:
    ret = []
    for kmer, _ in slide_window(dna, k):
        prob = determine_probability_of_match_using_profile_matrix(profile_matrix, kmer)
        ret.append(prob)
    return ret


def gibbs_sampler_motif_search_with_psuedocounts(k: int, dnas: List[str], cycles: int) -> List[str]:
    motif_matrix = []
    for dna in dnas:
        start = randrange(len(dna) - k + 1)
        kmer = dna[start:start + k]
        motif_matrix.append(kmer)

    best_motif_matrix = motif_matrix[:]  # create a copy, otherwise you'll be modifying both motif and best_motif

    for j in range(0, cycles):
        i = randrange(len(dnas))  # pick a dna
        del motif_matrix[i]  # remove the kmer for that dna from the motif str

        counts = motif_matrix_count(motif_matrix)
        apply_psuedocounts_to_count_matrix(counts)
        profile = motif_matrix_profile(counts)

        new_motif_kmer_probs = determine_probabilities_of_all_kmers_in_dna(profile, dnas[i], k)
        new_motif_kmer_idx = gibbs_rand(new_motif_kmer_probs)
        new_motif_kmer = dnas[i][new_motif_kmer_idx:new_motif_kmer_idx+k]
        motif_matrix.insert(i, new_motif_kmer)

        if score_motif(motif_matrix) < score_motif(best_motif_matrix):
            best_motif_matrix = motif_matrix[:]  # create a copy, otherwise you'll be modifying both motif and best_motif

    return best_motif_matrix

Searching for motif of k=3 in the following...

AAATTGACGCAT
GACGACCACGTT
CGTCAGCGCCTG
GCTGAGCACCGG
AGTTCGGGACAG

Running 1000 iterations...

Best found the motif matrix...

GAC
GAC
GCC
GAG
GAC

Motif Matrix Hybrid Alphabet

↩PREREQUISITES↩

WHAT: When creating finding a motif, it may be beneficial to use a hybrid alphabet rather than the standard nucleotides (A, C, T, and G). For example, the following hybrid alphabet marks certain combinations of nucleotides as a single letter:

⚠️NOTE️️️⚠️

The alphabet above was pulled from the Pevzner book section 2.16: Complications in Motif Finding. It's a subset of the IUPAC nucleotide codes alphabet. The author didn't mention if the alphabet was explicitly chosen for regulatory motif finding. If it was, it may have been derived from running probabilities over already discovered regulatory motifs: e.g. for the motifs already discovered, if a position has 2 possible nucleotides then G/C (S), G/T (K), C/T (Y), and A/T (W) are likely but other combinations aren't.

WHY: Hybrid alphabets may make it easier for motif finding algorithms to converge on a motif. For example, when scoring a motif matrix, treat the position as a single letter if the distinct nucleotides at that position map to one of the combinations in the hybrid alphabet.

Hybrid alphabets may make more sense for representing a consensus string. Rather than picking out the most popular nucleotide, the hybrid alphabet can be used to describe alternating nucleotides at each position.

ALGORITHM:

ch2_code/src/HybridAlphabetMatrix.py (lines 5 to 26):

PEVZNER_2_16_ALPHABET = dict()
PEVZNER_2_16_ALPHABET[frozenset({'A', 'T'})] = 'W'
PEVZNER_2_16_ALPHABET[frozenset({'G', 'C'})] = 'S'
PEVZNER_2_16_ALPHABET[frozenset({'G', 'T'})] = 'K'
PEVZNER_2_16_ALPHABET[frozenset({'C', 'T'})] = 'Y'


def to_hybrid_alphabet_motif_matrix(motif_matrix: List[str], hybrid_alphabet: Dict[FrozenSet[str], str]) -> List[str]:
    rows = len(motif_matrix)
    cols = len(motif_matrix[0])

    motif_matrix = motif_matrix[:]  # make a copy
    for c in range(cols):
        distinct_nucs_at_c = frozenset([motif_matrix[r][c] for r in range(rows)])
        if distinct_nucs_at_c in hybrid_alphabet:
            for r in range(rows):
                motif_member = motif_matrix[r]
                motif_member = motif_member[:c] + hybrid_alphabet[distinct_nucs_at_c] + motif_member[c+1:]
                motif_matrix[r] = motif_member

    return motif_matrix

Converted...

CATCCG
CTTCCT
CATCTT

to...

CWTCYK
CWTCYK
CWTCYK

using...

{frozenset({'A', 'T'}): 'W', frozenset({'G', 'C'}): 'S', frozenset({'G', 'T'}): 'K', frozenset({'C', 'T'}): 'Y'}

DNA Assembly

↩PREREQUISITES↩

DNA sequencers work by taking many copies of an organism's genome, breaking up those copies into fragments, then scanning in those fragments. Sequencers typically scan fragments in 1 of 2 ways:

Assembly is the process of reconstructing an organism's genome from the fragments returned by a sequencer. Since the sequencer breaks up many copies of the same genome and each fragment's start position is random, the original genome can be reconstructed by finding overlaps between fragments and stitching them back together.

Kroki diagram output

A typical problem with sequencing is that the number of errors in a fragment increase as the number of scanned bases increases. As such, read-pairs are preferred over reads: by only scanning in the head and tail of a long fragment, the scan won't contain as many errors as a read of the same length but will still contain extra information which helps with assembly (length of unknown nucleotides in between the prefix and suffix).

Assembly has many practical complications that prevent full genome reconstruction from fragments:

Stitch Reads

WHAT: Given a list of overlapping reads where ...

... , stitch them together. For example, in the read list [GAAA, AAAT, AATC] each read overlaps the subsequent read by an offset of 1: GAAATC.

0 1 2 3 4 5
R1 G A A A
R2 A A A T
R3 A A T C
Stitched G A A A T C

WHY: Since the sequencer breaks up many copies of the same DNA and each read's start position is random, larger parts of the original DNA can be reconstructed by finding overlaps between fragments and stitching them back together.

ALGORITHM:

ch3_code/src/Read.py (lines 55 to 76):

def append_overlap(self: Read, other: Read, skip: int = 1) -> Read:
    offset = len(self.data) - len(other.data)
    data_head = self.data[:offset]
    data = self.data[offset:]

    prefix = data[:skip]
    overlap1 = data[skip:]
    overlap2 = other.data[:-skip]
    suffix = other.data[-skip:]
    ret = data_head + prefix
    for ch1, ch2 in zip(overlap1, overlap2):
        ret += ch1 if ch1 == ch2 else '?'  # for failure, use IUPAC nucleotide codes instead of question mark?
    ret += suffix
    return Read(ret, source=('overlap', [self, other]))

@staticmethod
def stitch(items: List[Read], skip: int = 1) -> str:
    assert len(items) > 0
    ret = items[0]
    for other in items[1:]:
        ret = ret.append_overlap(other, skip)
    return ret.data

Stitched [GAAA, AAAT, AATC] to GAAATC

⚠️NOTE️️️⚠️

See also: Algorithms/Sequence Alignment/Overlap Alignment

Stitch Read-Pairs

↩PREREQUISITES↩

WHAT: Given a list of overlapping read-pairs where ...

... , stitch them together. For example, in the read-pair list [ATG---CCG, TGT---CGT, GTT---GTT, TTA---TTC] each read-pair overlaps the subsequent read-pair by an offset of 1: ATGTTACCGTTC.

0 1 2 3 4 5 6 7 8 9 10 11
R1 A T G - - - C C G
R2 T G T - - - C G T
R3 G T T - - - G T T
R4 T T A - - - T T C
Stitched A T G T T A C C G T T C

WHY: Since the sequencer breaks up many copies of the same DNA and each read's start position is random, larger parts of the original DNA can be reconstructed by finding overlaps between fragments and stitching them back together.

ALGORITHM:

Overlapping read-pairs are stitched by taking the first read-pair and iterating through the remaining read-pairs where ...

For example, to stitch [ATG---CCG, TGT---CGT], ...

  1. stitch the heads as if they were reads: [ATG, TGT] results in ATGT,
  2. stitch the tails as if they were reads: [CCG, CGT] results in CCGT.
0 1 2 3 4 5 6 7 8 9
R1 A T G - - - C C G
R2 T G T - - - C G T
Stitched A T G T - - C C G T

ch3_code/src/ReadPair.py (lines 82 to 110):

def append_overlap(self: ReadPair, other: ReadPair, skip: int = 1) -> ReadPair:
    self_head = Read(self.data.head)
    other_head = Read(other.data.head)
    new_head = self_head.append_overlap(other_head)
    new_head = new_head.data

    self_tail = Read(self.data.tail)
    other_tail = Read(other.data.tail)
    new_tail = self_tail.append_overlap(other_tail)
    new_tail = new_tail.data

    # WARNING: new_d may go negative -- In the event of a negative d, it means that rather than there being a gap
    # in between the head and tail, there's an OVERLAP in between the head and tail. To get rid of the overlap, you
    # need to remove either the last d chars from head or first d chars from tail.
    new_d = self.d - skip
    kdmer = Kdmer(new_head, new_tail, new_d)

    return ReadPair(kdmer, source=('overlap', [self, other]))

@staticmethod
def stitch(items: List[ReadPair], skip: int = 1) -> str:
    assert len(items) > 0
    ret = items[0]
    for other in items[1:]:
        ret = ret.append_overlap(other, skip)
    assert ret.d <= 0, "Gap still exists -- not enough to stitch"
    overlap_count = -ret.d
    return ret.data.head + ret.data.tail[overlap_count:]

Stitched [ATG---CCG, TGT---CGT, GTT---GTT, TTA---TTC] to ATGTTACCGTTC

⚠️NOTE️️️⚠️

See also: Algorithms/Sequence Alignment/Overlap Alignment

Break Reads

WHAT: Given a set of reads that arbitrarily overlap, each read can be broken into many smaller reads that overlap better. For example, given 4 10-mers that arbitrarily overlap, you can break them into better overlapping 5-mers...

Kroki diagram output

WHY: Breaking reads may cause more ambiguity in overlaps. At the same time, read breaking makes it easier to find overlaps by bringing the overlaps closer together and provides (artificially) increased coverage.

ALGORITHM:

ch3_code/src/Read.py (lines 80 to 87):

# This is read breaking -- why not just call it break? because break is a reserved keyword.
def shatter(self: Read, k: int) -> List[Read]:
    ret = []
    for kmer, _ in slide_window(self.data, k):
        r = Read(kmer, source=('shatter', [self]))
        ret.append(r)
    return ret

Broke ACTAAGAACC to [ACTAA, CTAAG, TAAGA, AAGAA, AGAAC, GAACC]

Break Read-Pairs

↩PREREQUISITES↩

WHAT: Given a set of read-pairs that arbitrarily overlap, each read-pair can be broken into many read-pairs with a smaller k that overlap better. For example, given 4 (4,2)-mers that arbitrarily overlap, you can break them into better overlapping (2,4)-mers...

Kroki diagram output

WHY: Breaking read-pairs may cause more ambiguity in overlaps. At the same time, read-pair breaking makes it easier to find overlaps by bringing the overlaps closer together and provides (artificially) increased coverage.

ALGORITHM:

ch3_code/src/ReadPair.py (lines 113 to 124):

# This is read breaking -- why not just call it break? because break is a reserved keyword.
def shatter(self: ReadPair, k: int) -> List[ReadPair]:
    ret = []
    d = (self.k - k) + self.d
    for window_head, window_tail in zip(slide_window(self.data.head, k), slide_window(self.data.tail, k)):
        kmer_head, _ = window_head
        kmer_tail, _ = window_tail
        kdmer = Kdmer(kmer_head, kmer_tail, d)
        rp = ReadPair(kdmer, source=('shatter', [self]))
        ret.append(rp)
    return ret

Broke ACTA--AACC to [AC----AA, CT----AC, TA----CC]

Probability of Fragment Occurrence

↩PREREQUISITES↩

WHAT: Sequencers work by taking many copies of an organism's genome, randomly breaking up those genomes into smaller pieces, and randomly scanning in those pieces (fragments). As such, it isn't immediately obvious how many times each fragment actually appears in the genome.

Imagine that you're sequencing an organism's genome. Given that ...

... you can use probabilities to hint at how many times a fragment appears in the genome.

WHY:

Determining how many times a fragment appears in a genome helps with assembly. Specifically, ...

ALGORITHM:

⚠️NOTE️️️⚠️

For simplicity's sake, the genome is single-stranded (not double-stranded DNA / no reverse complementing stand).

Imagine a genome of ATGGATGC. A sequencer runs over that single strand and generates 3-mer reads with roughly 30x coverage. The resulting fragments are ...

Read # of Copies
ATG 61
TGG 30
GAT 31
TGC 29
TGT 1

Since the genome is known to have less than 50% repeats, the dominate number of copies likely maps to 1 instance of that read appearing in the genome. Since the dominate number is ~30, divide the number of copies for each read by ~30 to find out roughly how many times each read appears in the genome ...

Read # of Copies # of Appearances in Genome
ATG 61 2
TGG 30 1
GAT 31 1
TGC 29 1
TGT 1 0.03

Note the last read (TGT) has 0.03 appearances, meaning it's a read that it either

In this case, it's an error because it doesn't appear in the original genome: TGT is not in ATGGATGC.

ch3_code/src/FragmentOccurrenceProbabilityCalculator.py (lines 15 to 29):

# If less than 50% of the reads are from repeats, this attempts to count and normalize such that it can hint at which
# reads may contain errors (= ~0) and which reads are for repeat regions (> 1.0).
def calculate_fragment_occurrence_probabilities(fragments: List[T]) -> Dict[T, float]:
    counter = Counter(fragments)
    max_digit_count = max([len(str(count)) for count in counter.values()])
    for i in range(max_digit_count):
        rounded_counter = Counter(dict([(k, round(count, -i)) for k, count in counter.items()]))
        for k, orig_count in counter.items():
            if rounded_counter[k] == 0:
                rounded_counter[k] = orig_count
        most_occurring_count, times_counted = Counter(rounded_counter.values()).most_common(1)[0]
        if times_counted >= len(rounded_counter) * 0.5:
            return dict([(key, value / most_occurring_count) for key, value in rounded_counter.items()])
    raise ValueError(f'Failed to find a common count: {counter}')

Sequenced fragments:

Probability of occurrence in genome:

Overlap Graph

↩PREREQUISITES↩

WHAT: Given the fragments for a single strand of DNA, create a directed graph where ...

  1. each node is a fragment.

    Kroki diagram output

  2. each edge is between overlapping fragments (nodes), where the ...

    Kroki diagram output

This is called an overlap graph.

WHY: An overlap graph shows the different ways that fragments can be stitched together. A path in an overlap graph that touches each node exactly once is one possibility for the original single stranded DNA that the fragments came from. For example...

These paths are referred to as Hamiltonian paths.

⚠️NOTE️️️⚠️

Notice that the example graph is circular. If the organism genome itself were also circular (e.g. bacterial genome), the genome guesses above are all actually the same because circular genomes don't have a beginning / end.

ALGORITHM:

Sequencers produce fragments, but fragments by themselves typically aren't enough for most experiments / algorithms. In theory, stitching overlapping fragments for a single-strand of DNA should reveal that single-strand of DNA. In practice, real-world complications make revealing that single-strand of DNA nearly impossible:

Nevertheless, in an ideal world where most of these problems don't exist, an overlap graph is a good way to guess the single-strand of DNA that a set of fragments came from. An overlap graph assumes that the fragments it's operating on ...

⚠️NOTE️️️⚠️

Although the complications discussed above make it impossible to get the original genome in its entirety, it's still possible to pull out large parts of the original genome. This is discussed in Algorithms/DNA Assembly/Find Contigs.

To construct an overlap graph, create an edge between fragments that have an overlap.

For each fragment, add that fragment's ...

Then, join the hash tables together to find overlapping fragments.

ch3_code/src/ToOverlapGraphHash.py (lines 13 to 36):

def to_overlap_graph(items: List[T], skip: int = 1) -> Graph[T]:
    ret = Graph()

    prefixes = dict()
    suffixes = dict()
    for i, item in enumerate(items):
        prefix = item.prefix(skip)
        prefixes.setdefault(prefix, set()).add(i)
        suffix = item.suffix(skip)
        suffixes.setdefault(suffix, set()).add(i)

    for key, indexes in suffixes.items():
        other_indexes = prefixes.get(key)
        if other_indexes is None:
            continue
        for i in indexes:
            item = items[i]
            for j in other_indexes:
                if i == j:
                    continue
                other_item = items[j]
                ret.insert_edge(item, other_item)
    return ret

Given the fragments ['TTA', 'TTA', 'TAG', 'AGT', 'GTT', 'TAC', 'ACT', 'CTT'], the overlap graph is...

Dot diagram

A path that touches each node of an graph exactly once is a Hamiltonian path. Each The Hamiltonian path in an overlap graph is a guess as to the original single strand of DNA that the fragments for the graph came from.

The code shown below recursively walks all paths. Of all the paths it walks over, the ones that walk every node of the graph exactly once are selected.

This algorithm will likely fall over on non-trivial overlap graphs. Even finding one Hamiltonian path is computationally intensive.

ch3_code/src/WalkAllHamiltonianPaths.py (lines 15 to 38):

def exhaustively_walk_until_all_nodes_touched_exactly_one(
        graph: Graph[T],
        from_node: T,
        current_path: List[T]
) -> List[List[T]]:
    current_path.append(from_node)

    if len(current_path) == len(graph):
        found_paths = [current_path.copy()]
    else:
        found_paths = []
        for to_node in graph.get_outputs(from_node):
            if to_node in set(current_path):
                continue
            found_paths += exhaustively_walk_until_all_nodes_touched_exactly_one(graph, to_node, current_path)

    current_path.pop()
    return found_paths


# walk each node exactly once
def walk_hamiltonian_paths(graph: Graph[T], from_node: T) -> List[List[T]]:
    return exhaustively_walk_until_all_nodes_touched_exactly_one(graph, from_node, [])

Given the fragments ['TTA', 'TTA', 'TAG', 'AGT', 'GTT', 'TAC', 'ACT', 'CTT'], the overlap graph is...

Dot diagram

... and the Hamiltonian paths are ...

De Bruijn Graph

↩PREREQUISITES↩

WHAT: Given the fragments for a single strand of DNA, create a directed graph where ...

  1. each fragment is represented as an edge connecting 2 nodes, where the ...

    Kroki diagram output

  2. duplicate nodes are merged into a single node.

    Kroki diagram output

This graph is called a de Bruijn graph: a balanced and strongly connected graph where the fragments are represented as edges.

⚠️NOTE️️️⚠️

The example graph above is balanced. But, depending on the fragments used, the graph may not be totally balanced. A technique for dealing with this is detailed below. For now, just assume that the graph will be balanced.

WHY: Similar to an overlap graph, a de Bruijn graph shows the different ways that fragments can be stitched together. However, unlike an overlap graph, the fragments are represented as edges rather than nodes. Where in an overlap graph you need to find paths that touch every node exactly once (Hamiltonian path), in a de Bruijn graph you need to find paths that walk over every edge exactly once (Eulerian cycle).

A path in a de Bruijn graph that walks over each edge exactly once is one possibility for the original single stranded DNA that the fragments came from: it starts and ends at the same node (a cycle), and walks over every edge in the graph.

In contrast to finding a Hamiltonian path in an overlap graph, it's much faster to find an Eulerian cycle in a de Bruijn graph.

De Bruijn graphs were originally invented to solve the k-universal string problem, which is effectively the same concept as assembly.

ALGORITHM:

Sequencers produce fragments, but fragments by themselves typically aren't enough for most experiments / algorithms. In theory, stitching overlapping fragments for a single-strand of DNA should reveal that single-strand of DNA. In practice, real-world complications make revealing that single-strand of DNA nearly impossible:

Nevertheless, in an ideal world where most of these problems don't exist, a de Bruijn graph is a good way to guess the single-strand of DNA that a set of fragments came from. A de Bruijn graph assumes that the fragments it's operating on ...

⚠️NOTE️️️⚠️

Although the complications discussed above make it impossible to get the original genome in its entirety, it's still possible to pull out large parts of the original genome. This is discussed in Algorithms/DNA Assembly/Find Contigs.

To construct a de Bruijn graph, add an edge for each fragment, creating missing nodes as required.

ch3_code/src/ToDeBruijnGraph.py (lines 13 to 20):

def to_debruijn_graph(reads: List[T], skip: int = 1) -> Graph[T]:
    graph = Graph()
    for read in reads:
        from_node = read.prefix(skip)
        to_node = read.suffix(skip)
        graph.insert_edge(from_node, to_node)
    return graph

Given the fragments ['TTAG', 'TAGT', 'AGTT', 'GTTA', 'TTAC', 'TACT', 'ACTT', 'CTTA'], the de Bruijn graph is...

Dot diagram

Note how the graph above is both balanced and strongly connected. In most cases, non-circular genomes won't generate a balanced graph like the one above. Instead, a non-circular genome will very likely generate a graph that's nearly balanced: Nearly balanced graphs are graphs that would be balanced if not for a few unbalanced nodes (usually root and tail nodes). They can artificially be made to become balanced by finding imbalanced nodes and creating artificial edges between them until they become balanced nodes.

⚠️NOTE️️️⚠️

Circular genomes are genomes that wrap around (e.g. bacterial genomes). They don't have a beginning / end.

ch3_code/src/BalanceNearlyBalancedGraph.py (lines 15 to 44):

def find_unbalanced_nodes(graph: Graph[T]) -> List[Tuple[T, int, int]]:
    unbalanced_nodes = []
    for node in graph.get_nodes():
        in_degree = graph.get_in_degree(node)
        out_degree = graph.get_out_degree(node)
        if in_degree != out_degree:
            unbalanced_nodes.append((node, in_degree, out_degree))
    return unbalanced_nodes


# creates a balanced graph from a nearly balanced graph -- nearly balanced means the graph has an equal number of
# missing outputs and missing inputs.
def balance_graph(graph: Graph[T]) -> Tuple[Graph[T], Set[T], Set[T]]:
    unbalanced_nodes = find_unbalanced_nodes(graph)
    nodes_with_missing_ins = filter(lambda x: x[1] < x[2], unbalanced_nodes)
    nodes_with_missing_outs = filter(lambda x: x[1] > x[2], unbalanced_nodes)

    graph = graph.copy()

    # create 1 copy per missing input / per missing output
    n_per_need_in = [_n for n, in_degree, out_degree in nodes_with_missing_ins for _n in [n] * (out_degree - in_degree)]
    n_per_need_out = [_n for n, in_degree, out_degree in nodes_with_missing_outs for _n in [n] * (in_degree - out_degree)]
    assert len(n_per_need_in) == len(n_per_need_out)  # need an equal count of missing ins and missing outs to balance

    # balance
    for n_need_in, n_need_out in zip(n_per_need_in, n_per_need_out):
        graph.insert_edge(n_need_out, n_need_in)

    return graph, set(n_per_need_in), set(n_per_need_out)  # return graph with cycle, orig root nodes, orig tail nodes

Given the fragments ['TTAC', 'TACC', 'ACCC', 'CCCT'], the artificially balanced de Bruijn graph is...

Dot diagram

... with original head nodes at {TTA} and tail nodes at {CCT}.

Given a de Bruijn graph (strongly connected and balanced), you can find a Eulerian cycle by randomly walking unexplored edges in the graph. Pick a starting node and randomly walk edges until you end up back at that same node, ignoring all edges that were previously walked over. Of the nodes that were walked over, pick one that still has unexplored edges and repeat the process: Walk edges from that node until you end up back at that same node, ignoring edges all edges that were previously walked over (including those in the past iteration). Continue this until you run out of unexplored edges.

ch3_code/src/WalkRandomEulerianCycle.py (lines 14 to 64):

# (6, 8), (8, 7), (7, 9), (9, 6)  ---->  68796
def edge_list_to_node_list(edges: List[Tuple[T, T]]) -> List[T]:
    ret = [edges[0][0]]
    for e in edges:
        ret.append(e[1])
    return ret


def randomly_walk_and_remove_edges_until_cycle(graph: Graph[T], node: T) -> List[T]:
    end_node = node
    edge_list = []
    from_node = node
    while len(graph) > 0:
        to_nodes = graph.get_outputs(from_node)
        to_node = next(to_nodes, None)
        assert to_node is not None  # eularian graphs are strongly connected, meaning we should never hit dead-end nodes

        graph.delete_edge(from_node, to_node, True, True)

        edge = (from_node, to_node)
        edge_list.append(edge)
        from_node = to_node
        if from_node == end_node:
            return edge_list_to_node_list(edge_list)

    assert False  # eularian graphs are strongly connected and balanced, meaning we should never run out of nodes


# graph must be strongly connected
# graph must be balanced
# if the 2 conditions above are met, the graph will be eularian (a eulerian cycle exists)
def walk_eulerian_cycle(graph: Graph[T], start_node: T) -> List[T]:
    graph = graph.copy()

    node_cycle = randomly_walk_and_remove_edges_until_cycle(graph, start_node)
    node_cycle_ptr = 0
    while len(graph) > 0:
        new_node_cycle = None
        for local_ptr, node in enumerate(node_cycle[node_cycle_ptr:]):
            if node not in graph:
                continue
            node_cycle_ptr += local_ptr
            inject_node_cycle = randomly_walk_and_remove_edges_until_cycle(graph, node)
            new_node_cycle = node_cycle[:]
            new_node_cycle[node_cycle_ptr:node_cycle_ptr+1] = inject_node_cycle
            break
        assert new_node_cycle is not None
        node_cycle = new_node_cycle

    return node_cycle

Given the fragments ['TTA', 'TAT', 'ATT', 'TTC', 'TCT', 'CTT'], the de Bruijn graph is...

Dot diagram

... and a Eulerian cycle is ...

TT -> TC -> CT -> TT -> TA -> AT -> TT

Note that the graph above is naturally balanced (no artificial edges have been added in to make it balanced). If the graph you're finding a Eulerian cycle on has been artificially balanced, simply start the search for a Eulerian cycle from one of the original head node. The artificial edge will show up at the end of the Eulerian cycle, and as such can be dropped from the path.

Kroki diagram output

This algorithm picks one Eulerian cycle in a graph. Most graph have multiple Eulerian cycles, likely too many to enumerate all of them.

⚠️NOTE️️️⚠️

See the section on k-universal strings to see a real-world application of Eulerian graphs. For something like k=20, good luck trying to enumerate all Eulerian cycles.

Find Bubbles

↩PREREQUISITES↩

WHAT: Given a set of a fragments that have been broken to k (read breaking / read-pair breaking), any ...

... of length ...

... may have been from a sequencing error.

Kroki diagram output

WHY: When fragments returned by a sequencer get broken (read breaking / read-pair breaking), any fragments containing sequencing errors may show up in the graph as one of 3 structures: forked prefix, forked suffix, or bubble. As such, it may be possible to detect these structures and flatten them (by removing bad branches) to get a cleaner graph.

For example, imagine the read ATTGG. Read breaking it into 2-mer reads results in: [AT, TT, TG, GG].

Kroki diagram output

Now, imagine that the sequencer captures that same part of the genome again, but this time the read contains a sequencing error. Depending on where the incorrect nucleotide is, one of the 3 structures will get introduced into the graph:

Note that just because these structures exist doesn't mean that the fragments they represent definitively have sequencing errors. These structures could have been caused by other problems / may not be problems at all:

⚠️NOTE️️️⚠️

The Pevzner book says that bubble removal is a common feature in modern assemblers. My assumption is that, before pulling out contigs (described later on), basic probabilities are used to try and suss out if a branch in a bubble / prefix fork / suffix fork is bad and remove it if it is. This (hopefully) results in longer contigs.

ALGORITHM:

ch3_code/src/FindGraphAnomalies.py (lines 53 to 105):

def find_head_convergences(graph: Graph[T], branch_len: int) -> List[Tuple[Optional[T], List[T], Optional[T]]]:
    root_nodes = filter(lambda n: graph.get_in_degree(n) == 0, graph.get_nodes())

    ret = []
    for n in root_nodes:
        for child in graph.get_outputs(n):
            path_from_child = walk_outs_until_converge(graph, child)
            if path_from_child is None:
                continue
            diverging_node = None
            branch_path = [n] + path_from_child[:-1]
            converging_node = path_from_child[-1]
            path = (diverging_node, branch_path, converging_node)
            if len(branch_path) <= branch_len:
                ret.append(path)
    return ret


def find_tail_divergences(graph: Graph[T], branch_len: int) -> List[Tuple[Optional[T], List[T], Optional[T]]]:
    tail_nodes = filter(lambda n: graph.get_out_degree(n) == 0, graph.get_nodes())

    ret = []
    for n in tail_nodes:
        for child in graph.get_inputs(n):
            path_from_child = walk_ins_until_diverge(graph, child)
            if path_from_child is None:
                continue
            diverging_node = path_from_child[0]
            branch_path = path_from_child[1:] + [n]
            converging_node = None
            path = (diverging_node, branch_path, converging_node)
            if len(branch_path) <= branch_len:
                ret.append(path)
    return ret


def find_bubbles(graph: Graph[T], branch_len: int) -> List[Tuple[Optional[T], List[T], Optional[T]]]:
    branching_nodes = filter(lambda n: graph.get_out_degree(n) > 1, graph.get_nodes())

    ret = []
    for n in branching_nodes:
        for child in graph.get_outputs(n):
            path_from_child = walk_outs_until_converge(graph, child)
            if path_from_child is None:
                continue
            diverging_node = n
            branch_path = path_from_child[:-1]
            converging_node = path_from_child[-1]
            path = (diverging_node, branch_path, converging_node)
            if len(branch_path) <= branch_len:
                ret.append(path)
    return ret

Fragments from sequencer:

Fragments after being broken to k=4:

De Bruijn graph:

Dot diagram

Problem paths:

Find Contigs

↩PREREQUISITES↩

WHAT: Given an overlap graph or de Bruijn graph, find the longest possible stretches of non-branching nodes. Each stretch will be a path that's either ...

Each found path is called a contig: a contiguous piece of the graph. For example, ...

Kroki diagram output

WHY: An overlap graph / de Bruijn graph represents all the possible ways a set of fragments may be stitched together to infer the full genome. However, real-world complications make it impractical to guess the full genome:

These complications result in graphs that are too tangled, disconnected, etc... As such, the best someone can do is to pull out the contigs in the graph: unambiguous stretches of DNA.

ALGORITHM:

ch3_code/src/FindContigs.py (lines 14 to 82):

def walk_until_non_1_to_1(graph: Graph[T], node: T) -> Optional[List[T]]:
    ret = [node]
    ret_quick_lookup = {node}
    while True:
        out_degree = graph.get_out_degree(node)
        in_degree = graph.get_in_degree(node)
        if not(in_degree == 1 and out_degree == 1):
            return ret

        children = graph.get_outputs(node)
        child = next(children)
        if child in ret_quick_lookup:
            return ret

        node = child
        ret.append(node)
        ret_quick_lookup.add(node)


def walk_until_loop(graph: Graph[T], node: T) -> Optional[List[T]]:
    ret = [node]
    ret_quick_lookup = {node}
    while True:
        out_degree = graph.get_out_degree(node)
        if out_degree > 1 or out_degree == 0:
            return None

        children = graph.get_outputs(node)
        child = next(children)
        if child in ret_quick_lookup:
            return ret

        node = child
        ret.append(node)
        ret_quick_lookup.add(node)


def find_maximal_non_branching_paths(graph: Graph[T]) -> List[List[T]]:
    paths = []

    for node in graph.get_nodes():
        out_degree = graph.get_out_degree(node)
        in_degree = graph.get_in_degree(node)
        if (in_degree == 1 and out_degree == 1) or out_degree == 0:
            continue
        for child in graph.get_outputs(node):
            path_from_child = walk_until_non_1_to_1(graph, child)
            if path_from_child is None:
                continue
            path = [node] + path_from_child
            paths.append(path)

    skip_nodes = set()
    for node in graph.get_nodes():
        if node in skip_nodes:
            continue
        out_degree = graph.get_out_degree(node)
        in_degree = graph.get_in_degree(node)
        if not (in_degree == 1 and out_degree == 1) or out_degree == 0:
            continue
        path = walk_until_loop(graph, node)
        if path is None:
            continue
        path = path + [node]
        paths.append(path)
        skip_nodes |= set(path)

    return paths

Given the fragments ['TGG', 'GGT', 'GGT', 'GTG', 'CAC', 'ACC', 'CCA'], the de Bruijn graph is...

Dot diagram

The following contigs were found...

GG->GT

GG->GT

GT->TG->GG

CA->AC->CC->CA

Peptide Sequence

↩PREREQUISITES↩

A peptide is a miniature protein consisting of a chain of amino acids anywhere between 2 to 100 amino acids in length. Peptides are created through two mechanisms:

  1. ribosomal peptides: DNA gets transcribed to mRNA (transcription), which in turn gets translated by the ribosome into a peptide (translation).

    Kroki diagram output

  2. non-ribosomal peptides: proteins called NRP synthetase construct peptides one amino acid at a time.

    Kroki diagram output

For ribosomal peptides, each amino acid is encoded as a DNA sequence of length 3. This 3 length DNA sequence is referred to as a codon. By knowing which codons map to which amino acids, the ...

For non-ribosomal peptides, a sample of the peptide needs to be isolated and passed through a mass spectrometer. A mass spectrometer is a device that shatters and bins molecules by their mass-to-charge ratio: Given a sample of molecules, the device randomly shatters each molecule in the sample (forming ions), then bins each ion by its mass-to-charge ratio (mz\frac{m}{z}).

The output of a mass spectrometer is a plot called a spectrum. The plot's ...

Kroki diagram output

For example, given a sample containing multiple instances of the linear peptide NQY, the mass spectrometer will take each instance of NQY and randomly break the bonds between its amino acids:

Kroki diagram output

⚠️NOTE️️️⚠️

How does it know to break the bonds holding amino acids together and not bonds within the amino acids themselves? My guess is that the bonds coupling one amino acid to another are much weaker than the bonds holding an individual amino acid together -- it's more likely that the weaker bonds will be broken.

Each subpeptide then will have its mass-to-charge ratio measured, which in turn gets converted to a set of potential masses by performing basic math. With these potential masses, it's possible to infer the sequence of the peptide.

Special consideration needs to be given to the real-world practical problems with mass spectrometry. Specifically, the spectrum given back by a mass spectrometer will very likely ...

The following table contains a list of proteinogenic amino acids with their masses and codon mappings:

1 Letter Code 3 Letter Code Amino Acid Codons Monoisotopic Mass (daltons)
A Ala Alanine GCA, GCC, GCG, GCU 71.04
C Cys Cysteine UGC, UGU 103.01
D Asp Aspartic acid GAC, GAU 115.03
E Glu Glutamic acid GAA, GAG 129.04
F Phe Phenylalanine UUC, UUU 147.07
G Gly Glycine GGA, GGC, GGG, GGU 57.02
H His Histidine CAC, CAU 137.06
I Ile Isoleucine AUA, AUC, AUU 113.08
K Lys Lysine AAA, AAG 128.09
L Leu Leucine CUA, CUC, CUG, CUU, UUA, UUG 113.08
M Met Methionine AUG 131.04
N Asn Asparagine AAC, AAU 114.04
P Pro Proline CCA, CCC, CCG, CCU 97.05
Q Gln Glutamine CAA, CAG 128.06
R Arg Arginine AGA, AGG, CGA, CGC, CGG, CGU 156.1
S Ser Serine AGC, AGU, UCA, UCC, UCG, UCU 87.03
T Thr Threonine ACA, ACC, ACG, ACU 101.05
V Val Valine GUA, GUC, GUG, GUU 99.07
W Trp Tryptophan UGG 186.08
Y Tyr Tyrosine UAC, UAU 163.06
* * STOP UAA, UAG, UGA

⚠️NOTE️️️⚠️

The stop marker tells the ribosome to stop translating / the protein is complete. The codons are listed as ribonucleotides (RNA). For nucleotides (DNA), swap U with T.

Codon Encode

WHAT: Given a DNA sequence, map each codon to the amino acid it represents. In total, there are 6 different ways that a DNA sequence could be translated:

  1. Since the length of a codon is 3, the encoding of the peptide could start from offset 0, 1, or 2 (referred to as reading frames).
  2. Since DNA is double stranded, either the DNA sequence or its reverse complement could represent the peptide.

WHY: The composition of a peptide can be determined from the DNA sequence that encodes it.

ALGORITHM:

ch4_code/src/helpers/AminoAcidUtils.py (lines 4 to 24):

_codon_to_amino_acid = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAU': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACU': 'T',
                        'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGU': 'S', 'AUA': 'I', 'AUC': 'I', 'AUG': 'M', 'AUU': 'I',
                        'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAU': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCU': 'P',
                        'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGU': 'R', 'CUA': 'L', 'CUC': 'L', 'CUG': 'L', 'CUU': 'L',
                        'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAU': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCU': 'A',
                        'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGU': 'G', 'GUA': 'V', 'GUC': 'V', 'GUG': 'V', 'GUU': 'V',
                        'UAA': '*', 'UAC': 'Y', 'UAG': '*', 'UAU': 'Y', 'UCA': 'S', 'UCC': 'S', 'UCG': 'S', 'UCU': 'S',
                        'UGA': '*', 'UGC': 'C', 'UGG': 'W', 'UGU': 'C', 'UUA': 'L', 'UUC': 'F', 'UUG': 'L', 'UUU': 'F'}

_amino_acid_to_codons = dict()
for k, v in _codon_to_amino_acid.items():
    _amino_acid_to_codons.setdefault(v, []).append(k)


def codon_to_amino_acid(rna: str) -> Optional[str]:
    return _codon_to_amino_acid.get(rna)


def amino_acid_to_codons(codon: str) -> Optional[List[str]]:
    return _amino_acid_to_codons.get(codon)

ch4_code/src/EncodePeptide.py (lines 9 to 26):

def encode_peptide(dna: str) -> str:
    rna = dna_to_rna(dna)
    protein_seq = ''
    for codon in split_to_size(rna, 3):
        codon_str = ''.join(codon)
        protein_seq += codon_to_amino_acid(codon_str)
    return protein_seq


def encode_peptides_all_readingframes(dna: str) -> List[str]:
    ret = []
    for dna_ in (dna, dna_reverse_complement(dna)):
        for rf_start in range(3):
            rf_end = len(dna_) - ((len(dna_) - rf_start) % 3)
            peptide = encode_peptide(dna_[rf_start:rf_end])
            ret.append(peptide)
    return ret

Given AAAAGAACCTAATCTTAAAGGAGATGATGATTCTAA, the possible peptide encodings are...

Codon Decode

WHAT: Given a peptide, map each amino acid to the DNA sequences it represents. Since each amino acid can map to multiple codons, there may be multiple DNA sequences for a single peptide.

WHY: The DNA sequences that encode a peptide can be determined from the peptide itself.

ALGORITHM:

ch4_code/src/helpers/AminoAcidUtils.py (lines 4 to 24):

_codon_to_amino_acid = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAU': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACU': 'T',
                        'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGU': 'S', 'AUA': 'I', 'AUC': 'I', 'AUG': 'M', 'AUU': 'I',
                        'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAU': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCU': 'P',
                        'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGU': 'R', 'CUA': 'L', 'CUC': 'L', 'CUG': 'L', 'CUU': 'L',
                        'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAU': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCU': 'A',
                        'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGU': 'G', 'GUA': 'V', 'GUC': 'V', 'GUG': 'V', 'GUU': 'V',
                        'UAA': '*', 'UAC': 'Y', 'UAG': '*', 'UAU': 'Y', 'UCA': 'S', 'UCC': 'S', 'UCG': 'S', 'UCU': 'S',
                        'UGA': '*', 'UGC': 'C', 'UGG': 'W', 'UGU': 'C', 'UUA': 'L', 'UUC': 'F', 'UUG': 'L', 'UUU': 'F'}

_amino_acid_to_codons = dict()
for k, v in _codon_to_amino_acid.items():
    _amino_acid_to_codons.setdefault(v, []).append(k)


def codon_to_amino_acid(rna: str) -> Optional[str]:
    return _codon_to_amino_acid.get(rna)


def amino_acid_to_codons(codon: str) -> Optional[List[str]]:
    return _amino_acid_to_codons.get(codon)

ch4_code/src/DecodePeptide.py (lines 8 to 27):

def decode_peptide(peptide: str) -> List[str]:
    def dfs(subpeptide: str, dna: str, ret: List[str]) -> None:
        if len(subpeptide) == 0:
            ret.append(dna)
            return
        aa = subpeptide[0]
        for codon in amino_acid_to_codons(aa):
            dfs(subpeptide[1:], dna + rna_to_dna(codon), ret)
    dnas = []
    dfs(peptide, '', dnas)
    return dnas


def decode_peptide_count(peptide: str) -> int:
    count = 1
    for ch in peptide:
        vals = amino_acid_to_codons(ch)
        count *= len(vals)
    return count

Given NQY, the possible DNA encodings are...

Experimental Spectrum

WHAT: Given a spectrum for a peptide, derive a set of potential masses from the mass-to-charge ratios. These potential masses are referred to as an experimental spectrum.

WHY: A peptide's sequence can be inferred from a list of its potential subpeptide masses.

ALGORITHM:

Prior to deriving masses from a spectrum, filter out low intensity mass-to-charge ratios. The remaining mass-to-charge ratios are converted to potential masses using mzz=m\frac{m}{z} \cdot z = m.

For example, consider a mass spectrometer that has a tendency to produce +1 and +2 ions. This mass spectrometer produces the following mass-to-charge ratios: [100, 150, 250]. Each mass-to-charge ratio from this mass spectrometer will be converted to two possible masses:

It's impossible to know which mass is correct, so all masses are included in the experimental spectrum:

[100Da, 150Da, 200Da, 250Da, 300Da, 500Da].

ch4_code/src/ExperimentalSpectrum.py (lines 6 to 14):

# Its expected that low intensity mass_charge_ratios have already been filtered out prior to invoking this func.
def experimental_spectrum(mass_charge_ratios: List[float], charge_tendencies: Set[float]) -> List[float]:
    ret = [0.0]  # implied -- subpeptide of length 0
    for mcr in mass_charge_ratios:
        for charge in charge_tendencies:
            ret.append(mcr * charge)
    ret.sort()
    return ret

The experimental spectrum for the mass-to-charge ratios...

[100.0, 150.0, 250.0]

... and charge tendencies...

{1.0, 2.0}

... is...

[0.0, 100.0, 150.0, 200.0, 250.0, 300.0, 500.0]

⚠️NOTE️️️⚠️

The following section isn't from the Pevzner book or any online resources. I came up with it in an effort to solve the final assignment for Chapter 4 (the chapter on non-ribosomal peptide sequencing). As such, it might not be entirely correct / there may be better ways to do this.

Just as a spectrum is noisy, the experimental spectrum derived from a spectrum is also noisy. For example, consider a mass spectrometer that produces up to ±0.5 noise per mass-to-charge ratio and has a tendency to produce +1 and +2 charges. A real mass of 100Da measured by this mass spectrometer will end up in the spectrum as a mass-to-charge ratio of either...

Converting these mass-to-charge ratio ranges to mass ranges...

Note how the +2 charge conversion produces the widest range: 100Da ± 1Da. Any real mass measured by this mass spectrometer will end up in the experimental spectrum with up to ±1Da noise. For example, a real mass of ...

Kroki diagram output

Similarly, any mass in the experimental spectrum could have come from a real mass within ±1Da of it. For example, an experimental spectrum mass of 100Da could have come from a real mass of anywhere between 99Da to 101Da: At a real mass of ...

Kroki diagram output

As such, the maximum amount of noise for a real mass that made its way into the experimental spectrum is the same as the tolerance required for mapping an experimental spectrum mass back to the real mass it came from. This tolerance can also be considered noise: the experimental spectrum mass is offset from the real mass that it came from.

ch4_code/src/ExperimentalSpectrumNoise.py (lines 6 to 8):

def experimental_spectrum_noise(max_mass_charge_ratio_noise: float, charge_tendencies: Set[float]) -> float:
    return max_mass_charge_ratio_noise * abs(max(charge_tendencies))

Given a max mass-to-charge ratio noise of ±0.5 and charge tendencies {1.0, 2.0}, the maximum noise per experimental spectrum mass is ±1.0

Theoretical Spectrum

↩PREREQUISITES↩

WHAT: A theoretical spectrum is an algorithmically generated list of all subpeptide masses for a known peptide sequence (including 0 and the full peptide's mass).

For example, linear peptide NQY has the theoretical spectrum...

theo_spec = [
  0,    # <empty>
  114,  # N
  128,  # Q
  163,  # Y
  242,  # NQ
  291,  # QY
  405   # NQY
]

... while experimental spectrum produced by feeding NQY to a mass spectrometer may look something like...

exp_spec = [
  0.0,    # <empty> (implied)
  113.9,  # N
  115.1,  # N
          # Q missing
  136.2,  # faulty
  162.9,  # Y
  242.0,  # NQ
          # QY missing
  311.1,  # faulty
  346.0,  # faulty
  405.2   # NQY
]

The theoretical spectrum is what the experimental spectrum would be in a perfect world...

WHY: The closer a theoretical spectrum is to an experimental spectrum, the more likely it is that the peptide sequence used to generate that theoretical spectrum is related to the peptide sequence that produced that experimental spectrum. This is the basis for how non-ribosomal peptides are sequenced: an experimental spectrum is produced by a mass spectrometer, then that experimental spectrum is compared against a set of theoretical spectrums.

Bruteforce Algorithm

ALGORITHM:

The following algorithm generates a theoretical spectrum in the most obvious way: iterate over each subpeptide and calculate its mass.

ch4_code/src/TheoreticalSpectrum_Bruteforce.py (lines 10 to 26):

def theoretical_spectrum(
        peptide: List[AA],
        peptide_type: PeptideType,
        mass_table: Dict[AA, float]
) -> List[int]:
    # add subpeptide of length 0's mass
    ret = [0.0]
    # add subpeptide of length 1 to k-1's mass
    for k in range(1, len(peptide)):
        for subpeptide, _ in slide_window(peptide, k, cyclic=peptide_type == PeptideType.CYCLIC):
            ret.append(sum([mass_table[ch] for ch in subpeptide]))
    # add subpeptide of length k's mass
    ret.append(sum([mass_table[aa] for aa in peptide]))
    # sort and return
    ret.sort()
    return ret

The theoretical spectrum for the linear peptide NQY is [0.0, 114.0, 128.0, 163.0, 242.0, 291.0, 405.0]

Prefix Sum Algorithm

↩PREREQUISITES↩

ALGORITHM:

The algorithm starts by calculating the prefix sum of the mass at each position of the peptide. The prefix sum is calculated by summing all amino acid masses up until that position. For example, the peptide GASP has the following masses at the following positions...

G A S P
57 71 87 97

As such, the prefix sum at each position is...

G A S P
Mass 57 71 87 97
Prefix sum of mass 57=57 57+71=128 57+71+87=215 57+71+87+97=312
prefixsum_masses[0] = mass['']     = 0             = 0   # Artificially added
prefixsum_masses[1] = mass['G']    = 0+57          = 57
prefixsum_masses[2] = mass['GA']   = 0+57+71       = 128
prefixsum_masses[3] = mass['GAS']  = 0+57+71+87    = 215
prefixsum_masses[4] = mass['GASP'] = 0+57+71+87+97 = 312

The mass for each subpeptide can be derived from just these prefix sums. For example, ...

mass['GASP'] = mass['GASP'] - mass['']    = prefixsum_masses[4] - prefixsum_masses[0]
mass['ASP']  = mass['GASP'] - mass['G']   = prefixsum_masses[4] - prefixsum_masses[1]
mass['AS']   = mass['GAS']  - mass['G']   = prefixsum_masses[3] - prefixsum_masses[1]
mass['A']    = mass['GA']   - mass['G']   = prefixsum_masses[2] - prefixsum_masses[1]
mass['S']    = mass['GAS']  - mass['GA']  = prefixsum_masses[3] - prefixsum_masses[2]
mass['P']    = mass['GASP'] - mass['GAS'] = prefixsum_masses[4] - prefixsum_masses[3]
# etc...

If the peptide is a cyclic peptide, some subpeptides will wrap around. For example, PG is a valid subpeptide if GASP is a cyclic peptide:

Kroki diagram output

The prefix sum can be used to calculate these wrapping subpeptides as well. For example...

mass['PG'] = mass['GASP'] - mass['AS']
           = mass['GASP'] - (mass['GAS'] - mass['G'])    # SUBSTITUTE IN mass['AS'] CALC FROM ABOVE
           = prefixsum_masses[4] - (prefixsum_masses[3] - prefixsum_masses[1])

This algorithm is faster than the bruteforce algorithm, but most use-cases won't notice a performance improvement unless either the...

ch4_code/src/TheoreticalSpectrum_PrefixSum.py (lines 37 to 53):

def theoretical_spectrum(
        peptide: List[AA],
        peptide_type: PeptideType,
        mass_table: Dict[AA, float]
) -> List[float]:
    prefixsum_masses = list(accumulate([mass_table[aa] for aa in peptide], initial=0.0))
    ret = [0.0]
    for end_idx in range(0, len(prefixsum_masses)):
        for start_idx in range(0, end_idx):
            min_mass = prefixsum_masses[start_idx]
            max_mass = prefixsum_masses[end_idx]
            ret.append(max_mass - min_mass)
            if peptide_type == PeptideType.CYCLIC and start_idx > 0 and end_idx < len(peptide):
                ret.append(prefixsum_masses[-1] - (prefixsum_masses[end_idx] - prefixsum_masses[start_idx]))
    ret.sort()
    return ret

The theoretical spectrum for the linear peptide NQY is [0.0, 114.0, 128.0, 163.0, 242.0, 291.0, 405.0]

⚠️NOTE️️️⚠️

The algorithm above is serial, but it can be made parallel to get even more speed:

  1. Parallelized prefix sum (e.g. Hillis-Steele / Blelloch).
  2. Parallelized iteration instead of nested for-loops.
  3. Parallelized sorting (e.g. Parallel merge sort / Parallel brick sort / Bitonic sort).

Spectrum Convolution

↩PREREQUISITES↩

WHAT: Given an experimental spectrum, subtract its masses from each other. The differences are a set of potential amino acid masses for the peptide that generated that experimental spectrum.

For example, the following experimental spectrum is for the linear peptide NQY:

[0.0Da, 113.9Da, 115.1Da, 136.2Da, 162.9Da, 242.0Da, 311.1Da, 346.0Da, 405.2Da]

Performing 242.0 - 113.9 results in 128.1, which is very close to the mass for amino acid Q. The mass for Q was derived even though no experimental spectrum masses are near Q's mass:

WHY: The closer a theoretical spectrum is to an experimental spectrum, the more likely it is that the peptide sequence used to generate that theoretical spectrum is related to the peptide sequence that produced that experimental spectrum. However, before being able to build a theoretical spectrum, a list of potential amino acids need to be inferred from the experimental spectrum. In addition to the 20 proteinogenic amino acids, there are many other non-proteinogenic amino acids that may be part of the peptide.

This operation infers a list of potential amino acid masses, which can be mapped back to amino acids themselves.

ALGORITHM:

Consider an experimental spectrum with masses that don't contain any noise. That is, the experimental spectrum may have faulty masses and may be missing masses, but any correct masses it does have are exact / noise-free. To derive a list of potential amino acid masses for this experimental spectrum:

  1. Subtract experimental spectrum masses from each other (each mass gets subtracted from every mass).
  2. Filter differences to those between 57Da and 200Da (generally accepted range for the mass of an amino acid).
  3. Filter differences to that don't occur at least n times (n is user-defined).

The result is a list of potential amino acid masses for the peptide that produced that experimental spectrum. For example, consider the following experimental spectrum for the linear peptide NQY:

[0Da, 114Da, 136Da, 163Da, 242Da, 311Da, 346Da, 405Da]

The experimental spectrum masses...

Subtract the experimental spectrum masses:

0 114 136 163 242 311 346 405
0 0 -114 -136 -163 -242 -311 -346 -405
114 114 0 -22 -49 -128 -197 -231 -291
136 136 22 0 -27 -106 -175 -210 -269
163 163 49 27 0 -79 -148 -183 -242
242 242 128 106 79 0 -69 -104 -163
311 311 197 175 148 69 0 -35 -94
346 346 232 210 183 104 35 0 -59
405 405 291 269 242 163 94 59 0

Then, remove differences that aren't between 57Da and 200Da:

0 114 136 163 242 311 346 405
0
114 114
136 136
163 163
242 128 106 79
311 197 175 148 69
346 183 104
405 163 94 59

Then, filter out any differences occurring less than than n times. In this case, it makes sense to set n to 1 because almost all of the differences occur only once.

The final result is a list of potential amino acid masses for the peptide that produced the experimental spectrum:

[59Da, 69Da, 79Da, 94Da, 104Da, 106Da, 114Da, 128Da, 136Da, 148Da, 163Da, 175Da, 183Da, 197Da]

Note that the experimental spectrum is for the linear peptide NQY. The experimental spectrum contained the masses for N (114Da) and Y (163Da), but not Q (128Da). This operation was able to pull out the mass for Q: 128Da is in the final list of differences.

ch4_code/src/SpectrumConvolution_NoNoise.py (lines 6 to 16):

def spectrum_convolution(experimental_spectrum: List[float], min_mass=57.0, max_mass=200.0) -> List[float]:
    # it's expected that experimental_spectrum is sorted smallest to largest
    diffs = []
    for row_idx, row_mass in enumerate(experimental_spectrum):
        for col_idx, col_mass in enumerate(experimental_spectrum):
            mass_diff = row_mass - col_mass
            if min_mass <= mass_diff <= max_mass:
                diffs.append(mass_diff)
    diffs.sort()
    return diffs

The spectrum convolution for [0.0, 114.0, 136.0, 163.0, 242.0, 311.0, 346.0, 405.0] is ...

⚠️NOTE️️️⚠️

The following section isn't from the Pevzner book or any online resources. I came up with it in an effort to solve the final assignment for Chapter 4 (the chapter on non-ribosomal peptide sequencing). As such, it might not be entirely correct / there may be better ways to do this.

The algorithm described above is for experimental spectrums that have exact masses (no noise). However, real experimental spectrums will have noisy masses. Since a real experimental spectrum has noisy masses, the amino acid masses derived from it will also be noisy. For example, consider an experimental spectrum that has ±1Da noise per mass. A real mass of...

Subtract the opposite extremes from these two ranges: 243Da - 113Da = 130Da. That's 2Da away from the real mass difference: 128Da. As such, the maximum noise per amino acid mass is 2 times the maximum noise for the experimental spectrum that it was derived from: ±2Da for this example.

ch4_code/src/SpectrumConvolutionNoise.py (lines 7 to 9):

def spectrum_convolution_noise(exp_spec_mass_noise: float) -> float:
    return 2.0 * exp_spec_mass_noise

Given a max experimental spectrum mass noise of ±1.0, the maximum noise per amino acid derived from an experimental spectrum is ±2.0

Extending the algorithm to handle noisy experimental spectrum masses requires one extra step: group together differences that are within some tolerance of each other, where this tolerance is the maximum amino acid mass noise calculation described above. For example, consider the following experimental spectrum for linear peptide NQY that has up to ±1Da noise per mass:

[0.0Da, 113.9Da, 115.1Da, 136.2Da, 162.9Da, 242.0Da, 311.1Da, 346.0Da, 405.2Da]

Just as before, subtract the experimental spectrum masses and differences that aren't between 57Da and 200Da:

0.0 113.9 115.1 136.2 162.9 242.0 311.1 346.0 405.2
0.0
113.9 113.9
115.1 115.1
136.2 136.2
162.9 162.9
242.0 128.1 126.9 105.8 79.1
311.1 197.2 196.0 174.9 142.9 69.1
346.0 183.1 104.0
405.2 163.0 94.1 59.2

Then, group differences that are within ±2Da of each other (2 times the experimental spectrum's maximum mass noise):

Then, filter out any groups that have less than n occurrences. In this case, filtering to n=2 occurrences reveals that all amino acid masses are captured for NQY:

Note that the experimental spectrum is for the linear peptide NQY. The experimental spectrum contained the masses near N (113.Da and 115.1Da) and Y (162.9Da), but not Q. This operation was able to pull out masses near Q: [128.1, 126.9] is in the final list of differences.

ch4_code/src/SpectrumConvolution.py (lines 7 to 58):

def group_masses_by_tolerance(masses: List[float], tolerance: float) -> typing.Counter[float]:
    masses = sorted(masses)
    length = len(masses)
    ret = Counter()
    for i, m1 in enumerate(masses):
        if m1 in ret:
            continue
        # search backwards
        left_limit = 0
        for j in range(i, -1, -1):
            m2 = masses[j]
            if abs(m2 - m1) > tolerance:
                break
            left_limit = j
        # search forwards
        right_limit = length - 1
        for j in range(i, length):
            m2 = masses[j]
            if abs(m2 - m1) > tolerance:
                break
            right_limit = j
        count = right_limit - left_limit + 1
        ret[m1] = count
    return ret


def spectrum_convolution(
        exp_spec: List[float],  # must be sorted smallest to largest
        tolerance: float,
        min_mass: float = 57.0,
        max_mass: float = 200.0,
        round_digits: int = -1,  # if set, rounds to this many digits past decimal point
        implied_zero: bool = False  # if set, run as if 0.0 were added to exp_spec
) -> typing.Counter[float]:
    min_mass -= tolerance
    max_mass += tolerance
    diffs = []
    for row_idx, row_mass in enumerate(exp_spec):
        for col_idx, col_mass in enumerate(exp_spec):
            mass_diff = row_mass - col_mass
            if round_digits != -1:
                mass_diff = round(mass_diff, round_digits)
            if min_mass <= mass_diff <= max_mass:
                diffs.append(mass_diff)
    if implied_zero:
        for mass in exp_spec:
            if min_mass <= mass <= max_mass:
                diffs.append(mass)
            if mass > max_mass:
                break
    return group_masses_by_tolerance(diffs, tolerance)

The spectrum convolution for [113.9, 115.1, 136.2, 162.9, 242.0, 311.1, 346.0, 405.2] is ...

Spectrum Score

↩PREREQUISITES↩

WHAT: Given an experimental spectrum and a theoretical spectrum, score them against each other by counting how many masses match between them.

WHY: The more matching masses between a theoretical spectrum and an experimental spectrum, the more likely it is that the peptide sequence used to generate that theoretical spectrum is related to the peptide sequence that produced that experimental spectrum. This is the basis for how non-ribosomal peptides are sequenced: an experimental spectrum is produced by a mass spectrometer, then that experimental spectrum is compared against a set of theoretical spectrums.

ALGORITHM:

Consider an experimental spectrum with masses that don't contain any noise. That is, the experimental spectrum may have faulty masses and may be missing masses, but any correct masses it does have are exact / noise-free. Scoring this experimental spectrum against a theoretical spectrum is simple: count the number of matching masses.

ch4_code/src/SpectrumScore_NoNoise.py (lines 9 to 28):

def score_spectrums(
        s1: List[float],  # must be sorted ascending
        s2: List[float]   # must be sorted ascending
) -> int:
    idx_s1 = 0
    idx_s2 = 0
    score = 0
    while idx_s1 < len(s1) and idx_s2 < len(s2):
        s1_mass = s1[idx_s1]
        s2_mass = s2[idx_s2]
        if s1_mass < s2_mass:
            idx_s1 += 1
        elif s1_mass > s2_mass:
            idx_s2 += 1
        else:
            idx_s1 += 1
            idx_s2 += 1
            score += 1
    return score

The spectrum score for...

[0.0, 57.0, 71.0, 128.0, 199.0, 256.0]

... vs ...

[0.0, 57.0, 71.0, 128.0, 128.0, 199.0, 256.0]

... is 6

Note that a theoretical spectrum may have multiple masses with the same value but an experimental spectrum won't. For example, the theoretical spectrum for GAK is ...

G A K GA AK GAK
Mass 0Da 57D a 71Da 128Da 128Da 199Da 256Da

K and GA both have a mass of 128Da. Since experimental spectrums don't distinguish between where masses come from, an experimental spectrum for this linear peptide will only have 1 entry for 128Da.

⚠️NOTE️️️⚠️

The following section isn't from the Pevzner book or any online resources. I came up with it in an effort to solve the final assignment for Chapter 4 (the chapter on non-ribosomal peptide sequencing). As such, it might not be entirely correct / there may be better ways to do this.

The algorithm described above is for experimental spectrums that have exact masses (no noise). However, real experimental spectrums have noisy masses. That noise needs to be accounted for when identifying matches.

Recall that each amino acid mass captured by a spectrum convolution has up to some amount of noise. This is what defines the tolerance for a matching mass between the experimental spectrum and the theoretical spectrum. Specifically, the maximum amount of noise for a captured amino acid mass is multiplied by the amino acid count of the subpeptide to determine the tolerance.

For example, imagine a case where it's determined that the noise tolerance for each captured amino acid mass is ±2Da. Given the theoretical spectrum for linear peptide NQY, the tolerances would be as follows:

N Q Y NQ QY NQY
Mass 0Da 114Da 128Da 163Da 242Da 291Da 405Da
Tolerance 0Da ±2Da ±2Da ±2Da ±4Da ±4Da ±6Da

ch4_code/src/TheoreticalSpectrumTolerances.py (lines 7 to 26):

def theoretical_spectrum_tolerances(
        peptide_len: int,
        peptide_type: PeptideType,
        amino_acid_mass_tolerance: float
) -> List[float]:
    ret = [0.0]
    if peptide_type == PeptideType.LINEAR:
        for i in range(peptide_len):
            tolerance = (i + 1) * amino_acid_mass_tolerance
            ret += [tolerance] * (peptide_len - i)
    elif peptide_type == PeptideType.CYCLIC:
        for i in range(peptide_len - 1):
            tolerance = (i + 1) * amino_acid_mass_tolerance
            ret += [tolerance] * peptide_len
        if peptide_len != 0:
            ret.append(peptide_len * amino_acid_mass_tolerance)
    else:
        raise ValueError()
    return ret

The theoretical spectrum for linear peptide NQY with amino acid mass tolerance of 2.0...

[0.0, 2.0, 2.0, 2.0, 4.0, 4.0, 6.0]

Given a theoretical spectrum with tolerances, each experimental spectrum mass is checked to see if it fits within a theoretical spectrum mass tolerance. If it fits, it's considered a match. The score includes both the number of matches and how closely each match was to the ideal theoretical spectrum mass.

ch4_code/src/SpectrumScore.py (lines 10 to 129):

def scan_left(
        exp_spec: List[float],
        exp_spec_lo_idx: int,
        exp_spec_start_idx: int,
        theo_mid_mass: float,
        theo_min_mass: float
) -> Optional[int]:
    found_dist = None
    found_idx = None
    for idx in range(exp_spec_start_idx, exp_spec_lo_idx - 1, -1):
        exp_mass = exp_spec[idx]
        if exp_mass < theo_min_mass:
            break
        dist_to_theo_mid_mass = abs(exp_mass - theo_mid_mass)
        if found_dist is None or dist_to_theo_mid_mass < found_dist:
            found_idx = idx
            found_dist = dist_to_theo_mid_mass
    return found_idx


def scan_right(
        exp_spec: List[float],
        exp_spec_hi_idx: int,
        exp_spec_start_idx: int,
        theo_mid_mass: float,
        theo_max_mass: float
) -> Optional[int]:
    found_dist = None
    found_idx = None
    for idx in range(exp_spec_start_idx, exp_spec_hi_idx):
        exp_mass = exp_spec[idx]
        if exp_mass > theo_max_mass:
            break
        dist_to_theo_mid_mass = abs(exp_mass - theo_mid_mass)
        if found_dist is None or dist_to_theo_mid_mass < found_dist:
            found_idx = idx
            found_dist = dist_to_theo_mid_mass
    return found_idx


def find_closest_within_tolerance(
        exp_spec: List[float],
        exp_spec_lo_idx: int,
        exp_spec_hi_idx: int,
        theo_exact_mass: float,
        theo_min_mass: float,
        theo_max_mass: float
) -> Optional[int]:
    # Binary search exp_spec for the where theo_mid_mass would be inserted (left-most index chosen if already there).
    start_idx = bisect_left(exp_spec, theo_exact_mass, lo=exp_spec_lo_idx, hi=exp_spec_hi_idx)
    if start_idx == exp_spec_hi_idx:
        start_idx -= 1
    # From start_idx - 1, walk left to find the closest possible value to theo_mid_mass
    left_idx = scan_left(exp_spec, exp_spec_lo_idx, start_idx - 1, theo_exact_mass, theo_min_mass)
    # From start_idx, walk right to find the closest possible value to theo_mid_mass
    right_idx = scan_right(exp_spec, exp_spec_hi_idx, start_idx, theo_exact_mass, theo_max_mass)
    if left_idx is None and right_idx is None:  # If nothing found, return None
        return None
    if left_idx is None:  # If found something while walking left but not while walking right, return left
        return right_idx
    if right_idx is None:  # If found something while walking right but not while walking left, return right
        return left_idx
    # Otherwise, compare left and right to see which is close to theo_mid_mass and return that
    left_exp_mass = exp_spec[left_idx]
    left_dist_to_theo_mid_mass = abs(left_exp_mass - theo_exact_mass)
    right_exp_mass = exp_spec[left_idx]
    right_dist_to_theo_mid_mass = abs(right_exp_mass - theo_exact_mass)
    if left_dist_to_theo_mid_mass < right_dist_to_theo_mid_mass:
        return left_idx
    else:
        return right_idx


def score_spectrums(
        exp_spec: List[float],  # must be sorted asc
        theo_spec_with_tolerances: List[Tuple[float, float, float]]  # must be sorted asc, items are (expected,min,max)
) -> Tuple[int, float, float]:
    dist_score = 0.0
    within_score = 0
    exp_spec_lo_idx = 0
    exp_spec_hi_idx = len(exp_spec)
    for theo_mass in theo_spec_with_tolerances:
        # Find closest exp_spec mass for theo_mass
        theo_exact_mass, theo_min_mass, theo_max_mass = theo_mass
        exp_idx = find_closest_within_tolerance(
            exp_spec,
            exp_spec_lo_idx,
            exp_spec_hi_idx,
            theo_exact_mass,
            theo_min_mass,
            theo_max_mass
        )
        if exp_idx is None:
            continue
        # Calculate how far the found mass is from the ideal mass (theo_exact_mass) -- a perfect match will add 1.0 to
        # score, the farther out it is away the less gets added to score (min added will be 0.5).
        exp_mass = exp_spec[exp_idx]
        dist = abs(exp_mass - theo_exact_mass)
        max_dist = theo_max_mass - theo_min_mass
        if max_dist > 0.0:
            closeness = 1.0 - (dist / max_dist)
        else:
            closeness = 1.0
        dist_score += closeness
        # Increment within_score for each match. The above block increases dist_score as the found mass gets closer to
        # theo_exact_mass. There may be a case where a peptide with 6 of 10 AAs matches exactly (6 * 1.0) while another
        # peptide with 10 of 10 AAs matching very loosely (10 * 0.5) -- the first peptide will incorrectly win out if
        # only dist_score were used.
        within_score += 1
        # Move up the lower bound for what to consider in exp_spec such that it it's after the exp_spec mass found
        # in this cycle. That is, the next cycle won't consider anything lower than the mass that was found here. This
        # is done because theo_spec may contain multiple copies of the same mass, but a real experimental spectrum won't
        # do that (e.g. a peptide containing 57 twice will have two entries for 57 in its theoretical spectrum, but a
        # real experimental spectrum for that same peptide will only contain 57 -- anything with mass of 57 will be
        # collected into the same bin).
        exp_spec_lo_idx = exp_idx + 1
        if exp_spec_lo_idx == exp_spec_hi_idx:
            break
    return within_score, dist_score, 0.0 if within_score == 0 else dist_score / within_score

The spectrum score for...

[0.0, 56.1, 71.9, 126.8, 200.6, 250.9]

... vs ...

[0.0, 57.0, 71.0, 128.0, 128.0, 199.0, 256.0]

... with 2.0 amino acid tolerance is...

(6, 4.624999999999999, 0.7708333333333331)

Spectrum Sequence

↩PREREQUISITES↩

WHAT: Given an experimental spectrum and a set of amino acid masses, generate theoretical spectrums and score them against the experimental spectrum in an effort to infer the peptide sequence of the experimental spectrum.

WHY: The more matching masses between a theoretical spectrum and an experimental spectrum, the more likely it is that the peptide sequence used to generate that theoretical spectrum is related to the peptide sequence that produced that experimental spectrum.

Bruteforce Algorithm

ALGORITHM:

Imagine if experimental spectrums were perfect just like theoretical spectrums: no missing masses, no faulty masses, no noise, and preserved repeat masses. To bruteforce the peptide that produced such an experimental spectrum, generate candidate peptides by branching out amino acids at each position and compare each candidate peptide's theoretical spectrum to the experimental spectrum. If the theoretical spectrum matches the experimental spectrum, it's reasonable to assume that peptide is the same as the peptide that generated the experimental spectrum.

The algorithm stops branching out once the mass of the candidate peptide exceeds the final mass in the experimental spectrum. For a perfect experimental spectrum, the final mass is always the mass of the peptide that produced it. For example, for the linear peptide GAK ...

G A K GA AK GAK
Mass 0Da 57Da 71Da 128Da 128Da 199Da 256Da

ch4_code/src/SequencePeptide_Naive_Bruteforce.py (lines 10 to 30):

def sequence_peptide(
        exp_spec: List[float],  # must be sorted asc
        peptide_type: PeptideType,
        aa_mass_table: Dict[AA, float]
) -> List[List[AA]]:
    peptide_mass = exp_spec[-1]
    candidate_peptides = [[]]
    final_peptides = []
    while len(candidate_peptides) > 0:
        new_candidate_peptides = []
        for p in candidate_peptides:
            for m in aa_mass_table.keys():
                new_p = p[:] + [m]
                new_p_mass = sum([aa_mass_table[aa] for aa in new_p])
                if new_p_mass == peptide_mass and theoretical_spectrum(new_p, peptide_type, aa_mass_table) == exp_spec:
                    final_peptides.append(new_p)
                elif new_p_mass < peptide_mass:
                    new_candidate_peptides.append(new_p)
        candidate_peptides = new_candidate_peptides
    return final_peptides

The linear peptides matching the experimental spectrum [0.0, 57.0, 71.0, 128.0, 128.0, 199.0, 256.0] are...

⚠️NOTE️️️⚠️

The following section isn't from the Pevzner book or any online resources. I came up with it in an effort to solve the final assignment for Chapter 4 (the chapter on non-ribosomal peptide sequencing). As such, it might not be entirely correct / there may be better ways to do this.

Even though real experimental spectrums aren't perfect, the high-level algorithm remains the same: Create candidate peptides by branching out amino acids and capture the best scoring ones until the mass goes too high. However, various low-level aspects of the algorithm need to be modified to handle the problems with real experimental spectrums.

For starters, since there are no preset amino acids to build candidate peptides with, amino acid masses are captured using spectrum convolution and used directly. For example, instead of representing a peptide as GAK, it's represented as 57-71-128.

G A K
57Da 71Da 128Da

Next, the last mass in a real experimental spectrum isn't guaranteed to be the mass of the peptide that produced it. Since real experimental spectrums have faulty masses and may be missing masses, it's possible that either the peptide's mass wasn't captured at all or was captured but at an index that isn't the last element.

If the experimental spectrum's peptide mass was captured and found, it'll have noise. For example, imagine an experimental spectrum for the peptide 57-57 with ±1Da noise. The exact mass of the peptide 57-57 is 114Da, but if that mass gets placed into the experimental spectrum it will show up as anywhere between 113Da to 115Da.

Given that same experimental spectrum, running a spectrum convolution to derive the amino acid masses ends up giving back amino acid masses with ±2Da noise. For example, the mass 57Da may be derived as anywhere between 55Da to 59Da. Assuming that you're building the peptide 57-57 with the low end of that range (55Da), its mass will be 55Da + 55Da = 110Da. Compared against the high end of the experimental spectrum's peptide mass (115Da), it's 5Da away.

ch4_code/src/ExperimentalSpectrumPeptideMassNoise.py (lines 18 to 21):

def experimental_spectrum_peptide_mass_noise(exp_spec_mass_noise: float, peptide_len: int) -> float:
    aa_mass_noise = spectrum_convolution_noise(exp_spec_mass_noise)
    return aa_mass_noise * peptide_len + exp_spec_mass_noise

Given an experimental spectrum mass noise of ±1.0 and expected peptide length of 2, the maximum noise for an experimental spectrum's peptide mass is ±5.0

Finally, given that real experimental spectrums contain faulty masses and may be missing masses, more often than not the peptides that score the best aren't the best candidates. Theoretical spectrum masses that are ...

... may push poor peptide candidates forward. As such, it makes sense to keep a backlog of the last m scoring peptides. Any of these backlog peptides may be the correct peptide for the experimental spectrum.

ch4_code/src/SequenceTester.py (lines 21 to 86):

class SequenceTester:
    def __init__(
            self,
            exp_spec: List[float],           # must be sorted asc
            aa_mass_table: Dict[AA, float],  # amino acid mass table
            aa_mass_tolerance: float,        # amino acid mass tolerance
            peptide_min_mass: float,         # min mass that the peptide could be
            peptide_max_mass: float,         # max mass that the peptide could be
            peptide_type: PeptideType,       # linear or cyclic
            score_backlog: int = 0           # keep this many previous scores
    ):
        self.exp_spec = exp_spec
        self.aa_mass_table = aa_mass_table
        self.aa_mass_tolerance = aa_mass_tolerance
        self.peptide_min_mass = peptide_min_mass
        self.peptide_max_mass = peptide_max_mass
        self.peptide_type = peptide_type
        self.score_backlog = score_backlog
        self.leader_peptides_top_score = 0
        self.leader_peptides = {0: []}

    @staticmethod
    def generate_theroetical_spectrum_with_tolerances(
            peptide: List[AA],
            peptide_type: PeptideType,
            aa_mass_table: Dict[AA, float],
            aa_mass_tolerance: float
    ) -> List[Tuple[float, float, float]]:
        theo_spec_raw = theoretical_spectrum(peptide, peptide_type, aa_mass_table)
        theo_spec_tols = theoretical_spectrum_tolerances(len(peptide), peptide_type, aa_mass_tolerance)
        theo_spec = [(m, m - t, m + t) for m, t in zip(theo_spec_raw, theo_spec_tols)]
        return theo_spec

    def test(
            self,
            peptide: List[AA],
            theo_spec: Optional[List[Tuple[float, float, float]]] = None
    ) -> TestResult:
        if theo_spec is None:
            theo_spec = SequenceTester.generate_theroetical_spectrum_with_tolerances(
                peptide,
                self.peptide_type,
                self.aa_mass_table,
                self.aa_mass_tolerance
            )
        # Don't add if mass out of range
        _, tp_min_mass, tp_max_mass = theo_spec[-1]  # last element of theo spec is the mass of the theo spec peptide
        if tp_min_mass < self.peptide_min_mass:
            return TestResult.MASS_TOO_SMALL
        elif tp_max_mass > self.peptide_max_mass:
            return TestResult.MASS_TOO_LARGE
        # Don't add if the score is lower than the previous n best scores
        peptide_score = score_spectrums(self.exp_spec, theo_spec)[0]
        min_acceptable_score = self.leader_peptides_top_score - self.score_backlog
        if peptide_score < min_acceptable_score:
            return TestResult.SCORE_TOO_LOW
        # Add, but also remove any previous test peptides that are no longer within the acceptable score threshold
        leaders = self.leader_peptides.setdefault(peptide_score, [])
        leaders.append(peptide)
        if peptide_score > self.leader_peptides_top_score:
            self.leader_peptides_top_score = peptide_score
            if len(self.leader_peptides) >= self.score_backlog:
                smallest_leader_score = min(self.leader_peptides.keys())
                self.leader_peptides.pop(smallest_leader_score)
        return TestResult.ADDED

ch4_code/src/SequencePeptide_Bruteforce.py (lines 13 to 41):

def sequence_peptide(
        exp_spec: List[float],                               # must be sorted asc
        aa_mass_table: Dict[AA, float],                      # amino acid mass table
        aa_mass_tolerance: float,                            # amino acid mass tolerance
        peptide_mass_candidates: List[Tuple[float, float]],  # mass range candidates for mass of peptide
        peptide_type: PeptideType,                           # linear or cyclic
        score_backlog: int                                   # backlog of top scores
) -> SequenceTesterSet:
    tester_set = SequenceTesterSet(
        exp_spec,
        aa_mass_table,
        aa_mass_tolerance,
        peptide_mass_candidates,
        peptide_type,
        score_backlog
    )
    candidates = [[]]
    while len(candidates) > 0:
        new_candidate_peptides = []
        for p in candidates:
            for m in aa_mass_table.keys():
                new_p = p[:]
                new_p.append(m)
                res = set(tester_set.test(new_p))
                if res != {TestResult.MASS_TOO_LARGE}:
                    new_candidate_peptides.append(new_p)
        candidates = new_candidate_peptides
    return tester_set

⚠️NOTE️️️⚠️

The experimental spectrum in the example below is for the peptide 114-128-163, which has the theoretical spectrum [0, 114, 128, 163, 242, 291, 405].

Given the ...

Top 10 captured mino acid masses (rounded to 1): [114.0, 112.5, 115.8, 161.1, 162.9, 127.1, 130.4, 177.5]

For peptides between 397.0 and 411.0...

Branch-and-bound Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm extends the bruteforce algorithm into a more efficient branch-and-bound algorithm by adding one extra step: After each branch, any candidate peptides deemed to be untenable are discarded. In this case, untenable means that there's no chance / little chance of the peptide branching out to a correct solution.

Imagine if experimental spectrums were perfect just like theoretical spectrums: no missing masses, no faulty masses, no noise, and preserved repeat masses. For such an experimental spectrum, an untenable candidate peptide has a theoretical spectrum with at least one mass that don't exist in the experimental spectrum. For example, the peptide 57-71-128 has the theoretical spectrum [0Da, 57Da, 71Da, 128Da, 128Da, 199Da, 256Da]. If 71Da were missing from the experimental spectrum, that peptide would be untenable (won't move forward).

When testing if a candidate peptide should move forward, the candidate peptide be treated as a linear peptide even if the experimental spectrum is for a cyclic peptide. For example, testing the experimental spectrum for cyclic peptide NQYQ against the theoretical spectrum for candidate cyclic peptide NQY...

Peptide 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
NQYQ 0 114 128 128 163 242 242 291 291 370 405 405 419 533
NQY 0 114 128 163 242 277 291 405

The theoretical spectrum contains 277, but the experimental spectrum doesn't. That means NQY won't branch out any further even though it should. As such, even if the experimental spectrum is for a cyclic peptide, treat candidate peptides as if they're linear segments of a cyclic peptide (essentially the same as linear peptides). If the theoretical spectrum for candidate linear peptide NQY were used...

Peptide 0 1 2 3 4 5 6 7 8 9 10 11 12 13
NQYQ 0 114 128 128 163 242 242 291 291 370 405 405 419 533
NQY 0 114 128 163 242 291 405

All theoretical spectrum masses are in the experimental spectrum. As such, the candidate NQY would move forward.

ch4_code/src/SequencePeptide_Naive_BranchAndBound.py (lines 11 to 61):

def sequence_peptide(
        exp_spec: List[float],  # must be sorted asc
        peptide_type: PeptideType,
        aa_mass_table: Dict[AA, float]
) -> List[List[AA]]:
    peptide_mass = exp_spec[-1]
    candidate_peptides = [[]]
    final_peptides = []
    while len(candidate_peptides) > 0:
        # Branch candidates
        new_candidate_peptides = []
        for p in candidate_peptides:
            for m in aa_mass_table:
                new_p = p[:] + [m]
                new_candidate_peptides.append(new_p)
        candidate_peptides = new_candidate_peptides
        # Test candidates to see if they match exp_spec or if they should keep being branched
        removal_idxes = set()
        for i, p in enumerate(candidate_peptides):
            p_mass = sum([aa_mass_table[aa] for aa in p])
            if p_mass == peptide_mass:
                theo_spec = theoretical_spectrum(p, peptide_type, aa_mass_table)
                if theo_spec == exp_spec:
                    final_peptides.append(p)
                removal_idxes.add(i)
            else:
                # Why get the theo spec of the linear version even if the peptide is cyclic? Think about what's
                # happening here. If the exp spec is for cyclic peptide NQYQ, and you're checking to see if the
                # candidate NQY should continue to be branched out...
                #
                # Exp spec  cyclic NQYQ: [0, 114, 128, 128, 163, 242, 242,      291, 291, 370, 405, 405, 419, 533]
                # Theo spec cyclic NQY:  [0, 114, 128,      163, 242,      277, 291,           405]
                #                                                           ^
                #                                                           |
                #                                                        mass(YN)
                #
                # Since NQY is being treated as a cyclic peptide, it has the subpeptide YN (mass of 277). However, the
                # cyclic peptide NQYQ doesn't have the subpeptide YN. That means NQY won't be branched out any further
                # even though it should. As such, even if the exp spec is for a cyclic peptide, treat the candidates as
                # linear segments of that cyclic peptide (essentially linear peptides).
                #
                # Exp spec  cyclic NQYQ: [0, 114, 128, 128, 163, 242, 242, 291, 291, 370, 405, 405, 419, 533]
                # Theo spec linear NQY:  [0, 114, 128,      163, 242,      291,           405]
                #
                # Given the specs above, the exp spec contains all masses in the theo spec.
                theo_spec = theoretical_spectrum(p, PeptideType.LINEAR, aa_mass_table)
                if not contains_all_sorted(theo_spec, exp_spec):
                    removal_idxes.add(i)
        candidate_peptides = [p for i, p in enumerate(candidate_peptides) if i not in removal_idxes]
    return final_peptides

The cyclic peptides matching the experimental spectrum [0.0, 114.0, 128.0, 128.0, 163.0, 242.0, 242.0, 291.0, 291.0, 370.0, 405.0, 405.0, 419.0, 533.0] are...

⚠️NOTE️️️⚠️

The following section isn't from the Pevzner book or any online resources. I came up with it in an effort to solve the final assignment for Chapter 4 (the chapter on non-ribosomal peptide sequencing). As such, it might not be entirely correct / there may be better ways to do this.

The bounding step described above won't work for real experimental spectrums. For example, a real experimental spectrum may ...

A possible bounding step for real experimental spectrums is to mark a candidate peptide as untenable if it has a certain number or percentage of mismatches. This is a heuristic, meaning that it won't always lead to the correct peptide. In contrast, the algorithm described above for perfect experimental spectrums always leads to the correct peptide.

ch4_code/src/SequencePeptide_BranchAndBound.py (lines 14 to 78):

def sequence_peptide(
        exp_spec: List[float],                               # must be sorted asc
        aa_mass_table: Dict[AA, float],                      # amino acid mass table
        aa_mass_tolerance: float,                            # amino acid mass tolerance
        peptide_mass_candidates: List[Tuple[float, float]],  # mass range candidates for mass of peptide
        peptide_type: PeptideType,                           # linear or cyclic
        score_backlog: int,                                  # backlog of top scores
        candidate_threshold: float                           # if < 1 then min % match, else min count match
) -> SequenceTesterSet:
    tester_set = SequenceTesterSet(
        exp_spec,
        aa_mass_table,
        aa_mass_tolerance,
        peptide_mass_candidates,
        peptide_type,
        score_backlog
    )
    candidate_peptides = [[]]
    while len(candidate_peptides) > 0:
        # Branch candidates
        new_candidate_peptides = []
        for p in candidate_peptides:
            for m in aa_mass_table:
                new_p = p[:] + [m]
                new_candidate_peptides.append(new_p)
        candidate_peptides = new_candidate_peptides
        # Test candidates to see if they match exp_spec or if they should keep being branched
        removal_idxes = set()
        for i, p in enumerate(candidate_peptides):
            res = set(tester_set.test(p))
            if {TestResult.MASS_TOO_LARGE} == res:
                removal_idxes.add(i)
            else:
                # Why get the theo spec of the linear version even if the peptide is cyclic? Think about what's
                # happening here. If the exp spec is for cyclic peptide NQYQ, and you're checking to see if the
                # candidate NQY should continue to be branched out...
                #
                # Exp spec  cyclic NQYQ: [0, 114, 128, 128, 163, 242, 242,      291, 291, 370, 405, 405, 419, 533]
                # Theo spec cyclic NQY:  [0, 114, 128,      163, 242,      277, 291,           405]
                #                                                           ^
                #                                                           |
                #                                                        mass(YN)
                #
                # Since NQY is being treated as a cyclic peptide, it has the subpeptide YN (mass of 277). However, the
                # cyclic peptide NQYQ doesn't have the subpeptide YN. That means NQY won't be branched out any further
                # even though it should. As such, even if the exp spec is for a cyclic peptide, treat the candidates as
                # linear segments of that cyclic peptide (essentially linear peptides).
                #
                # Exp spec  cyclic NQYQ: [0, 114, 128, 128, 163, 242, 242, 291, 291, 370, 405, 405, 419, 533]
                # Theo spec linear NQY:  [0, 114, 128,      163, 242,      291,           405]
                #
                # Given the specs above, the exp spec contains all masses in the theo spec.
                theo_spec = SequenceTester.generate_theroetical_spectrum_with_tolerances(
                    p,
                    PeptideType.LINEAR,
                    aa_mass_table,
                    aa_mass_tolerance
                )
                score = score_spectrums(exp_spec, theo_spec)
                if (candidate_threshold < 1.0 and score[0] / len(theo_spec) < candidate_threshold)\
                        or score[0] < candidate_threshold:
                    removal_idxes.add(i)
        candidate_peptides = [p for i, p in enumerate(candidate_peptides) if i not in removal_idxes]
    return tester_set

⚠️NOTE️️️⚠️

The experimental spectrum in the example below is for the peptide 114-128-163, which has the theoretical spectrum [0, 114, 128, 163, 242, 291, 405].

Given the ...

Top 10 captured mino acid masses (rounded to 1): [114.0, 112.5, 115.8, 161.1, 162.9, 127.1, 130.4, 177.5]

For peptides between 397.0 and 411.0...

Leaderboard Algorithm

ALGORITHM:

↩PREREQUISITES↩

This algorithm is similar to the branch-and-bound algorithm, but the bounding step is slightly different: At each branch, rather than removing untenable candidate peptides, it only moves forward the best n scoring candidate peptides. These best scoring peptides are referred to as the leaderboard.

For a perfect experimental spectrum (no missing masses, no faulty masses, no noise, and preserved repeat masses), this algorithm isn't much different than the branch-and-bound algorithm. However, imagine if the perfect experimental spectrum wasn't exactly perfect in that it could have faulty masses and could be missing masses. In such a case, the branch-and-bound algorithm would always fail while this algorithm could still converge to the correct peptide -- it's a heuristic, meaning that it isn't guaranteed to lead to the correct peptide.

ch4_code/src/SequencePeptide_Naive_Leaderboard.py (lines 11 to 95):

def sequence_peptide(
        exp_spec: List[float],  # must be sorted
        peptide_type: PeptideType,
        peptide_mass: Optional[float],
        aa_mass_table: Dict[AA, float],
        leaderboard_size: int
) -> List[List[AA]]:
    # Exp_spec could be missing masses / have faulty masses, but even so assume the last mass in exp_spec is the peptide
    # mass if the user didn't supply one. This may not be correct -- it's a best guess.
    if peptide_mass is None:
        peptide_mass = exp_spec[-1]
    leaderboard = [[]]
    final_peptides = [next(iter(leaderboard))]
    final_score = score_spectrums(
        theoretical_spectrum(final_peptides[0], peptide_type, aa_mass_table),
        exp_spec
    )
    while len(leaderboard) > 0:
        # Branch leaderboard
        expanded_leaderboard = []
        for p in leaderboard:
            for m in aa_mass_table:
                new_p = p[:] + [m]
                expanded_leaderboard.append(new_p)
        # Pull out any expanded_leaderboard peptides with mass >= peptide_mass
        removal_idxes = set()
        for i, p in enumerate(expanded_leaderboard):
            p_mass = sum([aa_mass_table[aa] for aa in p])
            if p_mass == peptide_mass:
                # The peptide's mass is equal to the expected mass. Check if the score against the current top score. If
                # it's ...
                #  * a higher score, reset the final peptides to it.
                #  * the same score, add it to the final peptides.
                theo_spec = theoretical_spectrum(p, peptide_type, aa_mass_table)
                score = score_spectrums(theo_spec, exp_spec)
                if score > final_score:
                    final_peptides = [p]
                    final_score = score_spectrums(
                        theoretical_spectrum(final_peptides[0], peptide_type, aa_mass_table),
                        exp_spec
                    )
                elif score == final_score:
                    final_peptides.append(p)
                # p should be removed at this point (the line below should be uncommented). Not removing it means that
                # it may end up in the leaderboard for the next cycle. If that happens, it'll get branched out into new
                # candidate peptides where each has an amino acids appended.
                #
                # The problem with branching p out further is that p's mass already matches the expected peptide mass.
                # Once p gets branched out, those branched out candidate peptides will have masses that EXCEED the
                # expected peptide mass, meaning they'll all get removed anyway. This would be fine, except that by
                # moving p into the leaderboard for the next cycle you're potentially preventing other viable
                # candidate peptides from making it in.
                #
                # So why isn't p being removed here (why was the line below commented out)? The questions on Stepik
                # expect no removal at this point. Uncommenting it will cause more peptides than are expected to show up
                # for some questions, meaning the answer will be rejected by Stepik.
                #
                # removal_idxes.add(i)
            elif p_mass > peptide_mass:
                # The peptide's mass exceeds the expected mass, meaning that there's no chance that this peptide can be
                # a match for exp_spec. Discard it.
                removal_idxes.add(i)
        expanded_leaderboard = [p for i, p in enumerate(expanded_leaderboard) if i not in removal_idxes]
        # Set leaderboard to the top n scoring peptides from expanded_leaderboard, but include peptides past n as long
        # as those peptides have a score equal to the nth peptide. The reason for this is that because they score the
        # same, there's just as much of a chance that they'll end up as a winner as it is that the nth peptide will.
            # NOTE: Why get the theo spec of the linear version even if the peptide is cyclic? For similar reasons as to
            # why it's done in the branch-and-bound variant: If we treat candidate peptides as cyclic, their theo spec
            # will include masses for wrapping subpeptides of the candidate peptide. These wrapping subpeptide masses
            # may end up inadvertently matching masses in the experimental spectrum, meaning that the candidate may get
            # a better score than it should, potentially pushing it forward over other candidates that would ultimately
            # branch out  to a more optimal final solution. As such, even if the exp  spec is  for a cyclic peptide,
            # treat the candidates as linear segments of that cyclic peptide (essentially linear  peptides). If you're
            # confused go see the comment in the branch-and-bound variant.
        theo_specs = [theoretical_spectrum(p, PeptideType.LINEAR, aa_mass_table) for p in expanded_leaderboard]
        scores = [score_spectrums(theo_spec, exp_spec) for theo_spec in theo_specs]
        scores_paired = sorted(zip(expanded_leaderboard, scores), key=lambda x: x[1], reverse=True)
        leaderboard_trim_to_size = len(expanded_leaderboard)
        for j in range(leaderboard_size + 1, len(scores_paired)):
            if scores_paired[leaderboard_size][1] > scores_paired[j][1]:
                leaderboard_trim_to_size = j - 1
                break
        leaderboard = [p for p, _ in scores_paired[:leaderboard_trim_to_size]]
    return final_peptides

⚠️NOTE️️️⚠️

The experimental spectrum in the example below is for the peptide NQYQ, which has the theoretical spectrum [0, 114, 128, 128, 163, 242, 242, 291, 291, 370, 405, 405, 419, 533].

The cyclic peptides matching the experimental spectrum [0.0, 114.0, 163.0, 242.0, 291.0, 370.0, 405.0, 419.0, 480.0, 533.0] are with leaderboard size of 10...

⚠️NOTE️️️⚠️

The following section isn't from the Pevzner book or any online resources. I came up with it in an effort to solve the final assignment for Chapter 4 (the chapter on non-ribosomal peptide sequencing). As such, it might not be entirely correct / there may be better ways to do this.

For real experimental spectrums, the algorithm is very similar to the real experimental spectrum version of the branch-and-bound algorithm. The only difference is the bounding heuristic: At each branch, rather than moving forward candidate peptides that meet a certain score threshold, move forward the best n scoring candidate peptides. These best scoring peptides are referred to as the leaderboard.

ch4_code/src/SequencePeptide_Leaderboard.py (lines 14 to 79):

def sequence_peptide(
        exp_spec: List[float],                               # must be sorted asc
        aa_mass_table: Dict[AA, float],                      # amino acid mass table
        aa_mass_tolerance: float,                            # amino acid mass tolerance
        peptide_mass_candidates: List[Tuple[float, float]],  # mass range candidates for mass of peptide
        peptide_type: PeptideType,                           # linear or cyclic
        score_backlog: int,                                  # backlog of top scores
        leaderboard_size: int,
        leaderboard_initial: List[List[AA]] = None           # bootstrap candidate peptides for leaderboard
) -> SequenceTesterSet:
    tester_set = SequenceTesterSet(
        exp_spec,
        aa_mass_table,
        aa_mass_tolerance,
        peptide_mass_candidates,
        peptide_type,
        score_backlog
    )
    if leaderboard_initial is None:
        leaderboard = [[]]
    else:
        leaderboard = leaderboard_initial[:]
    while len(leaderboard) > 0:
        # Branch candidates
        expanded_leaderboard = []
        for p in leaderboard:
            for m in aa_mass_table:
                new_p = p[:] + [m]
                expanded_leaderboard.append(new_p)
        # Test candidates to see if they match exp_spec or if they should keep being branched
        removal_idxes = set()
        for i, p in enumerate(expanded_leaderboard):
            res = set(tester_set.test(p))
            if {TestResult.MASS_TOO_LARGE} == res:
                removal_idxes.add(i)
        expanded_leaderboard = [p for i, p in enumerate(expanded_leaderboard) if i not in removal_idxes]
        # Set leaderboard to the top n scoring peptides from expanded_leaderboard, but include peptides past n as long
        # as those peptides have a score equal to the nth peptide. The reason for this is that because they score the
        # same, there's just as much of a chance that they'll end up as the winner as it is that the nth peptide will.
            # NOTE: Why get the theo spec of the linear version even if the peptide is cyclic? For similar reasons as to
            # why it's done in the branch-and-bound variant: If we treat candidate peptides as cyclic, their theo spec
            # will include masses for wrapping subpeptides of the candidate peptide. These wrapping subpeptide masses
            # may end up inadvertently matching masses in the experimental spectrum, meaning that the candidate may get
            # a better score than it should, potentially pushing it forward over other candidates that would ultimately
            # branch out  to a more optimal final solution. As such, even if the exp  spec is  for a cyclic peptide,
            # treat the candidates as linear segments of that cyclic peptide (essentially linear  peptides).
        theo_specs = [
            SequenceTester.generate_theroetical_spectrum_with_tolerances(
                p,
                peptide_type,
                aa_mass_table,
                aa_mass_tolerance
            )
            for p in expanded_leaderboard
        ]
        scores = [score_spectrums(exp_spec, theo_spec) for theo_spec in theo_specs]
        scores_paired = sorted(zip(expanded_leaderboard, scores), key=lambda x: x[1], reverse=True)
        leaderboard_tail_idx = min(leaderboard_size, len(scores_paired)) - 1
        leaderboard_tail_score = 0 if leaderboard_tail_idx == -1 else scores_paired[leaderboard_tail_idx][1]
        for j in range(leaderboard_tail_idx + 1, len(scores_paired)):
            if scores_paired[j][1] < leaderboard_tail_score:
                leaderboard_tail_idx = j - 1
                break
        leaderboard = [p for p, _ in scores_paired[:leaderboard_tail_idx + 1]]
    return tester_set

⚠️NOTE️️️⚠️

The experimental spectrum in the example below is for the peptide 114-128-163, which has the theoretical spectrum [0, 114, 128, 163, 242, 291, 405].

Given the ...

Top 10 captured mino acid masses (rounded to 1): [114.0, 112.5, 115.8, 161.1, 162.9, 127.1, 130.4, 177.5]

For peptides between 397.0 and 411.0...

⚠️NOTE️️️⚠️

This was the version of the algorithm used to solve chapter 4's final assignment (sequence a real experimental spectrum for some unknown variant of Tyrocidine). Note how the parameters into sequence_peptide take an initial leaderboard. This initial leaderboard was primed with subpeptide sequences from other Tyrocidine variants discusses in chapter 4. The problem wasn't solvable without these subpeptide sequences. More information on this can be found in the Python file for the final assignment.

Before coming up with the above solution, I came up with another heuristic that I tried: Use basic genetic algorithms / evolutionary algorithms as the heuristic to move forward peptides. This performed even worse than leaderboard: If the mutation rate is too low, the candidates converge to a local optima and can't break out. If the mutation rate is too high, the candidates never converge to a solution. As such, it was removed from the code.

Sequence Alignment

Many core biology constructs are represented as sequences. For example, ...

Performing a sequence alignment on a set of sequences means to match up the elements of those sequences against each other using a set of basic operations:

There are many ways that a set of sequences can be aligned. For example, the sequences MAPLE and TABLE may be aligned by performing...

String 1 String 2 Operation
M Insert/delete
T Insert/delete
A A Keep matching
P B Replace
L L Keep matching
E E Keep matching

Or, MAPLE and TABLE may be aligned by performing...

String 1 String 2 Operation
M T Replace
A A Keep matching
P B Replace
L L Keep matching
E E Keep matching

Typically the highest scoring sequence alignment is the one that's chosen, where the score is some custom function that best represents the type of sequence being worked with (e.g. proteins are scored differently than DNA). In the example above, if replacements are scored better than indels, the latter alignment would be the highest scoring. Sequences that strongly align are thought of as being related / similar (e.g. proteins that came from the same parent but diverged to 2 separate evolutionary paths).

The names of these operations make more sense if you were to think of alignment instead as transformation. The example above's first alignment in the context of transforming MAPLE to TABLE may be thought of as:

From To Operation Result
M Delete M
T Insert T T
A A Keep matching A TA
P B Replace P to B TAB
L L Keep matching L TABL
E E Keep matching E TABLE

The shorthand form of representing sequence alignments is to stack each sequence. The example above may be written as...

0 1 2 3 4 5
String 1 M A P L E
String 2 T A B L E

Typically, all possible sequence alignments are represented using an alignment graph: a graph that represents all possible alignments for a set of sequences. A path through an alignment graph from source node to sink node is called an alignment path: a path that represents one specific way the set of sequences may be aligned. For example, the alignment graph and alignment paths for the alignments above (MAPLE vs TABLE) ...

Kroki diagram output

The example above is just one of many sequence alignment types. There are different types of alignment graphs, applications of alignment graphs, and different scoring models used in bioinformatics.

⚠️NOTE️️️⚠️

The Pevzner book mentions a non-biology related problem to help illustrate alignment graphs: the Manhattan Tourist problem. Look it up if you're confused.

⚠️NOTE️️️⚠️

The Pevzner book, in a later chapter (ch7 -- phylogeny), spends an entire section talking about character tables and how they can be thought of as sequences (character vectors). There's no good place to put this information. It seems non-critical so the only place it exists is in the terminology section.

Find Maximum Path

WHAT: Given an arbitrary directed acyclic graph where each edge has a weight, find the path with the maximum weight between two nodes.

WHY: Finding a maximum path between nodes is fundamental to sequence alignments. That is, regardless of what type of sequence alignment is being performed, at its core it boils down to finding the maximum weight between two nodes in an alignment graph.

Bruteforce Algorithm

ALGORITHM:

This algorithm finds a maximum path using recursion. To calculate the maximum path between two nodes, iterate over each of the source node's children and calculate edge_weight + max_path(child, destination).weight. The iteration with the highest value is the one with the maximum path to the destination node.

This is too slow to be used on anything but small DAGs.

ch5_code/src/find_max_path/FindMaxPath_Bruteforce.py (lines 21 to 50):

def find_max_path(
        graph: Graph[N, ND, E, ED],
        current_node: N,
        end_node: N,
        get_edge_weight_func: GET_EDGE_WEIGHT_FUNC_TYPE
) -> Optional[Tuple[List[E], float]]:
    if current_node == end_node:
        return [], 0.0
    alternatives = []
    for edge_id in graph.get_outputs(current_node):
        edge_weight = get_edge_weight_func(edge_id)
        child_n = graph.get_edge_to(edge_id)
        res = find_max_path(
            graph,
            child_n,
            end_node,
            get_edge_weight_func
        )
        if res is None:
            continue
        path, weight = res
        path = [edge_id] + path
        weight = edge_weight + weight
        res = path, weight
        alternatives.append(res)
    if len(alternatives) == 0:
        return None  # no path to end, so return None
    else:
        return max(alternatives, key=lambda x: x[1])  # choose path to end with max weight

Given the following graph...

Dot diagram

... the path with the max weight between A and E ...

Cache Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm extends the bruteforce algorithm using dynamic programming: A technique that breaks down a problem into recursive sub-problems, where the result of each sub-problem is stored in some lookup table (cache) such that it can be re-used if that sub-problem were ever encountered again. The bruteforce algorithm already breaks down into recursive sub-problems. As such, the only change here is that the result of each sub-problem computation is cached such that it can be re-used if it were ever encountered again.

ch5_code/src/find_max_path/FindMaxPath_DPCache.py (lines 21 to 56):

def find_max_path(
        graph: Graph[N, ND, E, ED],
        current_node: N,
        end_node: N,
        cache: Dict[N, Optional[Tuple[List[E], float]]],
        get_edge_weight_func: GET_EDGE_WEIGHT_FUNC_TYPE
) -> Optional[Tuple[List[E], float]]:
    if current_node == end_node:
        return [], 0.0
    alternatives = []
    for edge_id in graph.get_outputs(current_node):
        edge_weight = get_edge_weight_func(edge_id)
        child_n = graph.get_edge_to(edge_id)
        if child_n in cache:
            res = cache[child_n]
        else:
            res = find_max_path(
                graph,
                child_n,
                end_node,
                cache,
                get_edge_weight_func
            )
            cache[child_n] = res
        if res is None:
            continue
        path, weight = res
        path = [edge_id] + path
        weight = edge_weight + weight
        res = path, weight
        alternatives.append(res)
    if len(alternatives) == 0:
        return None  # no path to end, so return None
    else:
        return max(alternatives, key=lambda x: x[1])  # choose path to end with max weight

Given the following graph...

Dot diagram

... the path with the max weight between A and E ...

Backtrack Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm is a better but less obvious dynamic programming approach. The previous dynamic programming algorithm builds a cache containing the maximum path from each node encountered to the destination node. This dynamic programming algorithm instead builds out a smaller cache from the source node fanning out one step at a time.

In this less obvious algorithm, there are edge weights just as before but each node also has a weight and a selected incoming edge. The DAG starts off with all node weights and incoming edge selections being unset. The source node has its weight set to 0. Then, for any node where all of its parents have a weight set, select the incoming edge where parent_weight + edge_weight is the highest. That highest parent_weight + edge_weight becomes the weight of that node and the edge responsible for it becomes the selected incoming edge (backtracking edge).

Repeat until all nodes have a weight and backtracking edge set.

For example, imagine the following DAG...

Kroki diagram output

Set source nodes to have a weight of 0...

Kroki diagram output

Then, iteratively set the weights and backtracking edges...

Kroki diagram output

⚠️NOTE️️️⚠️

This process is walking the DAG in topological order.

To find the path with the maximum weight, simply walk backward using the backtracking edges from the destination node to the source node. For example, in the DAG above the maximum path that ends at E can be determined by following the backtracking edges from E until A is reached...

The maximum path from A to E is A -> C -> E and the weight of that path is 4 (the weight of E).

This variant of the dynamic programming algorithm uses less memory than the first. For each node encountered, ...

ch5_code/src/find_max_path/FindMaxPath_DPBacktrack.py (lines 41 to 143):

def populate_weights_and_backtrack_pointers(
        g: Graph[N, ND, E, ED],
        from_node: N,
        set_node_data_func: SET_NODE_DATA_FUNC_TYPE,
        get_node_data_func: GET_NODE_DATA_FUNC_TYPE,
        get_edge_weight_func: GET_EDGE_WEIGHT_FUNC_TYPE
):
    processed_nodes = set()          # nodes where all parents have been processed AND it has been processed
    waiting_nodes = set()            # nodes where all parents have been processed BUT it has yet to be processed
    unprocessable_nodes = Counter()  # nodes that have some parents remaining to be processed (value=# of parents left)
    # For all root nodes, add to processed_nodes and set None weight and None backtracking edge.
    for node in g.get_nodes():
        if g.get_in_degree(node) == 0:
            set_node_data_func(node, None, None)
            processed_nodes |= {node}
    # For all root nodes, add any children where its the only parent to waiting_nodes.
    for node in processed_nodes:
        for e in g.get_outputs(node):
            dst_node = g.get_edge_to(e)
            if {g.get_edge_from(e) for e in g.get_inputs(dst_node)}.issubset(processed_nodes):
                waiting_nodes |= {dst_node}
    # Make sure from_node is a root and set its weight to 0.
    assert from_node in processed_nodes
    set_node_data_func(from_node, 0.0, None)
    # Track how many remaining parents each node in the graph has. Note that the graph's root nodes were already marked
    # as processed above.
    for node in g.get_nodes():
        incoming_nodes = {g.get_edge_from(e) for e in g.get_inputs(node)}
        incoming_nodes -= processed_nodes
        unprocessable_nodes[node] = len(incoming_nodes)
    # Any nodes in waiting_nodes have had all their parents already processed (in processed_nodes). As such, they can
    # have their weights and backtracking pointers calculated. They can then be placed into processed_nodes themselves.
    while len(waiting_nodes) > 0:
        node = next(iter(waiting_nodes))
        incoming_nodes = {g.get_edge_from(e) for e in g.get_inputs(node)}
        if not incoming_nodes.issubset(processed_nodes):
            continue
        incoming_accum_weights = {}
        for edge in g.get_inputs(node):
            src_node = g.get_edge_from(edge)
            src_node_weight, _ = get_node_data_func(src_node)
            edge_weight = get_edge_weight_func(edge)
            # Roots that aren't from_node were initialized to a weight of None -- if you see them, skip them.
            if src_node_weight is not None:
                incoming_accum_weights[edge] = src_node_weight + edge_weight
        if len(incoming_accum_weights) == 0:
            max_edge = None
            max_weight = None
        else:
            max_edge = max(incoming_accum_weights, key=lambda e: incoming_accum_weights[e])
            max_weight = incoming_accum_weights[max_edge]
        set_node_data_func(node, max_weight, max_edge)
        # This node has been processed, move it over to processed_nodes.
        waiting_nodes.remove(node)
        processed_nodes.add(node)
        # For outgoing nodes this node points to, if that outgoing node has all of its dependencies in processed_nodes,
        # then add it to waiting_nodes (so it can be processed).
        outgoing_nodes = {g.get_edge_to(e) for e in g.get_outputs(node)}
        for output_node in outgoing_nodes:
            unprocessable_nodes[output_node] -= 1
            if unprocessable_nodes[output_node] == 0:
                waiting_nodes.add(output_node)


def backtrack(
        g: Graph[N, ND, E, ED],
        end_node: N,
        get_node_data_func: GET_NODE_DATA_FUNC_TYPE
) -> List[E]:
    next_node = end_node
    reverse_path = []
    while True:
        node = next_node
        weight, backtracking_edge = get_node_data_func(node)
        if backtracking_edge is None:
            break
        else:
            reverse_path.append(backtracking_edge)
        next_node = g.get_edge_from(backtracking_edge)
    return reverse_path[::-1]  # this is the path in reverse -- reverse it to get it in the correct order


def find_max_path(
        graph: Graph[N, ND, E, ED],
        start_node: N,
        end_node: N,
        set_node_data_func: SET_NODE_DATA_FUNC_TYPE,
        get_node_data_func: GET_NODE_DATA_FUNC_TYPE,
        get_edge_weight_func: GET_EDGE_WEIGHT_FUNC_TYPE
) -> Optional[Tuple[List[E], float]]:
    populate_weights_and_backtrack_pointers(
        graph,
        start_node,
        set_node_data_func,
        get_node_data_func,
        get_edge_weight_func
    )
    path = backtrack(graph, end_node, get_node_data_func)
    if not path:
        return None
    weight, _ = get_node_data_func(end_node)
    return path, weight

Given the following graph...

Dot diagram

... the path with the max weight between A and E ...

Dot diagram

The edges in blue signify the incoming edge that was selected for that node.

⚠️NOTE️️️⚠️

Note how ...

It's easy to flip this around by reversing the direction the algorithm walks.

Global Alignment

↩PREREQUISITES↩

WHAT: Given two sequences, perform sequence alignment and pull out the highest scoring alignment.

WHY: A strong global alignment indicates that the sequences are likely homologous / related.

Graph Algorithm

ALGORITHM:

Determining the best scoring pairwise alignment can be done by generating a DAG of all possible operations at all possible positions in each sequence. Specifically, each operation (indel, match, mismatch) is represented as an edge in the graph, where that edge has a weight. Operations with higher weights are more desirable operations compared to operations with lower weights (e.g. a match is typically more favourable than an indel).

For example, consider a DAG that pits FOUR against CHOIR...

Kroki diagram output

Given this graph, each ...

Latex diagram

NOTE: Each edge is labeled with the elements selected from the 1st sequence, 2nd sequence, and edge weight.

This graph is called an alignment graph. A path through the alignment graph from source (top-left) to sink (bottom-right) represents a single alignment, referred to as an alignment path. For example the alignment path representing...

CH-OIR
--FOUR

... is as follows...

Latex diagram

NOTE: Each edge is labeled with the elements selected from the 1st sequence, 2nd sequence, and edge weight.

The weight of an alignment path is the sum of its operation weights. Since operations with higher weights are more desirable than those with lower weights, alignment paths with higher weights are more desirable than those with lower weights. As such, out of all the alignment paths possible, the one with the highest weight is the one with the most desirable set of operations.

The highlighted path in the example path above has a weight of -1: -1 + -1 + -1 + 1 + 0 + 1.

ch5_code/src/global_alignment/GlobalAlignment_Graph.py (lines 37 to 78):

def create_global_alignment_graph(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Graph[Tuple[int, ...], NodeData, str, EdgeData]:
    graph = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], weight_lookup.lookup(*elems))
    )
    return graph


def global_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[str], List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    graph = create_global_alignment_graph(v, w, weight_lookup)
    from_node = (0, 0)
    to_node = (v_node_count - 1, w_node_count - 1)
    populate_weights_and_backtrack_pointers(
        graph,
        from_node,
        lambda n_id, weight, e_id: graph.get_node_data(n_id).set_weight_and_backtracking_edge(weight, e_id),
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge(),
        lambda e_id: graph.get_edge_data(e_id).weight
    )
    final_weight = graph.get_node_data(to_node).weight
    edges = backtrack(
        graph,
        to_node,
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge()
    )
    alignment = []
    for e in edges:
        ed = graph.get_edge_data(e)
        alignment.append((ed.v_elem, ed.w_elem))
    return final_weight, edges, alignment

Given the sequences TAAT and GAT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... the global alignment is...

Latex diagram

TAAT
GA-T

Weight: 1.0

Matrix Algorithm

↩PREREQUISITES↩

ALGORITHM:

The following algorithm is essentially the same as the graph algorithm, except that the implementation is much more sympathetic to modern hardware. The alignment graph is represented as a 2D matrix where each element in the matrix represents a node in the alignment graph. The elements are then populated in a predefined topological order, where each element gets populated with the node weight, the chosen backtracking edge, and the elements from that backtracking edge.

Since the alignment graph is a grid, the node weights may be populated either...

In either case, the nodes being walked are guaranteed to have their parent node weights already set.

Kroki diagram output

ch5_code/src/global_alignment/GlobalAlignment_Matrix.py (lines 10 to 73):

def backtrack(
        node_matrix: List[List[Any]]
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_idx = len(node_matrix) - 1
    w_node_idx = len(node_matrix[0]) - 1
    final_weight = node_matrix[v_node_idx][w_node_idx][0]
    alignment = []
    while v_node_idx != 0 or w_node_idx != 0:
        _, elems, backtrack_ptr = node_matrix[v_node_idx][w_node_idx]
        if backtrack_ptr == '↓':
            v_node_idx -= 1
        elif backtrack_ptr == '→':
            w_node_idx -= 1
        elif backtrack_ptr == '↘':
            v_node_idx -= 1
            w_node_idx -= 1
        alignment.append(elems)
    return final_weight, alignment[::-1]


def global_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    node_matrix = []
    for v_node_idx in range(v_node_count):
        row = []
        for w_node_idx in range(w_node_count):
            row.append([-1.0, (None, None), '?'])
        node_matrix.append(row)
    node_matrix[0][0][0] = 0.0           # source node weight
    node_matrix[0][0][1] = (None, None)  # source node elements (elements don't matter for source node)
    node_matrix[0][0][2] = '↘'           # source node backtracking edge (direction doesn't matter for source node)
    for v_node_idx, w_node_idx in product(range(v_node_count), range(w_node_count)):
        parents = []
        if v_node_idx > 0 and w_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx - 1][0] + weight_lookup.lookup(v_elem, w_elem),
                (v_elem, w_elem),
                '↘'
            ])
        if v_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx][0] + weight_lookup.lookup(v_elem, None),
                (v_elem, None),
                '↓'
            ])
        if w_node_idx > 0:
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx][w_node_idx - 1][0] + weight_lookup.lookup(None, w_elem),
                (None, w_elem),
                '→'
            ])
        if parents:  # parents wil be empty if v_node_idx and w_node_idx were both 0
            node_matrix[v_node_idx][w_node_idx] = max(parents, key=lambda x: x[0])
    return backtrack(node_matrix)

Given the sequences TATTATTAT and AAA and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... the global alignment is...

TATTATTAT
-A--A--A-

Weight: -3.0

⚠️NOTE️️️⚠️

The standard Levenshtein distance algorithm using a 2D array you remember from over a decade ago is this algorithm: Matrix-based global alignment where matches score 0 but mismatches and indels score -1. The final weight of the alignment is the minimum number of operations required to convert one sequence to another (e.g. swap, insert, delete) -- it'll be negative, ignore the sign.

Divide-and-Conquer Algorithm

↩PREREQUISITES↩

ALGORITHM:

The following algorithm extends the matrix algorithm such that it can process much larger graphs at the expense of duplicating some computation work (trading time for space). It relies on two ideas.

Recall that in the matrix implementation of global alignment, node weights are populated in a pre-defined topological order (either row-by-row or column-by-column). Imagine that you've chosen to populate the matrix column-by-column.

The first idea is that, if all you care about is the final weight of the sink node, the matrix implementation technically only needs to keep 2 columns in memory: the column having its node weights populated as well as the previous column.

In other words, the only data needed to calculate the weights of the next column is the weights in the previous column.

Kroki diagram output

ch5_code/src/global_alignment/Global_ForwardSweeper.py (lines 9 to 51):

class ForwardSweeper:
    def __init__(self, v: List[ELEM], w: List[ELEM], weight_lookup: WeightLookup, col_backtrack: int = 2):
        self.v = v
        self.v_node_count = len(v) + 1
        self.w = w
        self.w_node_count = len(w) + 1
        self.weight_lookup = weight_lookup
        self.col_backtrack = col_backtrack
        self.matrix_v_start_idx = 0  # col
        self.matrix = []
        self._reset()

    def _reset(self):
        self.matrix_v_start_idx = 0  # col
        col = [-1.0] * self.w_node_count
        col[0] = 0.0  # source node weight is 0
        for w_idx in range(1, self.w_node_count):
            col[w_idx] = col[w_idx - 1] + self.weight_lookup.lookup(None, self.w[w_idx - 1])
        self.matrix = [col]

    def _step(self):
        next_col = [-1.0] * self.w_node_count
        next_v_idx = self.matrix_v_start_idx + len(self.matrix)
        if len(self.matrix) == self.col_backtrack:
            self.matrix.pop(0)
            self.matrix_v_start_idx += 1
        self.matrix += [next_col]
        self.matrix[-1][0] = self.matrix[-2][0] + self.weight_lookup.lookup(self.v[next_v_idx - 1], None)  # right penalty for first row of new col
        for w_idx in range(1, len(self.w) + 1):
            self.matrix[-1][w_idx] = max(
                self.matrix[-2][w_idx] + self.weight_lookup.lookup(None, self.w[w_idx - 1]),                     # right score
                self.matrix[-1][w_idx-1] + self.weight_lookup.lookup(self.v[next_v_idx - 1], None),              # down score
                self.matrix[-2][w_idx-1] + self.weight_lookup.lookup(self.v[next_v_idx - 1], self.w[w_idx - 1])  # diag score
            )

    def get_col(self, idx: int):
        if idx < self.matrix_v_start_idx:
            self._reset()
        furthest_stored_idx = self.matrix_v_start_idx + len(self.matrix) - 1
        for _ in range(furthest_stored_idx, idx):
            self._step()
        return list(self.matrix[idx - self.matrix_v_start_idx])

Given the sequences TACT and GACGT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... the node weights are ...

   0.0  -1.0  -2.0  -3.0  -4.0
  -1.0   0.0  -1.0  -2.0  -3.0
  -2.0  -1.0   1.0   0.0  -1.0
  -3.0  -2.0   0.0   2.0   1.0
  -4.0  -3.0  -1.0   1.0   2.0
  -5.0  -3.0  -2.0   0.0   2.0

The sink node weight (maximum alignment path weight) is 2.0

The second idea is that, for a column, it's possible to find out which node in that column a maximum alignment path travels through without knowing that path beforehand.

Kroki diagram output

Knowing this, a divide-and-conquer algorithm may be used to find that maximum alignment path. Any alignment path must travel from the source node (top-left) to the sink node (bottom-right). If you're able to find a node between the source node and sink node that a maximum alignment path travels through, you can sub-divide the alignment graph into 2.

That is, if you know that a maximum alignment path travels through some node, it's guaranteed that...

Kroki diagram output

By recursively performing this operation, you can pull out all nodes that make up a maximum alignment path:

Finding the edges between these nodes yields the maximum alignment path. To find the edges between the node found at column n and the node found at column n + 1, isolate the alignment graph between those nodes and perform the standard matrix variant of global alignment. The graph will likely be very small, so the computation and space requirements will likely be very low.

ch5_code/src/global_alignment/GlobalAlignment_DivideAndConquer_NodeVariant.py (lines 11 to 40):

def find_max_alignment_path_nodes(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        buffer: List[Tuple[int, int]],
        v_offset: int = 0,
        w_offset: int = 0) -> None:
    if len(v) == 0 or len(w) == 0:
        return
    c, r = find_node_that_max_alignment_path_travels_through_at_middle_col(v, w, weight_lookup)
    find_max_alignment_path_nodes(v[:c-1], w[:r-1], weight_lookup, buffer, v_offset=0, w_offset=0)
    buffer.append((v_offset + c, w_offset + r))
    find_max_alignment_path_nodes(v[c:], w[r:], weight_lookup, buffer, v_offset=v_offset+c, w_offset=v_offset+r)


def global_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    nodes = [(0, 0)]
    find_max_alignment_path_nodes(v, w, weight_lookup, nodes)
    weight = 0.0
    alignment = []
    for (v_idx1, w_idx1), (v_idx2, w_idx2) in zip(nodes, nodes[1:]):
        sub_weight, sub_alignment = GlobalAlignment_Matrix.global_alignment(v[v_idx1:v_idx2], w[w_idx1:w_idx2], weight_lookup)
        weight += sub_weight
        alignment += sub_alignment
    return weight, alignment

Given the sequences TACT and GACGT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... the global alignment is...

TAC-T
GACGT

Weight: 2.0

To understand how to find which node in a column a maximum alignment path travels through, consider what happens when edge directions are reversed in an alignment graph. When edge directions are reversed, the alignment graph essentially becomes the alignment graph for the reversed sequences. For example, reversing the edges for the alignment graph of SNACK and AJAX is essentially the same as the alignment graph for KCANS (reverse of SNACK) and XAJA (reverse of AJAX)...

Kroki diagram output

Between an alignment graph and its reversed edge variant, a maximum alignment path should travel through the same set of nodes. Notice how in the following example, ...

  1. the maximum alignment path in both alignment graphs have the same edges.

  2. the sink node weight in both alignment graphs are the same.

  3. for any node that the maximum alignment path travels through, taking the weight of that node from both alignment graphs and adding them together results in the sink node weight.

  4. for any node that the maximum alignment path DOES NOT travel through, taking the weight of that node from both alignment graphs and adding them together results in LESS THAN the sink node weight.

Latex diagram

Insights #3 and #4 in the list above are the key for this algorithm. Consider an alignment graph getting split down a column into two. The first half has edges traveling in the normal direction but the second half has its edges reversed...

Kroki diagram output

Populate node weights for both halves. Then, pair up half 1's last column with half 2's first column. For each row in the pair, add together the node weights in that row. The row with the maximum sum is for a node that a maximum alignment path travels through (insight #4 above). That maximum sum will always end up being the weight of the sink node in the original non-split alignment graph (insight #3 above).

Latex diagram

One way to think about what's happening above is that the algorithm is converging on to the same answer but at a different spot in the alignment graph (the same edge weights are being added). Normally the algorithm converges on to the bottom-right node of the alignment graph. If it were to instead converge on the column just before, the answer would be the same, but the node's position in that column may be different -- it may be any node that ultimately drives to the bottom-right node.

Given that there may be multiple maximum alignment paths for an alignment graph, there may be multiple nodes found per column. Each found node may be for a different maximum alignment path or the same maximum alignment path.

Ultimately, this entire process may be combined with the first idea (only need the previous column in memory to calculate the next column) such that the algorithm requires much lower memory requirements. That is, to find the nodes in a column which maximum alignment paths travel through, the...

ch5_code/src/global_alignment/Global_SweepCombiner.py (lines 10 to 19):

class SweepCombiner:
    def __init__(self, v: List[ELEM], w: List[ELEM], weight_lookup: WeightLookup):
        self.forward_sweeper = ForwardSweeper(v, w, weight_lookup)
        self.reverse_sweeper = ReverseSweeper(v, w, weight_lookup)

    def get_col(self, idx: int):
        fcol = self.forward_sweeper.get_col(idx)
        rcol = self.reverse_sweeper.get_col(idx)
        return [a + b for a, b in zip(fcol, rcol)]

Given the sequences TACT and GACGT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... the combined node weights at column 3 are ...

  -6.0
  -4.0
  -1.0
   2.0
   2.0
  -1.0

To recap, the full divide-and-conquer algorithm is as follows: For the middle column in an alignment graph, find a node that a maximum alignment path travels through. Then, sub-divide the alignment graph based on that node. Recursively repeat this process on each sub-division until you have a node from each column -- these are the nodes in a maximum alignment path. The edges between these found nodes can be determined by finding a maximum alignment path between each found node and its neighbouring found node. Concatenate these edges to construct the path.

ch5_code/src/global_alignment/Global_FindNodeThatMaxAlignmentPathTravelsThroughAtColumn.py (lines 10 to 29):

def find_node_that_max_alignment_path_travels_through_at_col(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        col: int
) -> Tuple[int, int]:
    col_vals = SweepCombiner(v, w, weight_lookup).get_col(col)
    row, _ = max(enumerate(col_vals), key=lambda x: x[1])
    return col, row


def find_node_that_max_alignment_path_travels_through_at_middle_col(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[int, int]:
    v_node_count = len(v) + 1
    middle_col_idx = v_node_count // 2
    return find_node_that_max_alignment_path_travels_through_at_col(v, w, weight_lookup, middle_col_idx)

Given the sequences TACT and GACGT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... a maximum alignment path is guaranteed to travel through (3, 3).

ch5_code/src/global_alignment/GlobalAlignment_DivideAndConquer_NodeVariant.py (lines 11 to 40):

def find_max_alignment_path_nodes(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        buffer: List[Tuple[int, int]],
        v_offset: int = 0,
        w_offset: int = 0) -> None:
    if len(v) == 0 or len(w) == 0:
        return
    c, r = find_node_that_max_alignment_path_travels_through_at_middle_col(v, w, weight_lookup)
    find_max_alignment_path_nodes(v[:c-1], w[:r-1], weight_lookup, buffer, v_offset=0, w_offset=0)
    buffer.append((v_offset + c, w_offset + r))
    find_max_alignment_path_nodes(v[c:], w[r:], weight_lookup, buffer, v_offset=v_offset+c, w_offset=v_offset+r)


def global_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    nodes = [(0, 0)]
    find_max_alignment_path_nodes(v, w, weight_lookup, nodes)
    weight = 0.0
    alignment = []
    for (v_idx1, w_idx1), (v_idx2, w_idx2) in zip(nodes, nodes[1:]):
        sub_weight, sub_alignment = GlobalAlignment_Matrix.global_alignment(v[v_idx1:v_idx2], w[w_idx1:w_idx2], weight_lookup)
        weight += sub_weight
        alignment += sub_alignment
    return weight, alignment

Given the sequences TACT and GACGT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... the global alignment is...

TAC-T
GACGT

Weight: 2.0

A slightly more complicated but also more elegant / efficient solution is to extend the algorithm to find the edges for the nodes that it finds. In other words, rather than finding just nodes that maximum alignment paths travel through, find the edges where those nodes are the edge source (node that the edge starts from).

The algorithm finds all nodes that a maximum alignment path travels through at both column n and column n + 1. For a found node in column n, it's guaranteed that at least one of its immediate neighbours is also a found node. It may be that the node immediately to the ...

Of the immediate neighbours that are also found nodes, the one forming the edge with the highest weight is the edge that a maximum alignment path travels through.

ch5_code/src/global_alignment/Global_FindEdgeThatMaxAlignmentPathTravelsThroughAtColumn.py (lines 10 to 65):

def find_edge_that_max_alignment_path_travels_through_at_col(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        col: int
) -> Tuple[Tuple[int, int], Tuple[int, int]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    sc = SweepCombiner(v, w, weight_lookup)
    # Get max node in column -- max alignment path guaranteed to go through here.
    col_vals = sc.get_col(col)
    row, _ = max(enumerate(col_vals), key=lambda x: x[1])
    # Check node immediately to the right, down, right-down (diag) -- the ones with the max value MAY form the edge that
    # the max alignment path goes through. Recall that the max value will be the same max value as the one from col_vals
    # (weight of the final alignment path / sink node in the full alignment graph).
    #
    # Of the ones WITH the max value, check the weights formed by each edge. The one with the highest edge weight is the
    # edge that the max alignment path goes through (if there's more than 1, it means there's more than 1 max alignment
    # path -- one is picked at random).
    neighbours = []
    next_col_vals = sc.get_col(col + 1) if col + 1 < v_node_count else None  # very quick due to prev call to get_col()
    if col + 1 < v_node_count:
        right_weight = next_col_vals[row]
        right_node = (col + 1, row)
        v_elem = v[col - 1]
        w_elem = None
        edge_weight = weight_lookup.lookup(v_elem, w_elem)
        neighbours += [(right_weight, edge_weight, right_node)]
    if row + 1 < w_node_count:
        down_weight = col_vals[row + 1]
        down_node = (col, row + 1)
        v_elem = None
        w_elem = w[row - 1]
        edge_weight = weight_lookup.lookup(v_elem, w_elem)
        neighbours += [(down_weight, edge_weight, down_node)]
    if col + 1 < v_node_count and row + 1 < w_node_count:
        downright_weight = next_col_vals[row + 1]
        downright_node = (col + 1, row + 1)
        v_elem = v[col - 1]
        w_elem = w[row - 1]
        edge_weight = weight_lookup.lookup(v_elem, w_elem)
        neighbours += [(downright_weight, edge_weight, downright_node)]
    neighbours.sort(key=lambda x: x[:2])  # sort by weight, then edge weight
    _, _, (col2, row2) = neighbours[-1]
    return (col, row), (col2, row2)


def find_edge_that_max_alignment_path_travels_through_at_middle_col(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[Tuple[int, int], Tuple[int, int]]:
    v_node_count = len(v) + 1
    middle_col_idx = (v_node_count - 1) // 2
    return find_edge_that_max_alignment_path_travels_through_at_col(v, w, weight_lookup, middle_col_idx)

Given the sequences TACT and GACGT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... a maximum alignment path is guaranteed to travel through the edge (3, 3), (3, 4).

The recursive sub-division process happens just as before, but this time with edges. Finding the maximum alignment path from edges provides two distinct advantages over the previous method of finding the maximum alignment path from nodes:

ch5_code/src/global_alignment/GlobalAlignment_DivideAndConquer_EdgeVariant.py (lines 10 to 80):

def find_max_alignment_path_edges(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        top: int,
        bottom: int,
        left: int,
        right: int,
        output: List[str]):
    if left == right:
        for i in range(top, bottom):
            output += ['↓']
        return
    if top == bottom:
        for i in range(left, right):
            output += ['→']
        return

    (col1, row1), (col2, row2) = find_edge_that_max_alignment_path_travels_through_at_middle_col(v[left:right], w[top:bottom], weight_lookup)
    middle_col = left + col1
    middle_row = top + row1
    find_max_alignment_path_edges(v, w, weight_lookup, top, middle_row, left, middle_col, output)
    if row1 + 1 == row2 and col1 + 1 == col2:
        edge_dir = '↘'
    elif row1 == row2 and col1 + 1 == col2:
        edge_dir = '→'
    elif row1 + 1 == row2 and col1 == col2:
        edge_dir = '↓'
    else:
        raise ValueError()
    if edge_dir == '→' or edge_dir == '↘':
        middle_col += 1
    if edge_dir == '↓' or edge_dir == '↘':
        middle_row += 1
    output += [edge_dir]
    find_max_alignment_path_edges(v, w, weight_lookup, middle_row, bottom, middle_col, right, output)


def global_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    edges = []
    find_max_alignment_path_edges(v, w, weight_lookup, 0, len(w), 0, len(v), edges)
    weight = 0.0
    alignment = []
    v_idx = 0
    w_idx = 0
    for edge in edges:
        if edge == '→':
            v_elem = v[v_idx]
            w_elem = None
            alignment.append((v_elem, w_elem))
            weight += weight_lookup.lookup(v_elem, w_elem)
            v_idx += 1
        elif edge == '↓':
            v_elem = None
            w_elem = w[w_idx]
            alignment.append((v_elem, w_elem))
            weight += weight_lookup.lookup(v_elem, w_elem)
            w_idx += 1
        elif edge == '↘':
            v_elem = v[v_idx]
            w_elem = w[w_idx]
            alignment.append((v_elem, w_elem))
            weight += weight_lookup.lookup(v_elem, w_elem)
            v_idx += 1
            w_idx += 1
    return weight, alignment

Given the sequences TACT and GACGT and the score matrix...

INDEL=-1.0
   A  C  T  G
A  1  0  0  0
C  0  1  0  0
T  0  0  1  0
G  0  0  0  1

... the global alignment is...

TAC-T
GACGT

Weight: 2.0

⚠️NOTE️️️⚠️

The other types of sequence alignment detailed in the sibling sections below don't implement a version of this algorithm. It's fairly straight forward to adapt this algorithm to support those sequence alignment types, but I didn't have the time to do it -- I almost completed a local alignment version but backed out. The same high-level logic applies to those other alignment types: Converge on positions to find nodes/edges in the maximal alignment path and sub-divide on those positions.

Fitting Alignment

↩PREREQUISITES↩

WHAT: Given two sequences, for all possible substrings of the first sequence, pull out the highest scoring alignment between that substring that the second sequence.

In other words, find the substring within the second sequence that produces the highest scoring alignment with the first sequence. For example, given the sequences GGTTTTTAA and TTCTT, it may be that TTCTT (second sequence) has the highest scoring alignment with TTTTT (substring of the first sequence)...

TTC-TT
TT-TTT

WHY: Searching for a gene's sequence in some larger genome may be problematic because of mutation. The gene sequence being searched for may be slightly off from the gene sequence in the genome.

In the presence of minor mutations, a standard search will fail where a fitting alignment will still be able to find that gene.

Graph Algorithm

↩PREREQUISITES↩

The graph algorithm for fitting alignment is an extension of the graph algorithm for global alignment. Construct the DAG as you would for global alignment, but for each node...

Latex diagram

NOTE: Orange edges are "free rides" from source / Purple edges are "free rides" to sink.

These newly added edges represent hops in the graph -- 0 weight "free rides" to other nodes. The nodes at the destination of each one of these edges will never go below 0: When selecting a backtracking edge, the "free ride" edge will always be chosen over other edges that make the node weight negative.

When finding a maximum alignment path, these "free rides" make it so that the path ...

such that if the first sequence is wedged somewhere within the second sequence, that maximum alignment path will be targeted in such a way that it homes in on it.

ch5_code/src/fitting_alignment/FittingAlignment_Graph.py (lines 37 to 95):

def create_fitting_alignment_graph(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Graph[Tuple[int, ...], NodeData, str, EdgeData]:
    graph = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], weight_lookup.lookup(*elems))
    )
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    source_node = 0, 0
    source_create_free_ride_edge_id_func = unique_id_generator('FREE_RIDE_SOURCE')
    for node in product([0], range(w_node_count)):
        if node == source_node:
            continue
        e = source_create_free_ride_edge_id_func()
        graph.insert_edge(e, source_node, node, EdgeData(None, None, 0.0))
    sink_node = v_node_count - 1, w_node_count - 1
    sink_create_free_ride_edge_id_func = unique_id_generator('FREE_RIDE_SINK')
    for node in product([v_node_count - 1], range(w_node_count)):
        if node == sink_node:
            continue
        e = sink_create_free_ride_edge_id_func()
        graph.insert_edge(e, node, sink_node, EdgeData(None, None, 0.0))
    return graph


def fitting_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[str], List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    graph = create_fitting_alignment_graph(v, w, weight_lookup)
    from_node = (0, 0)
    to_node = (v_node_count - 1, w_node_count - 1)
    populate_weights_and_backtrack_pointers(
        graph,
        from_node,
        lambda n_id, weight, e_id: graph.get_node_data(n_id).set_weight_and_backtracking_edge(weight, e_id),
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge(),
        lambda e_id: graph.get_edge_data(e_id).weight
    )
    final_weight = graph.get_node_data(to_node).weight
    edges = backtrack(
        graph,
        to_node,
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge()
    )
    alignment_edges = list(filter(lambda e: not e.startswith('FREE_RIDE'), edges))  # remove free rides from list
    alignment = []
    for e in alignment_edges:
        ed = graph.get_edge_data(e)
        alignment.append((ed.v_elem, ed.w_elem))
    return final_weight, edges, alignment

Given the sequences AGAC and TAAGAACT and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the global alignment is...

Latex diagram

AG-AC
AGAAC

Weight: 3.0

Matrix Algorithm

↩PREREQUISITES↩

ALGORITHM:

The following algorithm is an extension to global alignment's matrix algorithm to properly account for the "free ride" edges required by fitting alignment. It's essentially the same as the graph algorithm, except that the implementation is much more sympathetic to modern hardware.

ch5_code/src/fitting_alignment/FittingAlignment_Matrix.py (lines 10 to 93):

def backtrack(
        node_matrix: List[List[Any]]
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_idx = len(node_matrix) - 1
    w_node_idx = len(node_matrix[0]) - 1
    final_weight = node_matrix[v_node_idx][w_node_idx][0]
    alignment = []
    while v_node_idx != 0 or w_node_idx != 0:
        _, elems, backtrack_ptr = node_matrix[v_node_idx][w_node_idx]
        if backtrack_ptr == '↓':
            v_node_idx -= 1
            alignment.append(elems)
        elif backtrack_ptr == '→':
            w_node_idx -= 1
            alignment.append(elems)
        elif backtrack_ptr == '↘':
            v_node_idx -= 1
            w_node_idx -= 1
            alignment.append(elems)
        elif isinstance(backtrack_ptr, tuple):
            v_node_idx = backtrack_ptr[0]
            w_node_idx = backtrack_ptr[1]
    return final_weight, alignment[::-1]


def fitting_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    node_matrix = []
    for v_node_idx in range(v_node_count):
        row = []
        for w_node_idx in range(w_node_count):
            row.append([-1.0, (None, None), '?'])
        node_matrix.append(row)
    node_matrix[0][0][0] = 0.0           # source node weight
    node_matrix[0][0][1] = (None, None)  # source node elements (elements don't matter for source node)
    node_matrix[0][0][2] = '↘'           # source node backtracking edge (direction doesn't matter for source node)
    for v_node_idx, w_node_idx in product(range(v_node_count), range(w_node_count)):
        parents = []
        if v_node_idx > 0 and w_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx - 1][0] + weight_lookup.lookup(v_elem, w_elem),
                (v_elem, w_elem),
                '↘'
            ])
        if v_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx][0] + weight_lookup.lookup(v_elem, None),
                (v_elem, None),
                '↓'
            ])
        if w_node_idx > 0:
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx][w_node_idx - 1][0] + weight_lookup.lookup(None, w_elem),
                (None, w_elem),
                '→'
            ])
        # If first column but not source node, consider free-ride from source node
        if v_node_idx == 0 and w_node_idx != 0:
            parents.append([
                0.0,
                (None, None),
                (0, 0)  # jump to source
            ])
        # If sink node, consider free-rides coming from every node in last column that isn't sink node
        if v_node_idx == v_node_count - 1 and w_node_idx == w_node_count - 1:
            for w_node_idx_from in range(w_node_count - 1):
                parents.append([
                    node_matrix[v_node_idx][w_node_idx_from][0] + 0.0,
                    (None, None),
                    (v_node_idx, w_node_idx_from)  # jump to this position
                ])
        if parents:  # parents will be empty if v_node_idx and w_node_idx were both 0
            node_matrix[v_node_idx][w_node_idx] = max(parents, key=lambda x: x[0])
    return backtrack(node_matrix)

Given the sequences AGAC and TAAGAACT and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the fitting alignment is...

AG-AC
AGAAC

Weight: 3.0

Overlap Alignment

↩PREREQUISITES↩

WHAT: Given two sequences, for all possible substrings that ...

... , pull out the highest scoring alignment.

In other words, find the overlap between the two sequences that produces the highest scoring alignment. For example, given the sequences CCAAGGCT and GGTTTTTAA, it may be that the substrings with the highest scoring alignment are GGCT (tail of the first sequence) and GGT (head of the second sequence)...

GGCT
GG-T

WHY: DNA sequencers frequently produce fragments with sequencing errors. Overlap alignments may be used to detect if those fragment overlap even in the presence of sequencing errors and minor mutations, making assembly less tedious (overlap graphs / de Bruijn graphs may turn out less tangled).

Graph Algorithm

↩PREREQUISITES↩

The graph algorithm for overlap alignment is an extension of the graph algorithm for global alignment. Construct the DAG as you would for global alignment, but for each node...

Latex diagram

NOTE: Orange edges are "free rides" from source / Purple edges are "free rides" to sink.

These newly added edges represent hops in the graph -- 0 weight "free rides" to other nodes. The nodes at the destination of each one of these edges will never go below 0: When selecting a backtracking edge, the "free ride" edge will always be chosen over other edges that make the node weight negative.

When finding a maximum alignment path, these "free rides" make it so that the path ...

such that if there is a matching overlap between the sequences, that maximum alignment path will be targeted in such a way that maximizes that overlap.

ch5_code/src/overlap_alignment/OverlapAlignment_Graph.py (lines 37 to 95):

def create_overlap_alignment_graph(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Graph[Tuple[int, ...], NodeData, str, EdgeData]:
    graph = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], weight_lookup.lookup(*elems))
    )
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    source_node = 0, 0
    source_create_free_ride_edge_id_func = unique_id_generator('FREE_RIDE_SOURCE')
    for node in product([0], range(w_node_count)):
        if node == source_node:
            continue
        e = source_create_free_ride_edge_id_func()
        graph.insert_edge(e, source_node, node, EdgeData(None, None, 0.0))
    sink_node = v_node_count - 1, w_node_count - 1
    sink_create_free_ride_edge_id_func = unique_id_generator('FREE_RIDE_SINK')
    for node in product(range(v_node_count), [w_node_count - 1]):
        if node == sink_node:
            continue
        e = sink_create_free_ride_edge_id_func()
        graph.insert_edge(e, node, sink_node, EdgeData(None, None, 0.0))
    return graph


def overlap_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[str], List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    graph = create_overlap_alignment_graph(v, w, weight_lookup)
    from_node = (0, 0)
    to_node = (v_node_count - 1, w_node_count - 1)
    populate_weights_and_backtrack_pointers(
        graph,
        from_node,
        lambda n_id, weight, e_id: graph.get_node_data(n_id).set_weight_and_backtracking_edge(weight, e_id),
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge(),
        lambda e_id: graph.get_edge_data(e_id).weight
    )
    final_weight = graph.get_node_data(to_node).weight
    edges = backtrack(
        graph,
        to_node,
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge()
    )
    alignment_edges = list(filter(lambda e: not e.startswith('FREE_RIDE'), edges))  # remove free rides from list
    alignment = []
    for e in alignment_edges:
        ed = graph.get_edge_data(e)
        alignment.append((ed.v_elem, ed.w_elem))
    return final_weight, edges, alignment

Given the sequences AGACAAAT and GGGGAAAC and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the global alignment is...

Latex diagram

AGAC
A-AC

Weight: 2.0

Matrix Algorithm

↩PREREQUISITES↩

ALGORITHM:

The following algorithm is an extension to global alignment's matrix algorithm to properly account for the "free ride" edges required by overlap alignment. It's essentially the same as the graph algorithm, except that the implementation is much more sympathetic to modern hardware.

ch5_code/src/overlap_alignment/OverlapAlignment_Matrix.py (lines 10 to 93):

def backtrack(
        node_matrix: List[List[Any]]
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_idx = len(node_matrix) - 1
    w_node_idx = len(node_matrix[0]) - 1
    final_weight = node_matrix[v_node_idx][w_node_idx][0]
    alignment = []
    while v_node_idx != 0 or w_node_idx != 0:
        _, elems, backtrack_ptr = node_matrix[v_node_idx][w_node_idx]
        if backtrack_ptr == '↓':
            v_node_idx -= 1
            alignment.append(elems)
        elif backtrack_ptr == '→':
            w_node_idx -= 1
            alignment.append(elems)
        elif backtrack_ptr == '↘':
            v_node_idx -= 1
            w_node_idx -= 1
            alignment.append(elems)
        elif isinstance(backtrack_ptr, tuple):
            v_node_idx = backtrack_ptr[0]
            w_node_idx = backtrack_ptr[1]
    return final_weight, alignment[::-1]


def overlap_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    node_matrix = []
    for v_node_idx in range(v_node_count):
        row = []
        for w_node_idx in range(w_node_count):
            row.append([-1.0, (None, None), '?'])
        node_matrix.append(row)
    node_matrix[0][0][0] = 0.0           # source node weight
    node_matrix[0][0][1] = (None, None)  # source node elements (elements don't matter for source node)
    node_matrix[0][0][2] = '↘'           # source node backtracking edge (direction doesn't matter for source node)
    for v_node_idx, w_node_idx in product(range(v_node_count), range(w_node_count)):
        parents = []
        if v_node_idx > 0 and w_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx - 1][0] + weight_lookup.lookup(v_elem, w_elem),
                (v_elem, w_elem),
                '↘'
            ])
        if v_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx][0] + weight_lookup.lookup(v_elem, None),
                (v_elem, None),
                '↓'
            ])
        if w_node_idx > 0:
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx][w_node_idx - 1][0] + weight_lookup.lookup(None, w_elem),
                (None, w_elem),
                '→'
            ])
        # If first column but not source node, consider free-ride from source node
        if v_node_idx == 0 and w_node_idx != 0:
            parents.append([
                0.0,
                (None, None),
                (0, 0)  # jump to source
            ])
        # If sink node, consider free-rides coming from every node in last row that isn't sink node
        if v_node_idx == v_node_count - 1 and w_node_idx == w_node_count - 1:
            for v_node_idx_from in range(v_node_count - 1):
                parents.append([
                    node_matrix[v_node_idx_from][w_node_idx][0] + 0.0,
                    (None, None),
                    (v_node_idx_from, w_node_idx)  # jump to this position
                ])
        if parents:  # parents will be empty if v_node_idx and w_node_idx were both 0
            node_matrix[v_node_idx][w_node_idx] = max(parents, key=lambda x: x[0])
    return backtrack(node_matrix)

Given the sequences AGACAAAT and GGGGAAAC and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the overlap alignment is...

AGAC
AAAC

Weight: 2.0

Local Alignment

↩PREREQUISITES↩

WHAT: Given two sequences, for all possible substrings of those sequences, pull out the highest scoring alignment. For example, given the sequences GGTTTTTAA and CCTTCTTAA, it may be that the substrings with the highest scoring alignment are TTTTT (substring of first sequence) and TTCTT (substring of second sequence) ...

TTC-TT
TT-TTT

WHY: Two biological sequences may have strongly related parts rather than being strongly related in their entirety. For example, a class of proteins called NRP synthetase creates peptides without going through a ribosome (non-ribosomal peptides). Each NRP synthetase outputs a specific peptide, where each amino acid in that peptide is pumped out by the unique part of the NRP synthetase responsible for it.

These unique parts are referred to adenylation domains (multiple adenylation domains, 1 per amino acid in created peptide). While the overall sequence between two types of NRP synthetase differ greatly, the sequences between their adenylation domains are similar -- only a handful of positions in an adenylation domain sequence define the type of amino acid it pumps out. As such, local alignment may be used to identify these adenylation domains across different types of NRP synthetase.

⚠️NOTE️️️⚠️

The WHY section above is giving a high-level use-case for local alignment. If you actually want to perform that use-case you need to get familiar with the protein scoring section: Algorithms/Sequence Alignment/Protein Scoring.

Graph Algorithm

↩PREREQUISITES↩

ALGORITHM:

The graph algorithm for local alignment is an extension of the graph algorithm for global alignment. Construct the DAG as you would for global alignment, but for each node...

Latex diagram

NOTE: Orange edges are "free rides" from source / Purple edges are "free rides" to sink.

These newly added edges represent hops in the graph -- 0 weight "free rides" to other nodes. The nodes at the destination of each one of these edges will never go below 0: When selecting a backtracking edge, the "free ride" edge will always be chosen over other edges that make the node weight negative.

When finding a maximum alignment path, these "free rides" make it so that if either the...

The maximum alignment path will be targeted in such a way that it homes on the substring within each sequence that produces the highest scoring alignment.

ch5_code/src/local_alignment/LocalAlignment_Graph.py (lines 38 to 96):

def create_local_alignment_graph(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Graph[Tuple[int, ...], NodeData, str, EdgeData]:
    graph = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], weight_lookup.lookup(*elems))
    )
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    source_node = 0, 0
    source_create_free_ride_edge_id_func = unique_id_generator('FREE_RIDE_SOURCE')
    for node in product(range(v_node_count), range(w_node_count)):
        if node == source_node:
            continue
        e = source_create_free_ride_edge_id_func()
        graph.insert_edge(e, source_node, node, EdgeData(None, None, 0.0))
    sink_node = v_node_count - 1, w_node_count - 1
    sink_create_free_ride_edge_id_func = unique_id_generator('FREE_RIDE_SINK')
    for node in product(range(v_node_count), range(w_node_count)):
        if node == sink_node:
            continue
        e = sink_create_free_ride_edge_id_func()
        graph.insert_edge(e, node, sink_node, EdgeData(None, None, 0.0))
    return graph


def local_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[str], List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    graph = create_local_alignment_graph(v, w, weight_lookup)
    from_node = (0, 0)
    to_node = (v_node_count - 1, w_node_count - 1)
    populate_weights_and_backtrack_pointers(
        graph,
        from_node,
        lambda n_id, weight, e_id: graph.get_node_data(n_id).set_weight_and_backtracking_edge(weight, e_id),
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge(),
        lambda e_id: graph.get_edge_data(e_id).weight
    )
    final_weight = graph.get_node_data(to_node).weight
    edges = backtrack(
        graph,
        to_node,
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge()
    )
    alignment_edges = list(filter(lambda e: not e.startswith('FREE_RIDE'), edges))  # remove free rides from list
    alignment = []
    for e in alignment_edges:
        ed = graph.get_edge_data(e)
        alignment.append((ed.v_elem, ed.w_elem))
    return final_weight, edges, alignment

Given the sequences TAGAACT and CGAAG and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the local alignment is...

Latex diagram

GAA
GAA

Weight: 3.0

Matrix Algorithm

↩PREREQUISITES↩

ALGORITHM:

The following algorithm is an extension to global alignment's matrix algorithm to properly account for the "free ride" edges required by local alignment. It's essentially the same as the graph algorithm, except that the implementation is much more sympathetic to modern hardware.

ch5_code/src/local_alignment/LocalAlignment_Matrix.py (lines 10 to 95):

def backtrack(
        node_matrix: List[List[Any]]
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_idx = len(node_matrix) - 1
    w_node_idx = len(node_matrix[0]) - 1
    final_weight = node_matrix[v_node_idx][w_node_idx][0]
    alignment = []
    while v_node_idx != 0 or w_node_idx != 0:
        _, elems, backtrack_ptr = node_matrix[v_node_idx][w_node_idx]
        if backtrack_ptr == '↓':
            v_node_idx -= 1
            alignment.append(elems)
        elif backtrack_ptr == '→':
            w_node_idx -= 1
            alignment.append(elems)
        elif backtrack_ptr == '↘':
            v_node_idx -= 1
            w_node_idx -= 1
            alignment.append(elems)
        elif isinstance(backtrack_ptr, tuple):
            v_node_idx = backtrack_ptr[0]
            w_node_idx = backtrack_ptr[1]
    return final_weight, alignment[::-1]


def local_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    node_matrix = []
    for v_node_idx in range(v_node_count):
        row = []
        for w_node_idx in range(w_node_count):
            row.append([-1.0, (None, None), '?'])
        node_matrix.append(row)
    node_matrix[0][0][0] = 0.0           # source node weight
    node_matrix[0][0][1] = (None, None)  # source node elements (elements don't matter for source node)
    node_matrix[0][0][2] = '↘'           # source node backtracking edge (direction doesn't matter for source node)
    for v_node_idx, w_node_idx in product(range(v_node_count), range(w_node_count)):
        parents = []
        if v_node_idx > 0 and w_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx - 1][0] + weight_lookup.lookup(v_elem, w_elem),
                (v_elem, w_elem),
                '↘'
            ])
        if v_node_idx > 0:
            v_elem = v[v_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx - 1][w_node_idx][0] + weight_lookup.lookup(v_elem, None),
                (v_elem, None),
                '↓'
            ])
        if w_node_idx > 0:
            w_elem = w[w_node_idx - 1]
            parents.append([
                node_matrix[v_node_idx][w_node_idx - 1][0] + weight_lookup.lookup(None, w_elem),
                (None, w_elem),
                '→'
            ])
        # If not source node, consider free-ride from source node
        if v_node_idx != 0 and w_node_idx != 0:
            parents.append([
                0.0,
                (None, None),
                (0, 0)  # jump to source
            ])
        # If sink node, consider free-rides coming from every node that isn't sink node
        if v_node_idx == v_node_count - 1 and w_node_idx == w_node_count - 1:
            for v_node_idx_from, w_node_idx_from in product(range(v_node_count), range(w_node_count)):
                if v_node_idx_from == v_node_count - 1 and w_node_idx_from == w_node_count - 1:
                    continue
                parents.append([
                    node_matrix[v_node_idx_from][w_node_idx_from][0] + 0.0,
                    (None, None),
                    (v_node_idx_from, w_node_idx_from)  # jump to this position
                ])
        if parents:  # parents will be empty if v_node_idx and w_node_idx were both 0
            node_matrix[v_node_idx][w_node_idx] = max(parents, key=lambda x: x[0])
    return backtrack(node_matrix)

Given the sequences TAGAACT and CGAAG and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the local alignment is...

GAA
GAA

Weight: 3.0

Protein Scoring

↩PREREQUISITES↩

WHAT: Given a pair of protein sequences, score those sequences against each other based on the similarity of the amino acids. In this case, similarity is defined as how probable it is that one amino acid mutates to the other while still having the protein remain functional.

WHY: Before performing a pair-wise sequence alignment, there needs to be some baseline for how elements within those sequences measure up against each other. In the simplest case, elements are compared using equality: matching elements score 1, while mismatches or indels score 0. However, there are many other cases where element equality isn't a good measure.

Protein sequences are one such case. Biological sequences such as proteins and DNA undergo mutation. Two proteins may be very closely related (e.g. evolved from same parent protein, perform the same function, etc..) but their sequences may have mutated to a point where they appear as being wildly different. To appropriately align protein sequences, amino acid mutation probabilities need to be derived and factored into scoring. For example, there may be good odds that some random protein would still continue to function as-is if some of its Y amino acids were swapped with F.

PAM Scoring Matrix

Point accepted mutation (PAM) is a scoring matrix used for sequence alignments of proteins. The scoring matrix is calculated by inspecting / extrapolating mutations as homologous proteins evolve. Specifically, mutations in the DNA sequence that encode some protein may change the resulting amino acid sequence for that protein. Those mutations that...

PAM matrices are developed iteratively. An initial PAM matrix is calculated by aligning extremely similar protein sequences using a simple scoring model (1 for match, 0 for mismatch / indel). That sequence alignment then provides the scoring model for the next iteration. For example, the alignment for the initial iteration may have determined that D may be a suitable substitution for W. As such, the sequence alignment for the next iteration will score more than 0 (e.g. 1) if it encounters D being compared to W.

Other factors are also brought into the mix when developing scores for PAM matrices. For example, the ...

It's said that PAM is focused on tracking the evolutionary origins of proteins. Sequences that are 99% similar are said to be 1 PAM unit diverged, where a PAM unit is the amount of time it takes an "average" protein to mutate 1% of its amino acids. PAM1 (the initial scoring matrix) was defined by performing many sequence alignments between proteins that are 99% similar (1 PAM unit diverged).

⚠️NOTE️️️⚠️

Here and here both seem to say that BLOSUM supersedes PAM as a scoring matrix for protein sequences.

Although both matrices produce similar scoring outcomes they were generated using differing methodologies. The BLOSUM matrices were generated directly from the amino acid differences in aligned blocks that have diverged to varying degrees the PAM matrices reflect the extrapolation of evolutionary information based on closely related sequences to longer timescales

Henikoff and Henikoff [16] have compared the BLOSUM matrices to PAM, PET, Overington, Gonnet [17] and multiple PAM matrices by evaluating how effectively the matrices can detect known members of a protein family from a database when searching with the ungapped local alignment program BLAST [18]. They conclude that overall the BLOSUM 62 matrix is the most effective.

PAM250 is the most commonly used variant:

ch5_code/src/PAM250.txt (lines 2 to 22):

   A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A  2 -2  0  0 -3  1 -1 -1 -1 -2 -1  0  1  0 -2  1  1  0 -6 -3
C -2 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4  0 -2 -2 -8  0
D  0 -5  4  3 -6  1  1 -2  0 -4 -3  2 -1  2 -1  0  0 -2 -7 -4
E  0 -5  3  4 -5  0  1 -2  0 -3 -2  1 -1  2 -1  0  0 -2 -7 -4
F -3 -4 -6 -5  9 -5 -2  1 -5  2  0 -3 -5 -5 -4 -3 -3 -1  0  7
G  1 -3  1  0 -5  5 -2 -3 -2 -4 -3  0  0 -1 -3  1  0 -1 -7 -5
H -1 -3  1  1 -2 -2  6 -2  0 -2 -2  2  0  3  2 -1 -1 -2 -3  0
I -1 -2 -2 -2  1 -3 -2  5 -2  2  2 -2 -2 -2 -2 -1  0  4 -5 -1
K -1 -5  0  0 -5 -2  0 -2  5 -3  0  1 -1  1  3  0  0 -2 -3 -4
L -2 -6 -4 -3  2 -4 -2  2 -3  6  4 -3 -3 -2 -3 -3 -2  2 -2 -1
M -1 -5 -3 -2  0 -3 -2  2  0  4  6 -2 -2 -1  0 -2 -1  2 -4 -2
N  0 -4  2  1 -3  0  2 -2  1 -3 -2  2  0  1  0  1  0 -2 -4 -2
P  1 -3 -1 -1 -5  0  0 -2 -1 -3 -2  0  6  0  0  1  0 -1 -6 -5
Q  0 -5  2  2 -5 -1  3 -2  1 -2 -1  1  0  4  1 -1 -1 -2 -5 -4
R -2 -4 -1 -1 -4 -3  2 -2  3 -3  0  0  0  1  6  0 -1 -2  2 -4
S  1  0  0  0 -3  1 -1 -1  0 -3 -2  1  1 -1  0  2  1 -1 -2 -3
T  1 -2  0  0 -3  0 -1  0  0 -2 -1  0  0 -1 -1  1  3  0 -5 -3
V  0 -2 -2 -2 -1 -1 -2  4 -2  2  2 -2 -1 -2 -2 -1  0  4 -6 -2
W -6 -8 -7 -7  0 -7 -3 -5 -3 -2 -4 -4 -6 -5  2 -2 -5 -6 17  0
Y -3  0 -4 -4  7 -5  0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2  0 10

⚠️NOTE️️️⚠️

The above matrix was supplied by the Pevzner book. You can find it online here, but the indel scores on that website are set to -8 Whereas in the Pevzner book I've also seen them set to -5. I don't know which is correct. I don't know if PAM250 defines a constant for indels.

BLOSUM Scoring Matrix

Blocks amino acid substitution matrix (BLOSUM) is a scoring matrix used for sequence alignments of proteins. The scoring matrix is calculated by scanning a protein database for highly conserved regions between similar proteins, where the mutations between those highly conserved regions define the scores. Specifically, those highly conserved regions are identified based on local alignments without support for indels (gaps not allowed). Non-matching positions in that alignment define potentially acceptable mutations.

Several sets of BLOSUM matrices exist, each identified by a different number. This number defines the similarity of the sequences used to create the matrix: The protein database sequences used to derive the matrix are filtered such that only those with >= n% similarity are used, where n is the number. For example, ...

As such, BLOSUM matrices with higher numbers are designed to compare more closely related sequences while those with lower numbers are designed to score more distant related sequences.

BLOSUM62 is the most commonly used variant since "experimentation has shown that it's among the best for detecting weak similarities":

ch5_code/src/BLOSUM62.txt (lines 2 to 22):

   A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A  4  0 -2 -1 -2  0 -2 -1 -1 -1 -1 -2 -1 -1 -1  1  0  0 -3 -2
C  0  9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2
D -2 -3  6  2 -3 -1 -1 -3 -1 -4 -3  1 -1  0 -2  0 -1 -3 -4 -3
E -1 -4  2  5 -3 -2  0 -3  1 -3 -2  0 -1  2  0  0 -1 -2 -3 -2
F -2 -2 -3 -3  6 -3 -1  0 -3  0  0 -3 -4 -3 -3 -2 -2 -1  1  3
G  0 -3 -1 -2 -3  6 -2 -4 -2 -4 -3  0 -2 -2 -2  0 -2 -3 -2 -3
H -2 -3 -1  0 -1 -2  8 -3 -1 -3 -2  1 -2  0  0 -1 -2 -3 -2  2
I -1 -1 -3 -3  0 -4 -3  4 -3  2  1 -3 -3 -3 -3 -2 -1  3 -3 -1
K -1 -3 -1  1 -3 -2 -1 -3  5 -2 -1  0 -1  1  2  0 -1 -2 -3 -2
L -1 -1 -4 -3  0 -4 -3  2 -2  4  2 -3 -3 -2 -2 -2 -1  1 -2 -1
M -1 -1 -3 -2  0 -3 -2  1 -1  2  5 -2 -2  0 -1 -1 -1  1 -1 -1
N -2 -3  1  0 -3  0  1 -3  0 -3 -2  6 -2  0  0  1  0 -3 -4 -2
P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2  7 -1 -2 -1 -1 -2 -4 -3
Q -1 -3  0  2 -3 -2  0 -3  1 -2  0  0 -1  5  1  0 -1 -2 -2 -1
R -1 -3 -2  0 -3 -2  0 -3  2 -2 -1  0 -2  1  5 -1 -1 -3 -3 -2
S  1 -1  0  0 -2  0 -1 -2  0 -2 -1  1 -1  0 -1  4  1 -2 -3 -2
T  0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1  0 -1 -1 -1  1  5  0 -2 -2
V  0 -1 -3 -2 -1 -3 -3  3 -2  1  1 -3 -2 -2 -3 -2  0  4 -3 -1
W -3 -2 -4 -3  1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11  2
Y -2 -2 -3 -2  3 -3  2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1  2  7

⚠️NOTE️️️⚠️

The above matrix was supplied by the Pevzner book. You can find it online here, but the indel scores on that website are set to -4 whereas in the Pevzner book I've seen them set to -5. I don't know which is correct. I don't know if BLOSUM62 defines a constant for indels.

Extended Gap Scoring

↩PREREQUISITES↩

WHAT: When performing sequence alignment, prefer contiguous indels in a sequence vs individual indels. This is done by scoring contiguous indels differently than individual indels:

For example, given an alignment region where one of the sequences has 3 contiguous indels, the traditional method would assign a score of -15 (-5 for each indel) while this method would assign a score of -5.2 (-5 for starting indel, -0.1 for each subsequent indel)...

AAATTTAATA
AAA---AA-A

Score from indels using traditional scoring:   -5   + -5   + -5   + -5   = -20
Score from indels using extended gap scoring:  -5   + -0.1 + -0.1 + -5   = -10.2

WHY: DNA mutations are more likely to happen in chunks rather than point mutations (e.g. transposons). Extended gap scoring helps account for that fact. Since DNA encode proteins (codons), this effects proteins as well.

Naive Algorithm

ALGORITHM:

The naive way to perform extended gap scoring is to introduce a new edge for each contiguous indel. For example, given the alignment graph...

Kroki diagram output

Each added edge represents a contiguous set of indels. Contiguous indels are penalized by choosing the normal indel score for the first indel in the list (e.g. score of -5), then all other indels are scored using a better extended indel score (e.g. score of -0.1). As such, the maximum alignment path will choose one of these contiguous indel edges over individual indel edges or poor substitution choices such as those in PAM / BLOSUM scoring matrices.

Latex diagram

NOTE: Purple and red edges are extended indel edges.

The problem with this algorithm is that as the sequence lengths grow, the number of added edges explodes. It isn't practical for anything other than short sequences.

ch5_code/src/affine_gap_alignment/AffineGapAlignment_Basic_Graph.py (lines 37 to 104):

def create_affine_gap_alignment_graph(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        extended_gap_weight: float
) -> Graph[Tuple[int, ...], NodeData, str, EdgeData]:
    graph = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], weight_lookup.lookup(*elems))
    )
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    horizontal_indel_hop_edge_id_func = unique_id_generator('HORIZONTAL_INDEL_HOP')
    for from_c, r in product(range(v_node_count), range(w_node_count)):
        from_node_id = from_c, r
        for to_c in range(from_c + 2, v_node_count):
            to_node_id = to_c, r
            edge_id = horizontal_indel_hop_edge_id_func()
            v_elems = v[from_c:to_c]
            w_elems = [None] * len(v_elems)
            hop_count = to_c - from_c
            weight = weight_lookup.lookup(v_elems[0], w_elems[0]) + (hop_count - 1) * extended_gap_weight
            graph.insert_edge(edge_id, from_node_id, to_node_id, EdgeData(v_elems, w_elems, weight))
    vertical_indel_hop_edge_id_func = unique_id_generator('VERTICAL_INDEL_HOP')
    for c, from_r in product(range(v_node_count), range(w_node_count)):
        from_node_id = c, from_r
        for to_r in range(from_r + 2, w_node_count):
            to_node_id = c, to_r
            edge_id = vertical_indel_hop_edge_id_func()
            w_elems = w[from_r:to_r]
            v_elems = [None] * len(w_elems)
            hop_count = to_r - from_r
            weight = weight_lookup.lookup(v_elems[0], w_elems[0]) + (hop_count - 1) * extended_gap_weight
            graph.insert_edge(edge_id, from_node_id, to_node_id, EdgeData(v_elems, w_elems, weight))
    return graph


def affine_gap_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        extended_gap_weight: float
) -> Tuple[float, List[str], List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    graph = create_affine_gap_alignment_graph(v, w, weight_lookup, extended_gap_weight)
    from_node = (0, 0)
    to_node = (v_node_count - 1, w_node_count - 1)
    populate_weights_and_backtrack_pointers(
        graph,
        from_node,
        lambda n_id, weight, e_id: graph.get_node_data(n_id).set_weight_and_backtracking_edge(weight, e_id),
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge(),
        lambda e_id: graph.get_edge_data(e_id).weight
    )
    final_weight = graph.get_node_data(to_node).weight
    edges = backtrack(
        graph,
        to_node,
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge()
    )
    alignment = []
    for e in edges:
        ed = graph.get_edge_data(e)
        alignment.append((ed.v_elem, ed.w_elem))
    return final_weight, edges, alignment

Given the sequences TAGGCGGAT and TACCCCCAT and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the global alignment is...

Latex diagram

TA----GGCGGAT
TACCCC--C--AT

Weight: 1.4999999999999998

⚠️NOTE️️️⚠️

The algorithm above was applied on global alignment, but it should be obvious how to apply it to the other alignment types discussed.

Layer Algorithm

↩PREREQUISITES↩

ALGORITHM:

Recall that the problem with the naive algorithm algorithm is that as the sequence lengths grow, the number of added edges explodes. It isn't practical for anything other than short sequences. A better algorithm that achieves the exact same result is the layer algorithm. The layer algorithm breaks an alignment graph into 3 distinct layers:

  1. horizontal edges go into their own layer.
  2. diagonal edges go into their own layer.
  3. vertical edges go into their own layer.

The edge weights in the horizontal and diagonal layers are updated such that they use the extended indel score (e.g. -0.1). Then, for each node (x, y) in the diagonal layer, ...

Similarly, for each node (x, y) in both the horizontal and vertical layers that an edge from the diagonal layer points to, create a 0 weight "free ride" edge back to node (x, y) in the diagonal layer. These "free ride" edges are the same as the "free ride" edges in local alignment / fitting alignment / overlap alignment -- they hop across the alignment graph without adding anything to the sequence alignment.

The source node and sink node are at the top-left node and bottom-right node (respectively) of the diagonal layer.

Latex diagram

NOTE: Orange edges are "free rides" from source / Purple edges are "free rides" to sink.

The way to think about this layered structure alignment graph is that the hop from a node in the diagonal layer to a node in the horizontal/vertical layer will always have a normal indel score (e.g. -5). From there it either has the option to hop back to the diagonal layer (via the "free ride" edge) or continue pushing through indels using the less penalizing extended indel score (e.g. -0.1).

This layered structure produces 3 times the number of nodes, but for longer sequences it has far less edges than the naive method.

ch5_code/src/affine_gap_alignment/AffineGapAlignment_Layer_Graph.py (lines 37 to 135):

def create_affine_gap_alignment_graph(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        extended_gap_weight: float
) -> Graph[Tuple[int, ...], NodeData, str, EdgeData]:
    graph_low = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], extended_gap_weight) if offset == (1, 0) else None
    )
    graph_mid = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], weight_lookup.lookup(*elems)) if offset == (1, 1) else None
    )
    graph_high = create_grid_graph(
        [v, w],
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems[0], elems[1], extended_gap_weight) if offset == (0, 1) else None
    )

    graph_merged = Graph()
    create_edge_id_func = unique_id_generator('E')

    def merge(from_graph, n_prefix):
        for n_id in from_graph.get_nodes():
            n_data = from_graph.get_node_data(n_id)
            graph_merged.insert_node(n_prefix + n_id, n_data)
        for e_id in from_graph.get_edges():
            from_n_id, to_n_id, e_data = from_graph.get_edge(e_id)
            graph_merged.insert_edge(create_edge_id_func(), n_prefix + from_n_id, n_prefix + to_n_id, e_data)

    merge(graph_low, ('high', ))
    merge(graph_mid, ('mid',))
    merge(graph_high, ('low',))

    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    mid_to_low_edge_id_func = unique_id_generator('MID_TO_LOW')
    for r, c in product(range(v_node_count - 1), range(w_node_count)):
        from_n_id = 'mid', r, c
        to_n_id = 'high', r + 1, c
        e = mid_to_low_edge_id_func()
        graph_merged.insert_edge(e, from_n_id, to_n_id, EdgeData(v[r], None, weight_lookup.lookup(v[r], None)))
    low_to_mid_edge_id_func = unique_id_generator('HIGH_TO_MID')
    for r, c in product(range(1, v_node_count), range(w_node_count)):
        from_n_id = 'high', r, c
        to_n_id = 'mid', r, c
        e = low_to_mid_edge_id_func()
        graph_merged.insert_edge(e, from_n_id, to_n_id, EdgeData(None, None, 0.0))
    mid_to_high_edge_id_func = unique_id_generator('MID_TO_HIGH')
    for r, c in product(range(v_node_count), range(w_node_count - 1)):
        from_n_id = 'mid', r, c
        to_n_id = 'low', r, c + 1
        e = mid_to_high_edge_id_func()
        graph_merged.insert_edge(e, from_n_id, to_n_id, EdgeData(None, w[c], weight_lookup.lookup(None, w[c])))
    high_to_mid_edge_id_func = unique_id_generator('LOW_TO_MID')
    for r, c in product(range(v_node_count), range(1, w_node_count)):
        from_n_id = 'low', r, c
        to_n_id = 'mid', r, c
        e = high_to_mid_edge_id_func()
        graph_merged.insert_edge(e, from_n_id, to_n_id, EdgeData(None, None, 0.0))

    return graph_merged


def affine_gap_alignment(
        v: List[ELEM],
        w: List[ELEM],
        weight_lookup: WeightLookup,
        extended_gap_weight: float
) -> Tuple[float, List[str], List[Tuple[ELEM, ELEM]]]:
    v_node_count = len(v) + 1
    w_node_count = len(w) + 1
    graph = create_affine_gap_alignment_graph(v, w, weight_lookup, extended_gap_weight)
    from_node = ('mid', 0, 0)
    to_node = ('mid', v_node_count - 1, w_node_count - 1)
    populate_weights_and_backtrack_pointers(
        graph,
        from_node,
        lambda n_id, weight, e_id: graph.get_node_data(n_id).set_weight_and_backtracking_edge(weight, e_id),
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge(),
        lambda e_id: graph.get_edge_data(e_id).weight
    )
    final_weight = graph.get_node_data(to_node).weight
    edges = backtrack(
        graph,
        to_node,
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge()
    )
    edges = list(filter(lambda e: not e.startswith('LOW_TO_MID'), edges))  # remove free rides from list
    edges = list(filter(lambda e: not e.startswith('HIGH_TO_MID'), edges))  # remove free rides from list
    alignment = []
    for e in edges:
        ed = graph.get_edge_data(e)
        alignment.append((ed.v_elem, ed.w_elem))
    return final_weight, edges, alignment

Given the sequences TGGCGG and TCCCCC and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the global alignment is...

Latex diagram

TGGC----GG
T--CCCCC--

Weight: -1.5

⚠️NOTE️️️⚠️

The algorithm above was applied on global alignment, but it should be obvious how to apply it to the other alignment types discussed.

Multiple Alignment

↩PREREQUISITES↩

WHAT: Given more than two sequences, perform sequence alignment and pull out the highest scoring alignment.

WHY: Proteins that perform the same function but are distantly related are likely to have similar regions. The problem is that a 2-way sequence alignment may have a hard time identifying those similar regions, whereas an n-way sequence alignment (n > 2) will likely reveal much more / more accurate regions.

⚠️NOTE️️️⚠️

Quote from Pevzner book: "Bioinformaticians sometimes say that pairwise alignment whispers and multiple alignment shouts."

Graph Algorithm

ALGORITHM:

Thinking about sequence alignment geometrically, adding another sequence to a sequence alignment graph is akin to adding a new dimension. For example, a sequence alignment graph with...

Kroki diagram output

The alignment possibilities at each step of a sequence alignment may be thought of as a vertex shooting out edges to all other vertices in the geometry. For example, in a sequence alignment with 2 sequences, the vertex (0, 0) shoots out an edge to vertices ...

The vertex coordinates may be thought of as analogs of whether to keep or skip an element. Each coordinate position corresponds to a sequence element (first coordinate = first sequence's element, second coordinate = second sequence's element). If a coordinate is set to ...

Latex diagram

This same logic extends to sequence alignment with 3 or more sequences. For example, in a sequence alignment with 3 sequences, the vertex (0, 0, 0) shoots out an edge to all other vertices in the cube. The vertex coordinates define which sequence elements should be kept or skipped based on the same rules described above.

Latex diagram

ch5_code/src/graph/GraphGridCreate.py (lines 31 to 58):

def create_grid_graph(
        sequences: List[List[ELEM]],
        on_new_node: ON_NEW_NODE_FUNC_TYPE,
        on_new_edge: ON_NEW_EDGE_FUNC_TYPE
) -> Graph[Tuple[int, ...], ND, str, ED]:
    create_edge_id_func = unique_id_generator('E')
    graph = Graph()
    axes = [[None] + av for av in sequences]
    axes_len = [range(len(axis)) for axis in axes]
    for grid_coord in product(*axes_len):
        node_data = on_new_node(grid_coord)
        if node_data is not None:
            graph.insert_node(grid_coord, node_data)
    for src_grid_coord in graph.get_nodes():
        for grid_coord_offsets in product([0, 1], repeat=len(sequences)):
            dst_grid_coord = tuple(axis + offset for axis, offset in zip(src_grid_coord, grid_coord_offsets))
            if src_grid_coord == dst_grid_coord:  # skip if making a connection to self
                continue
            if not graph.has_node(dst_grid_coord):  # skip if neighbouring node doesn't exist
                continue
            elements = tuple(None if src_idx == dst_idx else axes[axis_idx][dst_idx]
                             for axis_idx, (src_idx, dst_idx) in enumerate(zip(src_grid_coord, dst_grid_coord)))
            edge_data = on_new_edge(src_grid_coord, dst_grid_coord, grid_coord_offsets, elements)
            if edge_data is not None:
                edge_id = create_edge_id_func()
                graph.insert_edge(edge_id, src_grid_coord, dst_grid_coord, edge_data)
    return graph

ch5_code/src/global_alignment/GlobalMultipleAlignment_Graph.py (lines 33 to 71):

def create_global_alignment_graph(
        seqs: List[List[ELEM]],
        weight_lookup: WeightLookup
) -> Graph[Tuple[int, ...], NodeData, str, EdgeData]:
    graph = create_grid_graph(
        seqs,
        lambda n_id: NodeData(),
        lambda src_n_id, dst_n_id, offset, elems: EdgeData(elems, weight_lookup.lookup(*elems))
    )
    return graph


def global_alignment(
        seqs: List[List[ELEM]],
        weight_lookup: WeightLookup
) -> Tuple[float, List[str], List[Tuple[ELEM, ...]]]:
    seq_node_counts = [len(s) for s in seqs]
    graph = create_global_alignment_graph(seqs, weight_lookup)
    from_node = tuple([0] * len(seqs))
    to_node = tuple(seq_node_counts)
    populate_weights_and_backtrack_pointers(
        graph,
        from_node,
        lambda n_id, weight, e_id: graph.get_node_data(n_id).set_weight_and_backtracking_edge(weight, e_id),
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge(),
        lambda e_id: graph.get_edge_data(e_id).weight
    )
    final_weight = graph.get_node_data(to_node).weight
    edges = backtrack(
        graph,
        to_node,
        lambda n_id: graph.get_node_data(n_id).get_weight_and_backtracking_edge()
    )
    alignment = []
    for e in edges:
        ed = graph.get_edge_data(e)
        alignment.append(ed.elems)
    return final_weight, edges, alignment

Given the sequences ['TATTATTAT', 'GATTATGATTAT', 'TACCATTACAT'] and the score matrix...

INDEL=-1.0
A A A 1
A A C -1
A A T -1
A A G -1
A C A -1
A C C -1
A C T -1
A C G -1
A T A -1
A T C -1
A T T -1
A T G -1
A G A -1
A G C -1
A G T -1
A G G -1
C A A -1
C A C -1
C A T -1
C A G -1
C C A -1
C C C 1
C C T -1
C C G -1
C T A -1
C T C -1
C T T -1
C T G -1
C G A -1
C G C -1
C G T -1
C G G -1
T A A -1
T A C -1
T A T -1
T A G -1
T C A -1
T C C -1
T C T -1
T C G -1
T T A -1
T T C -1
T T T 1
T T G -1
T G A -1
T G C -1
T G T -1
T G G -1
G A A -1
G A C -1
G A T -1
G A G -1
G C A -1
G C C -1
G C T -1
G C G -1
G T A -1
G T C -1
G T T -1
G T G -1
G G A -1
G G C -1
G G T -1
G G G 1

... the global alignment is...

--T-ATTATTA--T
GATTATGATTA--T
--T-ACCATTACAT

Weight: 0.0

⚠️NOTE️️️⚠️

The multiple alignment algorithm displayed above was specifically for on global alignment on a graph implementation, but it should be obvious how to apply it to most of the other alignment types (e.g. local alignment).

Matrix Algorithm

↩PREREQUISITES↩

The following algorithm is essentially the same as the graph algorithm, except that the implementation is much more sympathetic to modern hardware. The alignment graph is represented as an N-dimensional matrix where each element in the matrix represents a node in the alignment graph. This is similar to the 2D matrix used for global alignment's matrix implementation.

ch5_code/src/global_alignment/GlobalMultipleAlignment_Matrix.py (lines 12 to 79):

def generate_matrix(seq_node_counts: List[int]) -> List[Any]:
    last_buffer = [[-1.0, (None, None), '?'] for _ in range(seq_node_counts[-1])]  # row
    for dim in reversed(seq_node_counts[:-1]):
        # DON'T USE DEEPCOPY -- VERY SLOW: https://stackoverflow.com/a/29385667
        # last_buffer = [deepcopy(last_buffer) for _ in range(dim)]
        last_buffer = [pickle.loads(pickle.dumps(last_buffer, -1)) for _ in range(dim)]
    return last_buffer


def get_cell(matrix: List[Any], idxes: Iterable[int]):
    buffer = matrix
    for i in idxes:
        buffer = buffer[i]
    return buffer


def set_cell(matrix: List[Any], idxes: Iterable[int], value: Any):
    buffer = matrix
    for i in idxes[:-1]:
        buffer = buffer[i]
    buffer[idxes[-1]] = value


def backtrack(
        node_matrix: List[List[Any]],
        dimensions: List[int]
) -> Tuple[float, List[Tuple[ELEM, ELEM]]]:
    node_idxes = [d - 1 for d in dimensions]
    final_weight = get_cell(node_matrix, node_idxes)[0]
    alignment = []
    while set(node_idxes) != {0}:
        _, elems, backtrack_ptr = get_cell(node_matrix, node_idxes)
        node_idxes = backtrack_ptr
        alignment.append(elems)
    return final_weight, alignment[::-1]


def global_alignment(
        seqs: List[List[ELEM]],
        weight_lookup: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ...]]]:
    seq_node_counts = [len(s) + 1 for s in seqs]
    node_matrix = generate_matrix(seq_node_counts)
    src_node = get_cell(node_matrix, [0] * len(seqs))
    src_node[0] = 0.0                   # source node weight
    src_node[1] = (None, ) * len(seqs)  # source node elements (elements don't matter for source node)
    src_node[2] = (None, ) * len(seqs)  # source node parent (direction doesn't matter for source node)
    for to_node in product(*(range(sc) for sc in seq_node_counts)):
        parents = []
        parent_idx_ranges = []
        for idx in to_node:
            vals = [idx]
            if idx > 0:
                vals += [idx-1]
            parent_idx_ranges.append(vals)
        for from_node in product(*parent_idx_ranges):
            if from_node == to_node:  # we want indexes of parent nodes, not self
                continue
            edge_elems = tuple(None if f == t else s[t-1] for s, f, t in zip(seqs, from_node, to_node))
            parents.append([
                get_cell(node_matrix, from_node)[0] + weight_lookup.lookup(*edge_elems),
                edge_elems,
                from_node
            ])
        if parents:  # parents will be empty if source node
            set_cell(node_matrix, to_node, max(parents, key=lambda x: x[0]))
    return backtrack(node_matrix, seq_node_counts)

Given the sequences ['TATTATTAT', 'GATTATGATTAT', 'TACCATTACAT'] and the score matrix...

INDEL=-1.0
A A A 1
A A C -1
A A T -1
A A G -1
A C A -1
A C C -1
A C T -1
A C G -1
A T A -1
A T C -1
A T T -1
A T G -1
A G A -1
A G C -1
A G T -1
A G G -1
C A A -1
C A C -1
C A T -1
C A G -1
C C A -1
C C C 1
C C T -1
C C G -1
C T A -1
C T C -1
C T T -1
C T G -1
C G A -1
C G C -1
C G T -1
C G G -1
T A A -1
T A C -1
T A T -1
T A G -1
T C A -1
T C C -1
T C T -1
T C G -1
T T A -1
T T C -1
T T T 1
T T G -1
T G A -1
T G C -1
T G T -1
T G G -1
G A A -1
G A C -1
G A T -1
G A G -1
G C A -1
G C C -1
G C T -1
G C G -1
G T A -1
G T C -1
G T T -1
G T G -1
G G A -1
G G C -1
G G T -1
G G G 1

... the global alignment is...

--T-ATTATTA--T
GATTATGATTA--T
--T-ACCATTACAT

Weight: 0.0

⚠️NOTE️️️⚠️

The multiple alignment algorithm displayed above was specifically for on global alignment on a graph implementation, but it should be obvious how to apply it to most of the other alignment types (e.g. local alignment). With a little bit of effort it can also be converted to using the divide-and-conquer algorithm discussed earlier (there aren't that many leaps in logic).

Greedy Algorithm

↩PREREQUISITES↩

⚠️NOTE️️️⚠️

The Pevzner book challenged you to come up with a greedy algorithm for multiple alignment using profile matrices. This is what I was able to come up with. I have no idea if my logic is correct / optimal, but with toy sequences that are highly related it seems to perform well.

UPDATE: This algorithm seems to work well for the final assignment. ~380 a-domain sequences were aligned in about 2 days and it produced an okay/good looking alignment. Aligning those sequences using normal multiple alignment would be impossible -- nowhere near enough memory or speed available.

For an n-way sequence alignment, the greedy algorithm starts by finding the 2 sequences that produce the highest scoring 2-way sequence alignment. That alignment is then used to build a profile matrix. For example, the alignment of TRELLO and MELLOW results in the following alignment:

0 1 2 3 4 5 6
T R E L L O -
- M E L L O W

That alignment then turns into the following profile matrix:

0 1 2 3 4 5 6
Probability of T 0.5 0.0 0.0 0.0 0.0 0.0 0.0
Probability of R 0.0 0.5 0.0 0.0 0.0 0.0 0.0
Probability of M 0.0 0.5 0.0 0.0 0.0 0.0 0.0
Probability of E 0.0 0.0 1.0 0.0 0.0 0.0 0.0
Probability of L 0.0 0.0 0.0 1.0 1.0 0.0 0.0
Probability of O 0.0 0.0 0.0 0.0 0.0 1.0 0.0
Probability of W 0.0 0.0 0.0 0.0 0.0 0.0 0.5

Then, 2-way sequence alignments are performed between the profile matrix and the remaining sequences. For example, if the letter V is scored against column 1 of the profile matrix, the algorithm would score W against each letter stored in the profile matrix using the same scoring matrix as the initial 2-way sequence alignment. Each score would then get weighted by the corresponding probability in column 2 and the highest one would be chosen as the final score.

max(
    score('W', 'T') * profile_mat[1]['T'],
    score('W', 'R') * profile_mat[1]['R'],
    score('W', 'M') * profile_mat[1]['M'],
    score('W', 'E') * profile_mat[1]['E'],
    score('W', 'L') * profile_mat[1]['L'],
    score('W', 'O') * profile_mat[1]['O'],
    score('W', 'W') * profile_mat[1]['W']
)

Of all the remaining sequences, the one with the highest scoring alignment is removed and its alignment is added to the profile matrix. The process repeats until no more sequences are left.

⚠️NOTE️️️⚠️

The logic above is what was used to solve the final assignment. But, after thinking about it some more it probably isn't entirely correct. Elements that haven't been encountered yet should be left unset in the profile matrix. If this change were applied, the example above would end up looking more like this...

0 1 2 3 4 5 6
Probability of T 0.5
Probability of R 0.5
Probability of M 0.5
Probability of E 1.0
Probability of L 1.0 1.0
Probability of O 1.0
Probability of W 0.5

Then, when scoring an element against a column in the profile matrix, ignore the unset elements in the column. The score calculation in the example above would end up being...

max(
    score('W', 'R') * profile_mat[1]['R'],
    score('W', 'M') * profile_mat[1]['M']
)

For n-way sequence alignments where n is large (e.g. n=300) and the sequences are highly related, the greedy algorithm performs well but it may produce sub-optimal results. In contrast, the amount of memory and computation required for an n-way sequence alignment using the standard graph algorithm goes up exponentially as n grows linearly. For realistic biological sequences, the normal algorithm will likely fail for any n past 3 or 4. Adapting the divide-and-conquer algorithm for n-way sequence alignment will help, but even that only allows for targeting a slightly larger n (e.g. n=6).

ch5_code/src/global_alignment/GlobalMultipleAlignment_Greedy.py (lines 17 to 84):

class ProfileWeightLookup(WeightLookup):
    def __init__(self, total_seqs: int, backing_2d_lookup: WeightLookup):
        self.total_seqs = total_seqs
        self.backing_wl = backing_2d_lookup

    def lookup(self, *elements: Tuple[ELEM_OR_COLUMN, ...]):
        col: Tuple[ELEM, ...] = elements[0]
        elem: ELEM = elements[1]

        if col is None:
            return self.backing_wl.lookup(elem, None)  # should map to indel score
        elif elem is None:
            return self.backing_wl.lookup(None, col[0])  # should map to indel score
        else:
            probs = {elem: count / self.total_seqs for elem, count in Counter(e for e in col if e is not None).items()}
            ret = 0.0
            for p_elem, prob in probs.items():
                val = self.backing_wl.lookup(elem, p_elem) * prob
                ret = max(val, ret)
            return ret


def global_alignment(
        seqs: List[List[ELEM]],
        weight_lookup_2way: WeightLookup,
        weight_lookup_multi: WeightLookup
) -> Tuple[float, List[Tuple[ELEM, ...]]]:
    seqs = seqs[:]  # copy
    # Get initial best 2-way alignment
    highest_res = None
    highest_seqs = None
    for s1, s2 in combinations(seqs, r=2):
        if s1 is s2:
            continue
        res = GlobalAlignment_Matrix.global_alignment(s1, s2, weight_lookup_2way)
        if highest_res is None or res[0] > highest_res[0]:
            highest_res = res
            highest_seqs = s1, s2
    seqs.remove(highest_seqs[0])
    seqs.remove(highest_seqs[1])
    total_seqs = 2
    final_alignment = highest_res[1]
    # Build out profile matrix from alignment and continually add to it using 2-way alignment
    if seqs:
        s1 = highest_res[1]
        while seqs:
            profile_weight_lookup = ProfileWeightLookup(total_seqs, weight_lookup_2way)
            _, alignment = max(
                [GlobalAlignment_Matrix.global_alignment(s1, s2, profile_weight_lookup) for s2 in seqs],
                key=lambda x: x[0]
            )
            # pull out s1 from alignment and flatten for next cycle
            s1 = []
            for e in alignment:
                if e[0] is None:
                    s1 += [((None, ) * total_seqs) + (e[1], )]
                else:
                    s1 += [(*e[0], e[1])]
            # pull out s2 from alignment and remove from seqs
            s2 = [e for _, e in alignment if e is not None]
            seqs.remove(s2)
            # increase seq count
            total_seqs += 1
        final_alignment = s1
    # Recalculate score based on multi weight lookup
    final_weight = sum(weight_lookup_multi.lookup(*elems) for elems in final_alignment)
    return final_weight, final_alignment

Given the sequences ['TATTATTAT', 'GATTATGATTAT', 'TACCATTACAT', 'CTATTAGGAT'] and the score matrix...

INDEL=-1.0
    A   C   T   G
A   1  -1  -1  -1
C  -1   1  -1  -1
T  -1  -1   1  -1
G  -1  -1  -1   1

... the global alignment is...

---TATTATTAT
GATTATGATTAT
TACCATTA-CAT
--CTATTAGGAT

Weight: 8.0

Sum-of-Pairs Scoring

↩PREREQUISITES↩

WHAT: If a scoring model already exists for 2-way sequence alignments, that scoring model can be used as the basis for n-way sequence alignments (where n > 2). For a possible alignment position, generate all possible pairs between the elements at that position and score them. Then, sum those scores to get the final score for that alignment position.

WHY: Traditionally, scoring an n-way alignment requires an n-dimensional scoring matrix. For example, protein sequences have 20 possible element types (1 for each proteinogenic amino acid). That means a...

Creating probabilistic scoring models such a BLOSUM and PAM for n-way alignments where n > 2 is impractical. Sum-of-pairs scoring is a viable alternative.

ALGORITHM:

ch5_code/src/scoring/SumOfPairsWeightLookup.py (lines 8 to 14):

class SumOfPairsWeightLookup(WeightLookup):
    def __init__(self, backing_2d_lookup: WeightLookup):
        self.backing_wl = backing_2d_lookup

    def lookup(self, *elements: Tuple[Optional[ELEM], ...]):
        return sum(self.backing_wl.lookup(a, b) for a, b in combinations(elements, r=2))

Given the elements ['M', 'E', 'A', None, 'L', 'Y'] and the backing score matrix...

INDEL=-1.0
   A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A  4  0 -2 -1 -2  0 -2 -1 -1 -1 -1 -2 -1 -1 -1  1  0  0 -3 -2
C  0  9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2
D -2 -3  6  2 -3 -1 -1 -3 -1 -4 -3  1 -1  0 -2  0 -1 -3 -4 -3
E -1 -4  2  5 -3 -2  0 -3  1 -3 -2  0 -1  2  0  0 -1 -2 -3 -2
F -2 -2 -3 -3  6 -3 -1  0 -3  0  0 -3 -4 -3 -3 -2 -2 -1  1  3
G  0 -3 -1 -2 -3  6 -2 -4 -2 -4 -3  0 -2 -2 -2  0 -2 -3 -2 -3
H -2 -3 -1  0 -1 -2  8 -3 -1 -3 -2  1 -2  0  0 -1 -2 -3 -2  2
I -1 -1 -3 -3  0 -4 -3  4 -3  2  1 -3 -3 -3 -3 -2 -1  3 -3 -1
K -1 -3 -1  1 -3 -2 -1 -3  5 -2 -1  0 -1  1  2  0 -1 -2 -3 -2
L -1 -1 -4 -3  0 -4 -3  2 -2  4  2 -3 -3 -2 -2 -2 -1  1 -2 -1
M -1 -1 -3 -2  0 -3 -2  1 -1  2  5 -2 -2  0 -1 -1 -1  1 -1 -1
N -2 -3  1  0 -3  0  1 -3  0 -3 -2  6 -2  0  0  1  0 -3 -4 -2
P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2  7 -1 -2 -1 -1 -2 -4 -3
Q -1 -3  0  2 -3 -2  0 -3  1 -2  0  0 -1  5  1  0 -1 -2 -2 -1
R -1 -3 -2  0 -3 -2  0 -3  2 -2 -1  0 -2  1  5 -1 -1 -3 -3 -2
S  1 -1  0  0 -2  0 -1 -2  0 -2 -1  1 -1  0 -1  4  1 -2 -3 -2
T  0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1  0 -1 -1 -1  1  5  0 -2 -2
V  0 -1 -3 -2 -1 -3 -3  3 -2  1  1 -3 -2 -2 -3 -2  0  4 -3 -1
W -3 -2 -4 -3  1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11  2
Y -2 -2 -3 -2  3 -3  2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1  2  7

... the sum-of-pairs score for these elements is -17.0.

Entropy Scoring

↩PREREQUISITES↩

WHAT: When performing an n-way sequence alignment, score each possible alignment position based on entropy.

WHY: Entropy is a measure of uncertainty. The idea is that the more "certain" an alignment position is, the more likely it is to be correct.

ALGORITHM:

ch5_code/src/scoring/EntropyWeightLookup.py (lines 9 to 31):

class EntropyWeightLookup(WeightLookup):
    def __init__(self, indel_weight: float):
        self.indel_weight = indel_weight

    @staticmethod
    def _calculate_entropy(values: Tuple[float, ...]) -> float:
        ret = 0.0
        for value in values:
            ret += value * (log(value, 2.0) if value > 0.0 else 0.0)
        ret = -ret
        return ret

    def lookup(self, *elements: Tuple[Optional[ELEM], ...]):
        if None in elements:
            return self.indel_weight

        counts = Counter(elements)
        total = len(elements)
        probs = tuple(v / total for k, v in counts.most_common())
        entropy = EntropyWeightLookup._calculate_entropy(probs)

        return -entropy

Given the elements ['A', 'A', 'A', 'A', 'C'], the entropy score for these elements is -0.7219280948873623 (INDEL=-2.0).

Synteny

↩PREREQUISITES↩

A form of DNA mutation, called genome rearrangement, is when chromosomes go through structural changes such as ...

When a new species branches off from an existing one, genome rearrangements are responsible for at least some of the divergence. That is, the two related genomes will share long stretches of similar genes, but these long stretches will appear as if they had been randomly cut-and-paste and / or randomly reversed when compared to the other.

Kroki diagram output

These long stretches of similar genes are called synteny blocks. The example above has 4 synteny blocks:

Kroki diagram output

Real-life examples of species that share synteny blocks include ...

Genomic Dot Plot

WHAT: Given two genomes, create a 2D plot where each axis is assigned to one of the genomes and a dot is placed at each coordinate containing a match, where a match is either a shared k-mer or a k-mer and its reverse complement. These plots are called genomic dot plots.

Kroki diagram output

WHY: Genomic dot plots are used for identifying synteny blocks between two genomes.

ALGORITHM:

The following algorithm finds direct matches. However, a better solution may be to consider anything with some hamming distance as a match. Doing so would require non-trivial changes to the algorithm (e.g. modifying the lookup to use bloom filters).

ch6_code/src/synteny_graph/Match.py (lines 176 to 232):

@staticmethod
def create_from_genomes(
        k: int,
        cyclic: bool,  # True if chromosomes are cyclic
        genome1: Dict[str, str],  # chromosome id -> dna string
        genome2: Dict[str, str]   # chromosome id -> dna string
) -> List[Match]:
    # lookup tables for data1
    fwd_kmers1 = defaultdict(list)
    rev_kmers1 = defaultdict(list)
    for chr_name, chr_data in genome1.items():
        for kmer, idx in slide_window(chr_data, k, cyclic):
            fwd_kmers1[kmer].append((chr_name, idx))
            rev_kmers1[dna_reverse_complement(kmer)].append((chr_name, idx))
    # lookup tables for data2
    fwd_kmers2 = defaultdict(list)
    rev_kmers2 = defaultdict(list)
    for chr_name, chr_data in genome2.items():
        for kmer, idx in slide_window(chr_data, k, cyclic):
            fwd_kmers2[kmer].append((chr_name, idx))
            rev_kmers2[dna_reverse_complement(kmer)].append((chr_name, idx))
    # match
    matches = []
    fwd_key_matches = set(fwd_kmers1.keys())
    fwd_key_matches.intersection_update(fwd_kmers2.keys())
    for kmer in fwd_key_matches:
        idxes1 = fwd_kmers1.get(kmer, [])
        idxes2 = fwd_kmers2.get(kmer, [])
        for (chr_name1, idx1), (chr_name2, idx2) in product(idxes1, idxes2):
            m = Match(
                y_axis_chromosome=chr_name1,
                y_axis_chromosome_min_idx=idx1,
                y_axis_chromosome_max_idx=idx1 + k - 1,
                x_axis_chromosome=chr_name2,
                x_axis_chromosome_min_idx=idx2,
                x_axis_chromosome_max_idx=idx2 + k - 1,
                type=MatchType.NORMAL
            )
            matches.append(m)
    rev_key_matches = set(fwd_kmers1.keys())
    rev_key_matches.intersection_update(rev_kmers2.keys())
    for kmer in rev_key_matches:
        idxes1 = fwd_kmers1.get(kmer, [])
        idxes2 = rev_kmers2.get(kmer, [])
        for (chr_name1, idx1), (chr_name2, idx2) in product(idxes1, idxes2):
            m = Match(
                y_axis_chromosome=chr_name1,
                y_axis_chromosome_min_idx=idx1,
                y_axis_chromosome_max_idx=idx1 + k - 1,
                x_axis_chromosome=chr_name2,
                x_axis_chromosome_min_idx=idx2,
                x_axis_chromosome_max_idx=idx2 + k - 1,
                type=MatchType.REVERSE_COMPLEMENT
            )
            matches.append(m)
    return matches

Generating genomic dot plot for...

Result...

Genomic Dot Plot

⚠️NOTE️️️⚠️

Rather than just showing dots at matches, the plot below draws a line over the entire match.

Synteny Graph

↩PREREQUISITES↩

WHAT: Given the genomic dot-plot for two genomes, connect dots that are close together and going in the same direction. This process is commonly referred to as clustering. A clustered genomic dot plot is called a synteny graph.

WHY: Clustering together matches reveals synteny blocks.

Kroki diagram output

Kroki diagram output

ALGORITHM:

The following synteny graph algorithm relies on three non-trivial components:

  1. A spatial indexing algorithm bins points that are close together, such that it's fast to look up the set of dots that are within the proximity of some other dot. The spatial indexing algorithm used by this implementation is called a quad tree.
  2. A clustering algorithm connects dots going in the same direction to reveal synteny blocks. The clustering algorithm used by this implementation is iterative, doing multiple rounds of connecting within a set of constraints (e.g. the neighbouring dot has to be within some limit / in some angle for it to connect).
  3. A filtering algorithm that trims/merges overlapping synteny blocks as well as removes superfluous synteny blocks returned by the clustering algorithm. The filtering algorithm used by this implementation is simple off-the-cuff heuristics.

These components are complicated and not specific to bioinformatics. As such, this section doesn't discuss them in detail but the source code is available (entrypoint is displayed below)).

⚠️NOTE️️️⚠️

This is code I came up with to solve the ch 6 final assignment in the Pevzner book. I came up with / fleshed out the ideas myself -- the book only hinted at specific bits. I believe the fundamentals are correct but the implementation is finicky and requires a lot of knob twisting to get decent results.

ch6_code/src/synteny_graph/MatchMerger.py (lines 18 to 65):

def distance_merge(matches: Iterable[Match], radius: int, angle_half_maw: int = 45) -> List[Match]:
    min_x = min(m.x_axis_chromosome_min_idx for m in matches)
    max_x = max(m.x_axis_chromosome_max_idx for m in matches)
    min_y = min(m.y_axis_chromosome_min_idx for m in matches)
    max_y = max(m.y_axis_chromosome_max_idx for m in matches)
    indexer = MatchSpatialIndexer(min_x, max_x, min_y, max_y)
    for m in matches:
        indexer.index(m)
    ret = []
    remaining = set(matches)
    while remaining:
        m = next(iter(remaining))
        found = indexer.scan(m, radius, angle_half_maw)
        merged = Match.merge(found)
        for _m in found:
            indexer.unindex(_m)
            remaining.remove(_m)
        ret.append(merged)
    return ret


def overlap_filter(
        matches: Iterable[Match],
        max_filter_length: float,
        max_merge_distance: float
) -> List[Match]:
    clipper = MatchOverlapClipper(max_filter_length, max_merge_distance)
    for m in matches:
        while True:
            # When you attempt to add a match to the clipper, the clipper may instead ask you to make a set of changes
            # before it'll accept it. Specifically, the clipper may ask you to replace a bunch of existing matches that
            # it's already indexed and then give you a MODIFIED version of m that it'll accept once you've applied
            # those replacements
            changes_requested = clipper.index(m)
            if not changes_requested:
                break
            # replace existing entries in clipper
            for from_m, to_m in changes_requested.existing_matches_to_replace.items():
                clipper.unindex(from_m)
                if to_m:
                    res = clipper.index(to_m)
                    assert res is None
            # replace m with a revised version -- if None it means m isn't needed (its been filtered out)
            m = changes_requested.revised_match
            if not m:
                break
    return list(clipper.get())

Generating synteny graph for...

Original genomic dot plot...

Genomic Dot Plot

Merging radius=10 angle_half_maw=45...

Merging radius=15 angle_half_maw=45...

Merging radius=25 angle_half_maw=45...

Merging radius=35 angle_half_maw=45...

Filtering max_filter_length=35.0 max_merge_distance=35.0...

Merging radius=100 angle_half_maw=45...

Filtering max_filter_length=65.0 max_merge_distance=65.0...

Culling below length=15.0...

Final synteny graph...

Synteny Graph

Reversal Path

↩PREREQUISITES↩

WHAT: Given two genomes that share synteny blocks, where one genome has the synteny blocks in desired form while the other does not, determine the minimum number of genome rearrangement reversals (reversal distance) required to get the undesired genome's synteny blocks to match those in the desired genome.

Kroki diagram output

WHY: The theory is that the genome rearrangements between two species take the parsimonious path (or close to it). Since genome reversals are the most common form of genome rearrangement mutation, by calculating a parsimonious reversal path (smallest set of reversals) it's possible to get an idea of how the two species branched off. In the example above, it may be that one of the genomes in the reversal path is the parent that both genomes are based off of.

Kroki diagram output

Breakpoint List Algorithm

ALGORITHM:

This algorithm is a simple best effort heuristic to estimate the parsimonious reversal path. It isn't guaranteed to generate a reversal path in every case: The point of this algorithm isn't so much to be a robust solution as much as it is to be a foundation / provide intuition for better algorithms that determine reversal paths.

The algorithm relies on the concept of breakpoints and adjacencies...

Breakpoints and adjacencies are useful because they identify desirable points for reversals. This algorithm takes advantage of that fact to estimate the reversal distance. For example, a contiguous train of adjacencies in an undesired genome may identify the boundaries for a single reversal that gets the undesired genome closer to the desired genome.

Kroki diagram output

The algorithm starts by assigning integers to synteny blocks. The synteny blocks in the...

For example, ...

Kroki diagram output

The synteny blocks in each genomes of the above example may be represented as lists...

Artificially add a 0 prefix and a length + 1 suffix to both lists. In the above example, the length is 5, so each list gets a prefix of 0 and a suffix of 6...

In this modified list, consecutive elements (pi,pi+1)(p_i, p_{i+1}) are considered a...

In the undesired version of the example above, the breakpoints and adjacencies are...

Kroki diagram output

This algorithm continually applies genome rearrangement reversal operations on portions of the list in the hopes of reducing the number of breakpoints at each reversal, ultimately hoping to get it to the desired list. It targets portions of contiguous adjacencies sandwiched between breakpoints. In the example above, the reversal of [-4, -3, -2] reduces the number of breakpoints by 1...

Kroki diagram output

Following that up with a reversal of [-5] reduces the number of breakpoints by 2...

Kroki diagram output

Leaving the undesired list in the same state as the desired list. As such, the reversal distance for this example is 2 reversals.

In the best case, a single reversal will remove 2 breakpoints (one on each side of the reversal). In the worst case, there is no single reversal that drives down the number of breakpoints. For example, there is no single reversal for the list [+2, +1] that reduces the number of breakpoints...

Kroki diagram output

In such worst case scenarios, the algorithm fails. However, the point of this algorithm isn't so much to be a robust solution as much as it is to be a foundation for better algorithms that determine reversal paths.

ch6_code/src/breakpoint_list/BreakpointList.py (lines 7 to 26):

def find_adjacencies_sandwiched_between_breakpoints(augmented_blocks: List[int]) -> List[int]:
    assert augmented_blocks[0] == 0
    assert augmented_blocks[-1] == len(augmented_blocks) - 1
    ret = []
    for (x1, x2), idx in slide_window(augmented_blocks, 2):
        if x1 + 1 != x2:
            ret.append(idx)
    return ret


def find_and_reverse_section(augmented_blocks: List[int]) -> Optional[List[int]]:
    bp_idxes = find_adjacencies_sandwiched_between_breakpoints(augmented_blocks)
    for (bp_i1, bp_i2), _ in slide_window(bp_idxes, 2):
        if augmented_blocks[bp_i1] + 1 == -augmented_blocks[bp_i2] or\
                augmented_blocks[bp_i2 + 1] == -augmented_blocks[bp_i1 + 1] + 1:
            return augmented_blocks[:bp_i1 + 1]\
                   + [-x for x in reversed(augmented_blocks[bp_i1 + 1:bp_i2 + 1])]\
                   + augmented_blocks[bp_i2 + 1:]
    return None

Reversing on breakpoint boundaries...

No more reversals possible.

Since each reversal can at most reduce the number of breakpoints by 2, the reversal distance must be at least half the number of breakpoints (lower bound): drev(p)>=bp(p)2d_{rev}(p) >= \frac{bp(p)}{2}. In other words, the minimum number of reversals to transform a permutations to an identity permutation will never be less than bp(p)2\frac{bp(p)}{2}.

Breakpoint Graph Algorithm

↩PREREQUISITES↩

ALGORITHM:

This algorithm calculates a parsimonious reversal path by constructing an undirected graph representing the synteny blocks between genomes. Unlike the breakpoint list algorithm, this algorithm...

This algorithm begins by constructing an undirected graphs containing both the desired and undesired genomes, referred to as a breakpoint graph. It then performs a set of re-wiring operations on the breakpoint graph to determine a parsimonious reversal path (including fusion and fission), where each re-wiring operation is referred to as a two-break.

BREAKPOINT GRAPH REPRESENTATION

Construction of a breakpoint graph is as follows:

  1. Set the ends of synteny blocks as nodes. The arrow end should have a t suffix (for tail) while the non-arrow end should have a h suffix (for head)...

    Dot diagram

    If the genome has linear chromosomes, add a termination node as well to represent chromosome ends. Only one termination node is needed -- all chromosome ends are represented by the same termination node.

    Dot diagram

  2. Set the synteny blocks themselves as undirected edges, represented by dashed edges.

    Dot diagram

    Note that the arrow heads on these dashed edges represent the direction of the synteny match (e.g. head-to-tail for a normal match vs tail-to-head for a reverse complement match), not edge directions in the graph (graph is undirected). Since the h and t suffixes on nodes already convey the match direction information, the arrows may be omitted to reduce confusion.

    Dot diagram

  3. Set the regions between synteny blocks as undirected edges, represented by colored edges. Regions of ...

    Dot diagram

    For linear chromosomes, the region between a chromosome end and the synteny node just before it is also represented by the appropriate colored edge.

    Dot diagram

For example, the following two genomes share the synteny blocks A, B, C, and D between them ...

Kroki diagram output

Converting the above genomes to both a circular and linear breakpoint graph is as follows...

Dot diagram

As shown in the example above, the convention for drawing a breakpoint graph is to position nodes and edges as they appear in the desired genome (synteny edges should be neatly sandwiched between blue edges). Note how both breakpoint graphs in the example above are just another representation of their linear diagram counterparts. The ...

The reason for this convention is that it helps conceptualize the algorithms that operate on breakpoint graphs (described further down). Ultimately, a breakpoint graph is simply a merged version of the linear diagrams for both the desired and undesired genomes.

For example, if the circular genome version of the breakpoint graph example above were flattened based on the blue edges (desired genome), the synteny blocks would be ordered as they are in the linear diagram for the desired genome...

Kroki diagram output

Dot diagram

Likewise, if the circular genome version of the breakpoint graph example above were flattened based on red edges (undesired genome), the synteny blocks would be ordered as they are in the linear diagram for the undesired genome...

Kroki diagram output

Dot diagram

⚠️NOTE️️️⚠️

If you're confused at this point, don't continue. Go back and make sure you understand because in the next section builds on the above content.

DATA STRUCTURE REPRESENTATION

The data structure used to represent a breakpoint graph can simply be two adjacency lists: one for the red edges and one for the blue edges.

ch6_code/src/breakpoint_graph/ColoredEdgeSet.py (lines 16 to 35):

# Represents a single genome in a breakpoint graph
class ColoredEdgeSet:
    def __init__(self):
        self.by_node: Dict[SyntenyNode, ColoredEdge] = {}

    @staticmethod
    def create(ce_list: Iterable[ColoredEdge]) -> ColoredEdgeSet:
        ret = ColoredEdgeSet()
        for ce in ce_list:
            ret.insert(ce)
        return ret

    def insert(self, e: ColoredEdge):
        if e.n1 in self.by_node or e.n2 in self.by_node:
            raise ValueError(f'Node already occupied: {e}')
        if not isinstance(e.n1, TerminalNode):
            self.by_node[e.n1] = e
        if not isinstance(e.n2, TerminalNode):
            self.by_node[e.n2] = e

The edges representing synteny blocks technically don't need to be tracked because they're easily derived from either set of colored edges (red or blue). For example, given the following circular breakpoint graph ...

Dot diagram

..., walk the blue edges starting from the node B_t. The opposite end of the blue edge at B_t is C_h. The next edge to walk must be a synteny edge, but synteny edges aren't tracked in this data structure. However, since it's known that the nodes of a synteny edge...

, ... it's easy to derive that the opposite end of the synteny edge at node C_h is node C_t. As such, get the blue edge for C_t and repeat. Keep repeating until a cycle is detected.

For linear breakpoint graphs, the process must start and end at the termination node (no cycle).

ch6_code/src/breakpoint_graph/ColoredEdgeSet.py (lines 80 to 126):

# Walks the colored edges, spliced with synteny edges.
def walk(self) -> List[List[Union[ColoredEdge, SyntenyEdge]]]:
    ret = []
    all_edges = self.edges()
    term_edges = set()
    for ce in all_edges:
        if ce.has_term():
            term_edges.add(ce)
    # handle linear chromosomes
    while term_edges:
        ce = term_edges.pop()
        n = ce.non_term()
        all_edges.remove(ce)
        edges = []
        while True:
            se_n1 = n
            se_n2 = se_n1.swap_end()
            se = SyntenyEdge(se_n1, se_n2)
            edges += [ce, se]
            ce = self.by_node[se_n2]
            if ce.has_term():
                edges += [ce]
                term_edges.remove(ce)
                all_edges.remove(ce)
                break
            n = ce.other_end(se_n2)
            all_edges.remove(ce)
        ret.append(edges)
    # handle cyclic chromosomes
    while all_edges:
        start_ce = all_edges.pop()
        ce = start_ce
        n = ce.n1
        edges = []
        while True:
            se_n1 = n
            se_n2 = se_n1.swap_end()
            se = SyntenyEdge(se_n1, se_n2)
            edges += [ce, se]
            ce = self.by_node[se_n2]
            if ce == start_ce:
                break
            n = ce.other_end(se_n2)
            all_edges.remove(ce)
        ret.append(edges)
    return ret

Given the colored edges...

Synteny edges spliced in...

CE means colored edge / SE means synteny edge.

⚠️NOTE️️️⚠️

If you're confused at this point, don't continue. Go back and make sure you understand because in the next section builds on the above content.

PERMUTATION REPRESENTATION

A common textual representation of a breakpoint graph is writing out each of the two genomes as a set of lists. Each list, referred to as a permutation, describes one of the chromosomes in a genome.

To convert a chromosome within a breakpoint graph to a permutation, simply walk the edges for that chromosome...

Each synteny edge walked is appended to the list with a prefix of ...

For example, given the following breakpoint graph ...

Dot diagram

, ... walking the edges for the undesired genome (red) from node D_t in the ...

For circular chromosomes, the walk direction is irrelevant, meaning that both example permutations above represent the same chromosome. Likewise, the starting node is also irrelevant, meaning that the following permutations are all equivalent to the ones in the above example: [+C, +D], and [+D, +C].

For linear chromosomes, the walk direction is irrelevant but the walk must start from and end at the termination node (representing the ends of the chromosome). The termination nodes aren't included in the permutation.

In the example breakpoint graph above, the permutation set representing the undesired genome (red) may be written as either...

Likewise, the permutation set representing the desired genome (blue) in the example above may be written as either...

ch6_code/src/breakpoint_graph/Permutation.py (lines 158 to 196):

@staticmethod
def from_colored_edges(
        colored_edges: ColoredEdgeSet,
        start_n: SyntenyNode,
        cyclic: bool
) -> Tuple[Permutation, Set[ColoredEdge]]:
    # if not cyclic, it's expected that start_n is either from or to a term node
    if not cyclic:
        ce = colored_edges.find(start_n)
        assert ce.has_term(), "Start node must be for a terminal colored edge"
    # if cyclic stop once you detect a loop, otherwise  stop once you encounter a term node
    if cyclic:
        walked = set()
        def stop_test(x):
            ret = x in walked
            walked.add(next_n)
            return ret
    else:
        def stop_test(x):
            return x == TerminalNode.INST
    # begin loop
    blocks = []
    start_ce = colored_edges.find(start_n)
    walked_ce_set = {start_ce}
    next_n = start_n
    while not stop_test(next_n):
        if next_n.end == SyntenyEnd.HEAD:
            b = Block(Direction.FORWARD, next_n.id)
        elif next_n.end == SyntenyEnd.TAIL:
            b = Block(Direction.BACKWARD, next_n.id)
        else:
            raise ValueError('???')
        blocks.append(b)
        swapped_n = next_n.swap_end()
        next_ce = colored_edges.find(swapped_n)
        next_n = next_ce.other_end(swapped_n)
        walked_ce_set.add(next_ce)
    return Permutation(blocks, cyclic), walked_ce_set

Converting from a permutation set back to a breakpoint graph is basically just reversing the above process. For each permutation, slide a window of size two to determine the colored edges that permutation is for. The node chosen for the window element at index ...

  1. should be tail if sign is - or head if sign is +.
  2. should be head if sign is - or tail if sign is +.

For circular chromosomes, the sliding window is cyclic. For example, sliding the window over permutation [+A, +C, -B, +D] results in ...

For linear chromosomes, the sliding window is not cyclic and the chromosomes always start and end at the termination node. For example, the permutation [+A, +C, -B, +D] would actually be treated as [TERM, +A, +C, -B, +D, TERM], resulting in ...

ch6_code/src/breakpoint_graph/Permutation.py (lines 111 to 146):

def to_colored_edges(self) -> List[ColoredEdge]:
    ret = []
    # add link to dummy head if linear
    if not self.cyclic:
        b = self.blocks[0]
        ret.append(
            ColoredEdge(TerminalNode.INST, b.to_synteny_edge().n1)
        )
    # add normal edges
    for (b1, b2), idx in slide_window(self.blocks, 2, cyclic=self.cyclic):
        if b1.dir == Direction.BACKWARD and b2.dir == Direction.FORWARD:
            n1 = SyntenyNode(b1.id, SyntenyEnd.HEAD)
            n2 = SyntenyNode(b2.id, SyntenyEnd.HEAD)
        elif b1.dir == Direction.FORWARD and b2.dir == Direction.BACKWARD:
            n1 = SyntenyNode(b1.id, SyntenyEnd.TAIL)
            n2 = SyntenyNode(b2.id, SyntenyEnd.TAIL)
        elif b1.dir == Direction.FORWARD and b2.dir == Direction.FORWARD:
            n1 = SyntenyNode(b1.id, SyntenyEnd.TAIL)
            n2 = SyntenyNode(b2.id, SyntenyEnd.HEAD)
        elif b1.dir == Direction.BACKWARD and b2.dir == Direction.BACKWARD:
            n1 = SyntenyNode(b1.id, SyntenyEnd.HEAD)
            n2 = SyntenyNode(b2.id, SyntenyEnd.TAIL)
        else:
            raise ValueError('???')
        ret.append(
            ColoredEdge(n1, n2)
        )
    # add link to dummy tail if linear
    if not self.cyclic:
        b = self.blocks[-1]
        ret.append(
            ColoredEdge(b.to_synteny_edge().n2, TerminalNode.INST)
        )
    # return
    return ret

⚠️NOTE️️️⚠️

If you're confused at this point, don't continue. Go back and make sure you understand because in the next section builds on the above content.

TWO-BREAK ALGORITHM

Now that breakpoint graphs have been adequately described, the goal of this algorithm is to iteratively re-wire the red edges of a breakpoint graph such that they match its blue edges. At each step, the algorithm finds a pair of red edges that share nodes with a blue edge and re-wires those red edges such that one of them matches the blue edge.

For example, the two red edges highlighted below share the same nodes as a blue edge (D_h and C_t). These two red edges can be broken and re-wired such that one of them matches the blue edge...

Dot diagram

Each re-wiring operation is called a 2-break and represents either a chromosome fusion, chromosome fission, or reversal mutation (genome rearrangement). For example, ...

Genome rearrangement duplications and deletions aren't representable as 2-breaks. Genome rearrangement translocations can't be reliably represented as a single 2-break either. For example, the following translocation gets modeled as two 2-breaks, one that breaks the undesired chromosome (fission) and another that glues it back together (fusion)...

Kroki diagram output

Dot diagram

Dot diagram

ch6_code/src/breakpoint_graph/ColoredEdge.py (lines 46 to 86):

# Takes e1 and e2 and swaps the ends, such that one of the swapped edges becomes desired_e. That is, e1 should have
# an end matching one of desired_e's ends while e2 should have an end matching desired_e's other end.
#
# This is basically a 2-break.
@staticmethod
def swap_ends(
        e1: Optional[ColoredEdge],
        e2: Optional[ColoredEdge],
        desired_e: ColoredEdge
) -> Optional[ColoredEdge]:
    if e1 is None and e2 is None:
        raise ValueError('Both edges can\'t be None')
    if TerminalNode.INST in desired_e:
        # In this case, one of desired_e's ends is TERM (they can't both be TERM). That means either e1 or e2 will
        # be None because there's only one valid end (non-TERM end) to swap with.
        _e = next(filter(lambda x: x is not None, [e1, e2]), None)
        if _e is None:
            raise ValueError('If the desired edge has a terminal node, one of the edges must be None')
        if desired_e.non_term() not in {_e.n1, _e.n2}:
            raise ValueError('Unexpected edge node(s) encountered')
        if desired_e == _e:
            raise ValueError('Edge is already desired edge')
        other_n1 = _e.other_end(desired_e.non_term())
        other_n2 = TerminalNode.INST
        return ColoredEdge(other_n1, other_n2)
    else:
        # In this case, neither of desired_e's ends is TERM. That means both e1 and e2 will be NOT None.
        if desired_e in {e1, e2}:
            raise ValueError('Edge is already desired edge')
        if desired_e.n1 in e1 and desired_e.n2 in e2:
            other_n1 = e1.other_end(desired_e.n1)
            other_n2 = e2.other_end(desired_e.n2)
        elif desired_e.n1 in e2 and desired_e.n2 in e1:
            other_n1 = e2.other_end(desired_e.n1)
            other_n2 = e1.other_end(desired_e.n2)
        else:
            raise ValueError('Unexpected edge node(s) encountered')
        if {other_n1, other_n2} == {TerminalNode.INST}:  # if both term edges, there is no other edge
            return None
        return ColoredEdge(other_n1, other_n2)

Applying 2-breaks on circular genome until red_p_list=[['+A', '-B', '-C', '+D'], ['+E']] matches blue_p_list=[['+A', '+B', '-D'], ['-C', '-E']] (show_graph=True)...

Recall that the breakpoint graph is undirected. A permutation may have been walked in either direction (clockwise vs counter-clockwise) and there are multiple nodes to start walking from. If the output looks like it's going backwards, that's just as correct as if it looked like it's going forward.

Also, recall that a genome is represented as a set of permutations -- sets are not ordered.

⚠️NOTE️️️⚠️

It isn't discussed here, but the Pevzner book put an emphasis on calculating the parsimonious number of reversals (reversal distance) without having to go through and apply two-breaks in the breakpoint graph. The basic idea is to count the number of red-blue cycles in the graph.

For a cyclic breakpoint graphs, a single red-blue cycle is when you pick a node, follow the blue edge, then follow the red edge, then follow the blue edge, then follow the red edge, ..., until you arrive back at the same node. If the blue and red genomes match perfectly, the number of red-blue cycles should equal the number of synteny blocks. Otherwise, you can calculate the number of reversals needed to get them to equal by subtracting the number of red-blue cycles by the number of synteny blocks.

For a linear breakpoint graphs, a single red-blue cycle isn't actually a cycle because: Pick the termination node, follow a blue edge, then follow the red edge, then follow the blue edge, then follow the red edge, ... until you arrive back at the termination node (what if there are actual cyclic red-blue loops as well like in cyclic breakpoint graphs?). If the blue and red genomes match perfectly, the number of red-blue cycles should equal the number of synteny blocks + 1. Otherwise, you can ESTIMATE the number of reversals needed to get them to equal by subtracting the number of red-blue cycles by the number of synteny blocks + 1.

To calculate the real number of reversals need for linear breakpoint graphs (not estimate), there's a paper on ACM DL that goes over the algorithm. I glanced through it but I don't have the time / wherewithal to go through it. Maybe do it in the future.

UPDATE: Calculating the number of reversals quickly is important because the number of reversals can be used as a distance metric when computing a phylogenetic tree across a set of species (a tree that shows how closely a set of species are related / how they branched out). See distance matrix definition.

Phylogeny

↩PREREQUISITES↩

Phylogeny is the concept of inferring the evolutionary history of a set of biological entities (e.g. animal species, viruses, etc..) by inspecting properties of those entities for relatedness (e.g. phenotypic, genotypic, etc..).

Kroki diagram output

Evolutionary history is often displayed as a tree called a phylogenetic tree, where leaf nodes represent known entities and internal nodes represent inferred ancestor entities. The example above shows a phylogenetic tree for the species cat, lion, and bear based on phenotypic inspection. Cats and lions are inferred as descending from the same ancestor because both have deeply shared physical and behavioural characteristics (felines). Similarly, that feline ancestor and bears are inferred as descending from the same ancestor because all descendants walk on 4 legs.

The typical process for phylogeny is to first measure how related a set of entities are to each other, where each measure is referred to as a distance (e.g. dist(cat, lion) = 2), then work backwards to find a phylogenetic tree that fits / maps to those distances. The distance may be any metric so long as ...

⚠️NOTE️️️⚠️

The leapfrogging point may be confusing. All it's saying is that taking an indirect path between two species should produce a distance that's >= the direct path. For example, the direct path between cat and dog is 6: dist(cat, dog) = 6. If you were to instead jump from cat to lion dist(cat, lion) = 2, then from lion to dog dist(lion, dog) = 5, that combined distance should be >= to 6...

dist(cat, dog)  = 6
dist(cat, lion) = 2
dist(lion, dog) = 5

dist(cat, lion) + dist(lion, dog) >= dist(cat, dog)
        2       +        5        >=       6
                7                 >=       6

The Pevzner book refers to this as the triangle inequality.

Kroki diagram output

Later on non-conforming distance matrices are discussed called non-additive distance matrices. I don't know if non-additive distance matrices are required to have this specific property, but they should have all others.

Examples of metrics that may be used as distance, referred to as distance metrics, include...

Distances for a set of entities are typically represented as a 2D matrix that contains all possible pairings, called a distance matrix. The distance matrix for the example Cat/Lion/Bear phylogenetic tree is ...

Cat Lion Bear
Cat 0 2 23
Lion 2 0 23
Bear 23 23 0

Kroki diagram output

Note how the distance matrix has the distance for each pair slotted twice, mirrored across the diagonal of 0s (self distances). For example, the distance between bear and lion is listed twice.

⚠️NOTE️️️⚠️

Just to make it explicit: The ultimate point of this section is to work backwards from a distance matrix to a phylogenetic tree (essentially the concept of phylogeny -- inferring evolutionary history of a set of known / present-day organisms based on how different they are).

⚠️NOTE️️️⚠️

The best way to move forward with this, assuming that you're brand new to it, is to first understand the following four subsections...

Then jump to the algorithm you want to learn (subsection) within Algorithms/Phylogeny/Distance Matrix to Tree and work from the prerequisites to the algorithm. Otherwise all the sections in between come off as disjointed because it's building the intermediate knowledge required for the final algorithms.

Tree to Additive Distance Matrix

WHAT: Given a tree, the distance matrix generated from that tree is said to be an additive distance matrix.

WHY: The term additive distance matrix is derived from the fact that edge weights within the tree are being added together to generate the distances in the distance matrix. For example, in the following tree ...

Kroki diagram output

Cat Lion Bear
Cat 0 2 4
Lion 2 0 4
Bear 4 4 0

However, distance matrices aren't commonly generated from trees. Rather, they're generated by comparing present-day entities to each other to see how diverged they are (their distance from each other). There's no guarantee that a distance matrix generated from comparisons will be an additive distance matrix. That is, there must exist a tree with edge weights that satisfy that distance matrix for it to be an additive distance matrix (commonly referred to as a tree that fits the distance matrix).

In other words, while a...

ALGORITHM:

ch7_code/src/phylogeny/TreeToAdditiveDistanceMatrix.py (lines 39 to 69):

def find_path(g: Graph[N, ND, E, float], n1: N, n2: N) -> list[E]:
    if not g.has_node(n1) or not g.has_node(n2):
        ValueError('Node missing')
    if n1 == n2:
        return []
    queued_edges = list()
    for e in g.get_outputs(n1):
        queued_edges.append((n1, [e]))
    while len(queued_edges) > 0:
        ignore_n, e_list = queued_edges.pop()
        e_last = e_list[-1]
        active_n = [n for n in g.get_edge_ends(e_last) if n != ignore_n][0]
        if active_n == n2:
            return e_list
        children = set(g.get_outputs(active_n))
        children.remove(e_last)
        for child_e in children:
            child_ignore_n = active_n
            new_e_list = e_list[:] + [child_e]
            queued_edges.append((child_ignore_n, new_e_list))
    raise ValueError(f'No path from {n1} to {n2}')


def to_additive_distance_matrix(g: Graph[N, ND, E, float]) -> DistanceMatrix[N]:
    leaves = {n for n in g.get_nodes() if g.get_degree(n) == 1}
    dists = {}
    for l1, l2 in product(leaves, repeat=2):
        d = sum(g.get_edge_data(e) for e in find_path(g, l1, l2))
        dists[l1, l2] = d
    return DistanceMatrix(dists)

The tree...

Dot diagram

... produces the additive distance matrix ...

v0 v1 v2 v3 v4 v5
v0 0.0 13.0 21.0 21.0 22.0 22.0
v1 13.0 0.0 12.0 12.0 13.0 13.0
v2 21.0 12.0 0.0 20.0 21.0 21.0
v3 21.0 12.0 20.0 0.0 7.0 13.0
v4 22.0 13.0 21.0 7.0 0.0 14.0
v5 22.0 13.0 21.0 13.0 14.0 0.0

Tree to Simple Tree

↩PREREQUISITES↩

WHAT: Convert a tree into a simple tree. A simple tree is an unrooted tree where ...

The first point just means that the tree can't contain non-splitting internal nodes. By definition a tree's leaf nodes each have a degree of 1, and this restriction makes it so that each internal node must have a degree > 2 instead of >= 2...

Kroki diagram output

Kroki diagram output

In the context of phylogeny, a simple tree's ...

WHY: Simple trees have properties / restrictions that simplify the process of working backwards from a distance matrix to a tree. In other words, when constructing a tree from a distance matrix, the process is simpler if the tree is restricted to being a simple tree.

The first property is that a unique simple tree exists for a unique additive distance matrix (one-to-one mapping). That is, it isn't possible for...

For example, the following additive distance matrix will only ever map to the following simple tree (and vice-versa)...

w u y z
w 0 3 8 7
u 3 0 9 8
y 8 9 0 5
z 7 8 5 0

Kroki diagram output

However, that same additive distance matrix can map to an infinite number of non-simple trees (and vice-versa)...

Kroki diagram output

⚠️NOTE️️️⚠️

To clarify: This property / restriction is important because when reconstructing a tree from the distance matrix, if you restrict yourself to a simple tree you'll only ever have 1 tree to reconstruct to. This makes the algorithms simpler. This is discussed further in the cardinality subsection.

The second property is that the direction of evolution isn't maintained in a simple tree: It's an unrooted tree with undirected edges. This is a useful property because, while a distance matrix may provide enough information to infer common ancestry, it doesn't provide enough information to know the true parent-child relationships between those ancestors. For example, any of the internal nodes in the following simple tree may be the top-level entity that all other entities are descendants of ...

Kroki diagram output

The third property is that weights must be > 0, which is because of the restriction on distance metrics specified in the parent section: The distance between any two entities must be > 0. That is, it doesn't make sense for the distance between two entities to be ...

Kroki diagram output

ALGORITHM:

The following examples show various real evolutionary paths and their corresponding simple trees. Note how the simple trees neither fully represent the true lineage nor the direction of evolution (simple trees are unrooted and undirected).

Kroki diagram output

In the first two examples, one present-day entity branched off from another present-day entity. Both entities are still present-day entities (the entity branched off from isn't extinct).

In the fifth example, parent1 split off to the present-day entities entity1 and entity3, then entity2 branched off entity1. All three entities are present-day entities (neither entity1, entity2, nor entity3 is extinct).

In the third and last two examples, the top-level parent doesn't show up because adding it would break the requirement that internal nodes must be splitting (degree > 2). For example, adding parent1 into the simple tree of the last example above causes parent1 to have a degree = 2...

Kroki diagram output

The following algorithm removes nodes of degree = 2, merging its two edges together. This makes it so every internal edge has a degree of > 2...

ch7_code/src/phylogeny/TreeToSimpleTree.py (lines 88 to 105):

def merge_nodes_of_degree2(g: Graph[N, ND, E, float]) -> None:
    # Can be made more efficient by not having to re-collect bad nodes each
    # iteration. Kept it like this so it's simple to understand what's going on.
    while True:
        bad_nodes = {n for n in g.get_nodes() if g.get_degree(n) == 2}
        if len(bad_nodes) == 0:
            return
        bad_n = bad_nodes.pop()
        bad_e1, bad_e2 = tuple(g.get_outputs(bad_n))
        e_id = bad_e1 + bad_e2
        e_n1 = [n for n in g.get_edge_ends(bad_e1) if n != bad_n][0]
        e_n2 = [n for n in g.get_edge_ends(bad_e2) if n != bad_n][0]
        e_weight = g.get_edge_data(bad_e1) + g.get_edge_data(bad_e2)
        g.insert_edge(e_id, e_n1, e_n2, e_weight)
        g.delete_edge(bad_e1)
        g.delete_edge(bad_e2)
        g.delete_node(bad_n)

The tree...

Dot diagram

... simplifies to ...

Dot diagram

The following algorithm tests a tree to see if it meets the requirements of being a simple tree...

ch7_code/src/phylogeny/TreeToSimpleTree.py (lines 36 to 83):

def is_tree(g: Graph[N, ND, E, float]) -> bool:
    # Check for cycles
    if len(g) == 0:
        return False
    walked_edges = set()
    walked_nodes = set()
    queued_edges = set()
    start_n = next(g.get_nodes())
    for e in g.get_outputs(start_n):
        queued_edges.add((start_n, e))
    while len(queued_edges) > 0:
        ignore_n, e = queued_edges.pop()
        active_n = [n for n in g.get_edge_ends(e) if n != ignore_n][0]
        walked_edges.add(e)
        walked_nodes.update({ignore_n, active_n})
        children = set(g.get_outputs(active_n))
        children.remove(e)
        for child_e in children:
            if child_e in walked_edges:
                return False  # cyclic -- edge already walked
            child_ignore_n = active_n
            queued_edges.add((child_ignore_n, child_e))
    # Check for disconnected graph
    if len(walked_nodes) != len(g):
        return False  # disconnected -- some nodes not reachable
    return True


def is_simple_tree(g: Graph[N, ND, E, float]) -> bool:
    # Check if tree
    if not is_tree(g):
        return False
    # Test degrees
    for n in g.get_nodes():
        # Degree == 0 shouldn't exist if tree
        # Degree == 1 is leaf node
        # Degree == 2 is a non-splitting internal node (NOT ALLOWED)
        # Degree >= 3 is splitting internal node
        degree = g.get_degree(n)
        if degree == 2:
            return False
    # Test weights
    for e in g.get_edges():
        # No non-positive weights
        weight = g.get_edge_data(e)
        if weight <= 0:
            return False
    return True

The tree...

Dot diagram

... is NOT a simple tree

Additive Distance Matrix Cardinality

↩PREREQUISITES↩

⚠️NOTE️️️⚠️

This was discussed briefly in the simple tree section, but it's being discussed here in its own section because it's important.

WHAT: Determine the cardinality of between an additive distance matrix and a type of tree. For, ...

WHY: Non-simple trees are essentially derived from simple trees by splicing nodes in between edges (breaking up an edge into multiple edges). For example, any of the following non-simple trees...

Kroki diagram output

... will collapse to the following simple tree (edges connected by nodes of degree 2 merged by adding weights) ...

Kroki diagram output

All of the trees above, both the non-simple trees and the simple tree, will generate the following additive distance matrix ...

Cat Lion Bear
Cat 0 2 4
Lion 2 0 3
Bear 4 3 0

Similarly, this additive distance matrix will only ever map to the simple tree shown above or one of its many non-simple tree derivatives (3 of which are shown above). There is no other simple tree that this additive distance matrix can map to / no other simple tree that will generate this distance matrix. In other words, it isn't possible for...

Working backwards from a distance matrix to a tree is less complex when limiting the tree to a simple tree, because there's only one simple tree possible (vs many non-simple trees).

ALGORITHM:

This section is more of a concept than an algorithm. The following just generates an additive distance matrix from a tree and says if that tree is unique to that additive distance matrix (it should be if it's a simple tree). There is no code to show for it because it's just calling things from previous sections (generating an additive distance matrix and checking if a simple tree).

ch7_code/src/phylogeny/CardinalityTest.py (lines 15 to 19):

def cardinality_test(g: Graph[N, ND, E, float]) -> tuple[DistanceMatrix[N], bool]:
    return (
        to_additive_distance_matrix(g),
        is_simple_tree(g)
    )

The tree...

Dot diagram

... produces the additive distance matrix ...

v0 v1 v2 v3 v4 v5
v0 0.0 13.0 21.0 21.0 22.0 22.0
v1 13.0 0.0 12.0 12.0 13.0 13.0
v2 21.0 12.0 0.0 20.0 21.0 21.0
v3 21.0 12.0 20.0 0.0 7.0 13.0
v4 22.0 13.0 21.0 7.0 0.0 14.0
v5 22.0 13.0 21.0 13.0 14.0 0.0

The tree is simple. This is the ONLY simple tree possible for this additive distance matrix and vice-versa.

Test Additive Distance Matrix

↩PREREQUISITES↩

WHAT: Determine if a distance matrix is an additive distance matrix.

WHY: Knowing if a distance matrix is additive helps determine how the tree for that distance matrix should be constructed. For example, since it's impossible for a non-additive distance matrix to fit a tree, different algorithms are needed to approximate a tree that somewhat fits.

ALGORITHM:

This algorithm, called the four point condition algorithm, tests pairs within each quartet of leaf nodes to ensure that they meet a certain set of conditions. For example, the following tree has the quartet of leaf nodes (v0, v2, v4, v6) ...

Dot diagram

A quartet makes up 3 different pair combinations (pairs of pairs). For example, the example quartet above has the 3 pair combinations ...

⚠️NOTE️️️⚠️

Order of the pairing doesn't matter at either level. For example, ((v0, v2), (v4, v6)) and ((v6, v4), (v2, v0)) are the same. That's why there are only 3.

Of these 3 pair combinations, the test checks to see that ...

  1. the sum of distances for one is == the sum of distances for another.
  2. the sum of distances for the remaining is <= the sums from the point above.

In a tree with edge weights >= 0, every leaf node quartet will pass this test. For example, for leaf node quartet (v0, v2, v4, v6) highlighted in the example tree above ...

Dot diagram

dist(v0,v2) + dist(v4,v6) <= dist(v0,v6) + dist(v2,v4) == dist(v0,v4) + dist(v2,v6)

Note how the same set of edges are highlighted between the first two diagrams (same distance contributions) while the third diagram has less edges highlighted (missing some distance contributions). This is where the inequality comes from.

⚠️NOTE️️️⚠️

I'm almost certain this inequality should be < instead of <=, because in a phylogenetic tree you can't have an edge weight of 0, right? An edge weight of 0 would indicate that the nodes at each end of an edge are the same entity.

All of the information required for the above calculation is available in the distance matrix...

ch7_code/src/phylogeny/FourPointCondition.py (lines 21 to 47):

def four_point_test(dm: DistanceMatrix[N], l0: N, l1: N, l2: N, l3: N) -> bool:
    # Pairs of leaf node pairs
    pair_combos = (
        ((l0, l1), (l2, l3)),
        ((l0, l2), (l1, l3)),
        ((l0, l3), (l1, l2))
    )
    # Different orders to test pair_combos to see if they match conditions
    test_orders = (
        (0, 1, 2),
        (0, 2, 1),
        (1, 0, 2),
        (1, 2, 0),
        (2, 0, 1),
        (2, 1, 0)
    )
    # Find at least one order of pair combos that passes the test
    for p1_idx, p2_idx, p3_idx in test_orders:
        p1_1, p1_2 = pair_combos[p1_idx]
        p2_1, p2_2 = pair_combos[p2_idx]
        p3_1, p3_2 = pair_combos[p3_idx]
        s1 = dm[p1_1] + dm[p1_2]
        s2 = dm[p2_1] + dm[p2_2]
        s3 = dm[p3_1] + dm[p3_2]
        if s1 <= s2 == s3:
            return True
    return False

If a distance matrix was derived from a tree / fits a tree, its leaf node quartets will also pass this test. That is, if all leaf node quartets in a distance matrix pass the above test, the distance matrix is an additive distance matrix ...

ch7_code/src/phylogeny/FourPointCondition.py (lines 52 to 64):

def is_additive(dm: DistanceMatrix[N]) -> bool:
    # Recall that an additive distance matrix of size <= 3 is guaranteed to be an additive distance
    # matrix (try it and see -- any distances you use will always end up fitting a tree). Thats why
    # you need at least 4 leaf nodes to test.
    if dm.n < 4:
        return True
    leaves = dm.leaf_ids()
    for quartet in combinations(leaves, r=4):
        passed = four_point_test(dm, *quartet)
        if not passed:
            return False
    return True

The distance matrix...

v0 v1 v2 v3
v0 0.0 3.0 8.0 7.0
v1 3.0 0.0 9.0 8.0
v2 8.0 9.0 0.0 5.0
v3 7.0 8.0 5.0 0.0

... is additive.

⚠️NOTE️️️⚠️

Could the differences found by this algorithm help determine how "close" a distance matrix is to being an additive distance matrix?

Find Limb Length

↩PREREQUISITES↩

WHAT: Given an additive distance matrix, there exists a unique simple tree that fits that matrix. Compute the limb length of any leaf node in that simple tree just from the additive distance matrix.

WHY: This is one of the operations required to construct the unique simple tree for an additive distance matrix.

ALGORITHM:

To conceptualize how this algorithm works, consider the following simple tree and its corresponding additive distance matrix...

Dot diagram

v0 v1 v2 v3 v4 v5 v6
v0 0 13 19 20 29 40 36
v1 13 0 10 11 20 31 27
v2 19 10 0 11 20 31 27
v3 20 11 11 0 21 32 28
v4 29 20 20 21 0 17 13
v5 40 31 31 32 17 0 6
v6 36 27 27 28 13 6 0

In this simple tree, consider a path between leaf nodes that travels over v2's parent (v2 itself excluded). For example, path(v1,v5) travels over v2's parent...

Dot diagram

Now, consider the paths between each of the two nodes in the path above (v1 and v5) and v2: path(v1,v2) + path(v2,v5) ...

Dot diagram

Notice how the edges highlighted between path(v1,v5) and path(v1,v2) + path(v2,v5) would be the same had it not been for the two highlights on v2's limb. Adding 2 * path(v2,i0) to path(v1,v5) makes it so that each edge is highlighted an equal number of times ...

Dot diagram

path(v1,v2) + path(v2,v5) = path(v1,v5) + 2 * path(v2,i1)

Contrast the above to what happens when the pair of leaf nodes selected DOESN'T travel through v2's parent. For example, path(v4,v5) doesn't travel through v2's parent ...

Dot diagram

path(v4,v2) + path(v2,v5) > path(v4,v5) + 2 * path(v2,i1)

Even when path(v4,v5) includes 2 * path(v2,i1), less edges are highlighted when compared to path(v4,v2) + path(v2,v5). Specifically, edge(i1,i2) is highlighted zero times vs two times.

The above two examples give way to the following two formulas: Given a simple tree with distinct leaf nodes {L, A, B} and L's parent Lp ...

These two formulas work just as well with distances instead of paths...

The reason distances work has to do with the fact that simple trees require edges weights of > 0, meaning traversing over an edge always increases the overall distance. If ...

⚠️NOTE️️️⚠️

The Pevzner book has the 2nd formula above as >= instead of >.

I'm assuming they did this because they're letting edge weights be >= 0 instead of > 0, which doesn't make sense because an edge with a weight of 0 means the same entity exists on both ends of the edge. If an edge weight is 0, it'll contribute nothing to the distance, meaning that more edges being highlighted doesn't necessarily mean a larger distance.

In the above formulas, L's limb length is represented as dist(L,Lp). Except for dist(L,Lp), all distances in the formulas are between leaf nodes and as such are found in the distance matrix. Therefore, the formulas need to be isolated to dist(L,Lp) in order to derive what L's limb length is ...

Notice the left-hand side of both solved formulas are the same: (dist(L,A) + dist(L,B) - dist(A,B)) / 2

The algorithm for finding limb length is essentially an exhaustive test. Of all leaf node pairs (L not included), the one producing the smallest left-hand side result is guaranteed to be L's limb length. Anything larger will include weights from more edges than just L's limb.

⚠️NOTE️️️⚠️

From the book:

Exercise Break: The algorithm proposed on the previous step computes LimbLength(j) in O(n2) time (for an n x n distance matrix). Design an algorithm that computes LimbLength(j) in O(n) time.

The answer to this is obvious now that I've gone through and reasoned about things above.

For the limb length formula to work, you need to find leaf nodes (A, B) whose path travels through leaf node L's parent (Lp). Originally, the book had you try all combinations of leaf nodes (L excluded) and take the minimum. That works, but you don't need to try all possible pairs. Instead, you can just pick any leaf (that isn't L) for A and test against every other node (that isn't L) to find B -- as with the original method, you pick the B that produces the minimum value.

Because a phylogenetic tree is a connected graph (a path exists between each node and all other nodes), at least 1 path will exist starting from A that travels through Lp.

leaf_nodes.remove(L)  # remove L from the set
A = leaf_nodes.pop()  # removes and returns an arbitrary leaf node
B = min(leafs, key=lambda x: (dist(L, A) + dist(L, x) - dist(A, x)) / 2)

For example, imagine that you're trying to find v2's limb length in the following graph...

Dot diagram

Pick v4 as your A node, then try the formula with every other leaf node as B (except v2 because that's the node you're trying to get limb length for + v4 because that's your A node). At least one of path(A, B)'s will cross through v2's parent. Take the minimum, just as you did when you were trying every possible node pair across all leaf nodes in the graph.

ch7_code/src/phylogeny/FindLimbLength.py (lines 22 to 28):

def find_limb_length(dm: DistanceMatrix[N], l: N) -> float:
    leaf_nodes = dm.leaf_ids()
    leaf_nodes.remove(l)
    a = leaf_nodes.pop()
    b = min(leaf_nodes, key=lambda x: (dm[l, a] + dm[l, x] - dm[a, x]) / 2)
    return (dm[l, a] + dm[l, b] - dm[a, b]) / 2

Given the additive distance matrix...

v0 v1 v2 v3 v4 v5 v6
v0 0.0 13.0 19.0 20.0 29.0 40.0 36.0
v1 13.0 0.0 10.0 11.0 20.0 31.0 27.0
v2 19.0 10.0 0.0 11.0 20.0 31.0 27.0
v3 20.0 11.0 11.0 0.0 21.0 32.0 28.0
v4 29.0 20.0 20.0 21.0 0.0 17.0 13.0
v5 40.0 31.0 31.0 32.0 17.0 0.0 6.0
v6 36.0 27.0 27.0 28.0 13.0 6.0 0.0

The limb for leaf node v2 in its unique simple tree has a weight of 5.0

Test Same Subtree

↩PREREQUISITES↩

WHAT: Splitting a simple tree on the parent of one of its leaf nodes breaks it up into several subtrees. For example, the following simple tree has been split on v2's parent, resulting in 4 different subtrees ...

Dot diagram

Given just the additive distance matrix for a simple tree (not the simple tree itself), determine if two leaf nodes belong to the same subtree had that simple tree been split on some leaf node's parent.

WHY: This is one of the operations required to construct the unique simple tree for an additive distance matrix.

ALGORITHM:

The algorithm is essentially the formulas from the limb length algorithm. Recall that those formulas are ...

To conceptualize how this algorithm works, consider the following simple tree and its corresponding additive distance matrix...

Dot diagram

v0 v1 v2 v3 v4 v5 v6
v0 0 13 19 20 29 40 36
v1 13 0 10 11 20 31 27
v2 19 10 0 11 20 31 27
v3 20 11 11 0 21 32 28
v4 29 20 20 21 0 17 13
v5 40 31 31 32 17 0 6
v6 36 27 27 28 13 6 0

Consider what happens when you break the edges on v2's parent (i1). The tree breaks into 4 distinct subtrees (colored below as green, yellow, pink, and cyan)...

Dot diagram

If the two leaf nodes chosen are ...

ch7_code/src/phylogeny/SubtreeDetect.py (lines 23 to 32):

def is_same_subtree(dm: DistanceMatrix[N], l: N, a: N, b: N) -> bool:
    l_weight = find_limb_length(dm, l)
    test_res = (dm[l, a] + dm[l, b] - dm[a, b]) / 2
    if test_res == l_weight:
        return False
    elif test_res > l_weight:
        return True
    else:
        raise ValueError('???')  # not additive distance matrix?

Given the additive distance matrix...

v0 v1 v2 v3 v4 v5 v6
v0 0.0 13.0 19.0 20.0 29.0 40.0 36.0
v1 13.0 0.0 10.0 11.0 20.0 31.0 27.0
v2 19.0 10.0 0.0 11.0 20.0 31.0 27.0
v3 20.0 11.0 11.0 0.0 21.0 32.0 28.0
v4 29.0 20.0 20.0 21.0 0.0 17.0 13.0
v5 40.0 31.0 31.0 32.0 17.0 0.0 6.0
v6 36.0 27.0 27.0 28.0 13.0 6.0 0.0

Had the tree been split on leaf node v2's parent, leaf nodes v1 and v5 would reside in different subtrees.

Trim

↩PREREQUISITES↩

WHAT: Remove a limb from an additive distance matrix, just as it would get removed from its corresponding unique simple tree.

WHY: This is one of the operations required to construct the unique simple tree for an additive distance matrix.

ALGORITHM:

Recall that for any additive distance matrix, there exists a unique simple tree that fits that matrix. For example, the following simple tree is unique to the following distance matrix...

Dot diagram

v0 v1 v2 v3
v0 0 13 21 22
v1 13 0 12 13
v2 21 12 0 13
v3 22 13 13 0

Trimming v2 off that simple tree would result in ...

Dot diagram

v0 v1 v3
v0 0 13 22
v1 13 0 13
v3 22 13 0

Notice how when v2 gets trimmed off, the ...

As such, removing the row and column for some leaf node in an additive distance matrix is equivalent to removing its limb from the corresponding unique simple tree then merging together any edges connected by nodes of degree 2.

ch7_code/src/phylogeny/Trimmer.py (lines 26 to 37):

def trim_distance_matrix(dm: DistanceMatrix[N], leaf: N) -> None:
    dm.delete(leaf)  # remove row+col for leaf


def trim_tree(tree: Graph[N, ND, E, float], leaf: N) -> None:
    if tree.get_degree(leaf) != 1:
        raise ValueError('Not a leaf node')
    edge = next(tree.get_outputs(leaf))
    tree.delete_edge(edge)
    tree.delete_node(leaf)
    merge_nodes_of_degree2(tree)  # make sure its a simple tree

Given the additive distance matrix...

v0 v1 v2 v3
v0 0.0 13.0 21.0 22.0
v1 13.0 0.0 12.0 13.0
v2 21.0 12.0 0.0 13.0
v3 22.0 13.0 13.0 0.0

... trimming leaf node v2 results in ...

v0 v1 v3
v0 0.0 13.0 22.0
v1 13.0 0.0 13.0
v3 22.0 13.0 0.0

Bald

↩PREREQUISITES↩

WHAT: Set a limb length to 0 in an additive distance matrix, just as it would be set to 0 in its corresponding unique simple tree. Technically, a simple tree can't have edge weights that are <= 0. This is a special case, typically used as an intermediate operation of some larger algorithm.

WHY: This is one of the operations required to construct the unique simple tree for an additive distance matrix.

ALGORITHM:

Recall that for any additive distance matrix, there exists a unique simple tree that fits that matrix. For example, the following simple tree is unique to the following distance matrix...

Dot diagram

v0 v1 v2 v3 v4 v5
v0 0 13 21 21 22 22
v1 13 0 12 12 13 13
v2 21 12 0 20 21 21
v3 21 12 20 0 7 13
v4 22 13 21 7 0 14
v5 22 13 21 13 14 0

Setting v5's limb length to 0 (balding v5) would result in ...

Dot diagram

v0 v1 v2 v3 v4 v5
v0 0 13 21 21 22 15
v1 13 0 12 12 13 6
v2 21 12 0 20 21 14
v3 21 12 20 0 7 6
v4 22 13 21 7 0 7
v5 15 6 14 6 7 0

⚠️NOTE️️️⚠️

Can a limb length be 0 in a simple tree? I don't think so, but the book seems to imply that it's possible. But, if the distance between the two nodes on an edge is 0, wouldn't that make them the same organism? Maybe this is just a temporary thing for this algorithm.

Notice how of the two distance matrices, all distances are the same except for v5's distances. Each v5 distance in the balded distance matrix is equivalent to the corresponding distance in the original distance matrix subtracted by v5's original limb length...

v0 v1 v2 v3 v4 v5
v0 0 13 21 21 22 22 - 7 = 15
v1 13 0 12 12 13 13 - 7 = 6
v2 21 12 0 20 21 21 - 7 = 14
v3 21 12 20 0 7 13 - 7 = 6
v4 22 13 21 7 0 14 - 7 = 7
v5 22 - 7 = 15 13 - 7 = 6 21 - 7 = 14 13 - 7 = 6 14 - 7 = 7 0

Whereas v5 was originally contributing 7 to distances, after balding it contributes 0.

As such, subtracting some leaf node's limb length from its distances in an additive distance matrix is equivalent to balding that leaf node's limb in its corresponding simple tree.

ch7_code/src/phylogeny/Balder.py (lines 25 to 38):

def bald_distance_matrix(dm: DistanceMatrix[N], leaf: N) -> None:
    limb_len = find_limb_length(dm, leaf)
    for n in dm.leaf_ids_it():
        if n == leaf:
            continue
        dm[leaf, n] -= limb_len


def bald_tree(tree: Graph[N, ND, E, float], leaf: N) -> None:
    if tree.get_degree(leaf) != 1:
        raise ValueError('Not a leaf node')
    limb = next(tree.get_outputs(leaf))
    tree.update_edge_data(limb, 0.0)

Given the additive distance matrix...

v0 v1 v2 v3 v4 v5
v0 0.0 13.0 21.0 21.0 22.0 22.0
v1 13.0 0.0 12.0 12.0 13.0 13.0
v2 21.0 12.0 0.0 20.0 21.0 21.0
v3 21.0 12.0 20.0 0.0 7.0 13.0
v4 22.0 13.0 21.0 7.0 0.0 14.0
v5 22.0 13.0 21.0 13.0 14.0 0.0

... trimming leaf node v5 results in ...

v0 v1 v2 v3 v4 v5
v0 0.0 13.0 21.0 21.0 22.0 15.0
v1 13.0 0.0 12.0 12.0 13.0 6.0
v2 21.0 12.0 0.0 20.0 21.0 14.0
v3 21.0 12.0 20.0 0.0 7.0 6.0
v4 22.0 13.0 21.0 7.0 0.0 7.0
v5 15.0 6.0 14.0 6.0 7.0 0.0

Un-trim Tree

↩PREREQUISITES↩

WHAT: Given an ...

  1. additive distance matrix for simple tree T
  2. simple tree T with limb L trimmed off

... this algorithm determines where limb L should be added in the given simple tree such that it fits the additive distance matrix. For example, the following simple tree would map to the following additive distance matrix had v2's limb branched out from some specific location...

Dot diagram

v0 v1 v2 v3
v0 0 13 21 22
v1 13 0 12 13
v2 21 12 0 13
v3 22 13 13 0

That specific location is what this algorithm determines. It could be that v2's limb needs to branch from either ...

⚠️NOTE️️️⚠️

Attaching a new limb to an existing leaf node is never possible because...

  1. it'll turn that existing leaf node to an internal node, which doesn't make sense because in the context of phylogenetic trees leaf nodes identify known entities.
  2. it will cease to be a simple tree -- simple trees can't have nodes of degree 2 (train of edges not allowed).

WHY: This is one of the operations required to construct the unique simple tree for an additive distance matrix.

ALGORITHM:

The simple tree below would fit the additive distance matrix below had v5's limb been added to it somewhere ...

Dot diagram

v0 v1 v2 v3 v4 v5
v0 0 13 21 21 22 22
v1 13 0 12 12 13 13
v2 21 12 0 20 21 21
v3 21 12 20 0 7 13
v4 22 13 21 7 0 14
v5 22 13 21 13 14 0

There's enough information available in this additive distance matrix to determine ...

⚠️NOTE️️️⚠️

Recall that same subset algorithm says that two leaf nodes in DIFFERENT subsets are guaranteed to travel over v5's parent.

The key to this algorithm is figuring out where along that path (v0 to v3) v5's limb (limb length of 7) should be injected. Imagine that you already had the answer in front of you: v5's limb should be added 4 units from i0 towards i2 ...

Dot diagram

Consider the answer above with v5's limb balded...

Dot diagram

v0 v1 v2 v3 v4 v5
v0 0 13 21 21 22 22 - 7 = 15
v1 13 0 12 12 13 13 - 7 = 6
v2 21 12 0 20 21 21 - 7 = 14
v3 21 12 20 0 7 13 - 7 = 6
v4 22 13 21 7 0 14 - 7 = 7
v5 22 - 7 = 15 13 - 7 = 6 21 - 7 = 14 13 - 7 = 6 14 - 7 = 7 0

Since v5's limb length is 0, it doesn't contribute to the distance of any path to / from v5. As such, the distance of any path to / from v5 is actually the distance to / from its parent. For example, ...

Dot diagram

Essentially, the balded distance matrix is enough to tell you that the path from v0 to v5's parent has a distance of 15. The balded tree itself isn't required.

def find_pair_traveling_thru_leaf_parent(dist_mat: DistanceMatrix[N], leaf_node: N) -> tuple[N, N]:
    leaf_set = dist_mat.leaf_ids() - {leaf_node}
    for l1, l2 in product(leaf_set, repeat=2):
        if not is_same_subtree(dist_mat, leaf_node, l1, l2):
            return l1, l2
    raise ValueError('Not found')


def find_distance_to_leaf_parent(dist_mat: DistanceMatrix[N], from_leaf_node: N, to_leaf_node: N) -> float:
    balded_dist_mat = dist_mat.copy()
    bald_distance_matrix(balded_dist_mat, to_leaf_node)
    return balded_dist_mat[from_leaf_node, to_leaf_node]

In the original simple tree, walking a distance of 15 on the path from v0 to v3 takes you to where v5's parent should be. Since there is no internal node there, one is first added by breaking the edge before attaching v5's limb to it ...

Dot diagram

Had there been an internal node already there, the limb would get attached to that existing internal node.

def walk_until_distance(
        tree: Graph[N, ND, E, float],
        n_start: N,
        n_end: N,
        dist: float
) -> Union[
    tuple[Literal['NODE'], N],
    tuple[Literal['EDGE'], E, N, N, float, float]
]:
    path = find_path(tree, n_start, n_end)
    last_edge_end = n_start
    dist_walked = 0.0
    for edge in path:
        ends = tree.get_edge_ends(edge)
        n1 = last_edge_end
        n2 = next(n for n in ends if n != last_edge_end)
        weight = tree.get_edge_data(edge)
        dist_walked_with_weight = dist_walked + weight
        if dist_walked_with_weight > dist:
            return 'EDGE', edge, n1, n2, dist_walked, weight
        elif dist_walked_with_weight == dist:
            return 'NODE', n2
        dist_walked = dist_walked_with_weight
        last_edge_end = n2
    raise ValueError('Bad inputs')

ch7_code/src/phylogeny/UntrimTree.py (lines 110 to 148):

def untrim_tree(
        dist_mat: DistanceMatrix[N],
        trimmed_tree: Graph[N, ND, E, float],
        gen_node_id: Callable[[], N],
        gen_edge_id: Callable[[], E]
) -> None:
    # Which node was trimmed?
    n_trimmed = find_trimmed_leaf(dist_mat, trimmed_tree)
    # Find a pair whose path that goes through the trimmed node's parent
    n_start, n_end = find_pair_traveling_thru_leaf_parent(dist_mat, n_trimmed)
    # What's the distance from n_start to the trimmed node's parent?
    parent_dist = find_distance_to_leaf_parent(dist_mat, n_start, n_trimmed)
    # Walk the path from n_start to n_end, stopping once walk dist reaches parent_dist (where trimmed node's parent is)
    res = walk_until_distance(trimmed_tree, n_start, n_end, parent_dist)
    stopped_on = res[0]
    if stopped_on == 'NODE':
        # It stopped on an existing internal node -- the limb should be added to this node
        parent_n = res[1]
    elif stopped_on == 'EDGE':
        # It stopped on an edge -- a new internal node should be injected to break the edge, then the limb should extend
        # from that node.
        edge, n1, n2, walked_dist, edge_weight = res[1:]
        parent_n = gen_node_id()
        trimmed_tree.insert_node(parent_n)
        n1_to_parent_id = gen_edge_id()
        n1_to_parent_weight = parent_dist - walked_dist
        trimmed_tree.insert_edge(n1_to_parent_id, n1, parent_n, n1_to_parent_weight)
        parent_to_n2_id = gen_edge_id()
        parent_to_n2_weight = edge_weight - n1_to_parent_weight
        trimmed_tree.insert_edge(parent_to_n2_id, parent_n, n2, parent_to_n2_weight)
        trimmed_tree.delete_edge(edge)
    else:
        raise ValueError('???')
    # Add the limb
    limb_e = gen_edge_id()
    limb_len = find_limb_length(dist_mat, n_trimmed)
    trimmed_tree.insert_node(n_trimmed)
    trimmed_tree.insert_edge(limb_e, parent_n, n_trimmed, limb_len)

Given the additive distance matrix for simple tree T...

v0 v1 v2 v3 v4 v5
v0 0.0 13.0 21.0 21.0 22.0 22.0
v1 13.0 0.0 12.0 12.0 13.0 13.0
v2 21.0 12.0 0.0 20.0 21.0 21.0
v3 21.0 12.0 20.0 0.0 7.0 13.0
v4 22.0 13.0 21.0 7.0 0.0 14.0
v5 22.0 13.0 21.0 13.0 14.0 0.0

... and simple tree trim(T, v5)...

Dot diagram

... , v5 is injected at the appropriate location to become simple tree T (un-trimmed) ...

Dot diagram

Find Neighbours

↩PREREQUISITES↩

WHAT: Given a distance matrix, if the distance matrix is ...

WHY: This operation is required for approximating a simple tree for a non-additive distance matrix.

ALGORITHM:

The algorithm essentially boils down to edge counting. Consider the following example simple tree...

Dot diagram

If you were to choose a leaf node, then gather the paths from that leaf node to all other leaf nodes, the limb for ...

def edge_count(self, l1: N) -> Counter[E]:
    # Collect paths from l1 to all other leaf nodes
    path_collection = []
    for l2 in self.leaf_nodes:
        if l1 == l2:
            continue
        path = self.path(l1, l2)
        path_collection.append(path)
    # Count edges across all paths
    edge_counts = Counter()
    for path in path_collection:
        edge_counts.update(path)
    # Return edge counts
    return edge_counts

For example, given that the tree has 6 leaf nodes, edge_count(v1) counts v1's limb 5 times while all other limbs are counted once...

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
edge_count(v1) 3 2 1 5 1 1 1 1

If you were to choose a pair of leaf nodes and add their edge_count()s together, the limb for ...

def combine_edge_count(self, l1: N, l2: N) -> Counter[E]:
    c1 = self.edge_count(l1)
    c2 = self.edge_count(l2)
    return c1 + c2

For example, combine_edge_count(v1,v2) counts v1's limb 6 times, v2's limb 6 times, and every other limb 2 times ...

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
edge_count(v1) 3 2 1 5 1 1 1 1
edge_count(v2) 3 2 1 1 5 1 1 1
------- ------- ------- ------- ------- ------- ------- -------
6 4 2 6 6 2 2 2

The key to this algorithm is to normalize limb counts returned by combine_counts() such that each chosen limb's count equals to each non-chosen limb's count. That is, each chosen limb count needs to be reduced from leaf_count to 2.

To do this, each edge in the path between the chosen pair must be subtracted leaf_count - 2 times from combine_edge_count()'s result.

def combine_edge_count_and_normalize(self, l1: N, l2: N) -> Counter[E]:
    edge_counts = self.combine_edge_count(l1, l2)
    path_edges = self.path(l1, l2)
    for e in path_edges:
        edge_counts[e] -= self.leaf_count - 2
    return edge_counts

Continuing with the example above, the chosen pair (v1 and v2) each have a limb count of 6 while all other limbs have a count of 2. combine_edge_count_and_normalize(v1,v2) subtracts each edge in path(v1,v2) 4 times from the counts...

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
edge_count(v1) 3 2 1 5 1 1 1 1
edge_count(v2) 3 2 1 1 5 1 1 1
-4 * path(v1,v2) -4 -4
------- ------- ------- ------- ------- ------- ------- -------
6 4 2 2 2 2 2 2

The insight here is that, if the chosen pair ...

def neighbour_check(self, l1: N, l2: N) -> bool:
    path_edges = self.path(l1, l2)
    return len(path_edges) == 2

For example, ...

Dot diagram

That means if the pair aren't neighbours, combine_edge_count_and_normalize() will normalize limb counts for the pair in addition to reducing internal edge counts. For example, since v1 and v5 aren't neighbours, combine_edge_count_and_normalize(v1,v5) subtracts 4 from the limb counts of v1 and v5 as well as (i0,i1)'s count ...

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
edge_count(v1) 3 2 1 5 1 1 1 1
edge_count(v5) 3 2 1 1 1 1 1 5
-4 * path(v1,v5) -4 -4 -4
------- ------- ------- ------- ------- ------- ------- -------
2 4 2 2 2 2 2 2

Notice how (i0,i1) was reduced to 2 in the example above. It turns out that any internal edges in the path between the chosen pair get reduced to a count of 2, just like the chosen pair's limb counts.

def reduced_to_2_check(self, l1: N, l2: N) -> bool:
    p = self.path(l1, l2)
    c = self.combine_edge_count_and_normalize(l1, l2)
    return all(c[edge] == 2 for edge in p)  # if counts for all edges in p reduced to 2

To understand why, consider what's happening in the example. For edge_count(v1), notice how the count of each internal edge is consistent with the number of leaf nodes it leads to ...

Dot diagram

That is, edge_count(v1) counts the internal edge ...

Breaking an internal edge divides a tree into two sub-trees. In the case of (i1,i2), the tree separates into two sub-trees where the...

Dot diagram

Running edge_count() for any leaf node on the...

For example, since ...

def segregate_leaves(self, internal_edge: E) -> dict[N, N]:
    leaf_to_end = {}  # leaf -> one of the ends of internal_edge
    e1, e2 = self.tree.get_edge_ends(internal_edge)
    for l1 in self.leaf_nodes:
        # If path from l1 to e1 ends with internal_edge, it means that it had to
        # walk over the internal edge to get to e1, which ultimately means that l1
        # it isn't on the e1 side / it's on the e2 side. Otherwise, it's on the e1
        # side.
        p = self.path(l1, e1)
        if p[-1] != internal_edge:
            leaf_to_end[l1] = e1
        else:
            leaf_to_end[l1] = e2
    return leaf_to_end

If the chosen pair are on opposite sides, combine_edge_count() will count (i1,i2) 6 times, which is the same number of times that the chosen pair's limbs get counted (the number of leaf nodes in the tree). For example, combine_edge_count(v1,v3) counts (i1,i2) 6 times, because v1 sits on the i1 side (adds 2 to the count) and v3 sits on the i2 side (adds 4 to the count)...

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
edge_count(v1) 3 2 1 5 1 1 1 1
edge_count(v3) 3 4 1 1 1 5 1 1
------- ------- ------- ------- ------- ------- ------- -------
6 6 2 6 2 6 2 2

This will always be the case for any simple tree: If a chosen pair aren't neighbours, the path between them always travels over at least one internal edge. combine_edge_count() will always count each edge in the path leaf_count times. In the above example, path(v1,v3) travels over internal edges (i0,i1) and (i1,i2) and as such both those edges in addition to the limbs of v1 and v3 have a count of 6.

Just like how combine_edge_count_and_normalize() reduces the counts of the chosen pair's limbs to 2, so will it reduce the count of the internal edges in the path of the chosen pair to 2. That is, all edges in the path between the chosen pair get reduced to a count of 2.

For example, path(v1,v3) has the edges [(v1,i0), (i0,i1), (i1, i2), (v3, i2)]. combine_edge_count_and_normalize(v1,v3) reduces the count of each edge in that path to 2 ...

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
edge_count(v1) 3 2 1 5 1 1 1 1
edge_count(v3) 3 4 1 1 1 5 1 1
-4 * path(v1,v3) -4 -4 -4 -4
------- ------- ------- ------- ------- ------- ------- -------
2 2 2 2 2 2 2 2

The ultimate idea is that, for any leaf node pair in a simple tree, combine_edge_count_and_normalize() will have a count of ...

In other words, internal edges are the only differentiating factor in combine_edge_count_and_normalize()'s result. Non-neighbouring pairs will have certain internal edge counts reduced to 2 while neighbouring pairs keep internal edge counts > 2. In a ...

The pair with the highest total count is guaranteed to be a neighbouring pair because lesser total counts may have had their internal edges reduced.

ch7_code/src/phylogeny/NeighbourJoiningMatrix_EdgeCountExplainer.py (lines 126 to 136):

def neighbour_detect(self) -> tuple[int, tuple[N, N]]:
    found_pair = None
    found_total_count = -1
    for l1, l2 in combinations(self.leaf_nodes, r=2):
        normalized_counts = self.combine_edge_count_and_normalize(l1, l2)
        total_count = sum(c for c in normalized_counts.values())
        if total_count > found_total_count:
            found_pair = l1, l2
            found_total_count = total_count
    return found_total_count, found_pair

⚠️NOTE️️️⚠️

The graph in the example run below is the same as the graph used above. It may look different because node positions may have shifted around.

Given the tree...

Dot diagram

neighbour_detect reported that v4 and v3 have the highest total edge count of 26 and as such are guaranteed to be neighbours.

For each leaf pair in the tree, combine_count_and_normalize() totals are ...

v0 v1 v2 v3 v4 v5
v0 0 22 22 16 16 18
v1 22 0 22 16 16 18
v2 22 22 0 16 16 18
v3 16 16 16 0 26 20
v4 16 16 16 26 0 20
v5 18 18 18 20 20 0

Dot diagram

This same reasoning is applied to edge weights. That is, instead of just counting edges, the reasoning works the same if you were to multiply edge weights by those counts.

In the edge count version of this algorithm, edge_count() gets the paths from a leaf node to all other leaf nodes and counts up the number of times each edge is encountered. In the edge weight multiplicity version, instead of counting how many times each edge gets encountered, each time an edge gets encountered it increases the multiplicity of its weight ...

def edge_multiple(self, l1: N) -> Counter[E]:
    # Collect paths from l1 to all other leaf nodes
    path_collection = []
    for l2 in self.leaf_nodes:
        if l1 == l2:
            continue
        path = self.path(l1, l2)
        path_collection.append(path)
    # Sum edge weights across all paths
    edge_weight_sums = Counter()
    for path in path_collection:
        for edge in path:
            edge_weight_sums[edge] += self.tree.get_edge_data(edge)
    # Return edge weight sums
    return edge_weight_sums

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
edge_count(v1) 3 2 1 5 1 1 1 1
edge_multiple(v1) 3*4=12 2*3=6 1*11=11 5*2=10 1*10=10 1*3=3 1*4=4 1*7=7

Similarly, where in the edge count version combine_edge_count() adds together the edge_count()s for two leaf nodes, the edge weight multiplicity version should add together the edge_multiple()s for two leaf nodes instead...

def combine_edge_multiple(self, l1: N, l2: N) -> Counter[E]:
    c1 = self.edge_multiple(l1)
    c2 = self.edge_multiple(l2)
    return c1 + c2

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
combine_edge_count(v1) 6 4 2 6 6 2 2 2
combine_edge_multiple(v1) 6*4=24 4*3=12 2*11=22 6*2=20 6*10=60 2*3=6 2*4=8 2*7=14

Similarly, where in the edge count version combine_edge_count_and_normalize() reduces all limbs and possibly some internal edges from combine_edge_count() to a count of 2, the edge multiplicity version reduces weights for those same limbs and edges to a multiple of 2...

def combine_edge_multiple_and_normalize(self, l1: N, l2: N) -> Counter[E]:
    edge_multiples = self.combine_edge_multiple(l1, l2)
    path_edges = self.path(l1, l2)
    for e in path_edges:
        edge_multiples[e] -= (self.leaf_count - 2) * self.tree.get_edge_data(e)
    return edge_multiples

Dot diagram

(i0,i1) (i1,i2) (v0,i0) (v1,i0) (v2,i0) (v3,i2) (v4,i2) (v5,i1)
combine_edge_count_and_normalize(v1,v2) 6 4 2 2 2 2 2 2
combine_edge_multiple_and_normalize(v1,v2) 6*4=24 4*3=12 2*11=22 2*2=20 2*10=60 2*3=6 2*4=8 2*7=14

Similar to combine_edge_count_and_normalize(), for any leaf node pair in a simple tree combine_edge_multiple_and_normalize() will have an edge weight multiple of ...

In other words, internal edge weight multiples are the only differentiating factor in combine_edge_multiple_and_normalize()'s result. Non-neighbouring pairs will have certain internal edge weight multiples reduced to 2 while neighbouring pairs keep internal edge weight multiples > 2. In a ...

The pair with the highest combined multiple is guaranteed to be a neighbouring pair because lesser combined multiples may have had their internal edge multiples reduced.

⚠️NOTE️️️⚠️

Still confused?

Given a simple tree, combine_edge_multiple(A, B) will make it so that...

For example, the following diagrams visualize edge weight multiplicities produced by combine_edge_multiple() for various pairs in a 4 leaf simple tree. Note how the selected pair's limbs have a multiplicity of 4, other limbs have a multiplicity of 2, and internal edges have a multiplicity of 4...

Dot diagram

combine_edge_multiple_and_normalize(A, B) normalizes these multiplicities such that ...

limb multiplicity internal edge multiplicity
neighbouring pairs all = 2 all >= 2
non-neighbouring pairs all = 2 at least one = 2, others > 2

Since limbs always contribute the same regardless of whether the pair is neighbouring or not (2*weight), they can be ignored. That leaves internal edge contributions as the only thing differentiating between neighbouring and non-neighbouring pairs.

A simple tree with 2 or more leaf nodes is guaranteed to have at least 1 neighbouring pair. The pair producing the largest result is the one with maxed out contributions from its multiplied internal edges weights, meaning that none of those contributions were for internal edges reduced to 2*weight. Lesser results MAY be lesser because normalization reduced some of their internal edge weights to 2*weight, but the largest result you know for certain has all of its internal edge weights > 2*weight.

ch7_code/src/phylogeny/NeighbourJoiningMatrix_EdgeMultiplicityExplainer.py (lines 97 to 107):

def neighbour_detect(self) -> tuple[int, tuple[N, N]]:
    found_pair = None
    found_total_count = -1
    for l1, l2 in combinations(self.leaf_nodes, r=2):
        normalized_counts = self.combine_edge_multiple_and_normalize(l1, l2)
        total_count = sum(c for c in normalized_counts.values())
        if total_count > found_total_count:
            found_pair = l1, l2
            found_total_count = total_count
    return found_total_count, found_pair

⚠️NOTE️️️⚠️

The graph in the example run below is the same as the graph used above. It may look different because node positions may have shifted around.

Given the tree...

Dot diagram

neighbour_detect reported that v3 and v4 have the highest total edge sum of 122 and as such are guaranteed to be neighbours.

For each leaf pair in the tree, combine_count_and_normalize() totals are ...

v0 v1 v2 v3 v4 v5
v0 0 110 110 88 88 94
v1 110 0 110 88 88 94
v2 110 110 0 88 88 94
v3 88 88 88 0 122 104
v4 88 88 88 122 0 104
v5 94 94 94 104 104 0

Dot diagram

The matrix produced in the example above is called a neighbour joining matrix. The summation of combine_edge_multiple_and_normalize() performed in each matrix slot is rewritable as a set of addition and subtraction operations between leaf node distances. For example, recall that combine_edge_multiple_and_normalize(v1,v2) in the example graph breaks down to edge_multiple(v1) + edge_multiple(v2) - (leaf_count - 2) * path(v1,v2). The sum of ...

dist(v1,v0) + dist(v1,v2) + dist(v1,v3) + dist(v1,v4) + dist(v1,v5) +
dist(v2,v0) + dist(v2,v1) + dist(v2,v3) + dist(v2,v4) + dist(v2,v5) -
dist(v1,v2) - dist(v1,v2) - dist(v1,v2) - dist(v1,v2)

Since only leaf node distances are being used in the summation calculation, a distance matrix suffices as the input. The actual simple tree isn't required.

ch7_code/src/phylogeny/NeighbourJoiningMatrix.py (lines 21 to 49):

def total_distance(dist_mat: DistanceMatrix[N]) -> dict[N, float]:
    ret = {}
    for l1 in dist_mat.leaf_ids():
        ret[l1] = sum(dist_mat[l1, l2] for l2 in dist_mat.leaf_ids())
    return ret


def neighbour_joining_matrix(dist_mat: DistanceMatrix[N]) -> DistanceMatrix[N]:
    tot_dists = total_distance(dist_mat)
    n = dist_mat.n
    ret = dist_mat.copy()
    for l1, l2 in product(dist_mat.leaf_ids(), repeat=2):
        if l1 == l2:
            continue
        ret[l1, l2] = tot_dists[l1] + tot_dists[l2] - (n - 2) * dist_mat[l1, l2]
    return ret


def find_neighbours(dist_mat: DistanceMatrix[N]) -> tuple[N, N]:
    nj_mat = neighbour_joining_matrix(dist_mat)
    found_pair = None
    found_nj_val = -1
    for l1, l2 in product(nj_mat.leaf_ids_it(), repeat=2):
        if nj_mat[l1, l2] > found_nj_val:
            found_pair = l1, l2
            found_nj_val = nj_mat[l1, l2]
    assert found_pair is not None
    return found_pair

Given the following distance matrix...

v0 v1 v2 v3 v4 v5
v0 0.0 13.0 21.0 21.0 22.0 22.0
v1 13.0 0.0 12.0 12.0 13.0 13.0
v2 21.0 12.0 0.0 20.0 21.0 21.0
v3 21.0 12.0 20.0 0.0 7.0 13.0
v4 22.0 13.0 21.0 7.0 0.0 14.0
v5 22.0 13.0 21.0 13.0 14.0 0.0

... the neighbour joining matrix is ...

v0 v1 v2 v3 v4 v5
v0 0.0 110.0 110.0 88.0 88.0 94.0
v1 110.0 0.0 110.0 88.0 88.0 94.0
v2 110.0 110.0 0.0 88.0 88.0 94.0
v3 88.0 88.0 88.0 0.0 122.0 104.0
v4 88.0 88.0 88.0 122.0 0.0 104.0
v5 94.0 94.0 94.0 104.0 104.0 0.0

Find Neighbour Limb Lengths

↩PREREQUISITES↩

WHAT: Given a distance matrix and a pair of leaf nodes identified as being neighbours, if the distance matrix is ...

WHY: This operation is required for approximating a simple tree for a non-additive distance matrix.

Recall that the standard limb length finding algorithm determines the limb length of L by testing distances between leaf nodes to deduce a pair whose path crosses over L's parent. That won't work here because non-additive distance matrices have inconsistent distances -- non-additive means no tree exists that fits its distances.

Average Algorithm

ALGORITHM:

The algorithm is an extension of the standard limb length finding algorithm, essentially running the same computation multiple times and averaging out the results. For example, v1 and v2 are neighbours in the following simple tree...

Dot diagram

Since they're neighbours, they share the same parent node, meaning that the path from...

Dot diagram

Recall that to find the limb length for L, the standard limb length algorithm had to perform a minimum test to find a pair of leaf nodes whose path travelled over the L's parent. Since this algorithm takes in two neighbouring leaf nodes, that test isn't required here. The path from L's neighbour to every other node always travels over L's parent.

Since the path from L's neighbour to every other node always travels over L's parent, the core computation from the standard algorithm is performed multiple times and averaged to produce an approximate limb length: 0.5 * (dist(L,N) + dist(L,X) - dist(N,X)), where ...

The averaging makes it so that if the input distance matrix were ...

⚠️NOTE️️️⚠️

Still confused? Think about it like this: When the distance matrix is non-additive, each X has a different "view" of what the limb length should be. You're averaging their views to get a single limb length value.

ch7_code/src/phylogeny/FindNeighbourLimbLengths.py (lines 21 to 40):

def view_of_limb_length_using_neighbour(dm: DistanceMatrix[N], l: N, l_neighbour: N, l_from: N) -> float:
    return (dm[l, l_neighbour] + dm[l, l_from] - dm[l_neighbour, l_from]) / 2


def approximate_limb_length_using_neighbour(dm: DistanceMatrix[N], l: N, l_neighbour: N) -> float:
    leaf_nodes = dm.leaf_ids()
    leaf_nodes.remove(l)
    leaf_nodes.remove(l_neighbour)
    lengths = []
    for l_from in leaf_nodes:
        length = view_of_limb_length_using_neighbour(dm, l, l_neighbour, l_from)
        lengths.append(length)
    return mean(lengths)


def find_neighbouring_limb_lengths(dm: DistanceMatrix[N], l1: N, l2: N) -> tuple[float, float]:
    l1_len = approximate_limb_length_using_neighbour(dm, l1, l2)
    l2_len = approximate_limb_length_using_neighbour(dm, l2, l1)
    return l1_len, l2_len

Given distance matrix...

v0 v1 v2 v3 v4 v5
v0 0.0 13.0 21.0 21.0 22.0 22.0
v1 13.0 0.0 12.0 12.0 13.0 13.0
v2 21.0 12.0 0.0 20.0 21.0 21.0
v3 21.0 12.0 20.0 0.0 7.0 13.0
v4 22.0 13.0 21.0 7.0 0.0 14.0
v5 22.0 13.0 21.0 13.0 14.0 0.0

... and given that v1 and v2 are neighbours, the limb length for leaf node ...

Optimized Average Algorithm

↩PREREQUISITES↩

ALGORITHM:

The unoptimized algorithm performs the computation once for each leaf node in the pair. This is inefficient in that it's repeating a lot of the same operations twice. This algorithm removes a lot of that duplicate work.

The unoptimized algorithm maps to the formula ...

1n2kS{l1,l2}Dl1,l2+Dl1,kDl2,k2\frac{1}{n-2} \cdot \sum_{k \isin S-\{l1,l2\}}{\frac{D_{l1,l2} + D_{l1,k} - D_{l2,k}}{2}}

... where ...

Just like the code, the formula removes l1 and l2 from the set of leaf nodes (S) for the average's summation. The number of leaf nodes (n) is subtracted by 2 for the average's division because l1 and l2 aren't included. To optimize, consider what happens when you re-organize the formula as follows...

  1. Break up the division in the summation...

    1n2kS{l1,l2}(Dl1,l22+Dl1,k2Dl2,k2)\frac{1}{n-2} \cdot \sum_{k \isin S-\{l1,l2\}}{(\frac{D_{l1,l2}}{2} + \frac{D_{l1,k}}{2} - \frac{D_{l2,k}}{2})}
  2. Pull out Dl1,l22\frac{D_{l1,l2}}{2} as a term of its own...

    Dl1,l22+1n2kS{l1,l2}(Dl1,k2Dl2,k2)\frac{D_{l1,l2}}{2} + \frac{1}{n-2} \cdot \sum_{k \isin S-\{l1,l2\}}{(\frac{D_{l1,k}}{2} - \frac{D_{l2,k}}{2})}

    ⚠️NOTE️️️⚠️

    Confused about what's happening above? Think about it like this...

    • mean([0+0.5, 0+1, 0+0.25]) = 0.833 = 0+mean([0.5, 1, 0.25])
    • mean([1+0.5, 1+1, 1+0.25]) = 1.833 = 1+mean([0.5, 1, 0.25])
    • mean([2+0.5, 2+1, 2+0.25]) = 2.833 = 2+mean([0.5, 1, 0.25])
    • mean([3+0.5, 3+1, 3+0.25]) = 3.833 = 3+mean([0.5, 1, 0.25])
    • ...

    If you're including some constant amount for each element in the averaging, the result of the average will include that constant amount. In the case above, Dl1,l22\frac{D_{l1,l2}}{2} is a constant being added at each element of the average.

  3. Combine the terms in the summation back together ...

    Dl1,l22+1n2kS{l1,l2}Dl1,kDl2,k2\frac{D_{l1,l2}}{2} + \frac{1}{n-2} \cdot \sum_{k \isin S-\{l1,l2\}}{\frac{D_{l1,k} - D_{l2,k}}{2}}
  4. Factor out 12\frac{1}{2} from the entire equation...

    12(Dl1,l2+1n2kS{l1,l2}Dl1,kDl2,k)\frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot \sum_{k \isin S-\{l1,l2\}}{D_{l1,k} - D_{l2,k}})

    ⚠️NOTE️️️⚠️

    Confused about what's happening above? It's just distributing and pulling out. For example, given the formula 5/2 + x*(3/2 + 5/2 + 9/2) ...

    1. 5/2 + 3x/2 + 5x/2 + 9x/2 -- distribute x
    2. 1/2 * (5 + 3x + 5x + 9x) -- all terms are divided by 2 now, so pull out 1/2
    3. 1/2 * (5 + x*(3 + 5 + 9)) -- pull x back out
  5. Break up the summation into two simpler summations ...

    12(Dl1,l2+1n2(kS{l1,l2}Dl1,kkS{l1,l2}Dl2,k))\frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot (\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}} - \sum_{k \isin S-\{l1,l2\}}{D_{l2,k}}))

    ⚠️NOTE️️️⚠️

    Confused about what's happening above? Think about it like this...

    (9-1)+(8-2)+(7-3) = 9+8+7-1-2-3 = 24+(-6) = 24-6 = sum([9,8,7])-sum([1,2,3])

    It's just re-ordering the operations so that it can be represented as two sums. It's perfectly valid.

The above formula calculates the limb length for l1. To instead find the formula for l2, just swap l1 and l2 ...

len(l1)=12(Dl1,l2+1n2(kS{l1,l2}Dl1,kkS{l1,l2}Dl2,k))len(l2)=12(Dl2,l1+1n2(kS{l2,l1}Dl2,kkS{l2,l1}Dl1,k))len(l1) = \frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot (\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}} - \sum_{k \isin S-\{l1,l2\}}{D_{l2,k}})) \\[0.5em] len(l2) = \frac{1}{2} \cdot (D_{l2,l1} + \frac{1}{n-2} \cdot (\sum_{k \isin S-\{l2,l1\}}{D_{l2,k}} - \sum_{k \isin S-\{l2,l1\}}{D_{l1,k}}))

Note how the two are almost exactly the same. Dl1,l2=Dl2,l1D_{l1,l2} = D_{l2,l1}, and S{l1,l2}=S{l2,l1}S-{l1,l2} = S-{l2,l1}, and both summations are still there. The only exception is the order in which the summations are being subtracted ...

len(l1)=12(Dl1,l2+1n2(kS{l1,l2}Dl1,kkS{l1,l2}Dl2,k))len(l2)=12(Dl1,l2+1n2(kS{l1,l2}Dl2,kkS{l1,l2}Dl1,k))len(l1) = \frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot (\textcolor{#7f7f00}{\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}}} - \textcolor{#007f7f}{\sum_{k \isin S-\{l1,l2\}}{D_{l2,k}}})) \\[0.5em] len(l2) = \frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot (\textcolor{#007f7f}{\sum_{k \isin S-\{l1,l2\}}{D_{l2,k}}} - \textcolor{#7f7f00}{\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}}}))

Consider what happens when you re-organize the formula for l2 as follows...

  1. Convert the summation subtraction to an addition of a negative...

    len(l2)=12(Dl1,l2+1n2(kS{l1,l2}Dl2,k+(kS{l1,l2}Dl1,k))len(l2) = \frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot (\textcolor{#007f7f}{\sum_{k \isin S-\{l1,l2\}}{D_{l2,k}}} + (- \textcolor{#7f7f00}{\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}}}))
  2. Swap the order of the summation addition...

    len(l2)=12(Dl1,l2+1n2(kS{l1,l2}Dl1,k+kS{l1,l2}Dl2,k))len(l2) = \frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot (-\textcolor{#007f7f}{\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}}} + \textcolor{#7f7f00}{\sum_{k \isin S-\{l1,l2\}}{D_{l2,k}}}))
  3. Factor out -1 from summation addition ...

    len(l2)=12(Dl1,l2+1n21(kS{l1,l2}Dl1,kkS{l1,l2}Dl2,k))len(l2) = \frac{1}{2} \cdot (D_{l1,l2} + \frac{1}{n-2} \cdot -1 \cdot (\textcolor{#7f7f00}{\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}}} - \textcolor{#007f7f}{\sum_{k \isin S-\{l1,l2\}}{D_{l2,k}}}))
  4. Simplify ...

    len(l2)=12(Dl1,l2+1n2(kS{l1,l2}Dl1,kkS{l1,l2}Dl2,k))len(l2) = \frac{1}{2} \cdot (D_{l1,l2} + - \frac{1}{n-2} \cdot (\textcolor{#7f7f00}{\sum_{k \isin S-\{l1,l2\}}{D_{l1,k}}} - \textcolor{#007f7f}{\sum_{k \isin S-\{l1,l2\}}{D_{l2,k}}}))