Similarity Score Matrix and HeatMap for nucleic and amino acid k-mers. Similarity score is evaluated by Point Accepted Mutation (PAM) and BLOcks SUbstitution Matrix (BLOSUM). Higher similarity score indicates more similar sequences for BLOSUM and less similar sequences for PAM matrix. The 30, 40, 70, 120, 250 and 62, 45, 50, 62, 80, 100 matrix versions are available for PAM and BLOSUM, respectively. Alignment is evaluated by Needleman-Wunsch (Needleman and Wunsch 1970) and Smith-Waterman (Smith and Waterman 1981).
Simply apply the kmeRs_similarity_matrix function and mark the appropriate matrix, here BLOSUM62.
# Simple BLOSUM62 similarity matrix for all amino acid nucleotides
BLOSUM62 <- kmeRs_similarity_matrix(submat = "BLOSUM62")
# Fancy knitr table
knitr::kable(BLOSUM62)
G | A | L | M | F | W | K | Q | E | S | P | V | I | C | Y | H | R | N | D | T | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
G | 6 | 0 | -4 | -3 | -3 | -2 | -2 | -2 | -2 | 0 | -2 | -3 | -4 | -3 | -3 | -2 | -2 | 0 | -1 | -2 |
A | 0 | 4 | -1 | -1 | -2 | -3 | -1 | -1 | -1 | 1 | -1 | 0 | -1 | 0 | -2 | -2 | -1 | -2 | -2 | 0 |
L | -4 | -1 | 4 | 2 | 0 | -2 | -2 | -2 | -3 | -2 | -3 | 1 | 2 | -1 | -1 | -3 | -2 | -3 | -4 | -1 |
M | -3 | -1 | 2 | 5 | 0 | -1 | -1 | 0 | -2 | -1 | -2 | 1 | 1 | -1 | -1 | -2 | -1 | -2 | -3 | -1 |
F | -3 | -2 | 0 | 0 | 6 | 1 | -3 | -3 | -3 | -2 | -4 | -1 | 0 | -2 | 3 | -1 | -3 | -3 | -3 | -2 |
W | -2 | -3 | -2 | -1 | 1 | 11 | -3 | -2 | -3 | -3 | -4 | -3 | -3 | -2 | 2 | -2 | -3 | -4 | -4 | -2 |
K | -2 | -1 | -2 | -1 | -3 | -3 | 5 | 1 | 1 | 0 | -1 | -2 | -3 | -3 | -2 | -1 | 2 | 0 | -1 | -1 |
Q | -2 | -1 | -2 | 0 | -3 | -2 | 1 | 5 | 2 | 0 | -1 | -2 | -3 | -3 | -1 | 0 | 1 | 0 | 0 | -1 |
E | -2 | -1 | -3 | -2 | -3 | -3 | 1 | 2 | 5 | 0 | -1 | -2 | -3 | -4 | -2 | 0 | 0 | 0 | 2 | -1 |
S | 0 | 1 | -2 | -1 | -2 | -3 | 0 | 0 | 0 | 4 | -1 | -2 | -2 | -1 | -2 | -1 | -1 | 1 | 0 | 1 |
P | -2 | -1 | -3 | -2 | -4 | -4 | -1 | -1 | -1 | -1 | 7 | -2 | -3 | -3 | -3 | -2 | -2 | -2 | -1 | -1 |
V | -3 | 0 | 1 | 1 | -1 | -3 | -2 | -2 | -2 | -2 | -2 | 4 | 3 | -1 | -1 | -3 | -3 | -3 | -3 | 0 |
I | -4 | -1 | 2 | 1 | 0 | -3 | -3 | -3 | -3 | -2 | -3 | 3 | 4 | -1 | -1 | -3 | -3 | -3 | -3 | -1 |
C | -3 | 0 | -1 | -1 | -2 | -2 | -3 | -3 | -4 | -1 | -3 | -1 | -1 | 9 | -2 | -3 | -3 | -3 | -3 | -1 |
Y | -3 | -2 | -1 | -1 | 3 | 2 | -2 | -1 | -2 | -2 | -3 | -1 | -1 | -2 | 7 | 2 | -2 | -2 | -3 | -2 |
H | -2 | -2 | -3 | -2 | -1 | -2 | -1 | 0 | 0 | -1 | -2 | -3 | -3 | -3 | 2 | 8 | 0 | 1 | -1 | -2 |
R | -2 | -1 | -2 | -1 | -3 | -3 | 2 | 1 | 0 | -1 | -2 | -3 | -3 | -3 | -2 | 0 | 5 | 0 | -2 | -1 |
N | 0 | -2 | -3 | -2 | -3 | -4 | 0 | 0 | 0 | 1 | -2 | -3 | -3 | -3 | -2 | 1 | 0 | 6 | 1 | 0 |
D | -1 | -2 | -4 | -3 | -3 | -4 | -1 | 0 | 2 | 0 | -1 | -3 | -3 | -3 | -3 | -1 | -2 | 1 | 6 | -1 |
T | -2 | 0 | -1 | -1 | -2 | -2 | -1 | -1 | -1 | 1 | -1 | 0 | -1 | -1 | -2 | -2 | -1 | 0 | -1 | 5 |
In this example, the most ‘different’ k-mer to “GATTACA” sequence will be indicated from given set of heptamers. Here, 7 heptamer (being an anagram of the movie title “GATTACA”) are given, as follow:
# Given hexamers
kmers_given <- c("GATTACA", "ACAGATT", "GAATTAC", "GAAATCT", "CTATAGA", "GTACATA", "AACGATT")
# Matrix calculation
kmers_mat <- kmeRs_similarity_matrix(q = c("GATTACA"), x = kmers_given , submat = "BLOSUM62")
# Fancy knitr table
knitr::kable(kmers_mat)
GATTACA | |
---|---|
GATTACA | 37 |
ACAGATT | 1 |
GAATTAC | 15 |
GAAATCT | 19 |
CTATAGA | 7 |
GTACATA | 12 |
AACGATT | 4 |
Now, applying kmeRs_score function the total score is calculated and the matrix is sorted by increasing score value. The lowest value (in case of BLOSUM) indicates the most ‘different’ sequence from given k-mers, in contrast to the highest value which indicates the most similar one.
# Score and sort the matrix
kmers_res <- kmeRs_score(kmers_mat)
# Fancy knitr table
knitr::kable(kmers_res)
GATTACA | Sum | |
---|---|---|
ACAGATT | 1 | 1 |
AACGATT | 4 | 4 |
CTATAGA | 7 | 7 |
GTACATA | 12 | 12 |
GAATTAC | 15 | 15 |
GAAATCT | 19 | 19 |
GATTACA | 37 | 37 |
As can be observed, the most ‘different’ sequence to GATTACA is ACAGATT with total score equal to 1 and the most similar to GATTACA sequence is of course GATTACA sequence with the highest score equal to 37.
In this example, the most ‘different’ k-mer to whole given set of heptamers will be indicated. The same heptamers as in example 2 are used.
# Given hexamers
kmers_given <- c("GATTACA", "ACAGATT", "GAATTAC", "GAAATCT", "CTATAGA", "GTACATA", "AACGATT")
# Matrix calculation
kmers_mat <- kmeRs_similarity_matrix(q = kmers_given, submat = "BLOSUM62")
# Score the matrix and sort by decreasing score
kmers_res <- kmeRs_score(kmers_mat)
# Fancy knitr table
knitr::kable(kmers_res)
GATTACA | ACAGATT | GAATTAC | GAAATCT | CTATAGA | GTACATA | AACGATT | Sum | |
---|---|---|---|---|---|---|---|---|
CTATAGA | 7 | 3 | 6 | -2 | 37 | 11 | 0 | 62 |
AACGATT | 4 | 24 | 1 | 8 | 0 | 6 | 37 | 80 |
ACAGATT | 1 | 37 | 1 | 8 | 3 | 9 | 24 | 83 |
GAATTAC | 15 | 1 | 37 | 18 | 6 | 9 | 1 | 87 |
GTACATA | 12 | 9 | 9 | 9 | 11 | 37 | 6 | 93 |
GATTACA | 37 | 1 | 15 | 19 | 7 | 12 | 4 | 95 |
GAAATCT | 19 | 8 | 18 | 37 | -2 | 9 | 8 | 97 |
As can be observed, the most ‘different’ sequence to all given heptamers is CTATAGA with score equal to 62 and the most similar sequence is GAAATCT with the highest score equal to 97.
Applying function kmeRs_statistics to the result matrix the basic statistics can be calculated as additional rows. When summary_statistics_only is set to TRUE only summary table is returned. It is much more elegant way to present results, especially in case of ‘big data’ output.
# Calculate stats
kmers_stats <- kmeRs_statistics(kmers_res)
# Fancy knitr table
knitr::kable(kmers_stats[ ,1:(dim(kmers_stats)[2] - 4) ])
GATTACA | ACAGATT | GAATTAC | GAAATCT | CTATAGA | GTACATA | AACGATT | Sum | |
---|---|---|---|---|---|---|---|---|
CTATAGA | 7.00 | 3.00 | 6.00 | -2.00 | 37.00 | 11.00 | 0.00 | 62.00 |
AACGATT | 4.00 | 24.00 | 1.00 | 8.00 | 0.00 | 6.00 | 37.00 | 80.00 |
ACAGATT | 1.00 | 37.00 | 1.00 | 8.00 | 3.00 | 9.00 | 24.00 | 83.00 |
GAATTAC | 15.00 | 1.00 | 37.00 | 18.00 | 6.00 | 9.00 | 1.00 | 87.00 |
GTACATA | 12.00 | 9.00 | 9.00 | 9.00 | 11.00 | 37.00 | 6.00 | 93.00 |
GATTACA | 37.00 | 1.00 | 15.00 | 19.00 | 7.00 | 12.00 | 4.00 | 95.00 |
GAAATCT | 19.00 | 8.00 | 18.00 | 37.00 | -2.00 | 9.00 | 8.00 | 97.00 |
Min | 1.00 | 1.00 | 1.00 | -2.00 | -2.00 | 6.00 | 0.00 | 62.00 |
Max | 37.00 | 37.00 | 37.00 | 37.00 | 37.00 | 37.00 | 37.00 | 97.00 |
Mean | 13.57 | 11.86 | 12.43 | 13.86 | 8.86 | 13.29 | 11.43 | 85.29 |
SD | 12.08 | 13.64 | 12.62 | 12.40 | 13.16 | 10.63 | 13.83 | 12.04 |
Simply applying function kmeRs_heatmap to the result matrix.
Special thanks to Jason Lin, PhD from Chiba Cancer Center Research Institute, Chiba, Japan for contribution in 2021 as implementing of heatmap function and update deprecated functions.
OncoCDx Research Center, London, UK↩︎