明確的 DNA 序列僅由核堿基腺嘌呤 (A)、胞嘧啶 (C)、鳥嘌呤 (G)、胸腺嘧啶 (T) 組成。對于人類消費,堿基可以用相應char的大寫或小寫字母表示:A, C, G, T, 或a, c, g, t。然而,當需要存盤長序列時,這種表示是低效的。由于只需要存盤四個符號,因此可以為每個符號分配一個 2 位代碼。UCSC指定的常用.2bit格式正是這樣做的,使用以下編碼: T = , C = , A = , G = 。0b000b010b100b11
下面的 C 代碼顯示了為清楚起見而撰寫的參考實作。各種將基因組序列轉換為char序列的開源軟體通常使用一個 256 條目的查找表,這些查找表由每個char序列索引。這也與char. 然而,即使訪問的是片上快取,記憶體訪問也非常昂貴,并且通用表查找很難 SIMDize。因此,如果可以通過簡單的整數算術來完成轉換,那將是有利的。鑒于 ASCII 是主要的char編碼,人們可以限制這一點。
將作為 ASCII 字符給出的核堿基轉換為它們的.2bit表示的有效計算方法是什么?
/* Convert nucleobases A, C, G, T represented as either uppercase or lowercase
ASCII characters into UCSC .2bit-presentation. Passing any values other than
those corresponding to 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' results in an
indeterminate response.
*/
unsigned int ascii_to_2bit (unsigned int a)
{
const unsigned int UCSC_2BIT_A = 2;
const unsigned int UCSC_2BIT_C = 1;
const unsigned int UCSC_2BIT_G = 3;
const unsigned int UCSC_2BIT_T = 0;
switch (a) {
case 'A':
case 'a':
return UCSC_2BIT_A;
break;
case 'C':
case 'c':
return UCSC_2BIT_C;
break;
case 'G':
case 'g':
return UCSC_2BIT_G;
break;
case 'T':
case 't':
default:
return UCSC_2BIT_T;
break;
}
}
uj5u.com熱心網友回復:
一種選擇如下:
unsigned int ascii_to_2bit (unsigned int a)
{
return ((0xad - a) & 6) >> 1;
}
這樣做的好處是它只需要 8 位,不會溢位,并且不包含非常量移位,因此即使沒有特定的 SIMD 指令,它也可以立即用于并行化,例如在 64 位中輸入 8 個字符單詞
unsigned int ascii_to_2bit_8bytes (uint64_t a)
{
return ((0xadadadadadadadad - a) & 0x0606060606060606) >> 1;
}
將兩個輸出位留在每個位元組的底部。
uj5u.com熱心網友回復:
如果仔細觀察核堿基的 ASCII 字符的二進制代碼,很明顯第 1 位和第 2 位提供了唯一的兩位代碼:A= 0b01000001-> 0b00、C= 0b01000011-> 0b01、G= 0b01000111-> 0b11、T= 0b01010100-> 0b10。與小寫 ASCII 字符類似,它們僅在第 5 位上有所不同。不幸的是,這個簡單的映射與.2bit-encoding不完全匹配,因為 A 和 T 的代碼被交換了。解決此問題的一種方法是使用存盤在變數中的簡單四項置換表,可能在優化后分配給暫存器(“暫存器內查找表”):
unsigned int ascii_to_2bit_perm (unsigned int a)
{
unsigned int perm = (2 << 0) | (1 << 2) | (0 << 4) | (3 << 6);
return (perm >> (a & 6)) & 3;
}
另一種方法通過簡單的位操作來糾正生成的“幼稚”代碼,即通過簡單的位操作即時提取位 1 和 2,通過觀察位 1 為 0A和T,但 1 為Cand G。因此,我們可以通過將輸入的第 1 位的倒數與初步代碼的第 1 位進行異或運算來交換 A 和 T 的編碼:
unsigned int ascii_to_2bit_twiddle (unsigned int a)
{
return ((a >> 1) & 3) ^ (~a & 2);
}
這個版本在具有快速位域提取指令的處理器和沒有桶形移位器的低端處理器上是有利的,因為只需要右移一位。在亂序處理器上,這種方法提供了比置換表更多的指令級并行性。適應 SIMD 實作似乎也更容易,因為在所有位元組通道中使用相同的移位計數。
在我專注于相關 ASCII 字符的二進制編碼之前,我已經研究過使用簡單的數學計算。對小的乘數和除數進行簡單的蠻力搜索得到:
unsigned int ascii_to_2bit_math (unsigned int a)
{
return ((18 * (a & 31)) % 41) & 3;
}
乘法器18對沒有快速整數乘法器的處理器很友好。現代編譯器可以有效地處理具有編譯時常數除數的模計算,并且不需要除法。即便如此,我注意到即使是最好的可用編譯器也難以利用非常有限的輸入和輸出范圍,所以我不得不手動按摩它以簡化代碼:
unsigned int ascii_to_2bit_math (unsigned int a)
{
unsigned int t = 18 * (a & 31);
return (t - ((t * 25) >> 10)) & 3;
}
即使在這種形式下并且假設單周期乘法的可用性,這似乎與之前的兩種方法相比通常沒有競爭力,因為它產生更多的指令和更長的依賴鏈。然而,在 64 位平臺上,整個計算可以用一個 64 位、32 項的查找表代替,如果這個 64 位表可以有效地放入暫存器中,則可以提供具有競爭力的性能,x86 就是這種情況。 64,它作為立即數加載。
unsigned int ascii_to_2bit_tab (unsigned int a)
{
uint64_t tab = ((0ULL << 0) | (2ULL << 2) | (0ULL << 4) | (1ULL << 6) |
(3ULL << 8) | (0ULL << 10) | (2ULL << 12) | (3ULL << 14) |
(1ULL << 16) | (3ULL << 18) | (0ULL << 20) | (2ULL << 22) |
(3ULL << 24) | (1ULL << 26) | (2ULL << 28) | (0ULL << 30) |
(1ULL << 32) | (3ULL << 34) | (1ULL << 36) | (2ULL << 38) |
(0ULL << 40) | (1ULL << 42) | (3ULL << 44) | (0ULL << 46) |
(2ULL << 48) | (0ULL << 50) | (1ULL << 52) | (3ULL << 54) |
(0ULL << 56) | (2ULL << 58) | (3ULL << 60) | (1ULL << 62));
return (tab >> (2 * (a & 31))) & 3;
}
我正在附加我的測驗框架以供參考:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ORIGINAL_MATH (1)
unsigned int ascii_to_2bit_perm (unsigned int a)
{
unsigned int perm = (2 << 0) | (1 << 2) | (0 << 4) | (3 << 6);
return (perm >> (a & 6)) & 3;
}
unsigned int ascii_to_2bit_twiddle (unsigned int a)
{
return ((a >> 1) & 3) ^ (~a & 2);
}
unsigned int ascii_to_2bit_math (unsigned int a)
{
#if ORIGINAL_MATH
return ((18 * (a & 31)) % 41) & 3;
#else // ORIGINAL_MATH
unsigned int t = 18 * (a & 31);
return (t - ((t * 25) >> 10)) & 3;
#endif // ORIGINAL_MATH
}
unsigned int ascii_to_2bit_tab (unsigned int a)
{
uint64_t tab = ((0ULL << 0) | (2ULL << 2) | (0ULL << 4) | (1ULL << 6) |
(3ULL << 8) | (0ULL << 10) | (2ULL << 12) | (3ULL << 14) |
(1ULL << 16) | (3ULL << 18) | (0ULL << 20) | (2ULL << 22) |
(3ULL << 24) | (1ULL << 26) | (2ULL << 28) | (0ULL << 30) |
(1ULL << 32) | (3ULL << 34) | (1ULL << 36) | (2ULL << 38) |
(0ULL << 40) | (1ULL << 42) | (3ULL << 44) | (0ULL << 46) |
(2ULL << 48) | (0ULL << 50) | (1ULL << 52) | (3ULL << 54) |
(0ULL << 56) | (2ULL << 58) | (3ULL << 60) | (1ULL << 62));
return (tab >> (2 * (a & 31))) & 3;
}
/* Convert nucleobases A, C, G, T represented as either uppercase or lowercase
ASCII characters into UCSC .2bit-presentation. Passing any values other than
those corresponding to 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' results in an
inderminate response.
*/
unsigned int ascii_to_2bit (unsigned int a)
{
const unsigned int UCSC_2BIT_A = 2;
const unsigned int UCSC_2BIT_C = 1;
const unsigned int UCSC_2BIT_G = 3;
const unsigned int UCSC_2BIT_T = 0;
switch (a) {
case 'A':
case 'a':
return UCSC_2BIT_A;
break;
case 'C':
case 'c':
return UCSC_2BIT_C;
break;
case 'G':
case 'g':
return UCSC_2BIT_G;
break;
case 'T':
case 't':
default:
return UCSC_2BIT_T;
break;
}
}
int main (void)
{
char nucleobase[8] = {'A', 'C', 'G', 'T', 'a', 'c', 'g', 't'};
printf ("Testing permutation variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i ) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_perm (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
printf ("Testing bit-twiddling variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i ) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_twiddle (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
printf ("Testing math-based variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i ) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_math (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
printf ("Testing table-based variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i ) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_tab (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
return EXIT_SUCCESS;
}
uj5u.com熱心網友回復:
以下例程將使用填充了uint32_t您宣告的字符的 ASCII 字串的內容填充值陣列,并將保存狀態以便能夠將第二個、第三個等數量的字串附加到陣列中。使用它的方式通過一個main()例程來說明,該例程從命令列獲取字串引數并將它們傳遞給TOTAL元素陣列。該例程的引數在其上方的注釋中進行了描述。
#include <ctype.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define UCSC_2BIT_T (0)
#define UCSC_2BIT_C (1)
#define UCSC_2BIT_A (2)
#define UCSC_2BIT_G (3)
#define PAIRS_PER_INTEGER 16
/**
* Converts a string of nucleobases in ASCII form into an array
* of 32bit integers in 2bitform. As it should be possible to
* continue the array of integers, a status reference is
* maintained in order to determine determine where in the
* integer to start putting the two bit sequences. For this,
* a (*state) variable is maintained (initialize it to 0) to
* remember where to start putting bitpairs in the array.
*
* @param state_ref reference to the state variable to be maintained
* with the position on which to put the base in the
* array entry.
* @param dna string with the ASCII chain of bases.
* @param twobit array reference to start filling.
* @param sz size of the array ***in array entries***.
* @return the number of array elements written. This allows to
* use a pointer and advance it at each function call
* with the number of entries consumed on each call.
*/
ssize_t
dna2twobit(
int *state_ref,
char *dna,
uint32_t twobit[],
size_t sz)
{
/* local copy so we only change the state in case of
* successful execution */
int state = 30 - *state_ref;
if (state == 30) *twobit = 0;
ssize_t total_nb = 0; /* total number of array elements consumed */
while (*dna && sz) {
char base = toupper(*dna );
uint32_t tb;
switch (base) {
case 'A': tb = UCSC_2BIT_A; break;
case 'C': tb = UCSC_2BIT_C; break;
case 'T': tb = UCSC_2BIT_T; break;
case 'G': tb = UCSC_2BIT_G; break;
default: return -1;
}
*twobit |= tb << state;
state -= 2;
if (state < 0) {
--sz; twobit;
state = 32;
*twobit = 0;
total_nb ;
}
}
*state_ref = 30 - state;
return total_nb;
}
這個函式可以單獨鏈接到你想要的任何程式中,但是我提供了一個main()代碼來說明這個函式的使用。您可以使用命令列引數中以 ASCII 編碼的實際鏈來呼叫程式。該程式將在前一個之后對它們進行編碼,以演示多重轉換(16 個堿基適合一個 32 位整數,如您在問題中發布的格式的定義所述,因此如果沒有編碼 16 的倍數,狀態反映了最后一個已經覆寫了多少位。
代碼繼續下面的主要功能:
#define TOTAL 16
int main(int argc, char **argv)
{
int state = 0, i;
uint32_t twobit_array[TOTAL], *p = twobit_array;
size_t twobit_cap = TOTAL;
for (i = 1; i < argc; i) {
printf("Adding the following chain: [%s]\n", argv[i]);
ssize_t n = dna2twobit(
&state,
argv[i],
p,
twobit_cap);
if (n < 0) {
fprintf(stderr,
"argument invalid: %s\n"
"continuing to next\n",
argv[i]);
continue;
}
twobit_cap -= n;
p = n;
}
if (!state) --p;
uint32_t *q = twobit_array;
size_t off = 0;
for (int j = 0; q <= p; j ) {
char *sep = "";
printf(" zd: ", off);
for (int k = 0; k < 4 & q <= p; k ) {
printf("%sx", sep, *q);
sep = "-";
q ;
off = 16;
}
printf("\n");
}
return 0;
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/409391.html
標籤:
