Faster version of nt2aa MATLAB function – dna2amino

I have made Matlab dna2amino function, that converts nucleotide sequence (e.g. ATGGTCGAAACTACTTAG) to amino acid sequence (e.g. MVETT* (* is stop codon)). It works pretty same like build-in nt2aa function, but faster:

Results of dna2amino vs nt2aa function

Results of dna2amino vs nt2aa function

There is definition of amino acids and codons in dna2amino. I used “matrix approach” only, no-cycles in my code. Probably that’s the reason of better performance. On second thought my function does not support any additional parameters, like non-standard genetic codes and is more memory wasting.
Code is well commented, thus understandable while reading:

function [aminoSequence] = dna2amino(dna)
%DNA2AMINO
%http://5vet.cz
notTripletTailLength=mod(length(dna),3);
rna = strrep(dna(1:end-notTripletTailLength),'T','U');%transcription

%% Genetic code
aminoA.id = 'A';aminoA.codons = ['GCU'; 'GCC'; 'GCA'; 'GCG'];
aminoR.id = 'R';aminoR.codons = ['CGU'; 'CGC'; 'CGA'; 'CGG'; 'AGA'; 'AGG'];
aminoN.id = 'N';aminoN.codons = [ 'AAU'; 'AAC'];
aminoD.id = 'D';aminoD.codons = ['GAU'; 'GAC'];
aminoC.id = 'C';aminoC.codons = [ 'UGU'; 'UGC'];
aminoQ.id = 'Q';aminoQ.codons = ['CAA'; 'CAG'];
aminoE.id = 'E';aminoE.codons = ['GAA'; 'GAG'];
aminoG.id = 'G';aminoG.codons = ['GGU'; 'GGC'; 'GGA'; 'GGG'];
aminoH.id = 'H';aminoH.codons = ['CAU'; 'CAC'];
aminoI.id = 'I';aminoI.codons = [ 'AUU'; 'AUC'; 'AUA' ];
aminoL.id = 'L';aminoL.codons = [ 'CUU'; 'CUC'; 'CUA'; 'CUG'; 'UUA'; 'UUG'];
aminoK.id = 'K';aminoK.codons = ['AAA'; 'AAG'];
aminoM.id = 'M';aminoM.codons = ['AUG'];
aminoF.id = 'F';aminoF.codons = ['UUU'; 'UUC'];
aminoP.id = 'P';aminoP.codons = ['CCU'; 'CCC'; 'CCA'; 'CCG'];
aminoS.id = 'S';aminoS.codons = ['UCU'; 'UCC'; 'UCA'; 'UCG'; 'AGU'; 'AGC'];
aminoT.id = 'T';aminoT.codons = ['ACU'; 'ACC'; 'ACA'; 'ACG'];
aminoW.id = 'W';aminoW.codons = ['UGG'];
aminoY.id = 'Y';aminoY.codons = ['UAU'; 'UAC'];
aminoV.id = 'V';aminoV.codons = [ 'GUU'; 'GUC'; 'GUA'; 'GUG'];

%aminoO.id = 'O';aminoO.codons = [ 'UAG']; not amino, but stopcodon
%aminoU.id = 'U';aminoU.codons = ['UGA']; not amino, but stopcodon

aminoStop.id = '*';aminoStop.codons = ['UAG';'UGA';'UAA'];%stopcodon

aminosAndKodons{1} =aminoA;
aminosAndKodons{2} =aminoR;
aminosAndKodons{3} =aminoN;
aminosAndKodons{4} =aminoD;
aminosAndKodons{5} =aminoC;
aminosAndKodons{6} =aminoQ;
aminosAndKodons{7} =aminoE;
aminosAndKodons{8} =aminoG;
aminosAndKodons{9} =aminoH;
aminosAndKodons{10} =aminoI;
aminosAndKodons{11} =aminoL;
aminosAndKodons{12} =aminoK;
aminosAndKodons{13} =aminoM;
aminosAndKodons{14} =aminoF;
aminosAndKodons{15} =aminoP;
aminosAndKodons{16} =aminoS;
aminosAndKodons{17} =aminoT;
aminosAndKodons{18} =aminoW;
aminosAndKodons{19} =aminoY;
aminosAndKodons{20} =aminoV;
% aminosAndKodons{21} =aminoO;
% aminosAndKodons{22} =aminoU;

aminosAndKodons{21} =aminoStop;%stopcodon

%% get some properties
x=[aminosAndKodons{:}];
AllCodons=cat(1,x.codons);%all triplets in one matrix (64x3)

CodonsCounts = cellfun(@(x) size(x.codons,1), aminosAndKodons);%how much codons per amino

numOfCodons = sum(CodonsCounts);%total number of codons, should be 64 (4^3)
numOfTripletsInDNA = length(rna)/3;%how much triplet is in DNA sample
%% Translation itself

AllCodonMatrix = repmat(AllCodons,1,numOfTripletsInDNA);%codon is repeating in row
RnaMatrix = repmat(rna,numOfCodons,1);%same RNA code in all rows

IsEqMatrix = AllCodonMatrix ==RnaMatrix; %1 means match
triFirst = IsEqMatrix(:,1:3:end);
triSecon = IsEqMatrix(:,2:3:end);
triThird = IsEqMatrix(:,3:3:end);
%create 3 matrix. 1 per 1 letter in triplet

IsThatCodon = sum(cat(3,triFirst,triSecon,triThird),3)==3;%if it is equal to 3 it means, that all letter in triplet has match

%tripletVector is vector with index of amino. Count of same inexes is count
%of codons per amino (starting with four ones, because Alanin has 4 codons)
%%tripletsVector =repelem(indxs,tripletCounts);%%matlab 2015 an higher
index = zeros(1,sum(CodonsCounts));
index(cumsum([1 CodonsCounts(1:end-1)])) = 1;
tripletsVector = cumsum(index);%just hack to make it...

TripletsMatrix = repmat(tripletsVector',1,numOfTripletsInDNA);%replicate tripletsVector to size: numOfCodons x numOfTripletsInRNA

AminoNumberMatrix = TripletsMatrix.* IsThatCodon;%Amino number instead of 1
aminoSequenceNumber= sum(AminoNumberMatrix);%there is just one non-zero number per column. Make it flat..

aminos = [x.id];%get letters for all aminos

aminoSequence = aminos(aminoSequenceNumber);%Get letter for sequence

%deal with alternative starts... because of gentics
alternativeStartCodons = {'CTG' 'TTG'};%normaly Leu
%another start codon is 'ATG', but it is Methionin (already M)
if(ismember(dna(1:3),alternativeStartCodons))
aminoSequence(1) ='M';
end

end

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *