Summary
The existing protocols of measuring the selection coefficients of loci neglect linkage effects existing between loci. This protocol is free from this limitation. The protocol inputs a set of DNA sequences at three time points, removes conserved sites, and estimates selection coefficients. If the user wishes to test the accuracy, it can ask the protocol to generate mock data by computer simulation of evolution. The main limitation is the need for sequence samples isolated from 30–100 populations adapting in parallel.
For complete details on the use and execution of this protocol, please refer to Barlukova and Rouzine (2021).
Subject areas: Sequence analysis, Genetics, Genomics, Computer sciences, Evolutionary biology
Graphical abstract

Highlights
-
•
Measures selection coefficients for adapting loci in relative units
-
•
Inputs DNA sequences, and excludes conserved sites and insertions/deletions
-
•
Estimates selection coefficients from these data
-
•
If desired, tests the accuracy of the protocol by using simulated data
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
The existing protocols of measuring the selection coefficients of loci neglect linkage effects existing between loci. This protocol is free from this limitation. The protocol inputs a set of DNA sequences at three time points, removes conserved sites, and estimates selection coefficients. If the user wishes to test the accuracy, it can ask the protocol to generate mock data by computer simulation of evolution. The main limitation is the need for sequence samples isolated from 30–100 populations adapting in parallel.
Before you begin
DNA, RNA, or protein sequences
-
1.
Obtain a database of DNA sequences of a pathogen or organism at three time points, t1, t2, and t3. If you have RNA or protein sequences, translate them using MEGA or any other standard software.
-
2.
Align and trim sequences using MEGA software.
-
3.
Output sequences to MEGA files corresponding to times t1, t2, and t3.
Note: One can choose any three time points, as long as they are sufficiently far apart (see troubleshooting below).
Software
-
4.
Install MATLABTM version 2017 and later or GNU Octave.
-
5.
Install MEGA.
-
6.
Download and install the present software (key resources table).
CRITICAL: Make sure that genomic sequences are isolated from, at least, 30–100 independently-evolving populations using any phylogenetic method (for example, implemented in MEGA software). Polyphyletic tree must be observed.
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| Multiple DNA, sequences of any organism from, at least, 30–100 independently-adapting populations | Any public DNA database | N/A |
| Software and algorithms | ||
| MATLABTM 2017 or later or GNU Octave | MathWorks | N/A |
| MEGA software | MEGA | N/A |
| Software developed for the present work | https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0 | ACTGtranslate.m BinData.mat s_measure.m main.m (optional) |
| Software for generating mock sequences by simulating evolution (optional) | https://github.com/irouzine/Strong-linkage-in-sex | recomb_2022.m |
Step-by-step method details
Binarization of genomic data
Timing:
Run our program ACTGtranslate.m (key resources table) that inputs N sequences of length L from the MEGA output files for all timepoints t1, t2, and t3 obtained in preliminary step 3. The program will automatically carry out the steps, as follows:
-
1.
Initialize parameters.
%% parameters
% b1 - number of the first sequence to be read
% b2 - number of the last sequnce to be read
% tr1 - deletion threshold
% fcut - monomorphous threshhold
b1 = 1;
b2 = 3;
tr1 = 0.05;
fcut = 0.05;
am = (b2-b1+1); %amount of the sequences
gen = {}; %container for raw info
data = {}; %container for result info
-
2.
Take data from .fas files.
Note: Steps 2–6 are performed 3 times, for each time period, where k=1:3 is the counter.
%take data from files
for k = 1:3
filepath=strcat('data\period',num2str(k));
oldpath=cd(filepath);
filelist{k}= dir ('∗∗/∗.fas');
gen{k} = fastaread(filelist{k}.name,’BlockRead’, [b1 b2],’IgnoreGaps’,false);
-
3.Find the length of the sequence.l = length(gen{k}(1).Sequence);%l - length of the sequence
-
a.Find the sites with deletions and insertions in sequences (Figure 1) and memorize them.Note: The aim of protocol is to measure the selection coefficient si defined for sites that evolve due to point mutations. Genomic regions with frequent insertions and deletions cannot be described by thus parameter and, hence, cannot be treated by this method.%data cleaningbanlistSeq = [];banlistSit = [];erram = zeros(am,1);potential = [];for i0 = 1:ldel = 0;sch = 1;for j0 = 1:amif isempty(banlistSeq) || j0 ∼= banlistSeqseq = gen{k}(j0).Sequence;if seq(i0) =='-' || seq(i0) =='N'potential = [potential j0];del = del+1;endelsesch=sch+1;endendif del ∼=0if del/(am-sum(erram)) >=tr1banlistSit = [banlistSit i0];potential = [];elseerram(potential,1) = erram(potential,1)+1;banlistSeq = [banlistSeq potential];banlistSeq = sort(banlistSeq);potential = [];endendend
-
b.Delete all sequences with deletions. Mark the sites where some sequences have insertions with ‘E’.
-
a.
if ∼isempty(banlistSeq)
gen{k}(banlistSeq) = [];
end
if ∼isempty(banlistSit)
for sch3 = 1:am
gen{k}(sch3).Sequence(banlistSit) = 'E';
end
end
data{k} = zeros(am,l);
-
4.
Find the common consensus nucleotide or aminoacid sequence for all the time points.
%finding the common consensus
cons = '';
for i = 1:l
qA=0;
qC=0;
qT=0;
qG=0;
qE=0;
for j = 1:am
seq = gen{k}(j).Sequence;
switch seq(i)
case 'A'
qA = qA+1;
case 'C'
qC = qC+1;
case 'T'
qT = qT+1;
case 'G'
qG = qG+1;
case 'E'
qE = qE+1;
end
end
[∼,I] = max([qA qC qT qG qE]);
switch I
case 1
cons = [cons 'A'];
case 2
cons = [cons 'C'];
case 3
cons = [cons 'T'];
case 4
cons = [cons 'G'];
case 5
cons = [cons 'E'];
end
end
-
5.
For each time point and each genome, replace the consensus variant at each nucleotide or aminoacid position (below termed “site”) by 0 and any other variant by 1. Digit 2 marks the site that will be excluded later.
%binarization
for i2 = 1:l
for j2 = 1:am
seq = gen{k}(j2).Sequence;
if seq(i2)∼='E'
if seq(i2) == cons(i2)
data{k}(j2,i2)=0;
else
data{k}(j2,i2)=1;
end
else
data{k}(j2,i2)=2;
end
end
end
-
6.
Finds “legitimate” sites that do not have insertions and deletions and whose diversity is above threshold fcut.
Note: Homozygous sites do not allow the measurement of selection coefficients. Weakly heterozygous sites do, but with a large statistical error. For example, for 100 sequences estimated per geographic area, the threshold fcut of a few percent is recommended.
%finding the 'right' site numbers
rightsites{k}=[];
for msch = 1:l
if data{k}(1,msch)∼=2
if ∼(mean(data{k}(:,msch))>=1-fcut || mean(data{k}(:,msch))<=fcut)
rightsites{k} = [rightsites{k} msch];
end
end
end
cd(oldpath);
end
-
7.
Finds the intersection between the legitimate sites of the three time periods.
%findind the intersection between 'right' sites of the different time
%periods
C1=intersect(rightsites{1},rightsites{2});
C2=intersect(C1,rightsites{3});
-
8.
Excludes illegitimate sites marked “2” and memorizes the numbers of the legitimate sites.
%removing 'bad' sites and remebering the numbers of the 'right' one
for kk=1:k
[∼,l]=size(data{kk});
C3 = 1:l;
C3(C2) = [];
data{kk}(:,C3) = [];
end
data{k+1} = C2;
-
9.
Generates file BinData.mat containing genomic binary sequences for renumbered legitimate sites at three time points and displays the resulting binary matrices (see an example in Table 1).
%saving the data
save(strcat('BinData','.mat'),'data');
disp(data)
Figure 1.
An example of MEGA11 output with an insertion and a deletion
Table 1.
An example of a binary matrix of population DNA
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 3 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 4 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 5 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 6 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 7 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 8 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 9 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 10 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 11 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 12 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
Parameters: L = 100, N = 1000. Only the first 13 sites and 12 individual sequences are shown.
Inference of selection coefficients
Timing:
This major step serves to infer the selection coefficients of polymorphous loci from a DNA sample. Below we use an example with N mock binary sequences of length L data generated by run runs of Monte-Carlo simulation using recomb_2022.m (key resources table), for a fixed set of The same procedure can be run with N real-life sequences of length L obtained from several independent populations combined in one matrix (Table 1).
Note: Monte-Carlo simulation of sequences is required only if you want to test the accuracy of the protocol. It takes much more time than the main protocol: .
Run MATLAB program s_measure.m (key resources table) that inputs binary sequences from BinData.mat. The program carries out the steps, as follows:
-
10.
Input data from BinData.mat.
load('BinData.mat','data');
genome1 = data{1,1};
genome2 = data{1,2};
genome3 = data{1,3};
order = data{1,4};
-
11.
Input parameter description:
tr1 - deletion threshold in ACTGtranslate.m.
fcut - monomorphous threshhold in ACTGtranslate.m.
genome1 (2, 3) – binary matrix representing genomic samples at times 1, 2, or 3.
C – initial value of C in Equation 1.
appr – method of curve approximation:
‘poly’ - by basic polynomials.
‘spline’ – by cubic splines.
‘pchip’ – by Piecewise Cubic Hermite Interpolating Polynomial (PCHIP).
apprR – additional parameter for ‘poly’ approximation - rate of polynomial.
tsec1, tsec2, tsec3– times of the first, second and third sequence sample.
r – recombination probability per genome.
s0 –the width of the uniform distribution of selection coefficient.
ac∗s0, bc∗s0 - borders of uniform s distribution.
M – crossover number.
L – number of loci.
N –population size
tf – end time of evolution.
f0 – initial value of f.
run – number of Monte Carlo evolution runs, also used as the seed for the first run.
run2 – the seed for generating the random distribution of s.
mu – mutation probability per site.
eps – the accuracy of C.
step – step in C.
-
12.
Run program s_measure. m with the loaded arguments. It initialize variables.
Function sdis=s_measure(genome1,genome2,genome3,order,C,appr,apprR,generate)
global tsec1 tsec2 tsec3 r s0 M L N tf f0 run run2 mu ac bc
scon = {};
if gen
[∼,l] = size(genome1{1,1});
%l - quantity of the alleles
[R,∼] = size(genome1);
%R - quantity of runs
f1 = zeros(R,l); % array for fi at 1 time (for R runs)
f2 = zeros(R,l); % array for fi at 2 time (for R runs)
f3 = zeros(R,l); % array for fi at 3 time (for R runs)
end
eps = 10ˆ-3; % difference between zero and value of the y coordinate of the triangle center, when the loop stops
yccord = 1; % basic value of the y coordinate of the triangle center
step =10ˆ-3; % value of the step
findc = true;
-
13.
Calculate the frequency of digit “1” for each site denoted at each time point (Table 2).
% calculation of fi
if gen
for rs = 1:R
f1(rs,:)=mean(genome1{rs,1});
f2(rs,:)=mean(genome2{rs,1});
f3(rs,:)=mean(genome3{rs,1});
end
f1f = mean(f1);
f2f = mean(f2);
f3f = mean(f3);
else
f1f=mean(genome1);
f2f=mean(genome2);
f3f=mean(genome3);
end
-
14.
At each time point , estimate the relative shifted value of selection coefficient at site , denoted as product , from equation (Barlukova and Rouzine, 2021).
| (Equation 1) |
%calculation of si
bsi1 = -1∗log(f1f)-C;
bsi2 = -1∗log(f2f)-C;
bsi3 = -1∗log(f3f)-C;
Note: The value of is found, as follows. First, it gets initial value of C from the input arguments of the function. To determine the actual value , the same program s_measure.m carries out the steps:
-
15.
Rank genomic sites in the descending order of the estimated values of .
-
16.
Calculate for each site its new number , where is the label of the site in the genome, and is its number after the ranking in .
%sorting and calculating
[B1,I1] = sort(bsi1,'descend');
[B2,I2] = sort(bsi2,'descend');
[B3,I3] = sort(bsi3,'descend');
-
17.
Obtain a monotonous ranked curve for each time point (Figure 2A).
Note: The three curves form a small triangle. The curves will not separate fully, if the system is in a steady state where the method does not work (Figure 2B). The code for plotting is not shown here, see file s_measure.m.
-
18.
Find the center of mass of this triangle, denoted .
-
19.
Adjust to obtain at that center and repeats calculation with Equation 1 given above in step 14.
-
20.
Repeat the loop described in steps 18–26, until obtaining at the center of the triangle within accuracy eps.
while abs(yccord) > eps
%calculation of si
bsi1 = -1∗log(f1f)-C;
bsi2 = -1∗log(f2f)-C;
bsi3 = -1∗log(f3f)-C;
%sorting and calculating
[B1,I1] = sort(bsi1,'descend');
[B2,I2] = sort(bsi2,'descend');
[B3,I3] = sort(bsi3,'descend');
xdots = {};
ydots = {};
xdotss = {};
ydotss = {};
-
21.Fit the three curves with one of four methods, three of which (a, b, c) are standard functions in MATLAB.
-
a.Polynomial.switch apprcase 'poly'%polynomial approximationp1 = polyfit([1:1:length(I1)],B1,apprR);p2 = polyfit([1:1:length(I2)],B2,apprR);p3 = polyfit([1:1:length(I3)],B3,apprR);
-
b.Cubic splines.case 'spline'%spline approximationsp1 = spline([1:1:length(I1)],B1);sp2 = spline([1:1:length(I2)],B2);sp3 = spline([1:1:length(I3)],B3);
-
c.Piecewise Cubic Hermite Interpolating Polynomial (PCHIP).case 'pchip'%pchip approximationpp1 = pchip([1:1:length(I1)],B1);pp2 = pchip([1:1:length(I2)],B2);pp3 = pchip([1:1:length(I3)],B3);
-
d.Test case (only for plotting the curves, when the other algorithms do not work).
-
a.
case "test"
findc=false;
yccord = 0;
C = 0;
end
-
22.Find intersections of the three pairs of three curves by one of the methods chosen by input parameter appr.
-
a.Polynomial.inter1 = p1 - p2;inter2 = p1 - p3;inter3 = p2 - p3;xdots{1,1} = roots(inter1);xdots{1,2} = roots(inter2);xdots{1,3} = roots(inter3);
-
b.Cubic spline.sinter1 = @(x) ppval(sp1,x)-ppval(sp2,x);sinter2 = @(x) ppval(sp1,x)-ppval(sp3,x);sinter3 = @(x) ppval(sp2,x)-ppval(sp3,x);xdots{1,1} = fzero(sinter1,mean([1:1:length(I1)]));xdots{1,2} = fzero(sinter2,mean([1:1:length(I2)]));xdots{1,3} = fzero(sinter3,mean([1:1:length(I3)]));
-
c.PCHIP.
-
a.
pinter1 = @(x) ppval(pp1,x)-ppval(pp2,x);
pinter2 = @(x) ppval(pp1,x)-ppval(pp3,x);
pinter3 = @(x) ppval(pp2,x)-ppval(pp3,x);
xdots{1,1} = fzero(pinter1,mean([1:1:length(I1)]));
xdots{1,2} = fzero(pinter2,mean([1:1:length(I2)]));
xdots{1,3} = fzero(pinter3,mean([1:1:length(I3)]));
-
23.Find y-coordinates of the intersection dots and memorize y-coordinates of the first and last point of the approximation curve by one of the methods chosen by input parameter appr.
-
a.Polynomial.ydots{1,1} = polyval(p1,xdots{1,1});ydots{1,2} = polyval(p1,xdots{1,2});ydots{1,3} = polyval(p2,xdots{1,3});ybord{1,1} = polyval(p1,1);ybord{1,2} = polyval(p2,1);ybord{1,3} = polyval(p3,1);ybord{2,1} = polyval(p1,length(I1));ybord{2,2} = polyval(p2,length(I2));ybord{2,3} = polyval(p3,length(I3));
-
b.Spline.ydots{1,1} = ppval(sp1,xdots{1,1});ydots{1,2} = ppval(sp1,xdots{1,2});ydots{1,3} = ppval(sp2,xdots{1,3});ybord{1,1} = ppval(sp1,1);ybord{1,2} = ppval(sp2,1);ybord{1,3} = ppval(sp3,1);ybord{2,1} = ppval(sp1,length(I1));ybord{2,2} = ppval(sp2,length(I2));ybord{2,3} = ppval(sp3,length(I3));
-
c.PCHIP.
-
a.
ydots{1,1} = ppval(pp1,xdots{1,1});
ydots{1,2} = ppval(pp1,xdots{1,2});
ydots{1,3} = ppval(pp2,xdots{1,3});
ybord{1,1} = ppval(pp1,1);
ybord{1,2} = ppval(pp2,1);
ybord{1,3} = ppval(pp3,1);
ybord{2,1} = ppval(pp1,length(I1));
ybord{2,2} = ppval(pp2,length(I2));
ybord{2,3} = ppval(pp3,length(I3));
-
24.
Check the accuracy of the intersection points by verifying whether they have real values and lie on the curves.
if findc
%coordinates check
Iall={I1;I2;I3};
for times = 1: length(xdots)
ncount1 = 0;
for param1= 1: length(xdots{1,times})
IallLen(times) = length(Iall{times,1});
if xdots{1,times}(param1)<= IallLen(times)
&& xdots{1,times}(param1)>=1 && imag(xdots{1,times}(param1))==0 && ydots{1,times}(param1)<=ybord{1,times} && ydots{1,times}(param1)>=ybord{2,times}
ncount1 = ncount1 +1;
xdotss{times,ncount1} = xdots{1,times}(param1);
ydotss{times,ncount1} = ydots{1,times}(param1);
end
end
end
-
25.
Form the triangle from the intersection points and find its center.
%coordinates of triangle tops
xdcord = [];
ydcord = [];
[shr,dl] = size(xdotss);
for h =1:shr
for c = 1:dl
xdcord = [xdcord xdotss{h,c}];
ydcord = [ydcord ydotss{h,c}];
end
end
%center finding
polyin = polyshape({xdcord},{ydcord});
[xccord,yccord] = centroid(polyin);
-
26.
Modify C and repeat the cycle (step 18) until obtaining at the center within accuracy eps.
if yccord>0
C = C+ step;
end
if yccord<0
C = C - step;
end
end
end
-
27.
After the accuracy is reached in step 20, the final estimates of si are obtained.
Note: In our example of simulated sequences, these estimates are plotted against their actual values (Figure 3), at different number of runs used for averaging (100 and 1,000), and different time points. The code for plotting is not shown, see the file s_measure.m.
Table 2.
Example of locus allele frequencies calculated in step 13
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.912 | 0.827 | 0.963 | 0.936 | 0.950 | 0.964 | 0.943 | 0.927 | 0.908 | 0.868 | 0.920 | 0.816 | 0.935 |
Parameters: L = 100, N = 1000. Only the first 13 sites are shown. The example from Table 1 is used.
Figure 2.
Ranked curve for three time points
(A) Successful separation of curves, with a proper focal point, when far from a steady state.
(B) Failure of the method close to a steady state. Parameters: (A) s0 = 0.05, ac = -1, bc = 1, L = 100, N = 1000, tf = 60, f0 = 0.1, runs = 100, μL = 0.1, t1 = 20, t2 = 40, t3 = 60; (B) t1 = 500, t2 = 750, t3 = 1000, tf = 1000, and the others as in (A).
Figure 3.
Estimated values of selection coefficient as a function of their actual values
(A) 100 runs.
(B) 1000 runs. Parameter values are as in Figure 2.
-
28.
Re-order the ranked sites back, to and plot the relative values of selection coefficient, , against their actual aminoacid positions, (Figure 4).
Figure 4.
Estimated relative values of selection coefficient in the genome
X-axis: Site number in genome. Y axis: selection coefficient. (A, B and C) correspond to the three sampling times and differ, mostly, in the scaling factor common for all sites. Parameter values are as in Figure 3.
Expected outcomes
The desired outcome is the values of the selection coefficient of mutation at each heterozygous loci in the studied populations expressed in relative units (Figure 4). To find the single scaling factor that sets units, an additional experiment or data is required. The failure is defined by the three curves in Figure 2 not having clear separation with a single intersection point, as in Figure 2B.
Quantification and statistical analysis
In order to estimate the standard deviation of the estimates for , use either.
-
•
random resampling of a half of available sequences.
-
•
bootstrapping, i.e., random resampling from the same dataset with replacement.
The first method and second method will yield the upper and the lower estimate of the 95% statistical error.
Limitations
For the method to work, the system has to satisfy several requirements, based on the assumptions of the model in (Barlukova and Rouzine, 2021), as follows:
The method applies to adapting populations only. The sign of adapting populations is the clear separation of the three curves with a single intersection point (Figure 2A). If a population is no longer adapting and is near a steady state, all curves collapse onto one regardless of the time spacing (Figure 2B).
Selection type is directional and constant (or, at least, changing slowly on time scale on the order of the inverse selection coefficient). This condition can be checked, again, by observing clear separation and single intersection of the curves (Figure 2A).
Multiple samples from almost-independent replicate populations must be available for averaging of allelic frequencies, as seen in their phylogeny, otherwise, the curves in Figure 2 will be ragged and stochastic looking, and clear separation (Figure 2A) will not be obtained.
Epistasis is not included explicitly, because it is assumed to be incorporated in the renormalized values of selection coefficients. This is a good approximation on sufficiently short time scales. In the long term, genomes must be described as having many epistatic pairs, and the effective values of selection coefficients change slowly. The inference of epistasis is addressed elsewhere (Pedruzzi and Rouzine, 2021). The quality of this approximation can be assessed by increasing the time points together and observing a slow shift in the final estimates of selection coefficients (Figure 4).
The aim of protocol is to measure the selection coefficient si defined for sites that evolve due to point mutations. Genomic regions with frequent insertions and deletions cannot be described by thus parameter and, hence, cannot be treated by this method. Homozygous sites do not allow the measurement of selection coefficient neither.
Troubleshooting
Problem 1
The ranked–s curves after step 12 at different time points are not distinct enough, do not order as they supposed to, and/or do not intersect well within a single small area as they do Figure 2.
Potential solution
-
•
Try to increase the time window. One can choose any time points t1, t2, and t3, as long as they are sufficiently far apart. The criterion is that the resulting curves in Figure 2A are clearly separated and have a single intersection. The specific choice of time points affects only the common factor β multiplying the estimates of selection coefficient, but it does not affect their relative values. If the curves do not separate even for widely spaced time points, the system is in a steady state and the method is not applicable.
-
•
Try to eliminate the sites that create the bad behavior of the curves and work with the remaining sites.
-
•
If neither method works, the system may be in a steady state or under rapidly changing selection and is not amenable to this method.
Problem 2
The standard deviation of the final estimates is unacceptably large for some sites.
Potential solution
-
•
Filter noisy sites out by increasing the cutoff threshold for heterozygous sites, fcut, in file ACTGtranslate.m (step 1).
Problem 3
DNA sequences do not come from nearly-independent populations and form a monophyletic tree.
Potential solution
Seek additional data or change the organism of study.
Problem 4
Your MATLAB cannot open .m, due to the file format.
Potential solution
Use the text versions of the files in the database ending with .txt. Change their names to the names of the corresponding .m files.
Problem 5
MATLAB returns syntax errors.
Potential solution
Get a newer version of MATLAB. The program has been tested in MATLAB 2017, 2020, and 2022.
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Igor M Rouzine (igor.rouzine@iephb.ru).
Materials availability
This study did not generate new unique reagents.
Acknowledgments
The study was carried out within the framework of the state assignment of the Federal Agency for Scientific Organizations (FASO Russia: topic no. AAAA-А18-118012290142-9).
Author contributions
I.V.L.: Software, numerical simulation, calculation, visualization, wrote manuscript. I.M.R.: Concept, administration, supervision, developed protocol, wrote manuscript.
Declaration of interests
The authors declare no competing interests.
Contributor Information
Igor V. Likhachev, Email: igor.rouzine@iephb.ru.
Igor M. Rouzine, Email: reirose2002@gmail.com.
Data and code availability
The computer code that carries out the protocol is available at:
Database: https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0.
The questions about the programs should be addressed to the technical contact, Igor V Likhachev, reirose2002@gmail.com.
References
- Barlukova A., Rouzine I.M. The evolutionary origin of the universal distribution of mutation fitness effect. PLoS Comput. Biol. 2021;17:e1008822. doi: 10.1371/journal.pcbi.1008822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedruzzi G., Rouzine I.M. An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza. PLoS Pathog. 2021;17:e1009669. doi: 10.1371/journal.ppat.1009669. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The computer code that carries out the protocol is available at:
Database: https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0.
The questions about the programs should be addressed to the technical contact, Igor V Likhachev, reirose2002@gmail.com.

CRITICAL: Make sure that genomic sequences are isolated from, at least, 30–100 independently-evolving populations using any phylogenetic method (for example, implemented in MEGA software). Polyphyletic tree must be observed.
Timing:


