Skip to main content
STAR Protocols logoLink to STAR Protocols
. 2023 Mar 4;4(1):101821. doi: 10.1016/j.xpro.2022.101821

Measurement of selection coefficients from genomic samples of adapting populations by computer modeling

Igor V Likhachev 1,2,, Igor M Rouzine 1,3,∗∗
PMCID: PMC9999197  PMID: 36871222

Summary

The existing protocols of measuring the selection coefficients of loci neglect linkage effects existing between loci. This protocol is free from this limitation. The protocol inputs a set of DNA sequences at three time points, removes conserved sites, and estimates selection coefficients. If the user wishes to test the accuracy, it can ask the protocol to generate mock data by computer simulation of evolution. The main limitation is the need for sequence samples isolated from 30–100 populations adapting in parallel.

For complete details on the use and execution of this protocol, please refer to Barlukova and Rouzine (2021).

Subject areas: Sequence analysis, Genetics, Genomics, Computer sciences, Evolutionary biology

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Measures selection coefficients for adapting loci in relative units

  • Inputs DNA sequences, and excludes conserved sites and insertions/deletions

  • Estimates selection coefficients from these data

  • If desired, tests the accuracy of the protocol by using simulated data


Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.


The existing protocols of measuring the selection coefficients of loci neglect linkage effects existing between loci. This protocol is free from this limitation. The protocol inputs a set of DNA sequences at three time points, removes conserved sites, and estimates selection coefficients. If the user wishes to test the accuracy, it can ask the protocol to generate mock data by computer simulation of evolution. The main limitation is the need for sequence samples isolated from 30–100 populations adapting in parallel.

Before you begin

DNA, RNA, or protein sequences

  • 1.

    Obtain a database of DNA sequences of a pathogen or organism at three time points, t1, t2, and t3. If you have RNA or protein sequences, translate them using MEGA or any other standard software.

  • 2.

    Align and trim sequences using MEGA software.

  • 3.

    Output sequences to MEGA files corresponding to times t1, t2, and t3.

Note: One can choose any three time points, as long as they are sufficiently far apart (see troubleshooting below).

Software

  • 4.

    Install MATLABTM version 2017 and later or GNU Octave.

  • 5.

    Install MEGA.

  • 6.

    Download and install the present software (key resources table).

Inline graphicCRITICAL: Make sure that genomic sequences are isolated from, at least, 30–100 independently-evolving populations using any phylogenetic method (for example, implemented in MEGA software). Polyphyletic tree must be observed.

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

Multiple DNA, sequences of any organism from, at least, 30–100 independently-adapting populations Any public DNA database N/A

Software and algorithms

MATLABTM 2017 or later or GNU Octave MathWorks N/A
MEGA software MEGA N/A
Software developed for the present work https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0 ACTGtranslate.m
BinData.mat s_measure.m
main.m (optional)
Software for generating mock sequences by simulating evolution (optional) https://github.com/irouzine/Strong-linkage-in-sex recomb_2022.m

Step-by-step method details

Binarization of genomic data

Inline graphicTiming:1.0L×N/107sec

Run our program ACTGtranslate.m (key resources table) that inputs N sequences of length L from the MEGA output files for all timepoints t1, t2, and t3 obtained in preliminary step 3. The program will automatically carry out the steps, as follows:

  • 1.

    Initialize parameters.

%% parameters

% b1 - number of the first sequence to be read

% b2 - number of the last sequnce to be read

% tr1 - deletion threshold

% fcut - monomorphous threshhold

b1 = 1;

b2 = 3;

tr1 = 0.05;

fcut = 0.05;

am = (b2-b1+1); %amount of the sequences

gen = {}; %container for raw info

data = {}; %container for result info

  • 2.

    Take data from .fas files.

Note: Steps 2–6 are performed 3 times, for each time period, where k=1:3 is the counter.

%take data from files

for k = 1:3

 filepath=strcat('data\period',num2str(k));

 oldpath=cd(filepath);

filelist{k}= dir ('∗∗/∗.fas');

gen{k} = fastaread(filelist{k}.name,’BlockRead’, [b1 b2],’IgnoreGaps’,false);

  • 3.
    Find the length of the sequence.
    l = length(gen{k}(1).Sequence);
    %l - length of the sequence
    • a.
      Find the sites with deletions and insertions in sequences (Figure 1) and memorize them.
      Note: The aim of protocol is to measure the selection coefficient si defined for sites that evolve due to point mutations. Genomic regions with frequent insertions and deletions cannot be described by thus parameter and, hence, cannot be treated by this method.
      %data cleaning
       banlistSeq = [];
       banlistSit = [];
       erram = zeros(am,1);
       potential = [];
       for i0 = 1:l
        del = 0;
        sch = 1;
        for j0 = 1:am
        if isempty(banlistSeq) || j0 ∼= banlistSeq
         seq = gen{k}(j0).Sequence;
         if seq(i0) =='-' || seq(i0) =='N'
          potential = [potential j0];
          del = del+1;
         end
        else
         sch=sch+1;
        end
       end
       if del ∼=0
        if del/(am-sum(erram)) >=tr1
         banlistSit = [banlistSit i0];
         potential = [];
        else
         erram(potential,1) = erram(potential,1)+1;
         banlistSeq = [banlistSeq potential];
         banlistSeq = sort(banlistSeq);
         potential = [];
        end
       end
      end
    • b.
      Delete all sequences with deletions. Mark the sites where some sequences have insertions with ‘E’.

if ∼isempty(banlistSeq)

 gen{k}(banlistSeq) = [];

end

if ∼isempty(banlistSit)

 for sch3 = 1:am

 gen{k}(sch3).Sequence(banlistSit) = 'E';

 end

end

data{k} = zeros(am,l);

  • 4.

    Find the common consensus nucleotide or aminoacid sequence for all the time points.

%finding the common consensus

cons = '';

for i = 1:l

 qA=0;

 qC=0;

 qT=0;

 qG=0;

 qE=0;

 for j = 1:am

  seq = gen{k}(j).Sequence;

  switch seq(i)

   case 'A'

   qA = qA+1;

   case 'C'

   qC = qC+1;

   case 'T'

   qT = qT+1;

   case 'G'

   qG = qG+1;

   case 'E'

   qE = qE+1;

  end

 end

 [∼,I] = max([qA qC qT qG qE]);

switch I

 case 1

cons = [cons 'A'];

 case 2

cons = [cons 'C'];

 case 3

cons = [cons 'T'];

 case 4

cons = [cons 'G'];

 case 5

cons = [cons 'E'];

end

end

  • 5.

    For each time point and each genome, replace the consensus variant at each nucleotide or aminoacid position (below termed “site”) by 0 and any other variant by 1. Digit 2 marks the site that will be excluded later.

%binarization

for i2 = 1:l

 for j2 = 1:am

  seq = gen{k}(j2).Sequence;

if seq(i2)∼='E'

  if seq(i2) == cons(i2)

   data{k}(j2,i2)=0;

  else

   data{k}(j2,i2)=1;

  end

else

 data{k}(j2,i2)=2;

end

 end

end

  • 6.

    Finds “legitimate” sites that do not have insertions and deletions and whose diversity is above threshold fcut.

Note: Homozygous sites do not allow the measurement of selection coefficients. Weakly heterozygous sites do, but with a large statistical error. For example, for 100 sequences estimated per geographic area, the threshold fcut of a few percent is recommended.

%finding the 'right' site numbers

rightsites{k}=[];

for msch = 1:l

if data{k}(1,msch)∼=2

 if ∼(mean(data{k}(:,msch))>=1-fcut || mean(data{k}(:,msch))<=fcut)

rightsites{k} = [rightsites{k} msch];

 end

end

end

 cd(oldpath);

end

  • 7.

    Finds the intersection between the legitimate sites of the three time periods.

 %findind the intersection between 'right' sites of the different time

 %periods

C1=intersect(rightsites{1},rightsites{2});

C2=intersect(C1,rightsites{3});

  • 8.

    Excludes illegitimate sites marked “2” and memorizes the numbers of the legitimate sites.

%removing 'bad' sites and remebering the numbers of the 'right' one

for kk=1:k

 [∼,l]=size(data{kk});

 C3 = 1:l;

 C3(C2) = [];

 data{kk}(:,C3) = [];

end

data{k+1} = C2;

  • 9.

    Generates file BinData.mat containing genomic binary sequences for renumbered legitimate sites at three time points and displays the resulting binary matrices (see an example in Table 1).

%saving the data

save(strcat('BinData','.mat'),'data');

disp(data)

Figure 1.

Figure 1

An example of MEGA11 output with an insertion and a deletion

Table 1.

An example of a binary matrix of population DNA

1 2 3 4 5 6 7 8 9 10 11 12 13
1 1 0 1 1 1 0 1 0 1 1 1 1 1
2 1 0 1 1 1 0 1 0 1 1 1 1 1
3 1 0 1 1 1 0 1 0 1 1 1 1 1
4 1 0 1 1 1 0 1 0 1 1 1 1 1
5 1 0 1 1 1 0 1 0 1 1 1 1 1
6 1 0 1 1 1 0 1 0 1 1 1 1 1
7 1 0 1 1 1 0 1 0 1 1 1 1 1
8 1 0 1 1 1 0 1 0 1 1 1 1 1
9 1 0 1 1 1 0 1 0 1 1 1 1 1
10 1 0 1 1 1 1 1 1 1 1 1 0 1
11 1 0 1 1 1 1 1 1 1 1 1 0 1
12 1 0 1 1 1 1 1 1 1 1 1 0 1

Parameters: L = 100, N = 1000. Only the first 13 sites and 12 individual sequences are shown.

Inference of selection coefficients

Inline graphicTiming:8.0L×N/107sec

This major step serves to infer the selection coefficients of polymorphous loci from a DNA sample. Below we use an example with N mock binary sequences of length L data generated by run runs of Monte-Carlo simulation using recomb_2022.m (key resources table), for a fixed set of si. The same procedure can be run with N real-life sequences of length L obtained from several independent populations combined in one matrix (Table 1).

Note: Monte-Carlo simulation of sequences is required only if you want to test the accuracy of the protocol. It takes much more time than the main protocol: 30L×N×run×tf/6108min.

Run MATLAB program s_measure.m (key resources table) that inputs binary sequences from BinData.mat. The program carries out the steps, as follows:

  • 10.

    Input data from BinData.mat.

load('BinData.mat','data');

genome1 = data{1,1};

genome2 = data{1,2};

genome3 = data{1,3};

order = data{1,4};

  • 11.

    Input parameter description:

tr1 - deletion threshold in ACTGtranslate.m.

fcut - monomorphous threshhold in ACTGtranslate.m.

genome1 (2, 3) – binary matrix representing genomic samples at times 1, 2, or 3.

C – initial value of C in Equation 1.

appr – method of curve approximation:

‘poly’ - by basic polynomials.

‘spline’ – by cubic splines.

‘pchip’ – by Piecewise Cubic Hermite Interpolating Polynomial (PCHIP).

apprR – additional parameter for ‘poly’ approximation - rate of polynomial.

tsec1, tsec2, tsec3– times of the first, second and third sequence sample.

r – recombination probability per genome.

s0 –the width of the uniform distribution of selection coefficient.

ac∗s0, bc∗s0 - borders of uniform s distribution.

M – crossover number.

L – number of loci.

N –population size

tf – end time of evolution.

f0 – initial value of f.

run – number of Monte Carlo evolution runs, also used as the seed for the first run.

run2 – the seed for generating the random distribution of s.

mu – mutation probability per site.

eps – the accuracy of C.

step – step in C.

  • 12.

    Run program s_measure. m with the loaded arguments. It initialize variables.

Function sdis=s_measure(genome1,genome2,genome3,order,C,appr,apprR,generate)

global tsec1 tsec2 tsec3 r s0 M L N tf f0 run run2 mu ac bc

scon = {};

if gen

[∼,l] = size(genome1{1,1});

%l - quantity of the alleles

[R,∼] = size(genome1);

%R - quantity of runs

f1 = zeros(R,l); % array for fi at 1 time (for R runs)

f2 = zeros(R,l); % array for fi at 2 time (for R runs)

f3 = zeros(R,l); % array for fi at 3 time (for R runs)

end

eps = 10ˆ-3; % difference between zero and value of the y coordinate of the triangle center, when the loop stops

yccord = 1; % basic value of the y coordinate of the triangle center

step =10ˆ-3; % value of the step

findc = true;

  • 13.

    Calculate the frequency of digit “1” for each site i, denoted fi(t), at each time point t=t1,t2,andt3 (Table 2).

% calculation of fi

if gen

for rs = 1:R

 f1(rs,:)=mean(genome1{rs,1});

 f2(rs,:)=mean(genome2{rs,1});

 f3(rs,:)=mean(genome3{rs,1});

end

f1f = mean(f1);

f2f = mean(f2);

f3f = mean(f3);

else

 f1f=mean(genome1);

 f2f=mean(genome2);

 f3f=mean(genome3);

end

  • 14.

    At each time point t, estimate the relative shifted value of selection coefficient at site i, denoted as product β(t)si, from equation (Barlukova and Rouzine, 2021).

β(t)si=log[fi(t)]+C (Equation 1)

%calculation of si

bsi1 = -1∗log(f1f)-C;

bsi2 = -1∗log(f2f)-C;

bsi3 = -1∗log(f3f)-C;

Note: The value of C is found, as follows. First, it gets initial value of C from the input arguments of the function. To determine the actual value C, the same program s_measure.m carries out the steps:

  • 15.

    Rank genomic sites in the descending order of the estimated values of si.

  • 16.

    Calculate for each site i its new number mi, where i is the label of the site in the genome, and mi is its number after the ranking in si.

%sorting and calculating

[B1,I1] = sort(bsi1,'descend');

[B2,I2] = sort(bsi2,'descend');

[B3,I3] = sort(bsi3,'descend');

  • 17.

    Obtain a monotonous ranked curve si(mi) for each time point (Figure 2A).

Note: The three curves form a small triangle. The curves will not separate fully, if the system is in a steady state where the method does not work (Figure 2B). The code for plotting is not shown here, see file s_measure.m.

  • 18.

    Find the center of mass of this triangle, denoted (mc,sc).

  • 19.

    Adjust C to obtain si=0 at that center and repeats calculation with Equation 1 given above in step 14.

  • 20.

    Repeat the loop described in steps 18–26, until obtaining si=0 at the center of the triangle within accuracy eps.

while abs(yccord) > eps

%calculation of si

bsi1 = -1∗log(f1f)-C;

bsi2 = -1∗log(f2f)-C;

bsi3 = -1∗log(f3f)-C;

%sorting and calculating

[B1,I1] = sort(bsi1,'descend');

[B2,I2] = sort(bsi2,'descend');

[B3,I3] = sort(bsi3,'descend');

xdots = {};

ydots = {};

xdotss = {};

ydotss = {};

  • 21.
    Fit the three curves with one of four methods, three of which (a, b, c) are standard functions in MATLAB.
    • a.
      Polynomial.
      switch appr
       case 'poly'
      %polynomial approximation
      p1 = polyfit([1:1:length(I1)],B1,apprR);
      p2 = polyfit([1:1:length(I2)],B2,apprR);
      p3 = polyfit([1:1:length(I3)],B3,apprR);
    • b.
      Cubic splines.
       case 'spline'
      %spline approximation
      sp1 = spline([1:1:length(I1)],B1);
      sp2 = spline([1:1:length(I2)],B2);
      sp3 = spline([1:1:length(I3)],B3);
    • c.
      Piecewise Cubic Hermite Interpolating Polynomial (PCHIP).
       case 'pchip'
      %pchip approximation
      pp1 = pchip([1:1:length(I1)],B1);
      pp2 = pchip([1:1:length(I2)],B2);
      pp3 = pchip([1:1:length(I3)],B3);
    • d.
      Test case (only for plotting the curves, when the other algorithms do not work).

 case "test"

findc=false;

yccord = 0;

C = 0;

end

  • 22.
    Find intersections of the three pairs of three curves by one of the methods chosen by input parameter appr.
    • a.
      Polynomial.
      inter1 = p1 - p2;
      inter2 = p1 - p3;
      inter3 = p2 - p3;
      xdots{1,1} = roots(inter1);
      xdots{1,2} = roots(inter2);
      xdots{1,3} = roots(inter3);
    • b.
      Cubic spline.
      sinter1 = @(x) ppval(sp1,x)-ppval(sp2,x);
      sinter2 = @(x) ppval(sp1,x)-ppval(sp3,x);
      sinter3 = @(x) ppval(sp2,x)-ppval(sp3,x);
      xdots{1,1} = fzero(sinter1,mean([1:1:length(I1)]));
      xdots{1,2} = fzero(sinter2,mean([1:1:length(I2)]));
      xdots{1,3} = fzero(sinter3,mean([1:1:length(I3)]));
    • c.
      PCHIP.

pinter1 = @(x) ppval(pp1,x)-ppval(pp2,x);

pinter2 = @(x) ppval(pp1,x)-ppval(pp3,x);

pinter3 = @(x) ppval(pp2,x)-ppval(pp3,x);

xdots{1,1} = fzero(pinter1,mean([1:1:length(I1)]));

xdots{1,2} = fzero(pinter2,mean([1:1:length(I2)]));

xdots{1,3} = fzero(pinter3,mean([1:1:length(I3)]));

  • 23.
    Find y-coordinates of the intersection dots and memorize y-coordinates of the first and last point of the approximation curve by one of the methods chosen by input parameter appr.
    • a.
      Polynomial.
      ydots{1,1} = polyval(p1,xdots{1,1});
      ydots{1,2} = polyval(p1,xdots{1,2});
      ydots{1,3} = polyval(p2,xdots{1,3});
      ybord{1,1} = polyval(p1,1);
      ybord{1,2} = polyval(p2,1);
      ybord{1,3} = polyval(p3,1);
      ybord{2,1} = polyval(p1,length(I1));
      ybord{2,2} = polyval(p2,length(I2));
      ybord{2,3} = polyval(p3,length(I3));
    • b.
      Spline.
      ydots{1,1} = ppval(sp1,xdots{1,1});
      ydots{1,2} = ppval(sp1,xdots{1,2});
      ydots{1,3} = ppval(sp2,xdots{1,3});
      ybord{1,1} = ppval(sp1,1);
      ybord{1,2} = ppval(sp2,1);
      ybord{1,3} = ppval(sp3,1);
      ybord{2,1} = ppval(sp1,length(I1));
      ybord{2,2} = ppval(sp2,length(I2));
      ybord{2,3} = ppval(sp3,length(I3));
    • c.
      PCHIP.

ydots{1,1} = ppval(pp1,xdots{1,1});

ydots{1,2} = ppval(pp1,xdots{1,2});

ydots{1,3} = ppval(pp2,xdots{1,3});

ybord{1,1} = ppval(pp1,1);

ybord{1,2} = ppval(pp2,1);

ybord{1,3} = ppval(pp3,1);

ybord{2,1} = ppval(pp1,length(I1));

ybord{2,2} = ppval(pp2,length(I2));

ybord{2,3} = ppval(pp3,length(I3));

  • 24.

    Check the accuracy of the intersection points by verifying whether they have real values and lie on the curves.

if findc

%coordinates check

Iall={I1;I2;I3};

for times = 1: length(xdots)

 ncount1 = 0;

for param1= 1: length(xdots{1,times})

IallLen(times) = length(Iall{times,1});

 if xdots{1,times}(param1)<= IallLen(times)

&& xdots{1,times}(param1)>=1 && imag(xdots{1,times}(param1))==0 && ydots{1,times}(param1)<=ybord{1,times} && ydots{1,times}(param1)>=ybord{2,times}

  ncount1 = ncount1 +1;

  xdotss{times,ncount1} = xdots{1,times}(param1);

  ydotss{times,ncount1} = ydots{1,times}(param1);

 end

end

end

  • 25.

    Form the triangle from the intersection points and find its center.

%coordinates of triangle tops

xdcord = [];

ydcord = [];

[shr,dl] = size(xdotss);

for h =1:shr

 for c = 1:dl

  xdcord = [xdcord xdotss{h,c}];

  ydcord = [ydcord ydotss{h,c}];

 end

end

%center finding

polyin = polyshape({xdcord},{ydcord});

[xccord,yccord] = centroid(polyin);

  • 26.

    Modify C and repeat the cycle (step 18) until obtaining si=0 at the center within accuracy eps.

if yccord>0

 C = C+ step;

end

if yccord<0

 C = C - step;

end

end

end

  • 27.

    After the accuracy is reached in step 20, the final estimates of si are obtained.

Note: In our example of simulated sequences, these estimates are plotted against their actual values (Figure 3), at different number of runs used for averaging (100 and 1,000), and different time points. The code for plotting is not shown, see the file s_measure.m.

Table 2.

Example of locus allele frequencies calculated in step 13

1 2 3 4 5 6 7 8 9 10 11 12 13
0.912 0.827 0.963 0.936 0.950 0.964 0.943 0.927 0.908 0.868 0.920 0.816 0.935

Parameters: L = 100, N = 1000. Only the first 13 sites are shown. The example from Table 1 is used.

Figure 2.

Figure 2

Ranked curve si(mi) for three time points

(A) Successful separation of curves, with a proper focal point, when far from a steady state.

(B) Failure of the method close to a steady state. Parameters: (A) s0 = 0.05, ac = -1, bc = 1, L = 100, N = 1000, tf = 60, f0 = 0.1, runs = 100, μL = 0.1, t1 = 20, t2 = 40, t3 = 60; (B) t1 = 500, t2 = 750, t3 = 1000, tf = 1000, and the others as in (A).

Figure 3.

Figure 3

Estimated values of selection coefficient as a function of their actual values

(A) 100 runs.

(B) 1000 runs. Parameter values are as in Figure 2.

  • 28.

    Re-order the ranked sites back, i to mi and plot the relative values of selection coefficient, β(t)si, against their actual aminoacid positions, i (Figure 4).

Figure 4.

Figure 4

Estimated relative values of selection coefficient in the genome

X-axis: Site number in genome. Y axis: selection coefficient. (A, B and C) correspond to the three sampling times and differ, mostly, in the scaling factor common for all sites. Parameter values are as in Figure 3.

Expected outcomes

The desired outcome is the values of the selection coefficient of mutation at each heterozygous loci in the studied populations expressed in relative units (Figure 4). To find the single scaling factor that sets units, an additional experiment or data is required. The failure is defined by the three curves in Figure 2 not having clear separation with a single intersection point, as in Figure 2B.

Quantification and statistical analysis

In order to estimate the standard deviation of the estimates for si, use either.

  • random resampling of a half of available sequences.

  • bootstrapping, i.e., random resampling from the same dataset with replacement.

The first method and second method will yield the upper and the lower estimate of the 95% statistical error.

Limitations

For the method to work, the system has to satisfy several requirements, based on the assumptions of the model in (Barlukova and Rouzine, 2021), as follows:

The method applies to adapting populations only. The sign of adapting populations is the clear separation of the three curves with a single intersection point (Figure 2A). If a population is no longer adapting and is near a steady state, all curves collapse onto one regardless of the time spacing (Figure 2B).

Selection type is directional and constant (or, at least, changing slowly on time scale on the order of the inverse selection coefficient). This condition can be checked, again, by observing clear separation and single intersection of the curves (Figure 2A).

Multiple samples from almost-independent replicate populations must be available for averaging of allelic frequencies, as seen in their phylogeny, otherwise, the curves in Figure 2 will be ragged and stochastic looking, and clear separation (Figure 2A) will not be obtained.

Epistasis is not included explicitly, because it is assumed to be incorporated in the renormalized values of selection coefficients. This is a good approximation on sufficiently short time scales. In the long term, genomes must be described as having many epistatic pairs, and the effective values of selection coefficients change slowly. The inference of epistasis is addressed elsewhere (Pedruzzi and Rouzine, 2021). The quality of this approximation can be assessed by increasing the time points together and observing a slow shift in the final estimates of selection coefficients (Figure 4).

The aim of protocol is to measure the selection coefficient si defined for sites that evolve due to point mutations. Genomic regions with frequent insertions and deletions cannot be described by thus parameter and, hence, cannot be treated by this method. Homozygous sites do not allow the measurement of selection coefficient neither.

Troubleshooting

Problem 1

The ranked–s curves after step 12 at different time points are not distinct enough, do not order as they supposed to, and/or do not intersect well within a single small area as they do Figure 2.

Potential solution

  • Try to increase the time window. One can choose any time points t1, t2, and t3, as long as they are sufficiently far apart. The criterion is that the resulting curves in Figure 2A are clearly separated and have a single intersection. The specific choice of time points affects only the common factor β multiplying the estimates of selection coefficient, but it does not affect their relative values. If the curves do not separate even for widely spaced time points, the system is in a steady state and the method is not applicable.

  • Try to eliminate the sites that create the bad behavior of the curves and work with the remaining sites.

  • If neither method works, the system may be in a steady state or under rapidly changing selection and is not amenable to this method.

Problem 2

The standard deviation of the final estimates is unacceptably large for some sites.

Potential solution

  • Filter noisy sites out by increasing the cutoff threshold for heterozygous sites, fcut, in file ACTGtranslate.m (step 1).

Problem 3

DNA sequences do not come from nearly-independent populations and form a monophyletic tree.

Potential solution

Seek additional data or change the organism of study.

Problem 4

Your MATLAB cannot open .m, due to the file format.

Potential solution

Use the text versions of the files in the database ending with .txt. Change their names to the names of the corresponding .m files.

Problem 5

MATLAB returns syntax errors.

Potential solution

Get a newer version of MATLAB. The program has been tested in MATLAB 2017, 2020, and 2022.

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Igor M Rouzine (igor.rouzine@iephb.ru).

Materials availability

This study did not generate new unique reagents.

Acknowledgments

The study was carried out within the framework of the state assignment of the Federal Agency for Scientific Organizations (FASO Russia: topic no. AAAA-А18-118012290142-9).

Author contributions

I.V.L.: Software, numerical simulation, calculation, visualization, wrote manuscript. I.M.R.: Concept, administration, supervision, developed protocol, wrote manuscript.

Declaration of interests

The authors declare no competing interests.

Contributor Information

Igor V. Likhachev, Email: igor.rouzine@iephb.ru.

Igor M. Rouzine, Email: reirose2002@gmail.com.

Data and code availability

The computer code that carries out the protocol is available at:

Database: https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0.

The questions about the programs should be addressed to the technical contact, Igor V Likhachev, reirose2002@gmail.com.

References

  1. Barlukova A., Rouzine I.M. The evolutionary origin of the universal distribution of mutation fitness effect. PLoS Comput. Biol. 2021;17:e1008822. doi: 10.1371/journal.pcbi.1008822. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Pedruzzi G., Rouzine I.M. An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza. PLoS Pathog. 2021;17:e1009669. doi: 10.1371/journal.ppat.1009669. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The computer code that carries out the protocol is available at:

Database: https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0.

The questions about the programs should be addressed to the technical contact, Igor V Likhachev, reirose2002@gmail.com.


Articles from STAR Protocols are provided here courtesy of Elsevier

RESOURCES