Measurement of selection coefficients from genomic samples of adapting populations by computer modeling

Igor V Likhachev; Igor M Rouzine

doi:10.1016/j.xpro.2022.101821

. 2023 Mar 4;4(1):101821. doi: 10.1016/j.xpro.2022.101821

Measurement of selection coefficients from genomic samples of adapting populations by computer modeling

Igor V Likhachev ^1,^2,^∗, Igor M Rouzine ^1,^3,^∗∗

PMCID: PMC9999197 PMID: 36871222

Summary

The existing protocols of measuring the selection coefficients of loci neglect linkage effects existing between loci. This protocol is free from this limitation. The protocol inputs a set of DNA sequences at three time points, removes conserved sites, and estimates selection coefficients. If the user wishes to test the accuracy, it can ask the protocol to generate mock data by computer simulation of evolution. The main limitation is the need for sequence samples isolated from 30–100 populations adapting in parallel.

For complete details on the use and execution of this protocol, please refer to Barlukova and Rouzine (2021).

Subject areas: Sequence analysis, Genetics, Genomics, Computer sciences, Evolutionary biology

Graphical abstract

Highlights

•
Measures selection coefficients for adapting loci in relative units
•
Inputs DNA sequences, and excludes conserved sites and insertions/deletions
•
Estimates selection coefficients from these data
•
If desired, tests the accuracy of the protocol by using simulated data

Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.

Before you begin

DNA, RNA, or protein sequences

1.
Obtain a database of DNA sequences of a pathogen or organism at three time points, t₁, t₂, and t₃. If you have RNA or protein sequences, translate them using MEGA or any other standard software.
2.
Align and trim sequences using MEGA software.
3.
Output sequences to MEGA files corresponding to times t₁, t₂, and t₃.

Note: One can choose any three time points, as long as they are sufficiently far apart (see troubleshooting below).

Software

4.
Install MATLAB^TM version 2017 and later or GNU Octave.
5.
Install MEGA.
6.
Download and install the present software (key resources table).

CRITICAL: Make sure that genomic sequences are isolated from, at least, 30–100 independently-evolving populations using any phylogenetic method (for example, implemented in MEGA software). Polyphyletic tree must be observed.

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Multiple DNA, sequences of any organism from, at least, 30–100 independently-adapting populations	Any public DNA database	N/A

Software and algorithms

MATLAB^TM 2017 or later or GNU Octave	MathWorks	N/A
MEGA software	MEGA	N/A
Software developed for the present work	https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0	ACTGtranslate.m BinData.mat s_measure.m main.m (optional)
Software for generating mock sequences by simulating evolution (optional)	https://github.com/irouzine/Strong-linkage-in-sex	recomb_2022.m

Open in a new tab

Step-by-step method details

Binarization of genomic data

Timing: $1.0 L \times N / 10^{7} \sec$

Run our program ACTGtranslate.m (key resources table) that inputs N sequences of length L from the MEGA output files for all timepoints t₁, t₂, and t₃ obtained in preliminary step 3. The program will automatically carry out the steps, as follows:

1.
Initialize parameters.

%% parameters

% b1 - number of the first sequence to be read

% b2 - number of the last sequnce to be read

% tr1 - deletion threshold

% fcut - monomorphous threshhold

b1 = 1;

b2 = 3;

tr1 = 0.05;

fcut = 0.05;

am = (b2-b1+1); %amount of the sequences

gen = {}; %container for raw info

data = {}; %container for result info

2.
Take data from .fas files.

Note: Steps 2–6 are performed 3 times, for each time period, where k=1:3 is the counter.

%take data from files

for k = 1:3

filepath=strcat('data\period',num2str(k));

oldpath=cd(filepath);

filelist{k}= dir ('∗∗/∗.fas');

gen{k} = fastaread(filelist{k}.name,’BlockRead’, [b1 b2],’IgnoreGaps’,false);

3.
Find the length of the sequence.
l = length(gen{k}(1).Sequence);

%l - length of the sequence
- a.
  Find the sites with deletions and insertions in sequences (Figure 1) and memorize them.
  Note: The aim of protocol is to measure the selection coefficient s_i defined for sites that evolve due to point mutations. Genomic regions with frequent insertions and deletions cannot be described by thus parameter and, hence, cannot be treated by this method.
  
  %data cleaning
  
  banlistSeq = [];
  
  banlistSit = [];
  
  erram = zeros(am,1);
  
  potential = [];
  
  for i0 = 1:l
  
    del = 0;
  
    sch = 1;
  
    for j0 = 1:am
  
    if isempty(banlistSeq) || j0 ∼= banlistSeq
  
     seq = gen{k}(j0).Sequence;
  
     if seq(i0) =='-' || seq(i0) =='N'
  
      potential = [potential j0];
  
      del = del+1;
  
     end
  
    else
  
     sch=sch+1;
  
    end
  
  end
  
  if del ∼=0
  
    if del/(am-sum(erram)) >=tr1
  
     banlistSit = [banlistSit i0];
  
     potential = [];
  
    else
  
     erram(potential,1) = erram(potential,1)+1;
  
     banlistSeq = [banlistSeq potential];
  
     banlistSeq = sort(banlistSeq);
  
     potential = [];
  
    end
  
  end
  
  end
- b.
  Delete all sequences with deletions. Mark the sites where some sequences have insertions with ‘E’.

if ∼isempty(banlistSeq)

gen{k}(banlistSeq) = [];

end

if ∼isempty(banlistSit)

for sch3 = 1:am

gen{k}(sch3).Sequence(banlistSit) = 'E';

end

data{k} = zeros(am,l);

4.
Find the common consensus nucleotide or aminoacid sequence for all the time points.

%finding the common consensus

cons = '';

for i = 1:l

qA=0;

qC=0;

qT=0;

qG=0;

qE=0;

for j = 1:am

seq = gen{k}(j).Sequence;

switch seq(i)

case 'A'

qA = qA+1;

case 'C'

qC = qC+1;

case 'T'

qT = qT+1;

case 'G'

qG = qG+1;

case 'E'

qE = qE+1;

end

[∼,I] = max([qA qC qT qG qE]);

switch I

case 1

cons = [cons 'A'];

case 2

cons = [cons 'C'];

case 3

cons = [cons 'T'];

case 4

cons = [cons 'G'];

case 5

cons = [cons 'E'];

end

5.
For each time point and each genome, replace the consensus variant at each nucleotide or aminoacid position (below termed “site”) by 0 and any other variant by 1. Digit 2 marks the site that will be excluded later.

%binarization

for i2 = 1:l

for j2 = 1:am

seq = gen{k}(j2).Sequence;

if seq(i2)∼='E'

if seq(i2) == cons(i2)

data{k}(j2,i2)=0;

else

data{k}(j2,i2)=1;

end

else

data{k}(j2,i2)=2;

end

6.
Finds “legitimate” sites that do not have insertions and deletions and whose diversity is above threshold fcut.

Note: Homozygous sites do not allow the measurement of selection coefficients. Weakly heterozygous sites do, but with a large statistical error. For example, for 100 sequences estimated per geographic area, the threshold fcut of a few percent is recommended.

%finding the 'right' site numbers

rightsites{k}=[];

for msch = 1:l

if data{k}(1,msch)∼=2

if ∼(mean(data{k}(:,msch))>=1-fcut || mean(data{k}(:,msch))<=fcut)

rightsites{k} = [rightsites{k} msch];

end

cd(oldpath);

end

7.
Finds the intersection between the legitimate sites of the three time periods.

%findind the intersection between 'right' sites of the different time

%periods

C1=intersect(rightsites{1},rightsites{2});

C2=intersect(C1,rightsites{3});

8.
Excludes illegitimate sites marked “2” and memorizes the numbers of the legitimate sites.

%removing 'bad' sites and remebering the numbers of the 'right' one

for kk=1:k

[∼,l]=size(data{kk});

C3 = 1:l;

C3(C2) = [];

data{kk}(:,C3) = [];

end

data{k+1} = C2;

9.
Generates file BinData.mat containing genomic binary sequences for renumbered legitimate sites at three time points and displays the resulting binary matrices (see an example in Table 1).

%saving the data

save(strcat('BinData','.mat'),'data');

disp(data)

An example of MEGA11 output with an insertion and a deletion

Table 1.

An example of a binary matrix of population DNA

	1	3	4	5	6	7	8	9	10	11	12	13
1	1	1	1	1	0	1	0	1	1	1	1	1
2	1	1	1	1	0	1	0	1	1	1	1	1
3	1	1	1	1	0	1	0	1	1	1	1	1
4	1	1	1	1	0	1	0	1	1	1	1	1
5	1	1	1	1	0	1	0	1	1	1	1	1
6	1	1	1	1	0	1	0	1	1	1	1	1
7	1	1	1	1	0	1	0	1	1	1	1	1
8	1	1	1	1	0	1	0	1	1	1	1	1
9	1	1	1	1	0	1	0	1	1	1	1	1
10	1	1	1	1	1	1	1	1	1	1	0	1
11	1	1	1	1	1	1	1	1	1	1	0	1
12	1	1	1	1	1	1	1	1	1	1	0	1

Open in a new tab

Parameters: L = 100, N = 1000. Only the first 13 sites and 12 individual sequences are shown.

Inference of selection coefficients

Timing: $8.0 L \times N / 10^{7} s e c$

This major step serves to infer the selection coefficients of polymorphous loci from a DNA sample. Below we use an example with N mock binary sequences of length L data generated by run runs of Monte-Carlo simulation using recomb_2022.m (key resources table), for a fixed set of $s_{i} .$ The same procedure can be run with N real-life sequences of length L obtained from several independent populations combined in one matrix (Table 1).

Note: Monte-Carlo simulation of sequences is required only if you want to test the accuracy of the protocol. It takes much more time than the main protocol: $30 L \times N \times r u n \times t f / 6 \cdot 10^{8} m i n$ .

Run MATLAB program s_measure.m (key resources table) that inputs binary sequences from BinData.mat. The program carries out the steps, as follows:

10.
Input data from BinData.mat.

load('BinData.mat','data');

genome1 = data{1,1};

genome2 = data{1,2};

genome3 = data{1,3};

order = data{1,4};

11.
Input parameter description:

tr1 - deletion threshold in ACTGtranslate.m.

fcut - monomorphous threshhold in ACTGtranslate.m.

genome1 (2, 3) – binary matrix representing genomic samples at times 1, 2, or 3.

C – initial value of C in Equation 1.

appr – method of curve approximation:

‘poly’ - by basic polynomials.

‘spline’ – by cubic splines.

‘pchip’ – by Piecewise Cubic Hermite Interpolating Polynomial (PCHIP).

apprR – additional parameter for ‘poly’ approximation - rate of polynomial.

tsec1, tsec2, tsec3– times of the first, second and third sequence sample.

r – recombination probability per genome.

s0 –the width of the uniform distribution of selection coefficient.

ac∗s0, bc∗s0 - borders of uniform s distribution.

M – crossover number.

L – number of loci.

N –population size

tf – end time of evolution.

f0 – initial value of f.

run – number of Monte Carlo evolution runs, also used as the seed for the first run.

run2 – the seed for generating the random distribution of s.

mu – mutation probability per site.

eps – the accuracy of C.

step – step in C.

12.
Run program s_measure. m with the loaded arguments. It initialize variables.

Function sdis=s_measure(genome1,genome2,genome3,order,C,appr,apprR,generate)

global tsec1 tsec2 tsec3 r s0 M L N tf f0 run run2 mu ac bc

scon = {};

if gen

[∼,l] = size(genome1{1,1});

%l - quantity of the alleles

[R,∼] = size(genome1);

%R - quantity of runs

f1 = zeros(R,l); % array for fi at 1 time (for R runs)

f2 = zeros(R,l); % array for fi at 2 time (for R runs)

f3 = zeros(R,l); % array for fi at 3 time (for R runs)

end

eps = 10ˆ-3; % difference between zero and value of the y coordinate of the triangle center, when the loop stops

yccord = 1; % basic value of the y coordinate of the triangle center

step =10ˆ-3; % value of the step

findc = true;

13.
Calculate the frequency of digit “1” for each site $i,$ denoted $f_{i} (t),$ at each time point $t = t_{1}, t_{2}, and t_{3}$ (Table 2).

% calculation of fi

if gen

for rs = 1:R

f1(rs,:)=mean(genome1{rs,1});

f2(rs,:)=mean(genome2{rs,1});

f3(rs,:)=mean(genome3{rs,1});

end

f1f = mean(f1);

f2f = mean(f2);

f3f = mean(f3);

else

f1f=mean(genome1);

f2f=mean(genome2);

f3f=mean(genome3);

end

14.
At each time point $t$ , estimate the relative shifted value of selection coefficient at site $i$ , denoted as product $β (t) s_{i}$ , from equation (Barlukova and Rouzine, 2021).

β (t) s_{i} = - log [f_{i} (t)] + C

(Equation 1)

%calculation of si

bsi1 = -1∗log(f1f)-C;

bsi2 = -1∗log(f2f)-C;

bsi3 = -1∗log(f3f)-C;

Note: The value of $C$ is found, as follows. First, it gets initial value of C from the input arguments of the function. To determine the actual value $C$ , the same program s_measure.m carries out the steps:

15.
Rank genomic sites in the descending order of the estimated values of $s_{i}$ .
16.
Calculate for each site $i$ its new number $m_{i}$ , where $i$ is the label of the site in the genome, and $m_{i}$ is its number after the ranking in $s_{i}$ .

%sorting and calculating

[B1,I1] = sort(bsi1,'descend');

[B2,I2] = sort(bsi2,'descend');

[B3,I3] = sort(bsi3,'descend');

17.
Obtain a monotonous ranked curve $s_{i} (m_{i})$ for each time point (Figure 2A).

Note: The three curves form a small triangle. The curves will not separate fully, if the system is in a steady state where the method does not work (Figure 2B). The code for plotting is not shown here, see file s_measure.m.

18.
Find the center of mass of this triangle, denoted $(m_{c}, s_{c})$ .
19.
Adjust $C$ to obtain $s_{i} = 0$ at that center and repeats calculation with Equation 1 given above in step 14.
20.
Repeat the loop described in steps 18–26, until obtaining $s_{i} = 0$ at the center of the triangle within accuracy eps.

while abs(yccord) > eps

%calculation of si

bsi1 = -1∗log(f1f)-C;

bsi2 = -1∗log(f2f)-C;

bsi3 = -1∗log(f3f)-C;

%sorting and calculating

[B1,I1] = sort(bsi1,'descend');

[B2,I2] = sort(bsi2,'descend');

[B3,I3] = sort(bsi3,'descend');

xdots = {};

ydots = {};

xdotss = {};

ydotss = {};

21.
Fit the three curves with one of four methods, three of which (a, b, c) are standard functions in MATLAB.
- a.
  Polynomial.
  switch appr
  
  case 'poly'
  
  %polynomial approximation
  
  p1 = polyfit([1:1:length(I1)],B1,apprR);
  
  p2 = polyfit([1:1:length(I2)],B2,apprR);
  
  p3 = polyfit([1:1:length(I3)],B3,apprR);
- b.
  Cubic splines.
  case 'spline'
  
  %spline approximation
  
  sp1 = spline([1:1:length(I1)],B1);
  
  sp2 = spline([1:1:length(I2)],B2);
  
  sp3 = spline([1:1:length(I3)],B3);
- c.
  Piecewise Cubic Hermite Interpolating Polynomial (PCHIP).
  case 'pchip'
  
  %pchip approximation
  
  pp1 = pchip([1:1:length(I1)],B1);
  
  pp2 = pchip([1:1:length(I2)],B2);
  
  pp3 = pchip([1:1:length(I3)],B3);
- d.
  Test case (only for plotting the curves, when the other algorithms do not work).

case "test"

findc=false;

yccord = 0;

C = 0;

end

22.
Find intersections of the three pairs of three curves by one of the methods chosen by input parameter appr.
- a.
  Polynomial.
  inter1 = p1 - p2;
  
  inter2 = p1 - p3;
  
  inter3 = p2 - p3;
  
  xdots{1,1} = roots(inter1);
  
  xdots{1,2} = roots(inter2);
  
  xdots{1,3} = roots(inter3);
- b.
  Cubic spline.
  sinter1 = @(x) ppval(sp1,x)-ppval(sp2,x);
  
  sinter2 = @(x) ppval(sp1,x)-ppval(sp3,x);
  
  sinter3 = @(x) ppval(sp2,x)-ppval(sp3,x);
  
  xdots{1,1} = fzero(sinter1,mean([1:1:length(I1)]));
  
  xdots{1,2} = fzero(sinter2,mean([1:1:length(I2)]));
  
  xdots{1,3} = fzero(sinter3,mean([1:1:length(I3)]));
- c.
  PCHIP.

pinter1 = @(x) ppval(pp1,x)-ppval(pp2,x);

pinter2 = @(x) ppval(pp1,x)-ppval(pp3,x);

pinter3 = @(x) ppval(pp2,x)-ppval(pp3,x);

xdots{1,1} = fzero(pinter1,mean([1:1:length(I1)]));

xdots{1,2} = fzero(pinter2,mean([1:1:length(I2)]));

xdots{1,3} = fzero(pinter3,mean([1:1:length(I3)]));

23.
Find y-coordinates of the intersection dots and memorize y-coordinates of the first and last point of the approximation curve by one of the methods chosen by input parameter appr.
- a.
  Polynomial.
  ydots{1,1} = polyval(p1,xdots{1,1});
  
  ydots{1,2} = polyval(p1,xdots{1,2});
  
  ydots{1,3} = polyval(p2,xdots{1,3});
  
  ybord{1,1} = polyval(p1,1);
  
  ybord{1,2} = polyval(p2,1);
  
  ybord{1,3} = polyval(p3,1);
  
  ybord{2,1} = polyval(p1,length(I1));
  
  ybord{2,2} = polyval(p2,length(I2));
  
  ybord{2,3} = polyval(p3,length(I3));
- b.
  Spline.
  ydots{1,1} = ppval(sp1,xdots{1,1});
  
  ydots{1,2} = ppval(sp1,xdots{1,2});
  
  ydots{1,3} = ppval(sp2,xdots{1,3});
  
  ybord{1,1} = ppval(sp1,1);
  
  ybord{1,2} = ppval(sp2,1);
  
  ybord{1,3} = ppval(sp3,1);
  
  ybord{2,1} = ppval(sp1,length(I1));
  
  ybord{2,2} = ppval(sp2,length(I2));
  
  ybord{2,3} = ppval(sp3,length(I3));
- c.
  PCHIP.

ydots{1,1} = ppval(pp1,xdots{1,1});

ydots{1,2} = ppval(pp1,xdots{1,2});

ydots{1,3} = ppval(pp2,xdots{1,3});

ybord{1,1} = ppval(pp1,1);

ybord{1,2} = ppval(pp2,1);

ybord{1,3} = ppval(pp3,1);

ybord{2,1} = ppval(pp1,length(I1));

ybord{2,2} = ppval(pp2,length(I2));

ybord{2,3} = ppval(pp3,length(I3));

24.
Check the accuracy of the intersection points by verifying whether they have real values and lie on the curves.

if findc

%coordinates check

Iall={I1;I2;I3};

for times = 1: length(xdots)

ncount1 = 0;

for param1= 1: length(xdots{1,times})

IallLen(times) = length(Iall{times,1});

if xdots{1,times}(param1)<= IallLen(times)

&& xdots{1,times}(param1)>=1 && imag(xdots{1,times}(param1))==0 && ydots{1,times}(param1)<=ybord{1,times} && ydots{1,times}(param1)>=ybord{2,times}

ncount1 = ncount1 +1;

xdotss{times,ncount1} = xdots{1,times}(param1);

ydotss{times,ncount1} = ydots{1,times}(param1);

end

25.
Form the triangle from the intersection points and find its center.

%coordinates of triangle tops

xdcord = [];

ydcord = [];

[shr,dl] = size(xdotss);

for h =1:shr

for c = 1:dl

xdcord = [xdcord xdotss{h,c}];

ydcord = [ydcord ydotss{h,c}];

end

%center finding

polyin = polyshape({xdcord},{ydcord});

[xccord,yccord] = centroid(polyin);

26.
Modify C and repeat the cycle (step 18) until obtaining $s_{i} = 0$ at the center within accuracy eps.

if yccord>0

C = C+ step;

end

if yccord<0

C = C - step;

end

27.
After the accuracy is reached in step 20, the final estimates of s_i are obtained.

Note: In our example of simulated sequences, these estimates are plotted against their actual values (Figure 3), at different number of runs used for averaging (100 and 1,000), and different time points. The code for plotting is not shown, see the file s_measure.m.

Table 2.

Example of locus allele frequencies calculated in step 13

1	2	3	4	5	6	7	8	9	10	11	12	13
0.912	0.827	0.963	0.936	0.950	0.964	0.943	0.927	0.908	0.868	0.920	0.816	0.935

Open in a new tab

Parameters: L = 100, N = 1000. Only the first 13 sites are shown. The example from Table 1 is used.

Ranked curve $s_{i} (m_{i})$ for three time points

(A) Successful separation of curves, with a proper focal point, when far from a steady state.

(B) Failure of the method close to a steady state. Parameters: (A) s₀ = 0.05, ac = -1, bc = 1, L = 100, N = 1000, t_f = 60, f₀ = 0.1, *runs* = 100, μL = 0.1, t₁ = 20, t₂ = 40, t₃ = 60; (B) t₁ = 500, t₂ = 750, t₃ = 1000, t_f = 1000, and the others as in (A).

Estimated values of selection coefficient as a function of their actual values

(A) 100 runs.

(B) 1000 runs. Parameter values are as in Figure 2.

28.
Re-order the ranked sites back, $i$ to $m_{i}$ and plot the relative values of selection coefficient, $β (t) s_{i}$ , against their actual aminoacid positions, $i$ (Figure 4).

Estimated relative values of selection coefficient in the genome

X-axis: Site number in genome. Y axis: selection coefficient. (A, B and C) correspond to the three sampling times and differ, mostly, in the scaling factor common for all sites. Parameter values are as in Figure 3.

Expected outcomes

The desired outcome is the values of the selection coefficient of mutation at each heterozygous loci in the studied populations expressed in relative units (Figure 4). To find the single scaling factor that sets units, an additional experiment or data is required. The failure is defined by the three curves in Figure 2 not having clear separation with a single intersection point, as in Figure 2B.

Quantification and statistical analysis

In order to estimate the standard deviation of the estimates for $s_{i}$ , use either.

•
random resampling of a half of available sequences.
•
bootstrapping, i.e., random resampling from the same dataset with replacement.

The first method and second method will yield the upper and the lower estimate of the 95% statistical error.

Limitations

For the method to work, the system has to satisfy several requirements, based on the assumptions of the model in (Barlukova and Rouzine, 2021), as follows:

The method applies to adapting populations only. The sign of adapting populations is the clear separation of the three curves with a single intersection point (Figure 2A). If a population is no longer adapting and is near a steady state, all curves collapse onto one regardless of the time spacing (Figure 2B).

Selection type is directional and constant (or, at least, changing slowly on time scale on the order of the inverse selection coefficient). This condition can be checked, again, by observing clear separation and single intersection of the curves (Figure 2A).

Multiple samples from almost-independent replicate populations must be available for averaging of allelic frequencies, as seen in their phylogeny, otherwise, the curves in Figure 2 will be ragged and stochastic looking, and clear separation (Figure 2A) will not be obtained.

Epistasis is not included explicitly, because it is assumed to be incorporated in the renormalized values of selection coefficients. This is a good approximation on sufficiently short time scales. In the long term, genomes must be described as having many epistatic pairs, and the effective values of selection coefficients change slowly. The inference of epistasis is addressed elsewhere (Pedruzzi and Rouzine, 2021). The quality of this approximation can be assessed by increasing the time points together and observing a slow shift in the final estimates of selection coefficients (Figure 4).

The aim of protocol is to measure the selection coefficient s_i defined for sites that evolve due to point mutations. Genomic regions with frequent insertions and deletions cannot be described by thus parameter and, hence, cannot be treated by this method. Homozygous sites do not allow the measurement of selection coefficient neither.

Troubleshooting

Problem 1

The ranked–s curves after step 12 at different time points are not distinct enough, do not order as they supposed to, and/or do not intersect well within a single small area as they do Figure 2.

Potential solution

•
Try to increase the time window. One can choose any time points t₁, t₂, and t₃, as long as they are sufficiently far apart. The criterion is that the resulting curves in Figure 2A are clearly separated and have a single intersection. The specific choice of time points affects only the common factor β multiplying the estimates of selection coefficient, but it does not affect their relative values. If the curves do not separate even for widely spaced time points, the system is in a steady state and the method is not applicable.
•
Try to eliminate the sites that create the bad behavior of the curves and work with the remaining sites.
•
If neither method works, the system may be in a steady state or under rapidly changing selection and is not amenable to this method.

Problem 2

The standard deviation of the final estimates is unacceptably large for some sites.

Potential solution

•
Filter noisy sites out by increasing the cutoff threshold for heterozygous sites, fcut, in file ACTGtranslate.m (step 1).

Problem 3

DNA sequences do not come from nearly-independent populations and form a monophyletic tree.

Potential solution

Seek additional data or change the organism of study.

Problem 4

Your MATLAB cannot open .m, due to the file format.

Potential solution

Use the text versions of the files in the database ending with .txt. Change their names to the names of the corresponding .m files.

Problem 5

MATLAB returns syntax errors.

Potential solution

Get a newer version of MATLAB. The program has been tested in MATLAB 2017, 2020, and 2022.

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Igor M Rouzine (igor.rouzine@iephb.ru).

Materials availability

This study did not generate new unique reagents.

Acknowledgments

The study was carried out within the framework of the state assignment of the Federal Agency for Scientific Organizations (FASO Russia: topic no. AAAA-А18-118012290142-9).

Author contributions

I.V.L.: Software, numerical simulation, calculation, visualization, wrote manuscript. I.M.R.: Concept, administration, supervision, developed protocol, wrote manuscript.

Declaration of interests

The authors declare no competing interests.

Contributor Information

Igor V. Likhachev, Email: igor.rouzine@iephb.ru.

Igor M. Rouzine, Email: reirose2002@gmail.com.

Data and code availability

The computer code that carries out the protocol is available at:

Database: https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0.

The questions about the programs should be addressed to the technical contact, Igor V Likhachev, reirose2002@gmail.com.

References

Barlukova A., Rouzine I.M. The evolutionary origin of the universal distribution of mutation fitness effect. PLoS Comput. Biol. 2021;17:e1008822. doi: 10.1371/journal.pcbi.1008822. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedruzzi G., Rouzine I.M. An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza. PLoS Pathog. 2021;17:e1009669. doi: 10.1371/journal.ppat.1009669. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The computer code that carries out the protocol is available at:

Database: https://www.dropbox.com/sh/ptuspj468b88at8/AAD4EsEXghCU46zBC1Ai_-7Ka?dl=0.

The questions about the programs should be addressed to the technical contact, Igor V Likhachev, reirose2002@gmail.com.

[bib1] Barlukova A., Rouzine I.M. The evolutionary origin of the universal distribution of mutation fitness effect. PLoS Comput. Biol. 2021;17:e1008822. doi: 10.1371/journal.pcbi.1008822. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Pedruzzi G., Rouzine I.M. An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza. PLoS Pathog. 2021;17:e1009669. doi: 10.1371/journal.ppat.1009669. [DOI] [PMC free article] [PubMed] [Google Scholar]

	1	3	4	5	6	7	8	9	10	11	12	13
1	1	1	1	1	0	1	0	1	1	1	1	1
2	1	1	1	1	0	1	0	1	1	1	1	1
3	1	1	1	1	0	1	0	1	1	1	1	1
4	1	1	1	1	0	1	0	1	1	1	1	1
5	1	1	1	1	0	1	0	1	1	1	1	1
6	1	1	1	1	0	1	0	1	1	1	1	1
7	1	1	1	1	0	1	0	1	1	1	1	1
8	1	1	1	1	0	1	0	1	1	1	1	1
9	1	1	1	1	0	1	0	1	1	1	1	1
10	1	1	1	1	1	1	1	1	1	1	0	1
11	1	1	1	1	1	1	1	1	1	1	0	1
12	1	1	1	1	1	1	1	1	1	1	0	1

	1	3	4	5	6	7	8	9	10	11	12	13
1	1	1	1	1	0	1	0	1	1	1	1	1
2	1	1	1	1	0	1	0	1	1	1	1	1
3	1	1	1	1	0	1	0	1	1	1	1	1
4	1	1	1	1	0	1	0	1	1	1	1	1
5	1	1	1	1	0	1	0	1	1	1	1	1
6	1	1	1	1	0	1	0	1	1	1	1	1
7	1	1	1	1	0	1	0	1	1	1	1	1
8	1	1	1	1	0	1	0	1	1	1	1	1
9	1	1	1	1	0	1	0	1	1	1	1	1
10	1	1	1	1	1	1	1	1	1	1	0	1
11	1	1	1	1	1	1	1	1	1	1	0	1
12	1	1	1	1	1	1	1	1	1	1	0	1

PERMALINK

Measurement of selection coefficients from genomic samples of adapting populations by computer modeling

Igor V Likhachev

Igor M Rouzine

Summary

Graphical abstract

Highlights

Before you begin

DNA, RNA, or protein sequences

Software

Key resources table

Step-by-step method details

Binarization of genomic data

Figure 1.

Table 1.

Inference of selection coefficients

Table 2.

Figure 2.

Figure 3.

Figure 4.

Expected outcomes

Quantification and statistical analysis

Limitations

Troubleshooting

Problem 1

Potential solution

Problem 2

Potential solution

Problem 3

Potential solution

Problem 4

Potential solution

Problem 5

Potential solution

Resource availability

Lead contact

Materials availability

Acknowledgments

Author contributions

Declaration of interests

Contributor Information

Data and code availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

	1	3	4	5	6	7	8	9	10	11	12	13
1	1	1	1	1	0	1	0	1	1	1	1	1
2	1	1	1	1	0	1	0	1	1	1	1	1
3	1	1	1	1	0	1	0	1	1	1	1	1
4	1	1	1	1	0	1	0	1	1	1	1	1
5	1	1	1	1	0	1	0	1	1	1	1	1
6	1	1	1	1	0	1	0	1	1	1	1	1
7	1	1	1	1	0	1	0	1	1	1	1	1
8	1	1	1	1	0	1	0	1	1	1	1	1
9	1	1	1	1	0	1	0	1	1	1	1	1
10	1	1	1	1	1	1	1	1	1	1	0	1
11	1	1	1	1	1	1	1	1	1	1	0	1
12	1	1	1	1	1	1	1	1	1	1	0	1