1 |
Check if the input file is a zip or a gzip archive. If so, extract the raw file |
Formatting |
2 |
Pick and reorder the columns to the internal format, based on the corresponding *.json config file |
3 |
Cut the ‘chr’ prefix in the chromosome column, if present |
4 |
Calculated the weighted average of EAF columns into one EAF column, if multiple were specified |
5 |
Check if a chromosome entry is an integer from 1 to 23, or X, Y, M. If an entry is not, mark it as invalid |
Validation |
6 |
Check if a base pair position entry is a non-negative integer. If an entry is not, mark it as invalid |
7 |
Check if an rsID entry is a non-negative integer with ‘r’ prefix. If an entry is not, mark it as invalid |
8 |
Check if an effect allele or other allele entry is either a dash or composed of letters ATCG. If an entry is not, mark as invalid |
9 |
Check if a p-value or an EAF entry is a real value between 0 and 1 inclusively. If an entry is not, mark as invalid |
10 |
Check if a standard error or a beta entry is a real value. If an entry is not, mark as invalid |
11 |
Calculate statistics and save the report about correctness of the data |
12 |
Analyze the report, If no resolvable issues were found, finish the execution |
Analysis and preparation |
13 |
If resolvable issues were found, prepare the restoration algorithm depending on the issues |
14 |
Perform the liftover to build 38 if needed |
15 |
Sort the sumstats file either by ChrBP or by rsID, depending on the restoration algorithm |
16 |
If a chromosome or a BP entry was marked as invalid, and the sumstats file is sorted by rsID, then restore both entries by a lookup in the dbSNP for matching rsID |
Restoration |
17 |
If an rsID entry was marked as invalid, and the sumstats file is sorted by ChrBP, then restore rsID entry by a lookup in the dbSNP for matching Chr and BP |
18 |
If from EA and OA entries only one is invalid then restore the invalid allele as the most likely allele from known by a lookup in the dbSNP either for matching rsID and valid allele, if sumstats is sorted by rsID, or for matching Chr, BP, and valid allele, if sumstats is sorted by ChrBP |
19 |
If an EAF entry was marked as invalid, then EAF is restored by a lookup in the dbSNP either for matching rsID and effect allele, if sumstats is sorted by rsID, or for a matching Chr, BP, and effect allele, if sumstats is sorted by ChrBP |
20 |
If a standard error, beta, or p-value entry was marked as invalid, and the other two entries as valid, then restore the invalid using the formula s = β/z, where s is the standard error, β is beta, and z is z-score that corresponds to the p-value in the two-tailed test [10] |
21 |
Go back to step #5 (Validation stage) |