Table 2 |.
Key parameters for CoGAPS/PyCoGAPS and guidance on setting their values
| Parameter | Description | Guide to Setting |
|---|---|---|
| path | Path to data | Make sure data is log-normalized if providing a path rather than a data object |
| result_file | Name of result .h5ad file to output | Give this a descriptive name based on your data and run, such as PDACresult_50kiterations.h5ad |
| Standard parameters | ||
| nPatterns | Number of patterns CoGAPS will learn | The optimal number of patterns to learn will vary based on your data and may require several runs of varying values to observe learned features. We recommend starting off with selecting a value that represents the number of experimental conditions, cell types and/or biological processes expected from your data, as well as technical batches present |
| nIterations | Number of iterations of each phase of the algorithm | Higher iterations (i.e., 50,000 iterations) is recommended as it will lead to better convergence. However, higher iterations greatly increases runtime, so we invite the user to play around with values to observe the tradeoff and determine the appropriate value |
| useSparseOptimiz ation | Speeds up performance with sparse data | Set to true if using sparse data, i.e., if roughly >80% of data is zero |
| Run parameters | ||
| nThreads | Maximum number of threads to run on. Allows the underlying algorithm to run on multiple threads and has no effect on the mathematics of the algorithm | The precise number of threads to use depends on many factors such as hardware and data size. The best approach is to play around with different values and see how it affects the estimated time. This is separate from the distributed CoGAPS parallelization mechanism, which sets up multithreaded computing in a different way. |
| transposeData | Whether to transpose data | Whether to transpose the data matrix before running CoGAPS. Set to true if data is stored as samples × genes format (CoGAPS defaults to genes × samples format) |
| Distributed parameters | ||
| distributed | Whether to run distributed | Recommended in most cases for single-cell analysis. Set to ‘genome-wide’ for parallelization across genes, or ‘single-cell’ for parallelization across cells |
| nSets | Number of sets to break data into | For distributed with ‘genome-wide’, do not set value to below 2,000 genes per set. For distributed with ‘single-cell’, make sure this value captures sufficient representation of all cell types in the data |
| minNS | Minimum number of individual set contributions a cluster must contain | Be cautious in setting this value too high as increasing robustness may also cause misses in rare phenomenon or cells |
| maxNS | Maximum number of individual set contributions a cLuster can contain | Modifying this parameter is only important for highly correlated processes |