1. Problem understanding |
1. Assessment of the characteristics of secondary input data |
|
• Content: what the data represents in the real world, source of data, the context in which it was collected |
|
• Estimated volume: number of records and size of expected files |
|
• Expected data file format |
|
2. Assessment of the characteristics of the research |
|
• Population and period under study |
|
∙Inclusion and exclusion criteria for selection |
|
• Study design and analysis unit |
|
• Variables involved in the main research questions, objectives, and hypotheses |
|
3. Assessment of the characteristics of the output data |
|
• Estimated output data volume: number of records or file size |
|
• The desired format for delivery of output data |
|
4. Checking the availability of input data and variable dictionaries |
|
5. Evaluation of the ethical aspects and technical feasibility of data adequacy for the research |
2. Resource planning |
6. Sizing up human resources |
|
7. Sizing up computational resources (hardware and software platform) |
|
• Volume and format of input data |
|
• Support for the operations required to adjust the input data |
|
• Estimated volume and format of output data |
|
• Performance and data volume limits for eligible computing resources |
3. Data understanding |
8. Obtaining secondary data files and variable dictionaries |
|
9. Understanding the variable dictionaries related to the input data and creating the research variables dictionary for each type of file |
|
10. Inventory of data files: name and extension, size in bytes, and number of records |
|
11. Assessment of the existence of a unique record identifier (primary key) in each data file |
|
12. Inventory of the variables contained in the data files: name, type, and size |
|
13. Exploratory data analysis for completeness |
|
14. Elaboration of the data extraction plan for the research |
4. Data preparation |
15. Execution of the data extraction plan |
|
16. Exploratory data analysis to detect invalid content and assess the homogeneity in a data filling |
|
17. Data cleaning and transformation to generate research variables |
|
18. Updating the search variable dictionary |
5. Data validation |
19. Exploratory analysis of the transformed data for comparison with the original data |
6. Data distribution |
20. Exporting the database to the specified format (s) |
|
21. Reduction of the database to contain only the research variables |
|
22. Distribution of the database and dictionary of research variables |