Table 2.
Step | Operational challenges | Coping actions |
---|---|---|
1. Problem understanding | 1. Unavailability of the complete set of data files for immediate access 2. Lack of definition on how to access the variable dictionary 3. Interdisciplinary communication in the team 4. Establishment of consensus in the decisions and definitions |
1. Meetings with the institution providing the data 2. Recording of decisions and definitions 3. Obtaining a sample of the data to assess the technical feasibility of data adequacy for the specific use |
2. Resource planning | 5. Need to optimize cost and preparation time for a large volume of data (~10 Gigabytes) in more than one format | 4. Prioritizing the use of available human and computational resources and planning the acquisition of complementary computational resources to minimize the training time for human resources |
3. Data understanding | 6. Need to improve understanding of variables 7. Variables that have changed their format over time 8. Multiple files with different structures 9. No unique identifiers of records 10.File structure differ from data dictionary description 11. Variables filled with codes from other information systems |
5. Consultation with other sources of information and exchange of information in periodic meetings 6. Storage of data in database tables, using text fields 7. Unique identifier insertion of records to make them logically accessible 8. Log and reuse of commands (queries) in SQL language when possible 9. Making variable dictionaries with standardized names 10. Elaboration of the data extraction and combination plan: reduction of the number of tables; standardization of data structures; adding the source in the primary key of the tables; and identification of variables to filter the records of interest |
4. Data preparation | 12. Multiple tables 13. Multiple values to denote Null content 14. Different filling formats in date variables 15. Invalid values 16. Different filling formats in numeric variables 17. Variables with mixed content 18. Variables with multiple contents 19. Duplicates in variables with multiple contents 20. No rules for cross consistency of related variables 21. Variable filled with code dependent on an external database 22. No direct reference to the external databases used 23. No single variable for data file integration |
11. Reducing the original data (multiple tables) to two tables (table union) 12. Extraction of records of interest after combining data 13. Standardization of null content and recount of nulls 14. Elaboration and execution of the cross-consistency rules of the variables 15. Standardization of variable formats 16. Search for official databases to decode variables dependent on external codes 17. Incorporation of the description of external codes in the research database 18. Separation of variables with multiple contents into new variables for decomposition into single content 19. Data integration using a set of variables common to the tables |
5. Data validation | 24. No single report with the same scope in the original data source for comparison with prepared data | 20. Validation of the transformed data based on the expected data volume and the frequency distribution of each variable according to a time dimension |
6. Data distribution | 25. Big data volume (approximately 10 Gigabytes) 26. Need to deliver output data in more than one format 27. Need for storage and backup of work files |
21. Use of a statistical package to incorporate the dictionary of variables into the data 22. Use of converter software to export data and variable dictionary 23. Creation of private cloud for data distribution and users with different access levels |