Skip to main content
. 2021 Jun 15;9:642163. doi: 10.3389/fpubh.2021.642163

Table 2.

Operational challenges identified in the steps for ensuring the adequacy of structured secondary health data for specific use in quantitative research.

Step Operational challenges Coping actions
1. Problem understanding 1. Unavailability of the complete set of data files for immediate access
2. Lack of definition on how to access the variable dictionary
3. Interdisciplinary communication in the team
4. Establishment of consensus in the decisions and definitions
1. Meetings with the institution providing the data
2. Recording of decisions and definitions
3. Obtaining a sample of the data to assess the technical feasibility of data adequacy for the specific use
2. Resource planning 5. Need to optimize cost and preparation time for a large volume of data (~10 Gigabytes) in more than one format 4. Prioritizing the use of available human and computational resources and planning the acquisition of complementary computational resources to minimize the training time for human resources
3. Data understanding 6. Need to improve understanding of variables
7. Variables that have changed their format over time
8. Multiple files with different structures
9. No unique identifiers of records
10.File structure differ from data dictionary description
11. Variables filled with codes from other information systems
5. Consultation with other sources of information and exchange of information in periodic meetings
6. Storage of data in database tables, using text fields
7. Unique identifier insertion of records to make them logically accessible
8. Log and reuse of commands (queries) in SQL language when possible
9. Making variable dictionaries with standardized names
10. Elaboration of the data extraction and combination plan: reduction of the number of tables; standardization of data structures; adding the source in the primary key of the tables; and identification of variables to filter the records of interest
4. Data preparation 12. Multiple tables
13. Multiple values to denote Null content
14. Different filling formats in date variables
15. Invalid values
16. Different filling formats in numeric variables
17. Variables with mixed content
18. Variables with multiple contents
19. Duplicates in variables with multiple contents
20. No rules for cross consistency of related variables
21. Variable filled with code dependent on an external database
22. No direct reference to the external databases used
23. No single variable for data file integration
11. Reducing the original data (multiple tables) to two tables (table union)
12. Extraction of records of interest after combining data
13. Standardization of null content and recount of nulls
14. Elaboration and execution of the cross-consistency rules of the variables
15. Standardization of variable formats
16. Search for official databases to decode variables dependent on external codes
17. Incorporation of the description of external codes in the research database
18. Separation of variables with multiple contents into new variables for decomposition into single content
19. Data integration using a set of variables common to the tables
5. Data validation 24. No single report with the same scope in the original data source for comparison with prepared data 20. Validation of the transformed data based on the expected data volume and the frequency distribution of each variable according to a time dimension
6. Data distribution 25. Big data volume (approximately 10 Gigabytes)
26. Need to deliver output data in more than one format
27. Need for storage and backup of work files
21. Use of a statistical package to incorporate the dictionary of variables into the data
22. Use of converter software to export data and variable dictionary
23. Creation of private cloud for data distribution and users with different access levels