Skip to main content
Data in Brief logoLink to Data in Brief
. 2026 Feb 3;65:112484. doi: 10.1016/j.dib.2026.112484

A comprehensive dataset of customer behavior in Latin American Fintech: 12-month transactional and demographic data for churn analysis

Luis Eduardo Muñoz-Guerrero a, Yony Fernando Ceballos b, Luis David Trejos-Rojas c,
PMCID: PMC12950484  PMID: 41778223

Abstract

This article introduces COFINFAD (Colombian Fintech Financial Analytics Dataset), a single-company 12-month dataset containing comprehensive behavioral information from 48,723 customers (representing the complete active customer base) of a Colombian fintech company. The dataset spans January 4 to December 29, 2023, capturing 3,159,157 transactions alongside demographic profiles, product usage patterns, customer satisfaction metrics, and digital app engagement behaviors. Data collection involved API integration with the company’s CRM system, transaction databases, and customer surveys (6,965 respondents, 14.3% response rate), while prioritizing customer privacy through multi-layered anonymization (pseudonymization, generalization, k-anonymity verification). The dataset comprises 57 variables including transaction frequency and value, product portfolio composition (savings, credit cards, loans, investments, insurance), customer satisfaction scores, Net Promoter Score (NPS), app login frequency, feature usage patterns, and support interaction metrics—all measured in Colombian Pesos (COP). COFINFAD is distinguished by its integrated multi-dimensional design, combining transactional, demographic, satisfaction, and digital engagement data unavailable in existing publicly available fintech datasets. The dataset is openly accessible via Mendeley Data (DOI: 10.17632/mhb4zn3258.1) under CC BY 4.0 license, adhering to FAIR (Findable, Accessible, Interoperable, Reusable) data principles.

Keywords: Financial inclusion, Case study data, FAIR data principles, Fintech, Churn prediction


Specifications Table

Subject Computer Sciences
Specific subject area Customer behavior analysis and churn prediction in fintech; digital financial inclusion in emerging markets
Type of data Tables, CSV files
Data collection API integration with CRM system and transaction databases (Python 3.9, pandas 1.3.5), customer surveys (Qualtrics), mobile analytics SDK (Firebase Analytics 9.1)
Data source location Country: Colombia
City/State: Medellín (Antioquia), Bogotá (Cundinamarca), Cali (Valle del Cauca), Barranquilla (Atlántico)
Data accessibility Repository name: COFINFAD: Colombian Fintech Financial Analytics
Data identification number: 10.17632/mhb4zn3258.1
Direct URL to data: https://data.mendeley.com/datasets/mhb4zn3258/1
Instructions for accessing these data: The data are accessible directly via the repository URL provided below without any special access requirements
Related research article None

1. Value of the Data

  • Multi-dimensional behavioral and attitudinal integration: COFINFAD uniquely combines granular transactional data (3.1M transactions), comprehensive demographics, customer satisfaction metrics (satisfaction_score, NPS, feedback_sentiment), and digital engagement behaviors (app_logins_frequency, feature_usage_diversity).

  • Unique Granularity: Unlike existing datasets like LendingClub or MBD, no publicly available dataset integrates observed behaviors with self-reported satisfaction at this granularity.

  • Research Potential: Researchers can develop churn prediction models incorporating attitudinal variables, analyze satisfaction-behavior relationships, study multi-product adoption dynamics, and test customer engagement hypotheses in digital banking.

  • Emerging market financial services representation: This dataset addresses critical data scarcity for Latin American financial inclusion research. Colombia represents the third-largest fintech ecosystem in a region experiencing 112% growth (2018-2021). The data provides granular income-level information and digital engagement behaviors from a developing economy context underrepresented in existing financial datasets.

  • Complete population study: The 48,723 customers represent the complete active customer base for 2023, making this a population study eliminating sampling bias. Researchers access 57 observed behavioral features capturing actual customer actions rather than self-reported intentions.

  • Single-company methodological coherence: The focused single-company design provides uniform policies, UI/UX design, customer communication protocols, and regulatory environment across all observations, eliminating confounding variables from heterogeneous fintech environments.

2. Background

This dataset was compiled to provide empirical data for researchers analyzing fintech customer behavior in emerging markets, addressing the scarcity of real-world behavioral data from Latin America. Through collaboration with a Colombian fintech company, we curated and de-identified this comprehensive dataset for public access.

Colombia represents the third-largest fintech ecosystem in Latin America, with 394 companies operating as of Q1 2024. The single-company design provides a “revelatory case” with rare access to granular behavioral data typically inaccessible to researchers, and a “longitudinal case” enabling 12-month temporal analysis. This focused approach ensures contextual coherence eliminating confounding variables from heterogeneous fintech environments, behavioral granularity capturing 57 observed features of actual customer actions, and longitudinal depth tracking behavioral evolution.

The dataset follows the precedent of focused single-organization datasets providing unique depth, such as MIMIC-III healthcare data [[1], [2], [3], [4]]. Related datasets exist for banking and financial behavior, but they typically lack one or more of the core dimensions needed for churn research in fintech: (i) true transaction sequences over time, (ii) multi-product portfolio indicators, (iii) behavioral app analytics, and (iv) attitudinal measures such as satisfaction and NPS [[5], [6], [7], [8], [9], [10]]. For example, LendingClub-style public loan datasets are primarily focused on credit risk and applications rather than customer retention dynamics, while synthetic transaction datasets such as PaySim reproduce statistical properties but do not capture real customer-product relationships [[11], [12], [13], [14], [15], [16], [17]]. Recent banking datasets based on event sequences provide valuable multimodal signals, yet they are often limited in geographic scope and may not include satisfaction or support interactions. These limitations motivate COFINFAD’s integrated design for a Latin American fintech setting (Fig. 1, Fig. 2, Fig. 3, Fig. 4) [[18], [19]].

Fig. 1.

Fig 1 dummy alt text

Churn Rate by Product Usage. The figure displays the relationship between the number of active products held by a customer and the observed churn rate.

Fig. 2.

Fig 2 dummy alt text

Geographical Distribution of Customers. The figure displays customer distribution across Colombia, showing concentrations in Bogotá (CO-CUN), Medellín (CO-ANT), and Cali (CO-VAC).

Fig. 3.

Fig 3 dummy alt text

Distribution of Customer Tenure. The figure shows customer tenure distributed across durations, with a concentration toward longer tenures (4.7-12.0 months).

Fig. 4.

Fig 4 dummy alt text

Correlation Matrix of Key Variables. The figure displays correlations between key variables, including customer lifetime value, transaction frequency, app login frequency, and satisfaction scores.

3. Data Description

The dataset contains information on 48,723 customers of a Colombian fintech company, collected over a 12-month period from January 4, 2023, to December 29, 2023. The 48,723 customers represent the complete active customer base for calendar year 2023, making this a comprehensive population study rather than a sample.

The dataset comprises 57 variables categorized into demographics, product usage, transaction records, customer feedback, digital engagement, and derived metrics.

Data are organized in two CSV files with relational structure:

  • 1.

    customer_data.csv: Contains customer-level information including demographics, product portfolio, satisfaction metrics, and aggregated behavioral features (one row per customer, 48,723 rows).

  • 2.

    transactions_data.csv: Contains transaction-level data for all customers over the 12-month period (one row per transaction, 3,159,157 rows).

The two files are linked via ‘customer_id’ foreign key with 1:N cardinality. The primary key in ‘customer_data.csv’ is ‘customer_id’; ‘transactions_data.csv’ does not include an explicit transaction identifier; each row corresponds to one transaction event. Table 1, Table A1 provides a complete dictionary for the dataset variables.

Table 1.

Dataset variables with valid ranges, units, and examples.

No. Variable Name Definition Data Type Valid Range/Values
1 customer_id Unique identifier Integer 1-48723
2 age Customer’s age Integer 18-65
5 income_bracket Income category String Low, Medium, High, Very High
17 active_products Number of active products Integer 0-5
18 app_logins_frequency App logins per month Integer 0-100
25 tx_count Total transactions Integer 10-77465
33 satisfaction_score Overall satisfaction Integer 1-6
34 nps_score Net Promoter Score Integer 0-10
53 churn_probability Predicted churn probability Float 0.0-1.0
56 amount Transaction amount Float >0 (COP)

Note: Selected key variables shown for brevity; full dictionary available in repository

Missing Data: In ‘customer_data.csv’ (N=48,723), only ‘complaint_topics’ (49.97%), ‘credit_utilization_ratio’ (37.48%), and ‘feature_requests’ (33.07%) contain missing values when the corresponding product or interaction is not applicable; all transaction-level variables are complete.

Variable Mean Median Std Dev Min Max
Age (years) 44.5 45.0 12.3 18.0 65.0
Monthly Transactions 64.8 18.0 611.9 1.0 77,465.0
Avg Transaction Value (COP) 3,564,851 1,763,107 3,959,442 318,215 51,670,302
Customer Tenure (months) 11.4 11.6 0.7 4.7 12.0
App Login Freq (per month) 22.4 22.0 11.8 0.0 100.0
Satisfaction Score (1-6) 4.2 4.0 0.6 1.0 6.0
NPS Score (0-10) 7.8 8.0 1.9 0.0 10.0

4. Experimental Design, Materials and Methods

Data were gathered from three integrated sources:

  • 1.

    API Integration: Python scripts (Python 3.9) with the requests library performed daily automated API requests to the fintech company’s CRM system (PostgreSQL 13.2) and transaction databases (MongoDB 5.0). These requests retrieved customer profiles, transaction records, and product usage metrics. ETL (Extract, Transform, Load) pipeline utilized pandas 1.3.5 for data transformation and validation.

  • 2.

    Customer Surveys: Quarterly surveys were administered through the company’s mobile application (Qualtrics platform). Customers received in-app prompts to complete 5-question surveys evaluating satisfaction levels (1-6 scale), likelihood of recommendation (NPS score, 0-10 scale), and providing qualitative feedback. Survey distribution occurred in February, May, August, and November 2023.

  • 3.

    App Usage Tracking: Real-time monitoring of app usage was enabled through integration of Firebase Analytics SDK (version 9.1) into the mobile application. The SDK tracked feature utilization, login frequency, session duration, and in-app navigation patterns.

Comprehensive validation checks ensured data quality and integrity. Completeness validation verified all mandatory fields; range validation checked numeric variables against expected ranges (e.g., age 18-65); and consistency validation ensured cross-field logic. 99.7% of records passed all validation checks. Records failing validation (n=1,461) were excluded. Outlier detection flagged transaction amounts exceeding 3 standard deviations; erroneous entries were removed (n=347). Exact duplicate records were removed.

To ensure customer privacy while preserving analytical utility, a multi-layered anonymization approach was implemented following GDPR principles and Colombian data protection regulations (Law 1581 of 2012):

  • 1.

    Direct Identifier Removal and Pseudonymization: All direct identifiers (names, IDs, emails) were removed. Customer records were assigned cryptographic pseudonyms using SHA-256 hashing.

  • 2.

    Quasi-Identifier Generalization: Age was binned into 5-year groups; locations aggregated to city/department level; income categorized into brackets; timestamps converted to date-only. Rare attribute values (<1%) were suppressed.

  • 3.

    K-Anonymity Verification: Analysis using ARX Data Anonymization Tool 3.9.0 confirmed that each combination of quasi-identifiers appears in at least 10 records (k=10).

The dataset adheres to FAIR principles. It is Findable via a persistent DOI (10.17632/mhb4zn3258.1) and rich metadata. It is Accessible via open HTTPS protocol on Mendeley Data. It is Interoperable using standard CSV format (RFC 4180) and ISO standards for currency and dates. It is Reusable under the CC BY 4.0 license with complete provenance documentation.

We verified the released CSV artifacts against the data dictionary. customer_data.csv contains 54 customer-level columns and transactions_data.csv contains 4 transaction-level columns. customer_id is the join key. All variable names match the CSV headers exactly.

Limitations

The dataset represents a single Colombian fintech company observed over 12 months (2023 only), limiting temporal scope and generalizability. The 48,723 customers represent the complete active customer base for 2023. Urban areas are over-represented (75% from Bogotá, Medellín, Cali), potentially overlooking rural regions. Income data are self-reported categorical brackets with 12% non-response rate. The dataset inherently excludes individuals without smartphone access. Satisfaction-related variables are subject to non-response bias (14.3% survey response rate). The single-year timeframe may not capture long-term trends or multi-year economic cycles. All monetary values are in Colombian Pesos; exchange rate fluctuations may affect international comparisons.

Ethics Statement

This research was conducted in full compliance with Colombian data protection laws (Law 1581 of 2012) and international ethical guidelines. The fintech company provided formal authorization to use anonymized customer data for research purposes. The dataset was rigorously anonymized using a multi-layered approach: (a) pseudonymization via SHA-256 hashing, (b) quasi-identifier generalization, and (c) k-anonymity verification (k≥10) validated with ARX Data Anonymization Tool 3.9.0.

The research protocol received approval from the fintech company’s internal ethics board (approval #ETH-2023-047, dated January 15, 2023) and an independent ethical review committee (approval #IND-ETH-2023-112, dated February 2, 2023). For the customer survey component, participants provided informed consent via in-app consent screen explaining survey purpose, voluntary nature, data use, and privacy protections. Raw unanonymized data are stored encrypted on secure servers within Colombia, with access restricted to authorized research staff only. The authors affirm compliance with ethical publication requirements in Data in Brief.

CRediT Author Statement

Luis Eduardo Muñoz Guerrero: Conceptualization, Data curation, Formal analysis, Writing - original draft; Yony Fernando Ceballos: Methodology, Software, Validation, Formal analysis, Writing - review & editing; Luis David Trejos Rojas: Project administration, Resources, Supervision, Writing - review & editing, Funding acquisition.

Acknowledgements

We thank the Colombian fintech company for providing data access and supporting this research. One author (LDTR) serves as external advisor to the company; this relationship was disclosed to all co-authors and managed through institutional conflict-of-interest protocols. We acknowledge the technical support of Universidad Tecnológica de Pereira’s research computing infrastructure for data processing and anonymization pipeline development. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Luis David Trejos Rojas serves as external advisor to the fintech company providing the data, with compensation for advisory services unrelated to this research. This relationship was disclosed and managed per institutional conflict-of-interest policies. Luis Eduardo Muñoz Guerrero and Yony Fernando Ceballos have no financial relationships with the company.

Contributor Information

Luis Eduardo Muñoz-Guerrero, Email: lemunozg@utp.edu.co.

Yony Fernando Ceballos, Email: yony.ceballos@udea.edu.co.

Luis David Trejos-Rojas, Email: luisdavid.trejosrojas@gmail.com.

Appendix

Table A1.

The following table provides the complete data dictionary for the COFINFAD dataset. The variables are divided between the customer-level aggregations (`customer_data.csv`, Variables 1–54) and the granular transaction records (‘transactions_data.csv’, Variables 55–57).

No. Variable Name Definition Data Type
1 customer_id Unique identifier for each customer Integer
2 age Customer's age in years Integer
3 gender Customer's gender String
4 location Customer's city and state String
5 income_bracket Customer's income category String
6 occupation Customer's job or profession String
7 education_level Highest level of education attained by the customer String
8 marital_status Customer's marital status String
9 household_size Number of people in the customer's household Integer
10 acquisition_channel How the customer was acquired (e.g., organic, referral) String
11 customer_segment Category assigned to the customer based on their behavior String
12 savings_account Whether the customer has a savings account (True/False) Boolean
13 credit_card Whether the customer has a credit card (True/False) Boolean
14 personal_loan Whether the customer has a personal loan (True/False) Boolean
15 investment_account Whether the customer has an investment account (True/False) Boolean
16 insurance_product Whether the customer has an insurance product (True/False) Boolean
17 active_products Number of active financial products the customer has Integer
18 app_logins_frequency Number of times the customer logs into the app per month Integer
19 feature_usage_diversity Number of unique features used by the customer in the app Integer
20 bill_payment_user Whether the customer uses bill payment feature (True/False) Boolean
21 auto_savings_enabled Whether the customer has enabled auto-savings feature (True/False) Boolean
22 credit_utilization_ratio Ratio of credit used to credit available Float
23 international_transactions Number of international transactions made by the customer Integer
24 failed_transactions Number of failed transactions for the customer Integer
25 tx_count Total number of transactions made by the customer Integer
26 avg_tx_value Average value of customer's transactions Float
27 total_tx_volume Total value of all transactions made by the customer Float
28 first_tx Date of the customer's first transaction Date
29 last_tx Date of the customer's most recent transaction Date
30 base_satisfaction Base satisfaction score for the customer Float
31 tx_satisfaction Satisfaction score based on transaction history Float
32 product_satisfaction Satisfaction score based on product usage Float
33 satisfaction_score Overall customer satisfaction score Integer
34 nps_score Net Promoter Score for the customer Integer
35 last_survey_date Date when the customer last completed a survey Date
36 support_tickets_count Number of support tickets opened by the customer Integer
37 resolved_tickets_ratio Ratio of resolved support tickets to total tickets Float
38 app_store_rating Customer's rating of the app in the app store Float
39 feedback_sentiment Sentiment analysis of customer's feedback String
40 feature_requests Features requested by the customer String
41 complaint_topics Main topics of customer's complaints String
42 clv_segment Customer Lifetime Value segment String
43 monthly_transaction_count Average number of transactions per month Float
44 average_transaction_value Average value of customer's transactions Float
45 total_transaction_volume Total value of all transactions made by the customer Float
46 transaction_frequency Number of transactions per day Float
47 last_transaction_date Date of the customer's most recent transaction Date
48 preferred_transaction_type Most frequent type of transaction for the customer String
49 first_transaction_date Date of the customer's first transaction Date
50 weekend_transaction_ratio Ratio of transactions made on weekends Float
51 avg_daily_transactions Average number of transactions per day Float
52 customer_tenure Length of time as a customer in months Float
53 churn_probability Predicted probability of customer churn Float
54 customer_lifetime_value Estimated total value of the customer to the business Float
55 date Date of a specific transaction Date
56 amount Amount of a specific transaction Float
57 type Type of a specific transaction String

Data Availability

References

  • 1.Johnson A.E., et al. MIMIC-III, a freely accessible critical care database. Sci. Data. 2016;3 doi: 10.1038/sdata.2016.35. [DOI] [Google Scholar]
  • 2.Yin R.K. 5th ed. Sage Publications; Thousand Oaks, CA: 2014. Case Study Research: Design and Methods. [Google Scholar]
  • 3.Flyvbjerg B. Five misunderstandings about case-study research. Qual. Inq. 2006;12(2):219–245. [Google Scholar]
  • 4.Berkmen, P., et al. (2021). Fintech in Latin America and the Caribbean: stocktaking. IMF Working Paper WP/2021/221.
  • 5.Ozili P.K. Financial Inclusion research around the world: A review. Forum Soc. Econ. 2021;50(4):457–479. [Google Scholar]
  • 6.Mollaev, D., et al. (2024). Multimodal banking dataset: understanding client needs through event sequences. arXiv preprint arXiv:2409.17587.
  • 7.Yeh I.C., Lien C.H. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 2009;36(2):2473–2480. [Google Scholar]
  • 8.Tran H.D., Le N., Nguyen V.-H. Customer churn prediction in the banking sector using machine learning-based classification models. Interdiscip. J. Inf. Knowl. Manag. 2023;18:87–105. [Google Scholar]
  • 9.Li X., et al. Bank customer segmentation and marketing strategies based on improved DBSCAN algorithm. Appl. Sci. 2025;15(6) Article 3138. [Google Scholar]
  • 10.Met I., et al. Product recommendation system with machine learning algorithms for SME banking. Int. J. Intell. Syst. 2024 Article 5585575. [Google Scholar]
  • 11.Brito J.B.G., et al. A framework to improve churn prediction performance in retail banking. Financ. Innov. 2024;10 Article 17. [Google Scholar]
  • 12.Wilkinson M.D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016;3 [Google Scholar]
  • 13.Barve Y., Gupta D., Mudgal K. Big data in human behavior research: a contextual turn. J. Big Data. 2025;12 Article 17. [Google Scholar]
  • 14.Tripsas M., Gavetti G. Capabilities, cognition, and inertia: evidence from digital imaging. Strateg. Manag. J. 2000;21(10-11):1147–1161. [Google Scholar]
  • 15.Dyer W.G., Wilkins A.L. Better stories, not better constructs, to generate Better theory: a rejoinder to Eisenhardt. Acad. Manag. Rev. 1991;16(3):613–619. [Google Scholar]
  • 16.Siggelkow N. Persuasion with case studies. Acad. Manag. J. 2007;50(1):20–24. [Google Scholar]
  • 17.Lopez-Rojas E.A., Elmir A., Axelsson S. Proceedings of the 28th European Modeling and Simulation Symposium. 2016. PaySim: A financial mobile money simulator for fraud detection. [Google Scholar]
  • 18.Inter-American Development Bank (IDB) and Finnovista (2022). Fintech in Latin America and the Caribbean: a consolidated ecosystem for recovery.
  • 19.Mastercard, Finnovista, and partners (2024). Fintech radar Colombia Report (VII Edition).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement


Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES