| Algorithm 1: Dataset splitting based on geographic region hashing (hash bucketing) |
| Input: region: string // Geographic region name |
| Output: subset: {train, val, test} |
| 1: // Step 1: Generate normalized hash value |
| 2: hash_str ← MD5(region) // Compute MD5 hash of region |
| 3: hash_hex ← Substring(hash_str, 0, 8) // Extract first 8 hexadecimal characters |
| 4: hash_int ← HexToInt(hash_hex) // Convert hexadecimal substring to integer |
| 5: hash_mod ← hash_int MOD 10,000 // Limit value range via modulo operation |
| 6: hash_ratio ← hash_mod/10,000.0 // Normalize to [0, 1) |
| 7: // Step 2: Assign subset based on ratio |
| 8: if hash_ratio < 0.8 then |
| 9: subset ← “train” // 80% for training set |
| 10: else if hash_ratio < 0.9 then |
| 11: subset ← “val” // 10% for validation set |
| 12: else |
| 13: subset ← “test” // 10% for testing set |
| 14: end if |