Abstract
Objectives: The study sought to investigate how Spanish names are handled by national and international databases and to identify mistakes that can undermine the usefulness of these databases for locating and retrieving works by Spanish authors.
Methods: The authors sampled 172 articles published by authors from the University of Granada Medical School between 1987 and 1996 and analyzed the variations in how each of their names was indexed in Science Citation Index (SCI), MEDLINE, and Índice Médico Español (IME). The number and types of variants that appeared for each author's name were recorded and compared across databases to identify inconsistencies in indexing practices. We analyzed the relationship between variability (number of variants of an author's name) and productivity (number of items the name was associated with as an author), the consequences for retrieval of information, and the most frequent indexing structures used for Spanish names.
Results: The proportion of authors who appeared under more then one name was 48.1% in SCI, 50.7% in MEDLINE, and 69.0% in IME. Productivity correlated directly with variability: more than 50% of the authors listed on five to ten items appeared under more than one name in any given database, and close to 100% of the authors listed on more than ten items appeared under two or more variants. Productivity correlated inversely with retrievability: as the number of variants for a name increased, the number of items retrieved under each variant decreased. For the most highly productive authors, the number of items retrieved under each variant tended toward one. The most frequent indexing methods varied between databases. In MEDLINE and IME, names were indexed correctly as “first surname second surname, first name initial middle name initial” (if present) in 41.7% and 49.5% of the records, respectively. However, in SCI, the most frequent method was “first surname, first name initial second name initial” (48.0% of the records) and first surname and second surname run together, first name initial (18.3%).
Conclusions: Retrievability on the basis of author's name was poor in all three databases. Each database uses accurate indexing methods, but these methods fail to result in consistency or coherence for specific entries. The likely causes of inconsistency are: (1) use by authors of variants of their names during their publication careers, (2) lack of authority control in all three databases, (3) the use of an inappropriate indexing method for Spanish names in SCI, (4) authors' inconsistent behaviors, and (5) possible editorial interventions by some journals. We offer some suggestions as to how to avert the proliferation of author name variants in the databases.
INTRODUCTION
Because of their capacity to handle massive amounts of information, bibliographic databases (DB) have become an indispensable tool for retrieving scientific information and for performing bibliometric studies [1]. Their internal quality is a prime factor for ensuring efficient science communication and reliable, valid bibliometric analyses. Errors and inconsistencies in bibliographic records lead to the loss of relevant information in searches and interfere with access to documents [2]. In addition, errors can affect the operations of identification, selection, extraction, classification, ordering, and tabulation of data. Inaccuracies in bibliometric studies as a result of incorrect information supplied by DBs have been widely reported [3–9], and some authors have pointed out that these errors pose limitations serious enough to lead users to question the validity of bibliometric indicators [10–12].
The quality of bibliographic DBs has been widely questioned. Several studies have revealed deficient standardization measures for some fields in the registers and a high frequency of errors in the information contained therein [13–16]. That calls to improve the DBs should appear periodically is therefore not surprising. Some studies have identified features that can help to define quality and have suggested indicators and procedures that make it possible to measure and improve quality [17, 18].
The author name is one of the fields that has been widely shown to form the basis of many users' searches for information retrieval. However, this field has often been found to be inaccurately recorded. Even in the online catalogs of prestigious libraries, some studies have found that variability and inconsistencies in this field are frequent [19–21], despite rigorous authority control measures based on national and international standards. Although similar studies have yet to be undertaken for commercial DBs, the problem in these repositories is likely to be even worse. Many studies have denounced the lack of uniformity in author names and have tried to recommend appropriate strategies to ensure that searches on this field are successful [22–31].
Evaluations of the quality of references in scientific articles have revealed that author names are one of the largest sources of error [32]. Some of the mistakes can be attributed to inconsistencies in the DBs that researchers use to locate and reference their citations.
One of the reasons for the variability in personal names in bibliographic information systems is related to the diversity of structures that arise from historical and cultural traditions involved in naming persons in different countries [33]. In particular, this variability may be exacerbated when personal names are handled by linguistic systems different from the one in which they originated.
For Spanish authors indexed in international DBs—in which English is the dominant language—the problem is obvious given the complexity and different potential combinations of first and middle names and surnames (appendix). Although different countries and linguistic communities use cataloging rules and international guidelines to keep the problems related to rules for dealing with author names in bibliographic online catalogs to a minimum [34–36], the rules and standards are often not applied consistently by the DBs. Moreover, important differences exist between the structures of English (simple or compound name and patronymic surname) and Spanish names (simple or compound name, patronymic and matronymic surnames), which further aggravate the problem but await detailed study. Articles (in Spanish) by Gómez [37] and López-Cózar [38] are among the first to call attention to the mishandling of Spanish names in English-language DBs.
In this article, the authors take a closer look at the origins and consequences of errors in Spanish names in national and international bibliographic DBs, in an attempt to shed light on five issues:
How much variability in the structure of Spanish names is there in each of these DBs, and what are the possible causes of this variability?
What are the consequences of this variability for information retrieval based on Spanish personal names?
How consistent are the DBs in their approach to indexing Spanish names?
How accurately are Spanish authors' names handled in Science Citation Index (SCI), MEDLINE, and Índice Médico Español (IME)?
What criteria can be used by the DBs to improve accuracy and consistency, and what practices can authors and journals use to improve consistency and reliability?
METHODS
Sample
From a reference population of Spanish authors indexed in SCI, MEDLINE, and IME during 1987 to 1996, we extracted a sample of 172 authors affiliated during this period with the School of Medicine at the University of Granada, Granada, Spain. The choice of a nonrandom sample was based on methodological reasons. We needed to check all variants of the same name indexed in the DBs against the author's full, correct name. This information would have been difficult and time consuming to obtain, if we had used a random sample of all Spanish university authors associated with items in these DBs. The following criteria were used to ensure that the sample was representative and unbiased and that the results could be assumed to have external validity:
The sample contained all possible name structures at proportions that represented their true frequency in Spain (appendix). Table 1 shows the different name structures and their frequencies in our sample. The most frequent name structure in Spain consists of a first name (which may in turn consist of one or more parts; for the sake of convenience we have used the term “middle name” here to designate the second part of compound first names) and a patronymic surname followed by a matronymic surname. In our sample, the legal names of 83.8% of the authors reflected this structure (52.9% plus 30.9%).
The number of authors associated with more than one item was large enough to permit the study of possible variants in their name. Of the 172 authors in our sample, 162 (94%) had published more than one article (appendix).
The number of authors indexed in more than one of the three DBs was large enough to permit comparisons across DBs. In our sample, 82.6% of the authors appeared in two of the DBs, and 63.4% appeared in all three (Table 2).
The sample was free of biases. Although a sample of authors from the same institution might have been biased by uniform publication practices, this possible bias was ruled out by the fact that the School of Medicine of the University of Granada has no institutional publication guidelines or recommendations. The results of the study also failed to reveal any uniform trends in publication.
We chose the DBs in this study to investigate whether variability in name structure was widespread or limited to certain DBs and to compare different repositories in terms of information retrievability when searches were based on author names. The three DBs in this study were also the most representative sources of bibliographic information in the biomedical research community in Spain, both for national (IME) and international coverage (SCI, MEDLINE). An exploratory analysis of the DBs that Spanish biomedical researchers used for literature searching showed that 62% used MEDLINE, whereas 19% used IME [39]. A recent study also noted the importance of SCI for the biomedical literature and the effectiveness of the joint use of this DB and MEDLINE by medical researchers to retrieve the largest possible number of references [40].
The IME, produced by the Centro de Documentación e Información Biomédica (Center for Biomedical Documentation and Information) at the University of Valencia and the Consejo Superior de Investigaciones Científicas (CSIC, Higher Council of Scientific Research) is the most accurate and complete national DB for tracking medical publications in Spain and for increasing the national and international dissemination of this research [41, 42].
MEDLINE is considered the best bibliographic information system in biomedicine and is the most widely used [43], although other DBs, such as EMBASE/Excerpta Medica, cover a larger number of journals [44]. In addition, MEDLINE has been used as a source of bibliometric information to study Spanish scientific production in biomedicine. According to Pestaña [45], during the period from 1990 to 1994, the number of articles by Spanish authors indexed in MEDLINE was double the number indexed in SCI. This difference is not surprising given that during that period MEDLINE indexed thirty-six Spanish journals, whereas SCI and the Social Science Citation Index (SSCI) together covered only twelve.
SCI is a unique bibliographic information system. It processes the references contained in each source document it indexes, thereby making it possible to retrieve all related documents through the references cited. It is therefore the only system that can be used to measure the dissemination and use of scientific information via citation analysis and is hence an indispensable tool for bibliometric studies.
Aside from the reasons given above, the three DBs were considered particularly useful in the present study because of the differences in the scope of their coverage (SCI indexes journals from all areas of science, whereas MEDLINE and IME cover only health sciences) and in the main languages of the source items (English in MEDLINE and SCI, Spanish in IME). These differences allowed us to compare the degree to which bibliographic authority control was conditioned by the main language of the DB.
Data collection
The full correct name of each author was taken from a list of medical school teaching staff generated by the personnel database maintained by the School of Medicine, University of Granada. All names on this list were considered potential authors. The procedure to identify publications associated with each author in each DB was as follows.
For SCI, we searched on the address field year by year from 1987 to 1996 to retrieve all records that included the word “Granada.” These records were examined one by one to collect only those items whose authors were affiliated with the School of Medicine at the University of Granada. This strategy was successful thanks to the detailed information in the affiliation field for each author of each item.
For IME and MEDLINE, the searches were done in a slightly different way, because these databases recorded address information only for the first author in the byline, and searches on the affiliation field would therefore not have been effective. We performed four searches for each name on the list of potential authors: (1) first and second surnames and first name initial, (2) first and second surnames, (3) first surname only and first name initial, and (4) second surname only and first name initial. For authors with complex names, we ran searches with all possible name structures (first surname only, second surname only, or both; one or more initials for first and middle names; particles; see Table 1). Each record thus retrieved was checked to ensure that the author's name was a variant of the target author's name, rather than a different individual with a similar name.
All records obtained with these search strategies were loaded into a ProCite database (v. 4.0) for further manipulation. The search and indexing functions of this program were used to identify all variants of each name in records from each of the three DBs. We tabulated the results (appendix), and this information was used for further quantitative analyses for each author. For each author, we noted the full correct name in bold characters, then added the correct standard structure according to the current Spanish Cataloging Rules (RCE) [46] and the International Federation of Library Associations and Institutions (IFLA) guidelines [47]. All standard structures took the form of one of the two basic structures explained in Figure 1 or of one of the modifications.
The occurrences in each DB of each of the variants found for each author were identified. Variants were matched with their standard structure by searching for author entries with the same first name or names, first (and second, if given in the record) surname, and particle or particles, regardless of the order in which these components of the name appeared in the DB record. If the same variant was suspected to refer to two different authors, the variant was assigned to the first name in the alphabetical listing of potential authors. Variants that were obviously the result of a typographical error or the omission of a hyphen from the surname were assigned to the appropriate author with little difficulty. After we resolved these particular cases, in a few cases, some doubt remained as to which author a variant heading belonged.
Absolute and relative frequency indices were calculated for each variant. Pearson's correlation coefficient was calculated to determine when variability in the author's name correlated with productivity.
RESULTS
Overall quantification of variability in author names
To judge the magnitude, in absolute and relative terms, of the problem with Spanish author names in the three DBs analyzed here, we ranked authors by the number of variants in the names under which each was indexed. Table 3 shows that about half of the authors appear under two or more variants in SCI (48.1%) and MEDLINE (50.7%) and that this proportion is an even higher 69.0% in IME. The most frequent total number of variants was between two and four. Five authors had five variants in MEDLINE, and two authors had five variants in IME. In each of these DBs, one author appeared under seven different variants.
Relationship between productivity and variability
To obtain a more detailed vision of the problem, we analyzed the relationship between variability in author names and productivity. For authors with only one publication, variability is zero. As the number of publications increased, so did the number of variants under which authors were indexed (Figure 2). The correlation between these variables was significant: among authors with two or three publications, 25% to 40% appeared under more than one name. Among those with five to ten publications (approximately the median number for our sample), more than 50% were indexed under more than one name. For example, among those with nine items in the DBs, more than 70% were indexed under several different names. For highly productive authors with more than ten publications, it was exceptional for any name to appear systematically with the same structure in all records, in other words, almost 100% of those in this subgroup were indexed under several different names.
When the results for different DBs were examined separately (Figure 3), we again found that as the number of publications increased, the percentage of authors with no variants decreased. Uniformity in the structure of the names within the same DB was lost.
The implication of increased productivity for the retrievability of information was that as the number of publications increased, the effectiveness of searches on the author field diminished. In other words, the number of items likely to be located by a search with any of the possible variants for author name tended to decrease. This problem affected 50% of the authors with more than one item in SCI and 70% of all such authors in IME. Of all potentially locatable articles by a given author, between 50% and 70% of the records used some variant of the author's name. This inconsistency was especially notable in IME and was less marked in MEDLINE. In the latter DB, the decrease in the number of articles that a search on any variant was likely to locate was exponential; this relationship was less marked than in SCI or IME.
A separate study of each DB revealed other phenomena that deserved comment and were most noticeable in SCI. Figure 4 plots the percentage of authors with no variants against productivity. In the interval between two and five publications, the loss of uniformity was already notable: among authors associated with five items, only 20% appeared under the same name in all records. However, among highly productive authors with more than ten publications, two groups were clearly distinguishable: those with extremely high variability to the extent that a different variant appears for each publication, those with no variability, and those whose name appears in the same manner in all records.
Types and frequencies of variants
The frequencies of different name structures were counted to determine which variants were used most often in the DBs and to compare the three DBs in terms of accuracy and degree of variability. Table 4 shows the most frequent variants used for indexing in the different DBs. In MEDLINE and IME, the structure used most often was “first surname second surname, first name initial” (for example, Galvez Vargas, R), which accounted for 32.9% and 36.7% of the index entries, respectively. The second-most frequent structure was “first surname, first name initial” (for example, Galvez, R) at 25.2% and 21.5%, and the third-most often used was “first surname, first name initial middle name initial” (for example, Caballero, AM) at 12.7% and 9.2%. In fourth place in MEDLINE and IME records was the structure “first surname second surname, first name initial middle name initial” (for example, Caballero Plasencia, AM) with 8.8% and 12.8% of all entries, respectively.
However, the situation in SCI was clearly different. The name structure that accounted for the largest proportion of entries (35.8%) was “first surname, first name initial” (for example, Galvez, R). The second-most frequent structure (18.3%) was a single string of characters consisting of the first surname run in with the second surname, followed by the first name initial (for example, Galvezvargas, R). The third-most common structure, accounting for 12.2% of the entries in SCI, was “first surname, first name initial middle name initial” (for example, Caballero, AM), and the fourth most common structure (7.1%) was “second surname, first name initial first surname initial” (for example, Vargas, RG for the author named Ramon [first name] Galvez [first surname] Vargas [second surname]).
In all three DBs, the remaining variants each accounted for less than 5% of the entries, and many variants in name structure occurred only one or two times each.
DISCUSSION
The large proportion of authors whose name appears in more than one manner and the direct correlation between productivity and variability suggest that the origin of the problem is in the lack of consistency on the part of authors themselves in signing their research articles. This behavior leads to serious inconsistencies in the DBs, whose managers and technicians apparently do nothing to curtail or correct the problem.
One of the principles used to improve the quality of bibliographic databases is to establish a single structure for each name that appears associated with different source documents. Our findings strongly suggest that the DBs we sampled do not apply any type of control measure to keep variants of the same name from proliferating. As a result, the quality of information retrieval based on searches by author is poor. The consequences for users of these DBs are clear: if users wish to locate all items published by a certain author by searching on the author field, they must perform at least two separate searches for half of the authors (at least, for the population of authors affiliated with the University of Granada medical school) and often many more, depending on how many variants there are of the authors' names. With browsing techniques, users would need to discover by trial and error which variants have been used for entries, among the many possible combinations (some clearly incorrect and unlikely to be guessed by a typical DB user) of first and second surnames and first and middle name initials.
We were surprised to find the highest rate of variability in the Spanish database, IME. In theory, familiarity with Spanish name structures on the part of the persons who manage this DB should have led to less variability in IME than in the two international DBs created and managed in the United States. The withdrawal in 1990 of the public funding on which IME depended for its maintenance and day-to-day operations may explain the drop in the quality of bibliographic control and may account for the poor performance we observed for items published between 1987 and 1996. In addition, IME enters information into its records just as it appears in the source documents and produces indexes only of Spanish publications. This process may be construed as evidence that Spanish authors—at least those who publish in biomedicine—are less careful about using the same name for all their publications when they submit articles to Spanish journals than when they submit them to “international” journals. Studies designed to investigate authors' behavior in signing different articles will be needed to shed light on this hypothesis.
Our analysis of the relationship between productivity and variability revealed a direct correlation between these two variables. In MEDLINE, the increase in variants as a given author published more items was less than in the other two DBs, a finding that might be related to the use of measures to control author names or at least to invert the order of first names and surnames correctly to respect the original name structure in Spanish.
Hence, greater productivity implies lower effectiveness of information retrieval, with a tendency for the number of items theoretically retrievable to approach one for highly productive authors. The consequences of this tendency are obvious: as an author produces more publications, the likelihood that all of them will be located by a single search with a single variant of the author's name—regardless of whether the variant tried is the correct one according to Spanish language usage or one of the several possible incorrect permutations—decreases sharply, as illustrated in Figure 2.
For comparatively unproductive authors who have published only a few articles or perhaps only one, retrievability is better, but only if the search or browsing session is based on the same name structure used to index the item or items in the DB. Regardless of their productivity, then, this means that even for Spanish authors indexed by only one name—which may or may not be correct according to Spanish usage—retrievability cannot be assured. As several studies have already pointed out [48–50], information retrieval based on searching by author name will become optimal only when authority control measures are used to standardize entries, ensure their consistency across records (unification), and guarantee that “see” and “see also” cross-references are appropriately linked.
Despite the trend toward loss of reliability in retrieval as the number of publications increases, Figure 4, which is based on results obtained with SCI, shows that high productivity is not always associated with high variability in author name structures. Some authors have standardized their name structures throughout their publishing career by signing all their articles with the same “pen name.” (For some examples, see authors 11, 60, 116, and 167 in the appendix.) These authors, along with those for whom only one item is indexed, account for the subgroup of authors with no variants. At the opposite extreme are authors with complex names who have not adopted a permanent pen name; under these circumstances many variants can arise. (For examples, see authors 18, 56, 68, 77, 85, and 149 in the appendix.) Interestingly, some highly productive authors in the present sample appear to have used a pen name more systematically for some journal submissions than for others. This transitory consistency is reflected in the numbers of variants found in SCI for some cases. (For examples, see authors 20, 92, 128, 132, 150, and 172 in the appendix.) This may reflect the fact that most source items indexed by SCI are from journals published in English-speaking countries, and many journals may impose English-language conventions on the structure of foreign authors' names. In any case, it appears that some Spanish authors with publications in international journals consider the country where the journal is published when they place their name on the title page of the manuscript.
An analysis of the most frequent variants in each DB suggests some answers to some of the questions raised above. In overall terms, four main variants account for a large percentage of occurrences of author names (Table 4). These four variants are all derived from the two name structures that are currently the most common in Spain and that together represent the full legal names of 83.7% of the authors in our sample (Table 1). However, to understand these data, it is necessary to backtrack and deduce how these variants came about as a result of the indexing rules or criteria used by each DB.
In IME and MEDLINE, the most frequent and fourth-most frequent variants are the correct, standardized forms of authors names according to national (RCE) and international (IFLA) cataloging rules, the only departure from these guidelines is that in both cases initials are used for the first and middle names. IME uses the RCE criteria (summarized in Figure 1), and MEDLINE, produced by the National Library of Medicine (NLM), uses the NLM Cataloging System [51]. This latter system is compliant with the Program for Cooperative Cataloging (PCC) developed by the Library of Congress and the Name Authority Cooperative Organization (NACO) [52], whose base standards are the criteria established by the second edition of the Anglo-American Cataloging Rules (AACR2) [53] for the formation of headings for persons and the specifications of the MARC format for type of personal name entry element [54].
The second-most and third-most frequently seen variants in IME and MEDLINE, as well as the remaining variants shown in Table 4, can be derived from the base standards cited above, although these variants do not represent correct names. We can therefore deduce that the indexing criteria used by IME and MEDLINE follow Spanish linguistic practices, although they seemingly fail to apply mechanisms such as checking against an authority file to ensure that the same form of the author's name is indexed consistently and continuously throughout the life of the DB. Such systematization measures would greatly improve the retrievability of information by ensuring that all works linked to the same author are located in a single search.
In SCI, however, the most frequent variants are the result of the application of a specific indexing criteria used by this DB: “the general rule is that the final name presented is taken as the surname—this applies to all languages. All other names presented are processed as initials” [55]. This general rule is compatible with the basic criterion of the AACR2 for the standard structure (name and surname) of names in English. On the basis of this general rule, ISI always considers the last part of the name given in the source document to be the only indexable surname for that author and thus uses this part as the entry element. The remaining parts of the name are reduced to initials; for example, José María Bermúdez García becomes García, JMB. The system used by ISI uses one exception for all languages: particles that link the first name with the surname are treated as part of the surname: “Particles are included as part of the surname. There is a list of accepted particles that is applied to all languages” [56]. For example, Juan Luis del Arbol becomes Delarbol, JL. A specific rule used by SCI for Spanish names further confirms that these names are often mutilated by their indexing policy: “Compound names joined by ‘y’ or ‘e’ are split so that the last name presented is processed as the surname, and the conjunction is taken as an initial” [57]. For example, María González y Rodriguez becomes Rodriguez, MGY.
According to these indexing criteria, and considering that the most common structure of Spanish names is “first name (middle name if present) first surname second surname,” the most common variants in SCI would be expected to be “second surname, first name initial first surname initial” (e.g., Angeles Ruiz Extremera becomes Extremera, AR) and second surname, first name initial (middle name initial) first surname initial (e.g., María Estrella Ruiz Requena would be expected to be indexed in SCI as Requena, MER). However, these variants actually occupy the fourth and sixth positions in descending order of frequency.
We found that the variants produced by authors who presumably adapted their pen name to English-language conventions were more frequent in this DB than the “standard” entry structure derived from applying SCI's indexing criteria to Spanish names published according to normal Spanish-language conventions. For example, authors whose name appeared as “first name first surname” were indexed under entries structured as “first surname, first name initial,” the most frequent variant in SCI. Authors who signed their articles as “first name middle name first surname” accordingly were indexed under “first surname, first name initial middle name initial”—the third most frequent variant in SCI. If authors joined their two surnames with a hyphen (first name, first surname-second surname), they were indexed as first surname run together with second surname, first name initial—the second most frequent variant in SCI.
Why are these variants found in SCI but not in MEDLINE? Both DBs are produced in the United States and may be assumed to adapt Spanish names in a similar manner. The explanation may lie in the fact that most of the items by Spanish authors in MEDLINE (about 70%) are from the thirty-four Spanish journals that this DB indexes, whereas nearly all items by Spanish authors in SCI are from journals published in English-speaking countries. This difference appears to be the result of two factors. First, author names in MEDLINE are spared any attempt to adapt them to English linguistic conventions and are indexed correctly. Second, for articles in journals covered in SCI, authors may have been more careful to adapt their names to English linguistic conventions, either spontaneously or to comply with the journal's instructions to authors.
CONCLUSIONS
Causes of variability
Although the data we obtained for author name variants should be verified against the way authors' names appear in the source documents, the present results have important implications, discussed below.
We suggest several main causes of variability in author name index entries in the three DBs we compared. One cause is likely to originate with the authors themselves, who may sign different articles in different ways. We have found that variability tends to increase with productivity. Another cause is the difference in indexing procedures at different DBs—even though the indexing mechanisms are apparently applied consistently within each DB. The procedures used by MEDLINE and IME are appropriate for Spanish names and respect Spanish language conventions, whereas those used by SCI clearly violate these conventions by imposing English name structures. Despite the consistency in the rules developed and applied by each DB, the lack of effective authority control measures means that variants created by the authors themselves are not corrected. Indexing procedures that are inappropriate for Spanish names multiply the effect of inconsistency on the part of the authors. A minority of authors, aware of the problems their names cause in international bibliographic DBs (possibly due to style manuals or the instructions to authors provided by some journals), take measures to counteract the proliferation of variants, although these preventive measures have little effect on the overall accuracy of Spanish author entries in the DB.
Recommendations to reduce variability
To avert the problem at its origin, recommendations need to be aimed at authors and journals. In contrast, attempts to correct the problem during the final phase of information transfer, while the DB record is being prepared or after the source document has been indexed in the DB, should be aimed at DB managers and users.
Authors should be encouraged to sign their articles with the same pen name throughout their publishing careers. Our findings suggest that the structures most likely to reduce variability in Spanish names to a minimum in international DBs are “first name first surname” and “first name middle name first surname.” In other words, the pen name should imitate English language conventions and, if possible, omit particles. Speaking realistically, it seems unlikely that SCI and other large DBs will change their indexing rules in the near future to accommodate the linguistic conventions of Spanish or other languages that differ from English in the way personal names are given. Whether the first and middle names are given in full or as initials would not affect the structure of the DB entry, as all three DBs studied here reduce these names to initials.
This suggestion, however, raises some sensitive issues. First, it assumes that no solution is to be expected from the DBs themselves. Second, Spanish authors may be little inclined to amputate or otherwise mutilate their legal names, a behavior that may well involve some trauma to their linguistic identity. Third, the use of the first name only (with no middle name) and the first surname only would increase problems of homonymia caused by several authors sharing the same pen name, and confusion would be further increased by the use of the initial only for the first name in the index entry and consequent loss of information on the author's first name. This problem has been pointed out in some style manuals and has been discussed in depth by Silva [58]. Assuming that DBs may one day attempt to palliate the problem of homonymia, it seems advisable for authors to provide first and middle names in full in their manuscripts.
Another solution that some authors have resorted to is to join the two surnames with a hyphen so that they are treated as a single indexing element. This hyphenation may more effectively reduce the problems with author identification, particularly in SCI. However, because this DB removes hyphens and combines the surnames, it creates an indexing term that is entirely inaccurate for Spanish authors, as the spurious surname thus created is no longer the author's real surname. For example, Francisco Pérez-Blanco becomes Pérezblanco, F in SCI. Combining the two surnames is not necessary in MEDLINE, because its indexing mechanism leaves the original structure of Spanish names unaltered.
Journals can also do their part to standardize elements used in the process of information transfer. Editorial staff could take measures to ensure that authors appear consistently under the same name in the published byline. They could also take steps to ensure that an author's name appears in the same way in all parts of an issue (i.e., in the table of contents as well as in the article byline) and in the summaries and indexes. The journal's instructions to authors could require specifically that authors take special care to use the same pen name consistently for all their manuscripts.
Because the databases process such huge amounts of information, it seems unlikely that they would be willing or able to play a part in solving the problem of variability and inconsistency in author entries, although these problems undermine their usefulness as information-retrieval tools. This, however, does not exempt them from taking responsibility for their share of the problem. The three DBs we compare in this study should consider applying control procedures that would ensure that a given author is consistently indexed under the same name. SCI should adapt its indexing procedures to the different conventions of languages other than English. The use of indexing rules based on English linguistic conventions for authors from all countries results in the systematic distortion of non-English names.
Meanwhile, to increase the efficacy of information-retrieval processes and improve the accuracy of bibliometric analyses, DB users should take some precautions. For Spanish authors who may be indexed under more than one name, Table 4 provides a list of most possible variants that should be used in a search. Based on the most frequent name structure for Spanish authors (“first name middle name first surname second surname”), searches for items associated with the author Antonio María Caballero Plasencia should be run for the potential variants Caballero Plasencia A, Caballero A, Caballero AM, Caballero Plasencia AM, and Caballero M in MEDLINE and IME and for the potential variants Caballero A, Caballero AM, Caballeroplasencia A, Caballeroplasencia AM, and Plasencia AMC in SCI. Ideally, these variants should be tried in the order in which they are given here. Search strategies based on these recommendations can be assumed to retrieve about 85% of all items associated with a given author. As searching continues with the minority variants listed in Table 4, the proportion of potential items associated with the author will approach 100%.
Acknowledgments
We thank Elvira Ruiz de Osma Delatas of the Departamento de Biblioteconomía y Documentación, University of Granada, for locating the database records used in this analysis; Antonio Campos, dean of the School of Medicine at the University of Granada at the time the data were collected, for providing us with a list of all potential authors affiliated with the School of Medicine; and K. Shashok for translating the original manuscript into English.
APPENDIX
Full correct name, standard structure, and frequency of different variants that appeared during the ten-year period from 1987 to 1996 in each database for each author affiliated with the School of Medicine, University of Granada
Footnotes
† Mistyped variants.
* Doubtful variants for this author.
Contributor Information
R. Ruiz-Pérez, Email: rruiz@ugr.es.
E. Delgado López-Cózar, Email: edelgado@ugr.es.
E. Jiménez-Contreras, Email: evaristo@ugr.es.
REFERENCES
- Williams ME, Lannom L. Lack of standardization of the journal title data element in data bases. J Am Soc Inf Sci. 1981 May; 32(3):229–33. [Google Scholar]
- Hawkins DT. Unconventional uses of on-line information retrieval systems: on-line bibliometric studies. J Am Soc Inf Sci. 1977 Jan; 28(1):13–8. [Google Scholar]
- Smith LC. Citation analysis. Libr Trends. 1981 Summer; 30(1):83–106. [Google Scholar]
- Galbán C, Vázquez M.. Las bases de datos como fuentes de información para estudios bibliométricos. Bol Anabad. 1988;38(1–2):369–81. [Google Scholar]
- Macroberts MH, Macroberts B. Problems of citation analysis: a critical review. J Am Soc Inf Sci. 1989 Sep; 40(5):342–9. [Google Scholar]
- Moed HF, Viriens M.. Possible inaccuracies occurring in citation analysis. J Inf Sci. 1989;15(2):95–117. [Google Scholar]
- Rice RE, Borgman CL, Bednarski D, and Hart PJ. Journal-to-journal citation data: issues of validity and reliability. Scientometrics. 1989 Mar; 15(3–4):257–82. [Google Scholar]
- Lardy JP, Herzhaft L. Bibliometric treatments according to bibliographic errors and data heterogeneity: the end-user point of view. In: Online Information 92, Proceedings of the 16th International Online Information Meeting; London, U.K.; 8–10 Dec 1992. Oxford, NJ: Learned Information, 1992: 547–56. [Google Scholar]
- Sancho R.. Indicadores bibliométricos utilizados en la evaluación de la ciencia. Rev Esp Doc Cient. 1990;13(3–4):842–65. [Google Scholar]
- López Piñero JM, Terrada ML. Los indicadores bibliométricos y la evaluación de la actividad médico-científica: (III) los indicadores de producción, circulación y dispersión, consumo de la información y repercusión. Med Clín (Barc). 1992 Feb 1; 98(4):142–8. [PubMed] [Google Scholar]
- Jeannin PH. L'evaluation quantitative de la recherche en sciences sociales et humaines. In: Revue de sciences sociales et humaines. Actes du séminaire “La communication et l'information scientifiques entre spécialistes” (1991–1992). Toulouse: IUT, Université de Toulouse III, 1992:42. [Google Scholar]
- Pulido M, González JC, and Sanz F. Errores en las referencias bibliográficas: un estudio retrospectivo en Medicina Clínica (1962–1992). Med Clín (Barc). 1995 Feb 11; 104(5):170–4. [PubMed] [Google Scholar]
- Vázquez M, Galbán C. Lack of standardisation in the corporate source field of different databases. In: Proceedings of the 10th International Online Information Meeting; London, U.K.; 2–4 Dec 1986:335–52. [Google Scholar]
- Rice RE, Borgman CL, Bednarski D, and Hart PJ. Journal-to-journal citation data: issues of validity and reliability. Scientometrics. 1989 Mar; 15(3–4):257–82. [Google Scholar]
- Hudnut SK. Should journal references be standardized? In: Proceedings of the 12th National Online Meeting 1991. Medford NJ: Learned Information, 1991:149–55. [Google Scholar]
- Rittberger M, Rittberger W.. Measuring quality in the production of databases. J Inf Sci. 1997;23(1):25–37. [Google Scholar]
- Spinak E.. Errores ortográficos en el ingreso en bases de datos. Rev Esp Doc Cient. 1995;18(3):307–19. [Google Scholar]
- Bell J, Speer S. Bibliographic verification for interlibrary loan: is it necessary? Coll Res Libr. 1988 Nov; 49(6):494–500. [Google Scholar]
- Fuller EE.. Variation in personal names in works represented in the catalog. Cat Class Quart. 1989;9(3):75–95. [Google Scholar]
- Weintraub TS. Personal name variations: implications for authority control in computerized catalogs. Libr Resour Tech Serv. 1991 Apr; 35(2):217–28. [Google Scholar]
- Jones EA. Consistency in choice and form of main entry: a comparison of Library Congress and British Library monograph cataloging. Libr Resour Tech Serv. 1992 Apr; 36(2):209–23. [Google Scholar]
- Piternick AB. What's in a name? use of names and titles in subject searching. Database. 1985 Dec; 8(4):22–8. [Google Scholar]
- Piternick AB. Name of an author! Indexer. 1992 Oct; 18(2):95–100. [Google Scholar]
- Kotiaho JS, Tomkins JL, and Simmons LW. Unfamiliar citations breed mistakes. Nature. 1999 Jul 22; 400(6742):307. [DOI] [PubMed] [Google Scholar]
- Corrochano LM. Spanish practice. Nature. 1996 Nov 14; 384(6605):106. 8906777 [Google Scholar]
- Sellick JTC. Multiple author. Nature. 1996 Oct 17; 383(6601):569. 8857527 [Google Scholar]
- Pilachowski DM, Everett D.. What's in a name? looking for people online-social sciences. Database. 1985;8(3):47–65. [Google Scholar]
- Snow B. Caduceus: people in medicine names online. Online. 1986 Sep; 10(5):122–7. [Google Scholar]
- Meneghini R. Systematization of academic and scientific affiliation, or how to prevent data on your publications from being lost in the national and international data base. Braz J Med Biol Res. 1995 Jun; 28(6):617–9. [PubMed] [Google Scholar]
- D'Auria D. Six characters in search of an author [editorial]. Occup Med (Oxford). 1997 May; 47(4):195. [DOI] [PubMed] [Google Scholar]
- Shore ML. Variation between personal name headings and title page usage. Cat Class Quart. 1984 Summer; 4(4):1–11. [Google Scholar]
- Sweetland JH. Errors in bibliographic citations: a continuing problem. The Libr Quart. 1989 Oct; 59(4):291–304. [Google Scholar]
- Borgman CL, Siegfried SL. Getty's synoname and its cousins: a survey of applications of personal name-matching algorithms. J Am Soc Infor Sc. 1992 Aug; 43(7):459–76. [Google Scholar]
- AACR2 1998. Anglo-American cataloguing rules. 2d ed., 1998 revision. Ottawa, ON, Canada: Canadian Library Association; London, U.K.: Library Association Publishing; Chicago, IL: American Library Association. [Google Scholar]
- RCE 1995. Reglas de Catalogación Españolas. Madrid, Spain: Dirección General del Libro, Archivos y Bibliotecas, 1995:431–454. [Google Scholar]
- International Federation of Library Associations and Institutions. Names of persons: national usages for entries in catalogues. London, U.K.: IFLA International Office for UBC, 1977:39–41. [Google Scholar]
- Gómez I, Ccma L, Morillo F, and Camí J. Medicina Clínica (1992–1993) vista a través del Science Citation Index. Med Clín (Barc). 1997 Oct 18; 109(13):497–505. [PubMed] [Google Scholar]
- López-Cózar E. Incidencia de la normalización de las revistas científicas en la transferencia y evaluación de la información científica. Rev Neurol. 1997 Dec; 25(148):1942–6. [PubMed] [Google Scholar]
- Salvadó Pérez L, Molina Troya J. ¿MEDLINE e Índice Médico Español son mutuamente excluyentes? Med Clín (Barc). 1997 Jan 18; 108(2):79. [PubMed] [Google Scholar]
- Brown CM. Complementary use of the SciSearch database for improved biomedical information searching. Bull Med Libr Assoc. 1998 Jan; 86(1):63–7. [PMC free article] [PubMed] [Google Scholar]
- Terrada ML, Cueva A, Mota A, Osca MJ, Aleixandre R, Cebrian M, Gimeno E, Almero A, and Cussac N. La base de datos IME y el repertorio Índice Médico Español (1965–1992). In: Congreso y Conferencia FID XLVI 199; Madrid, Spain, 1992;5:210–6. [Google Scholar]
- Cueva A, Terrada ML.. La documentación médica española. el Índice Médico Español y el estudio de la actividad científica. Cuad Salud. 1991;13:121–6. [Google Scholar]
- Pulido M. Index Medicus: cobertura y manejo. Med Clín (Barc). 1987 Mar 28; 88(12):500–4. [PubMed] [Google Scholar]
- Jorda M. Documentación biomédica: estructura y funcionamiento de las bases de datos bibliográficas. Med Clín (Barc). 1991 Sep 7; 97(7):265–71. [PubMed] [Google Scholar]
- Pestaña A.. El MEDLINE como fuente de información bibliométrica de la producción Española en biomedicina y ciencias médicas. comparación con el Science Citation Index. Med Clín (Barc) 1997;109(13):506–11. [PubMed] [Google Scholar]
- RCE 1995. Reglas de Catalogación Españolas. Madrid, Spain: Dirección General del Libro, Archivos y Bibliotecas, 1995:431–454. [Google Scholar]
- International Federation of Library Associations and Institutions. Names of persons: national usages for entries in catalogues. London, U.K.: IFLA International Office for UBC, 1977:39–41. [Google Scholar]
- O'Neill ET, Vizine-Goetz D.. Quality control in online database. Annu Rev Inform Sci Tech. 1988;23:125–47. [Google Scholar]
- Oddy P. Authority control in the local, national, and international environment. In: Standards for the international exchange of bibliographic information. London, U.K.: Library Association Publishing, 1991:66–72. [Google Scholar]
- Jacso P.. Content evaluation of databases. Annu Rev Inform Sci Tech. 1997;32(Chap 5):231–67. [Google Scholar]
- NLM Cataloging Section. Cataloging system. [Web document]. [rev. 16 Jan 2001; cited 18 Jun 2002]. <http://www.nlm.nih.gov/tsd/cataloging/topics.html#CatSys>. [Google Scholar]
- Program for Cooperative Cataloging. NACO. [Web document]. [rev. 16 Jan 2001; cited 18 Jun 2002]. <http://www.loc.gov/catdir/pcc/naco.html>. [Google Scholar]
- AACR2 1998. Anglo-American cataloguing rules. 2d ed., 1998 revision. Ottawa, ON, Canada: Canadian Library Association; London, U.K.: Library Association Publishing; Chicago, IL: American Library Association. [Google Scholar]
- MARC 21 Concise Bibliographic. Main entry personal name. [Web document]. [rev. 16 Jan 2001; cited 18 Jun 2002]. <http://lcweb.loc.gov/marc/bibliographic/ecbdmain.html#mrcb100>. [Google Scholar]
- Williams RM. (ISI Europe, England). Indexing rules. Email (rwilliams@smtpgwy.isinet.com) sent to: rruiz@ugr.es. 19 May 1998. [Google Scholar]
- Williams RM. (ISI Europe, England). Indexing rules. Email (rwilliams@smtpgwy.isinet.com) sent to: rruiz@ugr.es. 19 May 1998. [Google Scholar]
- Williams RM. (ISI Europe, England). Indexing rules. Email (rwilliams@smtpgwy.isinet.com) sent to: rruiz@ugr.es. 19 May 1998. [Google Scholar]
- Silva GA. Nombres de pila completos: las iniciales no bastan. Med Clin (Barc.). 1992 Oct 10; 99(11):435. [PubMed] [Google Scholar]