Medical record linkage in health information systems by approximate string matching and clustering. (41/196)

BACKGROUND: Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. METHODS: The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. RESULTS: The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. CONCLUSION: Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e. real-time) proximity detection when inserting a new identity.  (+info)

BioThesaurus: a web-based thesaurus of protein and gene names. (42/196)

BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to protein entries in the UniProt Knowledgebase. Currently covering more than two million proteins, BioThesaurus consists of over 2.8 million names extracted from multiple molecular biological databases according to the database cross-references in iProClass. The BioThesaurus web site allows the retrieval of synonymous names of given protein entries and the identification of protein entries sharing the same names. AVAILABILITY: BioThesaurus is accessible for online searching at http://pir.georgetown.edu/iprolink/biothesaurus  (+info)

PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords. (43/196)

BACKGROUND: MEDLINE/PubMed (hereinafter called PubMed) is one of the most important literature databases for the biological and medical sciences, but it is impossible to read all related records due to the sheer size of the repository. We usually have to repeatedly enter keywords in a trial-and-error manner to extract useful records. Software which can reduce such a laborious task is therefore required. RESULTS: We developed a web-based software, the PubMed Sentence Extractor (PSE), which parses large number of PubMed abstracts, extracts and displays the co-occurrence sentences of gene names and other keywords, and some information from EntrezGene records. The result links to whole abstracts and other resources such as the Online Mendelian Inheritance in Men and Reference Sequence. While PSE executes at the sentence-level when evaluating the existence of keywords, the popular PubMed operates at the record-level. Therefore, the relationship between the two keywords, a gene name and a common word, is more accurately captured by PSE than PubMed. In addition, PSE shows the list of keywords and considers the synonyms and variations on gene names. Through these functions, PSE would reduce the task of searching through records for gene information. CONCLUSION: We developed PSE in order to extract useful records efficiently from PubMed. This system has four advantages over a simple PubMed search; the reduction in the amount of collected literatures, the showing of keyword lists, the consideration for synonyms and variations on gene names, and the links to external databases. We believe PSE is helpful in collecting necessary literatures efficiently in order to find research targets. PSE is freely available under the GPL licence as additional files to this manuscript.  (+info)

Methodology to identify Iranian immigrants for epidemiological studies. (44/196)

Determining ethnic differences in cancer patterns using administrative databases is often a methodological challenge for information on ethnicity or place of birth is commonly lacking. This paper describes the approach we used to identify Iranians residing in British Columbia (BC), Canada and who were registered within the BC Cancer Registry. A listing of common Iranian surnames and given names was generated from two sources: a residential telephone book (with a high density of Iranians) and a provincial breast cancer screening program (which allowed for the selection of women born in Iran). Surnames and given names were reviewed manually and the Iranian names were identified and coded as 'highly probable' and 'probable' Iranian. A name directory was then created and linked with the BC Cancer Registry to identify Iranian cancer cases. Using this method, 1729 surnames and 737 given names were selected from the telephone book, and 1881 surnames and 757 given names from the screening program. The majority of these names were coded as 'highly probable' Iranian (98% and 96% for surnames and given names, respectively). 12% of surnames and 10% of given names were common to both sources. A listing of the most common Iranian surnames and given names is provided. In conclusion, in the absence of other ethnicity data, surnames and given names can be very helpful to identify persons of specific ethnicities when these ethnic groups have distinctive names.  (+info)

Genetic signatures of coancestry within surnames. (45/196)

Surnames are cultural markers of shared ancestry within human populations. The Y chromosome, like many surnames, is paternally inherited, so men sharing surnames might be expected to share similar Y chromosomes as a signature of coancestry. Such a relationship could be used to connect branches of family trees, to validate population genetic studies based on isonymy, and to predict surname from crime-scene samples in forensics. However, the link may be weak or absent due to multiple independent founders for many names, adoptions, name changes and nonpaternities, and mutation of Y haplotypes. Here, rather than focusing on a single name, we take a general approach by seeking evidence for a link in a sample of 150 randomly ascertained pairs of males who each share a British surname. We show that sharing a surname significantly elevates the probability of sharing a Y-chromosomal haplotype and that this probability increases as surname frequency decreases. Within our sample, we estimate that up to 24% of pairs share recent ancestry and that a large surname-based forensic database might contribute to the intelligence-led investigation of up to approximately 70 rapes and murders per year in the UK. This approach would be applicable to any society that uses patrilineal surnames of reasonable time-depth.  (+info)

How good is probabilistic record linkage to reconstruct reproductive histories? Results from the Aberdeen Children of the 1950s study. (46/196)

BACKGROUND: Probabilistic record linkage is widely used in epidemiology, but studies of its validity are rare. Our aim was to validate its use to identify births to a cohort of women, being drawn from a large cohort of people born in Scotland in the early 1950s. METHODS: The Children of the 1950s cohort includes 5868 females born in Aberdeen 1950-56 who were in primary schools in the city in 1962. In 2001 a postal questionnaire was sent to the cohort members resident in the UK requesting information on offspring. Probabilistic record linkage (based on surname, maiden name, initials, date of birth and postcode) was used to link the females in the cohort to birth records held by the Scottish Maternity Record System (SMR 2). RESULTS: We attempted to mail a total of 5540 women; 3752 (68%) returned a completed questionnaire. Of these 86% reported having had at least one birth. Linkage to SMR 2 was attempted for 5634 women, one or more maternity records were found for 3743. There were 2604 women who reported at least one birth in the questionnaire and who were linked to one or more SMR 2 records. When judged against the questionnaire information, the linkage correctly identified 4930 births and missed 601 others. These mostly occurred outside of Scotland (147) or prior to full coverage by SMR 2 (454). There were 134 births incorrectly linked to SMR 2. CONCLUSION: Probabilistic record linkage to routine maternity records applied to population-based cohort, using name, date of birth and place of residence, can have high specificity, and as such may be reliably used in epidemiological research.  (+info)

Do surgeons wish to become doctors? (47/196)

OBJECTIVES: To gauge opinion among otolaryngologists about their wish to retain the title Mr, Miss, Ms or Mrs or to adopt the title of doctor. DESIGN: An e-mail questionnaire sent to all members of ENT-UK (The British Association of Otolaryngologists-Head and Neck Surgeons), who had registered an e-mail address with the ENT-UK secretariat. SETTING: The specialty group of otolaryngologists in the UK. PARTICIPANTS: 723 recipients of e-mails, who were members or fellows of a surgical Royal College and, by convention in the UK, entitled to adopt the title Mr, Miss, Ms or Mrs. RESULTS: 304 recipients of the e-mail questionnaire responded. 39% were not aware of any proposals to change the convention, addressing surgeons as 'doctor' in the future. Overall, 61.8% were in favour of retaining the current convention and retaining the title Mr or a female equivalent. Applying the null hypothesis that most surgeons would not like to change a title, the chi(2) test produced a highly significant P value of 0.0002. Of female respondents, however, only 43% supported retention of the current convention. Using Fisher's exact test to compare female and male respondents, the two-sided P value was highly significant at 0.006, with female respondents favouring the title of doctor. CONCLUSIONS: A large proportion of ENT surgeons in the UK responded to the questionnaire. They were unaware of proposals to change the current convention of address for surgeons. A significant number of those responding were in favour of retaining the current convention. The small proportion of female respondents indicated a preference for being addressed as 'doctor'.  (+info)

Admixture dynamics in Hispanics: a shift in the nuclear genetic ancestry of a South American population isolate. (48/196)

Although it is well established that Hispanics generally have a mixed Native American, African, and European ancestry, the dynamics of admixture at the foundation of Hispanic populations is heterogeneous and poorly documented. Genetic analyses are potentially very informative for probing the early demographic history of these populations. Here we evaluate the genetic structure and admixture dynamics of a province in northwest Colombia (Antioquia), which prior analyses indicate was founded mostly by Spanish men and native women. We examined surname, Y chromosome, and mtDNA diversity in a geographically structured sample of the region and obtained admixture estimates with highly informative autosomal and X chromosome markers. We found evidence of reduced surname diversity and support for the introduction of several common surnames by single founders, consistent with the isolation of Antioquia after the colonial period. Y chromosome and mtDNA data indicate little population substructure among founder Antioquian municipalities. Interestingly, despite a nearly complete Native American mtDNA background, Antioquia has a markedly predominant European ancestry at the autosomal and X chromosome level, which suggests that, after foundation, continuing admixture with Spanish men (but not with native women) increased the European nuclear ancestry of Antioquia. This scenario is consistent with historical information and with results from population genetics theory.  (+info)