As scholarly data increases rapidly, scholarly digital libraries, supplying tremendous scholarly data through convenient online interfaces, become more popular and important tools for researchers. However, because of the limitation of naming convention widely practiced in academic fields, a large number of scholarly publications often suffer with the problem of correctly identifying authors with common names. Especially, the naming conventions such as abbreviating first and middle names make it even harder to identify and distinguish authors with the same representation (i.e. spelling) of names.
Several disambiguation methods have been suggested to tackle the problem but most of them require less practical inputs such as number of same-named authors, training set, or rich information about papers. Base on assumption that coauthors are likely to write more than one paper together, we propose an autonomous approach to group papers from the same author using the most common information, author lists.
We employ various techniques to achieve the goal. First, we represent the input set of papers as a data matrix and reduce dimension of the matrix to find groups of coauthors who appear frequently together. Second, we devise relative correlation distance measure suitable to the reduced space and apply it to density-based clustering which are used to cluster papers showing similar coauthors. Finally, we adopt a concept of summarization to represent cluster of papers as a single vector.
We evaluate our method using publication records about 11 ambiguous names, and show that our approach results better disambiguation while keeping high purity of clusters compared to other four density-based clustering methods.