How accurate is the data in dblp?

Unfortunately, there exists no metric or study to answer this question with scientific rigor. But we put a lot of effort into executing a process that should help to make data in dblp as reliable as possible. To see this, please have a look at our data acquisition work flow:

dblp always indexes the tables of contents of complete proceedings or journal volumes in bulk. Usually, the necessary meta data for each volume is obtained by us directly from the publisher of a volume or the organizer of an event. In a smaller number of cases, meta data is submitted to dblp by voluntary helpers from the community. Once we have obtained the data, a rigorous data cleaning process is applied by an editor from the dblp team. This process is supported by some simple algorithms checking the consistency of the data, but is mainly executed by hand. This manual curation process has four major goals:

Only after a full data cleaning pass has been applied to the data, the new records are added to the dblp data set. However, the data cleaning process does not end here. In an iterative process, for the next few days, newly added data is monitored by special helper scripts for any suspicious signs of data inconsistency. For example, we often observe a certain ripple effect: On the first day, a newly added publication helps to uncover formerly unrecognized homonymous of synonymous data record. Thanks to fixing this information, on the next day, further inconsistencies in further records become evident, and so on. Such an iterative effect may last for several days.

In the end – as a lower bound, if you will – the data listed in dblp should be at least as accurate as the data provided by the publishers. However, we spent a lot of our limited resources to make sure to remove as many mistakes, misassignments, and inconsistencies as humanly possible.

maintained by Schloss Dagstuhl LZI at University of Trier