What do I find in dblp.xml?

The dblp XML format is modeled after the BibTeX *.bib file format. The format is defined in the DTD file in the same directory. Please understand that (by design) our DTD is not very strict, as it makes no restriction to element order or multiplicity, and even allows nonsensical child elements (e.g., ‹school› tags in ‹article› elements, ‹editor› and ‹author› elements at the same time) that you will never find in the actual dblp data set. Our priority was to keep the definition clean and simple, and not to model every aspect of the publication landscape.

In general, our XML is a shallow but very long list of XML records. The root element has several million child elements, but usually no element is deeper than level three. An excerpt of the XML file looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">


<article key="journals/cacm/Gentry10" mdate="2010-04-26">
<author>Craig Gentry</author>
<title>Computing arbitrary functions of encrypted data.</title>
<journal>Commun. ACM</journal>


<inproceedings key="conf/focs/Yao82a" mdate="2011-10-19">
<title>Theory and Applications of Trapdoor Functions (Extended Abstract)</title>
<author>Andrew Chi-Chih Yao</author>


<www mdate="2004-03-23" key="homepages/g/OdedGoldreich">
<author>Oded Goldreich</author>
<title>Home Page</title>


Level 1: data records

The children of the root element represent the individual data records that are stored in dblp. In general, there are two types of records: publication records and person records.

Publication records are inspired by the BibTeX syntax and are given by one of the following elements:

Please note that while the bibtex type of the records does define certain categories on the dblp data records, these record categories are actually slightly different from the publication types that are used throughout the dblp website.
Please note that while there is a record type for proceedings volumes, there is no record type for journal volumes. Consequently, the dblp XML file contains no data entities for whole journal volumes or series. This is a (sometimes unfortunate) heritage of the BibTeX data model.

Person records are described separately here.

All records share a number of common attributes:

Level 2: bibliographic metadata

Record elements do not contain any text, but they contain a number of child elements to specify the record's bibliographic metadata entries. See the Wikipedia page on BibTeX to learn which data entries are meaningful in which record type.

Note that in contrast to BibTeX, there are no key elements since the key is already an attribute of the record node. Also, there is a custom url element to specify a local hyperlink relative to the dblp websites homepage.

Level 3: optional HTML markup

In the XML file, only title or booktitle elements contain optional HTML markups, and only a selected few markup elements are allowed:

In theory, the elements of this level may be nested arbitrarily deep to describe complex structures like formulas, e.g. ‹i›x‹sub›y‹sup›2‹/sup›‹/sub›‹/i› to describe x. However, such cases are very rare.


The dblp XML file is encoded in plain ASCII. Additional ISO/IEC 8859-1 (latin-1) characters are defined as named entities in the DTD and used whenever necessary.

At the moment, most parts of dblp are restricted to ISO-8859-1 (latin-1) characters, i.e. the first 255 Unicode characters. With exception to the the ‹author›- or ‹editor›-elements, where you will still find only latin-1 characters, you may find numerical entities outside of this range. For example, ‹title›-elements my contain Greek letters like an ε, or the ‹note›-elements of a person record may contain a Chinese name in the original Unicode spelling. All characters above the first 255 Unicode characters are given as numerical entities.

More information on the XML structure of the dblp records and several design decisions can be found in the following paper:

maintained by Schloss Dagstuhl LZI at University of Trier