The dblp XML format is modeled after the BibTeX *.bib file format. The format is defined in the DTD file in the same directory. Please understand that (by design) our DTD is not very strict, as it makes no restriction to element order or multiplicity, and even allows nonsensical child elements (e.g., school tags in article elements, editor and author elements at the same time) that you will never find in the actual dblp data set. Our priority was to keep the definition clean and simple, and not to model every aspect of the publication landscape.
In general, our XML is a shallow but very long list of XML records. The root element has several million child elements, but usually no element is deeper than level three. An excerpt of the XML file looks like this:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE dblp SYSTEM "dblp.dtd"> <dblp> [...] <article key="journals/cacm/Gentry10" mdate="2010-04-26"> <author>Craig Gentry</author> <title>Computing arbitrary functions of encrypted data.</title> <pages>97-105</pages> <year>2010</year> <volume>53</volume> <journal>Commun. ACM</journal> <number>3</number> <ee>http://doi.acm.org/10.1145/1666420.1666444</ee> <url>db/journals/cacm/cacm53.html#Gentry10</url> </article> [...] <inproceedings key="conf/focs/Yao82a" mdate="2011-10-19"> <title>Theory and Applications of Trapdoor Functions (Extended Abstract)</title> <author>Andrew Chi-Chih Yao</author> <pages>80-91</pages> <crossref>conf/focs/FOCS23</crossref> <year>1982</year> <booktitle>FOCS</booktitle> <url>db/conf/focs/focs82.html#Yao82a</url> <ee>http://doi.ieeecomputersociety.org/10.1109/SFCS.1982.45</ee> </inproceedings> [...] <www mdate="2004-03-23" key="homepages/g/OdedGoldreich"> <author>Oded Goldreich</author> <title>Home Page</title> <url>http://www.wisdom.weizmann.ac.il/~oded/</url> </www> [...] </dblp>
Level 1: data records
The children of the root element represent the individual data records that are stored in dblp. In general, there are two types of records: publication records and person records.
Publication records are inspired by the BibTeX syntax and are given by one of the following elements:
- article – An article from a journal or magazine.
- inproceedings – A paper in a conference or workshop proceedings.
- proceedings – The proceedings volume of a conference or workshop.
- book – An authored monograph or an edited collection of articles.
- incollection – A part or chapter in a monograph.
- phdthesis – A PhD thesis.
- mastersthesis – A Master's thesis. There are only very few Master's theses in dblp.
- www – A web page. There are only very few web pages in dblp. See also the notes on person records.
Person records are described separately here.
All records share a number of common attributes:
- key – The unique dblp key of this record.
- mdate – The date this record has been last modified.
- publtype – An optional attribute that specifies whether a publication record is an informal publication, an encyclopedia entry, an editorial, etc.
Level 2: bibliographic metadata
Record elements do not contain any text, but they contain a number of child elements to specify the record's bibliographic metadata entries. See the Wikipedia page on BibTeX to learn which data entries are meaningful in which record type.
Note that in contrast to BibTeX, there are no key elements since the key is already an attribute of the record node. Also, there is a custom url element to specify a local hyperlink relative to the dblp websites homepage.
Level 3: optional HTML markup
In the XML file, only title or booktitle elements contain optional HTML markups, and only a selected few markup elements are allowed:
- ref – a pseudo-HTML markup to denote local hyperlinks within the dblp website (relative to the dblp websites homepage); requires the attribute href
- sup – superscript text
- sub – subscript text
- i - italics
- tt – monospace
In theory, the elements of this level may be nested arbitrarily deep to describe complex structures like formulas, e.g.
ixsubysup2/sup/sub/i to describe xy². However, such cases are very rare.
The dblp XML file is encoded in plain ASCII. Additional ISO/IEC 8859-1 (latin-1) characters are defined as named entities in the DTD and used whenever necessary.
At the moment, most parts of dblp are restricted to ISO-8859-1 (latin-1) characters, i.e. the first 255 Unicode characters. With exception to the the author- or editor-elements, where you will still find only latin-1 characters, you may find numerical entities outside of this range. For example, title-elements my contain Greek letters like an ε, or the note-elements of a person record may contain a Chinese name in the original Unicode spelling. All characters above the first 255 Unicode characters are given as numerical entities.
More information on the XML structure of the dblp records and several design decisions can be found in the following paper: