A Critical Guide to the UniProtKB Flat-file Format

This Critical Guide briefly presents the need for biological databases and for a standard format for storing and organising biological data. Web-based interfaces have made databases more user-friendly, but knowledge of the underlying file format offers a deeper understanding of how to navigate and mine the information they contain, so that humans and machines can get the most out of them. This Guide explores the file format that underpins one of today’s most popular protein sequence databases – UniProtKB.

Specifically, this Guide introduces the concept of database ‘flat-files’, and examines features of the UniProtKB flat-file format. On reading this Guide, users will be able to: i) identify key fields within UniProtKB/Swiss-Prot and UniProtKB/TrEMBL flat-files; ii) explain what these fields mean, what information they contain and what the information is used for; iii) analyse the information in different fields and infer structural and functional features of a sequence; iv) examine and investigate the provenance of annotations; and v) compare annotations at different time-points and evaluate the likely impact of annotation changes.