PubMed: a success or a tragedy?
When celebrating the 20 millions of articles in last july (see my previous post), I missed this excellent review of the famous platform.
- A central index freely available globally: Many biomedical scientists probably take PubMed for granted, but try to imagine biology and medicine without it – we would struggle to find anything.
- Twenty million citations: That’s a lot of data and it’s growing at a rate of about one paper per minute (on average).
- More than a billion searches in 2009: That’s an average of 3.5 million searches per day or 40 searches per second …
- PubMed is too big and full of noise: Theodore Sturgeon’s law states that 90% of everything is rubbish. If correct, this means around 18 million records in PubMed are worthless junk. But that won’t stop them cluttering up the database and your search results making it harder to find what you want when you need it. Many of the papers indexed by PubMed are “salami-sliced” by publication-hungry scientists into the least publishable unit and are of little or no actual scientific value. It can be difficult (or impossible) to find what you need in PubMed. Cameron Neylon calls this discovery deficit, but however you describe it, finding the information you need in PubMed can be frustratingly difficult – despite the redesigns. There is so much in PubMed it is impossible to keep up.
- PubMed is too small: Some people argue that an overly conservative indexing and editorial policy prevents PubMed from including lots of biomedically relevant literature that is published in physics, chemistry, mathematics, engineering and computer science journals. Currently much of this data is excluded from the database. Actually, what we really need is PubSCIENCE (covering non-medical sciences) but that idea got tragically axed back in 2002.
- Identity crisis, ambiguous authors:
- Identity crisis, missing document identifiers: There are over forty million unique document ID’s in the form of DOI’s. They are a useful way to uniquely identify papers on the Web and link directly to their full content wherever they were originally published. But you might have trouble using DOIs in PubMed. Sometimes DOI’s get left out of records (see some random examples here) altogether. When they are included, they can get buried and are not very accessible. For example this record has a DOI but you won’t find it anywhere in the default page served by PubMed, which means you can’t easily click through to the full text of the article which the DOI would take you to. What this means is, PubMed is not as well integrated with other databases as it could and should be.
- Mostly abstracts only: PubMed has 20 million freely available abstracts rather than 20 million full text papers. Imagine how the rate of scientific discovery and invention might increase (and the cost might decrease) if it was PubMed Central that had 20 million citations instead of just PubMed. Alas, PubMed Central is currently closer to the 2 million mark than the 20 million mark, but it is growing rapidly thanks to deposition mandates and open access publishing.
- Ranking results: by default PubMed ranks search results by date – but if Google did the same, very few people would bother use it. Ranking results by relevance, by using an algorithm more like PageRank, would be much more useful to many users as demonstrated by Pierre Lindenbaum.
- Text mining and ontologies: We’ve still a long way to go before fully exploiting the possibilities offered by text-mining and ontologies to allow PubMed users to semantically search and browse the data. MeSH is just the beginning but that’s another story…
PubMed is a substantial fourteen years of work which continues to have significant benefits for many scientists around the world. There is plenty of room for improvement, but it’s hard to imagine Life® without PubMed®.
Duncan Hull. Twenty million papers in PubMed: a triumph or a tragedy?. O’Really, Online, posted on July 27, 2010: