When I was a young research associate, having just recently started down a career of molecular biology and facing the insurmountable task of figuring out what to work on, I received some of the best research advice from an equally young and unequivocally wise post-doc. As I sat at the lab bench one afternoon complaining that I just couldn't think of anything original to work toward, he advised me start reading as many papers as I could, and to constantly write down every question and idea I had based on those articles. Then, he told me to start searching for those ideas on PubMed to see if someone else had already published on it. To be sure, he continued, at the early stages of the search, I would find that my ideas would be published 10 or so years previous. But eventually, as I became more nuanced and cognizant of the literature, that number would continually decrease until I was asking questions that were being answered within the last few months of publications. And finally, after all of that, I would soon have an idea or ask a question to which I could not find the answer. That question would be my first testable hypothesis, something to build a graduate or post graduate life around.

IBM Watson - Hypothesis Generation

At the time, I found that answer daunting. Exactly how much did I really have to read? And when would I find the time to read all those papers when I was also expected to get work done in the lab as well? To many scientists, the answers to those two questions are no mystery: a LOT, and every available, waking minute. But being realistic, there is virtually no plausible way that a researcher in today’s world can keep up with the literature that is being produced at an ever-expanding rate. It was estimated that roughly 1 million articles were being published daily as of 2008, and there were estimated to be more than 50 million journal articles in existence as of 2009! Small wonder then that, as scientists, we have become increasingly specialized in increasingly narrow fields. There is simply too much to attempt to take in and little way for our brains to adequately absorb it all.

Even specialized fields can be inundated with such a surplus of journal articles that it makes any review process overwhelming. These article rich areas, such as biomedical research, often populate tens of thousands of published studies and reports. For perspective, in a field with just 10,000 published articles, a grad student would need to read nine papers a day, every day, for three years straight! And for a field of study for a highly cited protein like the p53 kinase, there are an estimated 70,000 published papers which would require a scientist to read almost 20 papers a day for 10 years! Sifting through so many papers is realistically only something a computer could adequately accomplish.

Instead, most of us cherry pick the papers that we read, skipping over (or perhaps not even noticing) papers that we think provide less value or ones that may harder to read for whatever reason. That behavior, as necessary as it is, may lead down research paths that have already proven to have negative or false positive results.

These problems are not new, nor have we been unaware of the problem. Developers and scientists have been working on various methods to build better and more articulate search algorithms for decades; to parse out relevant research from the bulk, to separate the chaff from the wheat, so to speak. Additionally, these search algorithms are being used to find novel research opportunities: avenues that may simply have been overlooked or lost in the shuffle of constant research results. Generally speaking, these systems have been collectively called “Automated Hypothesis Generators.” To one degree or another they work about as well as any Google search might, generating a list of papers to sift through and awarding some kind of power ranking to each paper which hopefully corresponds to the relevance of the original search.

Search queries are just queries though. They do not typically possess the understanding of the subtleties of our quixotic languages. They can spit out paper after paper that align word or even phrase association, but they just can’t answer a simple question: “What [insert Gene or protein of interest here] should I look for next?” There is, however, one computer that has shown that it can do just that: IBM’s Watson.

Most of us will probably remember Watson from its debut on Jeopardy several years ago when the computer managed to beat two of the all-time best Jeopardy champions, Brad Rutter and Ken Jennings, a feat long believed impossible to do. It was an amazing achievement, not just from the standpoint that a computer could outperform a human at trivia, but that Watson could reason through a spoken or written language and understand its context enough to generate correct answers.

Watson achieves this through a variety of technologies, including natural language processing (NLP), machine learning, and open domain question answering architecture. These “programs” are then backed by over 2500 GHz of processors and 16 terabytes of RAM (which STILL isn’t enough to merit being a member of the Top 500 Supercomputer List) and can likely process the equivalent of 1 million books per second. But it’s the human-question/computer-answer interaction that really makes Watson stand out compared to other powerhouse computer systems. There are potentially vast opportunities that such a system might present to the numerous fields of study in the world, and researchers led by Oliver Lichtarge at the Baylor College of Medicine wanted to see exactly how useful Watson could be for the scientific world and hypothesis generation.

When you’re a computer, there is an advantage in reading scientific literature over many other forms of literature. The dry, bare-bones and strictly analytical writing style of a medical study is the bane of sleep-deprived grad students everywhere, but it is also the perfect read for a computer that traditionally struggles with things like obscure allusions or tone. Even more so, the abstracts of those papers, written as the terse and tidy summary of the paper overall, drill down to the bare essentials that a computer actually needs in order to comprehend the primary points of an article.

Lichtarge’s group decided to test Watson with a selection of papers on the p53 kinase. P53 is an important protein that is in charge of the body’s defense system, responding to the presence of genomic problems, such as cancerous cells, and dispatching hundreds of other proteins to correct the error, or if that fails, to cause the affected cell to kill itself. There are more than 500 known kinases, and currently only 33 of those are known to modify p53. Finding novel, testable, p53 kinase targets would be a boon to cancer research and an excellent proof of concept of Watson’s power and resourcefulness. Lichtarge selectively truncated his list of papers to include only those of 259 kinases, of which 23 were known to be p53 kinase. Next, his team restricted Watson to only papers written prior to 2003, with 10 known p53 kinases at the time while 9 more were discovered over the next decade.

Watson was able crunch those a priori abstracts and accurately predict the existence of 7 of those 9 “undiscovered” p53 kinases! Imagine what the medical field would be like if scientists didn’t have to spend a decade of research to discover just one novel protein, followed by another decade of clinical trials to determine its effectiveness. Imagine how many of those novel proteins end up as dead ends in clinical trials and all the time and money that’s wasted investigating them. Watson could effectively eliminate that guessing, the lead time from discovery to production and the false positives that invariably raise drug costs!

The future of our research and discovery is upon us! Over the last 70 years, computers have progressively helped us save, catalog, categorize and locate the information that the human race has meticulously documented over the last 5000 years. But our ability to mentally codify and comprehend the extent of our own ever-increasing accrual of knowledge is failing to progress at the rate that we need. With computers such as Watson, we are standing on the cusp of an age of enlightenment in which computers quantifiably correlate our abundances of knowledge into an exponential era of discovery. It is an era I am excited to watch unfold.

As that most famous detective himself remarked, “It is one of those instances where the reasoner can produce an effect which seems remarkable to his neighbor, because the latter has missed the one little point which is the basis of the deduction.” Elementary indeed, Watson.

Spangler, S., Wilkins, A. D., Bachman, B. J., Nagarajan, M., Dayaram, T., Haas, P., ... & Lichtarge, O. (2014, August). Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1877-1886). ACM.

BJÖRK, B.-C., ROOSR, A., and LAURI, M., Global annual volume of peer reviewed scholarly articles and the share available via different open access options. In Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing, Toronto, Canada.

Jinha, A. E. (2010). Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258-263.

Category Code: 79105 79101 88241 79102