«Prev Harmonization remains the lofty ideal, but country-specific challenges remain NextWhere do we go next? How to put your research mission into action»

by the R&D Informatics team

1. Introduction

A. Why integrate data in the first place?

The pharmaceutical industry has always been “data driven.” This is down to the business being staffed by scientists. Our training leads us to form a hypothesis and then run a set of experiments to prove or disprove that hypothesis. Each experiment generates a bunch of data which is analyzed and, should we find something interesting, published. Being scientists, we try to take the same approach to our business decisions: we gather the data, analyze it, make that decision and, should the decision be yes, we prepare some kind of internal document to prove our hypothesis right. The resulting data landscape in Pharma has an organic feel to it. Drug programs accrete mini-ecosystems of experimental results, reports and analyses that vary in size depending on how far along the pipeline they get. The issue is that each piece of information was generated for its specific “experiment," and generally shelved as soon as the decision point was passed. When we are looking at the opportunities that a translational view might bring, e.g., to allow us to revisit an old drug program in the light of new knowledge or to figure out where in the program we should have known about the issue that we are now seeing in the clinic, we find ourselves having to connect all these data silos together.

B. Typical Issues

i. Not knowing what data you have

Sounds like a simple matter but in reality it can be a major task to identify all the data that you might want to integrate. Many integration projects have high-level goals like “consolidate all our biomarker content” without really appreciating what that actually means. Projects can stall very early on while analysts try to figure out where the data actually resides. Not all of that data will necessarily be in-house. Many essential data sources are only available from commercial vendors or are in the public domain. Some sources, commercial and internal, have specific requirements about who can access the data and for what. Traceability of data becomes very important, more and more so as organizations become more cooperative. Not only do you need to know the source for traceability – you may need to back that data out at some point.

ii. Not understanding the use

Another common “gotcha” is where the team are unclear what success actually looks like. The Big Data literature evangelizes the idea that the answers will just appear automagically out of a big enough data lake. All you need is an army of data scientists to “fish” in your lake. Most real-world integration projects aren’t so ambitious but there is still a world of difference between aggregating data and putting search and/or dashboards on top and building a data repository with some kind of programmatic access, e.g., via API, that users can interrogate. Doing the first thing and needing the second will probably end up with you redoing the whole project from scratch. The other way around is a waste of time, money and resources.

iii. Not realizing how ugly the data really is

How many projects have started with a small sample of data, had great early success and then unexpectedly hit treacle when the “small” job of populating the model started? Quite a lot in my experience. Every connection, constraint or validation in your data model adds value to the data but simultaneously creates overhead and exceptions. Especially when you start dealing with old data. A modicum of data forensics at the start of a project can pay dividends later on.

2. Lightweight or Heavyweight Integration

There is a natural inclination to address the opportunity by integrating everything. That is a dangerous road! If you don’t integrate data for a specific purpose or a set of related purposes, you find yourself very quickly between the Scylla of a huge unstructured data mart that merely provides a bigger data graveyard than you started with and the Charybdis of an over-engineered, hard-to-change data repository that eventually everyone works around because it's just too hard to work with.

How close do you want to steer to Scylla or Charybdis? Many informaticians live in the world of lightweight integration – just enough data ingestion, cleaning, mapping and analysis to get the job done. There are some great tools out there for this kind of integration: TIBCO’s Spotfire and Biovia’s Pipeline Pilot come to mind. Lightweight integration is generally easier and more goal-directed, hence it can usually get funding from the project team.

Heavyweight integration is more of an enterprise project. It's going to involve more stakeholders, attract more expectations and require some kind of central funding to get done. Corporate IT are usually the sponsors because they recognize that data in one place is more leveragable and easier to secure. By their nature, these projects are expensive and take time to deliver, but a well-designed heavyweight integration becomes an asset that can be leveraged in many ways.

I generally favor middleweight integration. If you focus on a defined subset of all the data, you can build something that is genuinely useful in a reasonable timeframe. If you build out your data model in the right way, you can build other systems for each function and then interconnect them to achieve the benefits of broader data integration.

3. The Anatomy of a Data Integration Application

Pretty much any data integration application has the same logical architecture. You need some kind of mechanism for bringing the data in and doing all the cleaning and enrichment processes you need for the application. You need someplace to hold the data and you need a set of capabilities to make the data findable, present/report on it and, in many cases, allow your users to interact with the data programmatically.

Figure 1:Anatomy of a Data Integration
(SOURCE: Clarivate Analytics Informatics Practice)

 

4. Bringing the Content In

A. Connectors/ETL

The Extraction, Transformation and Load (ETL) process to pull the data from where you find it can be hard. It goes back to what I was saying earlier about ugly data. Each data source has to be analyzed to figure out its structure and how the data is populated over time. In a lot of cases, the analyst has to figure out what to do with missing values, and there is always a trade-off: bring in absolutely everything or cherry-pick the data that looks the most valuable? That’s where a “non-destructive” data integration makes sense. If you leave the original silos untouched but merely bring in what you think you need, then the option of going back and fetching more data is still there. There was a time when most integration projects had, as a primary goal, the retirement of the systems that were integrated. In the world of the cloud, this no longer appears as a priority. Provided the source system isn’t actually broken, it is better to leave it alone.

It’s a good idea to do all the data munging on load, or in some intermediate staging to live process. That's where technologies like Hadoop and Spark can really fly. They enable us to process a lot of data and they enable us to run the ingestion many times to get it right and/or to harvest additional data elements from the sources. A distributed ingestion pipeline also enables us to keep the data up to date so our data-integration resource stays current.

B. The role of text mining

Much of the data we want is unstructured. Fortunately, text-mining technologies have matured hugely in recent years. There are several really good capabilities in Open Source plus a selection of commercial packages that go even further, in particular those which offer capabilities targeted at the Life Sciences sector, like SciBite’s Termite and Linguamatics I2E. Text mining is essential if all you have is pure unstructured text in documents, but it can still add value even with structured data as, very often, new concepts emerge over time that the database didn’t have a mechanism to cope with and users have helpfully populated that data in notes and comments.

C. Curation

Most data-integration projects tend to assume that all the data is there somewhere, or it can be generated by text mining. It is worth considering whether that is really true for your project. Annotations can be missing, patchy, use out-of-date terminology or may be just plain wrong. If you want your data to be clean you should at least consider some kind of exception processing for these kinds of cases, and the best solution will always be a human who knows the database and the domain. You can set up an internal curation team or use crowdsourcing. If you choose the crowdsourcing route, it's still a good idea to maintain some kind of supervisory curation function, a “data Tsar” or similar function to ensure that the quality of the curation at least meets some minimum standards.

5. Database Technologies

Where to put the data? I’m not a fan of the “NoSQL” moniker because a) defining something by what its not seems odd, and b) because experience has taught us that a standardized, vendor-non-specific Data Description/Data Manipulation language is actually very desirable. But, whatever you choose to call it, these new database technologies offer tremendous benefits to data integrators. Relational databases still have their place – in many ways they remain the Rolls Royce for data integration, but they can be very expensive to build and maintain. Column stores may well be the inheritors of the RDBMS crown. They offer a lot of what I like about RDBMSs while retaining the flexibility, speed and agility of key value stores. Key value stores and their relatives, the document databases, make it really easy to get data in but don’t always help to make it useful. Nevertheless, you can bring content in to a key value store and then populate a richer data format from that intermediate store. That’s a model that works very well if traceability matters to you.

For most of the data-integration problems that I have faced in the translational space, graph databases and triple stores offer the best blend of capabilities. They can handle most of the entities and relationships that we are interested in, they conform to a standard data representation (RDF) that can be ported to other technologies and, in the case of triple stores, they have a standard SQL-like query language. Where they really fly is the ability to make connections that their designer never thought of. Each new entity and relationship adds another item to the knowledge graph, and that knowledge graph can reveal interesting connections that enable new inferences to be made. Provided you stay in the middleweight zone, the challenges of performance and scalability don’t trip you up.

Don’t feel you have to restrict yourself to any one database technology, though. If you use an intermediate store for your extracted, enriched content then publishing that out to, say, a graph database for exploring distant relationships and to a column store for powering your visual analytics makes perfect sense. I would, however, resist the urge to use every toy in the box. Best to pick a set of technology components that you can build expertise in and have some kind of governance process for bringing in new ones as you go along.

6. Connecting the Dots

A. Entities we care about

Translational research is about joining the dots. But which dots? At the extreme end, every noun phrase in a database is an entity but most of us only care about a defined set of entities that can be managed and act as connection points for data. As a general rule, the fewer managed entities you have, the easier your job is going to be.

B. Fuzzy equivalency

The next question is, when can we say x is the same as y? In chemistry the structure is the unique identifier, in theory, but in the real world things are messier. Is a stereochemistry undefined structure the same as a defined stereoisomer? What about salts, hydrates? In biology, if a database column called “target” contains Her2, is that the same as a Her2 found in a column called “biomarker”? Do they all resolve directly to the protein, to the encoding gene? The answers will differ depending on the application. Any new data-integration project needs to figure out what entities they care about, and any translational application needs to figure out the rules that decide whether two entities are the same.

C. Ontologies

I’m using the broad definition of “ontology” here, meaning any kind of taxonomy, controlled vocabulary, etc. Life Sciences is unusual in that most fields of study have too few ontologies whereas we have so many! When you are doing data integration you quickly discover that things that appear to be the same really aren’t, and vice versa. A typical translational analysis will want to connect data from discovery (probably indexed to MeSH or SNOMED), toxicology (probably indexed to MedDRA) and clinical (probably indexed to ICD9 or 10). It will probably want to bring in data from Orphanet or OMIM and it will probably also want to refer to genes and proteins in UniProt or EntrezGene.

In order to manage this Tower of Babel you will need to map between these different ontologies – a task that is made extra difficult because there are often many-to-many relationships between entities in any two ontologies. You’ll have to pick the display term you want to use and manage the synonyms so your users can find what they want. The first step to this process is usually to select some standard ontology that is closest to what you are trying to do and/or the most familiar to your users, but many data-integration applications realize that none of the standard ontologies quite represent the world view of their organization. So you end up adding terms, synonyms, even entire branches to the original tree. That’s fine, except when the underlying ontology changes or the person who knew how to manage the ontology leaves. It is worth asking whether this is something you need to address and providing the infrastructure to enable the ontology you build to stay current.

Entities matter. There is nothing more disheartening than when your first demo of the new system gets in the hands of your key stakeholders and they either a) can’t find what they know is in the data in your system, just because you don’t have the synonym, or b) see some kind of “dumb” connection which undermines their confidence in the rest.

7. A Resource Fit for Purpose

In this paper I have advocated medium-weight, goal-oriented data integration. Which begs the question – what do your users actually want? Do they want a set of custom web-based interfaces that they can go to? Or do they need this data integrated into some package they use every day? Or are your customers data-savvy and eager to work with the data, either in a controlled way through dashboards or in an uncontrolled way through, for example, analytics in R or some other programming language?

Data integration requires findability. If your users can’t find what they are looking for they will assume it's just not there. Fortunately, the technology to make searching possible is very mature. But, tuning a search engine is non-trivial. End users have low patience thresholds and won’t want to page through to find the result they want, even if they did a lousy search. They expect Google-like relevance. Just wrapping your data in a search engine won’t deliver that – you need to leverage your ontologies and think about what users need to do to build a solution that is going to be well received.

The more flexible you want to make the data to different usage, the more you need to lean towards some kind of self-service option, e.g., an API, else your resources will quickly get used up making endless “apps” on the data. Making your data programmatically useful may require publishing out some subset of the data in an entirely different data model, often flattening the data out into a more analytical format. This effort can quickly pay off because your users will be able to get their jobs done without bothering you. The only caveat is that the result of that analysis is yet another piece of information that your system will probably want to persist. The more you want your users to go this route, the more you should consider managing a copy of the code in a linked repository so you can reproduce those electronic experiments.

8. Conclusions

Data integration is hard, and especially hard in the Life Sciences. We suffer less from the Volume and Velocity vectors of the Big Data equation but we have the Variety issue in spades. Having clear goals for your project, going for middleweight data integration, and actively managing the entities you care about can help make your project a success. Good luck!