Home Big Data How Gilead used Amazon Redshift to rapidly and cost-effectively load third-party medical claims information

How Gilead used Amazon Redshift to rapidly and cost-effectively load third-party medical claims information

How Gilead used Amazon Redshift to rapidly and cost-effectively load third-party medical claims information


This publish was co-written with Rajiv Arora, Director of Knowledge Science Platform at Gilead Life Sciences.

Gilead Sciences, Inc. is a biopharmaceutical firm dedicated to advancing modern medicines to forestall and deal with life-threatening illnesses, together with HIV, viral hepatitis, irritation, and most cancers. A pacesetter in virology, Gilead traditionally relied on these medication for progress however now via strategic investments, Gilead is increasing and growing their focus in oncology, having acquired Kite and Immunomedics to spice up their publicity to cell remedy and non-cell remedy, making it the first progress engine. As a result of Gilead is increasing into biologics and huge molecule therapies, and has an bold purpose of launching 10 modern therapies by 2030, there may be heavy emphasis on utilizing information with AI and machine studying (ML) to speed up the drug discovery pipeline.

Amazon Redshift Serverless is a completely managed cloud information warehouse that lets you seamlessly create your information warehouse with no infrastructure administration required. You pay just for the compute assets and storage that you simply use. Redshift Serverless measures information warehouse capability in Redshift Processing Models (RPUs), that are a part of the compute assets. All the information saved in your warehouse, comparable to tables, views, and customers, make up a namespace in Redshift Serverless.

One of many advantages of Redshift Serverless is that you simply don’t must dimension your information warehouse in your peak workload. The height workload contains loading periodic giant datasets in multi-terabyte vary. You’ll be able to set a base RPU from 8 as much as 512 and Redshift Serverless will routinely scale the RPUs to satisfy your workload calls for. This makes it simple to handle your information warehouse in an economical method.

On this publish, we share how Gilead collaborated with AWS to revamp their information ingestion course of. They used Redshift Serverless as their information producer to load third-party medical claims information in a quick and cost-effective manner, decreasing load occasions from days to hours.

Gilead use case

Gilead masses a wide range of information from a whole lot of sources to their R&D information setting. They not too long ago wanted to do a month-to-month load of 140 TB of uncompressed healthcare claims information in underneath 24 hours after receiving it to supply analysts and information scientists with up-to-date data on a affected person’s healthcare journey. This information quantity is predicted to extend month-to-month and is absolutely refreshed every month. The three-node RA3 16XL provisioned cluster that had beforehand been internet hosting their warehouse was taking round 12 hours to ingest this information to Amazon Redshift, and Gilead was trying to optimize the info ingestion course of in a extra dynamic method. Working with Amazon Redshift specialists from AWS, Gilead selected Redshift Serverless as a strategy to cost-effectively load this information after which use Redshift information sharing to share the ultimate dataset to 2 extra Redshift information warehouses for end-user queries.

Loading information is a key course of for any analytical system, together with Amazon Redshift. When loading very giant datasets, it’s necessary to not solely load the info as rapidly as doable but additionally in a manner that optimizes the consumption queries.

Gilead’s healthcare claims information took 40 hours to load, which meant delays in utilizing the info for downstream processes. The groups sought enhancements, concentrating on a most 24-hour SLA for the load. They achieved the load in 8 hours, an 80% discount in time to make information obtainable.

Answer overview

After collaborating, the Gilead and AWS groups selected a two-step course of to load the info to Amazon Redshift. First, the info was loaded with out a distkey and sortkey, which let the load course of use the total parallel assets of the cluster. Then we used a deep copy to redistribute this information and add the specified distribution and type traits.

The answer makes use of Redshift Serverless. The crew needed to ingest information to satisfy the required SLA, and the next approaches had been benchmarked:

  • COPY command – The COPY command makes use of the Amazon Redshift massively parallel processing (MPP) structure to learn and cargo information in parallel from information on Amazon Easy Storage Service (Amazon S3)
  • Knowledge lake analytics Amazon Redshift Spectrum is used to question information instantly from information on Amazon S3 by choosing a subset of columns and avoiding the intermediate step of copying information to staging desk

Preliminary Answer strategy: Single COPY command

The crew decided it could be simpler to use the distribution and type keys in a post-copy step. The info was loaded first utilizing computerized distribution of information. This took roughly 12 hours to finish. The crew created open and closed claims tables with outlined dist keys and with 20% of the columns to alleviate the necessity to question the bigger desk. With this success, we discovered that we will nonetheless enhance the large copy, as detailed within the following sections.

Proposed Answer strategy 1: Parallel COPY command

Primarily based on the preliminary resolution strategy above, the crew examined yearly parallel copy instructions as illustrated within the following diagram.

Yearly Parallel Copy Commands

Under are the findings and learnings from this strategy:

  • Ingesting information for 4 years utilizing parallel copy confirmed a 25% efficiency enchancment over the one copy command.
  • In comparison with Preliminary resolution strategy, the place we had been taking 12 hours to ingest the info, we additional optimized this runtime by 67% by segregating the info ingestion into separate yearly staging tables and working parallel copy instructions.
  • After the info was loaded into staging yearly tables, we created the open and closed declare tables with an auto distkey with the subset of columns required for bigger reporting teams. It took a further 1 hour to create.

The crew used a manifest file to be sure that the COPY command masses the entire required information for the respective 12 months for ingesting.

Proposed Answer strategy 2: Knowledge Lake analytics

The crew used this strategy with Redshift Spectrum to load solely the required columns to Redshift Serverless, which prevented loading information into a number of yearly tables and on to a single desk. The next diagram illustrates this strategy.

Using Spectrum Approach

The workflow consists of the next steps:

  1. Crawl the information utilizing AWS Glue.
  2. Create a information lake exterior schema and desk in Redshift Serverless.
  3. Create two separate claims desk for open and closed claims as a result of open claims are most regularly consumed and are 20% of the columns and 100% of the info.
  4. Create open and closed tables with selective columns wanted for optimum efficiency optimization throughout consumption as a substitute of all columns within the unique third-party dataset. The info quantity distribution is as follows:
    • Whole variety of open claims data = 50 billion
    • Whole variety of closed claims data = 200 billion
    • Total, whole variety of data = 250 billion
  5. Distribute open and closed tables with a customer-identified distkey.
  6. Configure information ingestion into open and closed claims tables mixed utilizing Redshift Serverless with 512 RPUs. This took 1.5 hours, which is additional improved by 70% in comparison with situation 1. We selected 512 RPUs with a view to load information within the quickest manner doable.

On this technique, information ingestion was streamlined by solely loading important fields from the medical claims dataset and by splitting the desk into open and closed claims. Open claims information is most regularly accessed and constitutes solely 20% of columns so by splitting the tables. The crew not solely improved the ingestion efficiency but additionally consumption.

Amazon Redshift not too long ago launched computerized mounting of AWS Glue Knowledge Catalog, making it simpler to run information lake analytics with out manually creating exterior schemas. You’ll be able to question information lake tables instantly from Amazon Redshift Question Editor v2 or your favourite SQL editors.

Suggestions and finest practices

Take into account the next suggestions when loading large-scale information in Amazon Redshift.

  • Use Redshift Serverless with most 512 RPUs to effectively and rapidly load information
  • Relying on consumption use case and question sample, undertake both of the next approaches:
    • When consumption queries require solely chosen fields from the dataset and most regularly entry a subset of information, use information lake queries to load solely the related columns from Amazon S3 into Amazon Redshift
    • When consumption queries require all fields, use COPY instructions with a manifest file to ingest information in parallel into a number of logically separated tables and create a database view with UNION ALL of all tables
  • Keep away from utilizing varchar(max) whereas creating tables and create VARCHAR columns with the precise dimension

Remaining Structure

The next diagram exhibits the high-level remaining structure that was applied.

Final Architecture


With the scalability of Redshift Serverless, information sharing to decouple ingestion from consumption workloads, and information lake analytics to ingest information, Gilead made their 140 TB dataset obtainable to their analysts inside hours of it being delivered. The modern structure of utilizing a serverless ingestion information warehouse, a serverless consumption information warehouse for energy customers, and their unique 3-node provisioned cluster for normal queries provides Gilead isolation to make sure information masses don’t have an effect on their customers. The structure gives scalability to serve rare giant queries with their serverless shopper together with the advantage of a fixed-cost and fixed-performance possibility of their provisioned cluster for his or her commonplace person queries. Because of the month-to-month schedule of the info load and the variable want for giant queries by shoppers, Redshift Serverless proved to be an economical possibility in comparison with merely growing the provisioned cluster to serve every of those use instances.

This break up producer/shopper mannequin of utilizing Redshift serverless can carry advantages to many workloads which have comparable efficiency traits to Gilead’s warehouse. Prospects often run giant information masses sometimes, and people processes compete with person queries. With this sample, you may depend on your queries to carry out constantly no matter whether or not new information is being loaded to the system. This strikes a stability between minimizing value whereas sustaining efficiency and frees the system directors to load information with out affecting customers.

In regards to the Authors

Rajiv Arora is a Director of Scientific Knowledge Science at Gilead Sciences with over 20 years of expertise within the business. He’s answerable for the multi-modal information platform for the event group and helps all statistical and predictive analytical infrastructure for RWE and Superior Analytical capabilities.

Ritesh Kumar Sinha is an Analytics Specialist Options Architect based mostly out of San Francisco. He has helped prospects construct scalable information warehousing and massive information options for over 16 years. He likes to design and construct environment friendly end-to-end options on AWS. In his spare time, he loves studying, strolling, and doing yoga.

Raks KhareRaks Khare is an Analytics Specialist Options Architect at AWS based mostly out of Pennsylvania. He helps prospects architect information analytics options at scale on the AWS platform.

Brent Sturdy is a Senior Options Architect within the Healthcare and Life Sciences crew at AWS. He has greater than 15 years of expertise within the business, specializing in information and analytics and DevOps. At AWS, he works carefully with giant Life Sciences prospects to assist them ship new and modern therapies.

Phil Bates is a Senior Analytics Specialist Options Architect at AWS with over 25 years of information warehouse expertise.



Please enter your comment!
Please enter your name here