Genomics UI – Simplifying Genomic Data Analysis with Microsoft Azure
The University of Guelph’s Genomics Facility needed a way for school researchers to get access to and share data so they can quickly and efficiently process it to do the analysis they need.
faster completion of complex analytics processes
that is simple to navigate
Background and Challenge Story
Our client, the University of Guelph, is a public research university with almost 30,000 students located in Guelph, Ontario. Their Genomics Facility is a specialized laboratory offering molecular biology technology support services and training in DNA sequencing, genotyping, and gene expression analysis.
The University’s Genomics Facility needed a way for school researchers to get access to and share data so they can quickly and efficiently process it to do the analysis they need. Genomic sequencing systems are capable of producing very large quantities of data (they can generate 16 terabytes of data in a 48-hour period) that need to be analyzed, requiring a large amount of processing resources. U of G’s current solution consisted of storing, sharing, and analyzing large genomic files locally with fixed-scale infrastructure, which was delaying researchers’ access to analysis.
Challenge #1 – Scalable (On-Demand) Analytics Capacity
Microsoft Azure offers the University access to a highly scalable infrastructure to reduce the time required to process complex genomic data analysis. The powerful capabilities offered by Azure demand knowledgeable experts to provision compute and storage resources in a secure manner. The University of Guelph set out to find a way of exposing the power and scale of Azure’s cloud services to faculty and students with little or no cloud infrastructure knowledge or training.
Challenge #2 – Cost Governance
With the virtually limitless capacity available from Azure, the University required simple process ‘gates’ to ensure jobs were reviewed and approved prior to submission. Approval workflow would ensure only those jobs that were authorized would be submitted to Azure for execution; thereby preventing users from either accidentally or maliciously submitting large jobs that could drive thousands or tens of thousands of dollars in worthless cloud consumption costs.
Microsoft engaged Adastra as an expert in Azure-based data analysis and analytic services for the higher education market.
Adastra is proud to have been recognized as the winner of Microsoft’s 2022 Data Platform Modernization, Financial Services, and Modern Marketing Impact Awards. These awards are evidence of Adastra’s ability to deliver value-driven data solutions to complex business challenges across sectors.
The University of Guelph, Adastra, and Microsoft collaborated to design and implement an application for their researchers to find and analyze their genomic data. Using this application, the underlying complexity of the Azure services powering the rapid execution of genomic analytical workflows would be abstracted from the user. An intuitive User Experience would allow researchers to easily access their sequencer output data (FASTQ files). This same application would offer researchers and students the opportunity to select and configure workflow tools such as RNA-Seq for gene expression studies. Figure 1 provides a sample interface from the application where users are able to navigate to their FASTQ files.
Figure 1 – User Interface – Selecting Sequence Data for Analysis from Azure Storage
Cromwell Workflow Engine and WDL Files
The Cromwell engine is an open-source scientific workflow Management System. Workflows are defined in .WDL files, which pass instructions to Cromwell (e.g., how much memory is required, where your data is etc.) to carry out a workflow.
The application was built to guide users through the process of capturing user-defined parameters for configuring .WDL files. The application offers a very simple and intuitive way for researchers to generate a .WDL file, launch the associated workflow, and pull up the desired data.
Microsoft Power Apps
Adastra used Microsoft Power Apps to build a multi-user application with three tiers of users: admin, research group admin and research group member. This allowed the University to impose financial and access controls depending on the needs of each type of user.
- Admin – Admin users can create new user accounts and configure different aspects of the system.
- Research group admin – A research group admin can organize a group of users in the system that can all share data and can set a threshold of the predicted costs of a workflow. Their authorization will be required before workflows over a certain threshold will execute, allowing research group admins to ensure they don’t exceed budgets when performing genomic analysis.
- Research group member – This account tier is most likely assigned to a student who wants to perform genomic analysis. Research group members can pull data required for analysis but don’t have administrative controls.
University of Guelph Genomics Facility students work in a lab to prepare a genomic sample for analysis, then the principal investigator of the project takes the samples from all their students and runs them through a genomic analyzer. They then put the raw data in Azure data storage and create a new project in the system. An output will be assigned to each research group member that’s involved in the genomic sequencing run, allowing research students to gain access easily to their genomic samples and analyses as well as related analyses of their peers.
Power Apps and Dataverse are the primary services for delivering the User Experience. The intuitive user experience abstracts the complexity of the underlying Azure services, coordinating a scalable and flexible analytic service based on the Cromwell workflow engine. Some of the services used include:
- User Authentication and Authorization – Azure Active Directory provides a service-based method for the application to manage users. The application further defines a hierarchy of users and their respective roles and entitlements.
- Azure Storage – Used to source input (FASTQ files) for workflow execution and for storing the results of each workflow (output).
- SharePoint – Used to store application data (Reference Data) that can be modified by Administrators to define application behavior and specify input options for workflow execution.
- Efficient large-scale data processing
- UI that is simple to navigate
- Financial and user controls
- Can easily be made an educational tool
Adastra’s solution is a big first step to equipping the University of Guelph Genomics Facility with a user-friendly tool for sharing large amounts of complex data with researchers quickly, and without overloading the system. The Genomics UI solution offers opportunities to give researchers and students the ability to complete complex analytics processes faster and with fewer technical skills.
The application is simple to navigate for researchers in the Facility with a science background but without extensive IT knowledge. Multiple user tiers give Principal Investigators control of cloud service consumption costs, providing administrative users an easy way to track the scope and costs of jobs.
Overall, Adastra’s solution provides opportunities for improving the experience for students at the University of Guelph. Future iterations of the application are planned that would include educational content to further transform the Genomics UI from a simple workflow execution tool into a robust learning system for Bioinformatics. As the University continues to evolve the solution, this novel resource will doubtlessly deliver an educational experience for students that is unique and effective in evolving to changing needs in this fast-paced and vital area of scientific discovery.