Challenges when dealing with metagenomics data

Whilst the BASE and MM datasets hold great potential of broad national and international utility for academic researchers, industry and government agencies, rendering the data usable across the range of the intended audience requires further work.

Currently, metagenomics data produced by the BASE and MM initiatives is available for consortium members via the BPA Data Portal, however the continental scale datasets generated are large* and rapidly expanding as many more collection sites are included, necessitating functionality to be developed for users to generate data subsets of interest. For example:
  • samples that are from a specific geographical area, or 
  • samples that include DNA markers that are associated to specific taxonomic groupings (e.g. a species or a genus etc). 
This image shows how researchers currently interact (see green lines) with the metagenomics data in the BPA Data Repository (indicated by the blue box) now:



Currently the method to interact with, and glean information from the data requires download of the files and off-line computational analysis which largely restricts its use to groups with skilled informatics capacity and precludes many individuals and organisations in environmental research and management deriving insights from these important data resources.



_______________

* - e.g. in addition to raw sequence data files covering each amplicon used (e.g. bacterial 16S rRNA gene) for each site (~5MB fastq files), for the whole continent soil survey, data tables containing analysed data representing species/genus occurrence and abundance are ~9GB in size and contain  over 2 million rows (representing different organisational taxonomic units) and over 10,000 columns (representing different individual soil collection sites) - way too big to open using Excel!

Comments