The NMDC Data Portal User Guide — NMDC (2024)

Introduction

The pilot NMDC Data Portal (https://data.microbiomedata.org) providesa resource for consistently processed multi-omics data that isintegrated to enable search, access, analysis, and download. Open-sourcebioinformatics workflows are used to process raw multi-omics data andproduce interoperable and reusable annotated data from metagenome,metatranscriptome, metaproteome, metabolome, and natural organic mattercharacterizations. The NMDC Data Portal offers several search andnavigation components, and data can be downloaded through the graphicaluser interface using an ORCiD authentication, with associated downloadmetrics, or retrieved through available RESTful APIs. All multi-omicsdata are available under a Creative Commons 4.0 license, which enablespublic use with attribution, as outlined in the NMDC Data Use Policy(https://microbiomedata.org/nmdc-data-use-policy). This firstiteration of the NMDC Data Portal was released in March 2021, and willcontinue to expand its data hostings and functionality on a quarterlybasis. Associated release notes and updated user guides will accompanyeach quarterly release.

There is a short video tutorial showing how to navigate the portal onYoutube (https://www.youtube.com/watch?v=KJQDrCnJRho).

User-Centered Design Process

The NMDC is a resource designed together with and for the scientificcommunity. We have engaged in extensive user research through interviewsand direct collaboration with the scientific community that haveinformed the design, development, and display of data through the NMDCData Portal. This methodology (1) enables the scientific community toprovide feedback, iterative and continuous improvement of our systems,and ensures that our systems enable a high level of scientificproductivity. Feedback collected from the scientific community duringearly iterations of the Data Portal can be linked to the features anddesign directions found in the current release. Our community-centereddesign approach ensures that the NMDC can evolve with the needs of themicrobiome research community, but will also be important for uncoveringcreative design solutions, clarifying expectations, reducing redesign,and perhaps most importantly, enabling shared ownership (2) of the NMDC.We hope that this inclusive approach will enable us to expand ourengagements with the microbiome research community and the utility ofthe NMDC Data Portal.

Available Studies & Data

For the October 2021 NMDC Data Portal release, the data hostings include7 studies, 638 biosamples, and 5 data types from a breadth ofenvironmental microbiomes, spanning river sediments, subsurface shalecarbon reservoirs, plant-microbe associations, and temperate andtropical soils. Specifics are as follows:

Studies

As the NMDC Data Portal is a pilot infrastructure, incoming projects forwhich study information and curated environmental metadata becomeavailable is first validated and loaded with a flag (Omics data comingsoon) before processed instrumentation data is integrated into theportal.

Standards

The NMDC team works closely with several standards groups andorganizations. We have adopted the Genomic Standards Consortium (GSC)Minimum Information about any (x) Sequence (MIxS) templates (3). Thisprovides a standard data dictionary of sample descriptors (e.g.,location, biome, altitude, depth) organized into seventeen environmentalpackages (https://www.gensc.org/pages/standards-intro.html) for sequence data. The NMDC team hasmapped fields used to describe samples in the GOLD database to MIxSversion 5 (v5) elements. In addition, we are adopting the MIxS standardsfor sequence data types (e.g., sequencing method, pcr primers andconditions, etc.), and are leveraging standards and controlledvocabularies developed by the Proteomics Standards Initiative (4), theNational Cancer Institute’s Proteomic Data Commons(https://pdc.cancer.gov/data-dictionary/dictionary.html), and theMetabolomics Standards Initiative (5) for mass spectrometry data types(e.g., ionization mode, mass resolution, scan rate, etc.).

MIxS environmental packages

The GSC has developed standards for describing genomic and metagenomicsequences, and the environment from which a biological sampleoriginates. These “Minimum Information about any (x)Sequence” (MIxS) packages providesstandardized sample descriptors (e.g., location, environment, elevation,altitude, depth, etc.) for 17 different sample environments.

Environment Ontology (EnvO)

EnvO is a community-led ontology that represents environmental entitiessuch as biomes, environmental features, and environmental materials.These EnvO entities are the recommended values for several of themandatory terms in the MIxS packages, often referred to as the “MIxStriad”.

Genomes OnLine Database (GOLD)

GOLD is an open-access repository of genome, metagenome, andmetatranscriptome sequencing projects with their associated metadata.Biosamples (defined as the physical material collected from anenvironment) are described using a five-level ecosystem classificationpath that goes from ecosystem down to the type of environmental materialthat describes the sample.

Omics Data

A suite of omics processing data can be generated from availablebiosamples, and the value of associating these data through a commonsample source enables researchers to probe function. The NMDC dataschema offers an approach to link omics processing runs to their sourcebiosample (for example, multiple organic matter characterizations can begenerated from a single sample through extraction with various solvents,eg, chloroform, methanol, and water fractionation). Below outlines thevarious omics data currently available through the portal.

Metagenomes.

Illumina-sequenced shotgun metagenome data undergo pre-processing, errorcorrection, assembly, structural and functional annotation, and binningleveraging the JGI’s production pipelines (6), along with an additionalread-based taxonomic analysis component. Standardized outputs from theread QC, read-based analysis, assembly, annotation, and binning areavailable for search and download for 123 metagenomes on the NMDC DataPortal.

Metatranscriptomes.

Illumina-sequenced shotgun reads from cDNA library undergopre-processing and error correction in the same way as described abovein the metagenome workflow with additional steps to filter ribosomalreads. High-quality reads are then assembled into transcripts usingMEGAHIT (7), annotated using the annotation module described in themetagenome workflow, and the high-quality reads are mapped back to theannotated transcripts using HISAT2 (8) and then processed to calculatethe number of reads mapped per feature using FeatureCount (9) and RPKMcalculations per feature using edgeR (10). Results from read QC,assembly, and annotation are available for search and download for 45metatranscriptomes on the NMDC Data Portal.

Metaproteomes.

Data-dependent mass spectrometry raw data files are first converted tomzML, using MSConvert (11). Peptide identification is achieved usingMSGF+ (12) and the associated metagenomic information in the FASTA file.Peptide identification false discovery rate is controlled using a decoydatabase approach. Intensity information is extracted using MASIC (13)and combined with protein information. Protein annotation information isobtained from the associated metagenome annotation output. Standardizedoutputs for quality control, and peptide and protein-level quantitativedata are available for search and download for 38 metaproteomes on theNMDC Data Portal.

Metabolomes.

The gas chromatography-mass spectrometry (GC-MS) based metabolomicsworkflow (metaMS) developed by leveraging EMSL’s CoreMS massspectrometry software framework allows target and semi-target dataanalysis of metabolomics data (14). The raw data is parsed into coreMSdata structure and undergoes all the steps of signal processing (signalnoise reduction, m/z based chromatogram peak deconvolution, abundancethreshold calculation, peak picking) and molecular identification,including the molecular search using a metabolites standard compoundlibrary, spectral similarity calculation, and similarity scorecalculation (15), all in a single step. The putative metaboliteannotation data is available to download for 34 metabolomes on the NMDCData Portal. Data dependent LC-MS based workflows are currently underdevelopment. Additionally, it should be noted that all available dataderives from exploratory, untargeted analysis and is semi-quantitative.

Natural Organic Matter Characterization (NOM).

Direct Infusion Fourier Transform mass spectrometry (DI FT-MS) dataundergoes signal processing and molecular formula assignment leveragingEMSL’s CoreMS framework (14). Raw time domain data is transformed intothe m/z domain using Fourier Transform and Ledford equation (16). Datais denoised followed by peak picking, recalibration using an externalreference list of known compounds, and searched against a dynamicallygenerated molecular formula library with a defined molecular searchspace. The confidence scores for all the molecular formula candidatesare calculated based on the mass accuracy and fine isotopic structure,and the best candidate assigned as the highest score. The molecularformula characterization table is available to download for 946 naturalorganic matter characterizations on the NMDC Data Portal.

Portal Functionality

#Faceted search and access

Search by investigator name

NMDC-linked data can be filtered by the associated principalinvestigator by selecting ‘PI Name’ from the left query term bar. Thisselection will display studies and samples associated with that PI, andselecting the arrow on the right side of the study name will open upmore information about that study and that principal investigator.

Search by omics processing information

Samples can be queried by various omics processing information termsincluding instrument name, omics type (processing runs sorted by omicstype can also be queried using the bar plot on the main portal page),and processing institution.

Search by KEGG Orthology (KO)

Under ‘Function’ on the query term bar, users are able to search by KEGGOrthology (KO) terms to limit the query to samples with datasets thatinclude at least one of the listed KO terms. Users may list multiple KOterms, but it is important to note that adding multiple terms will limitthe search to datasets that include at least one of those KO terms, notall of the added terms.

Search by environmental descriptors

The query term bar also includes several environmental descriptorfiltering fields of where the samples were isolated from. Users canfilter by sample isolation depth, collection date, latitude andlongitude (can also filter by latitude and longitude using theinteractive map on the omics main page), as well as geographic locationname.

Search by ecosystem classifications

Samples can also be queried by ecosystem classifications using GOLDand/or ENVO terms. Selecting GOLD classification in the query term baropens up a hierarchy that can be navigated through to select ecosystemclassification(s) of interest. Users can select everything under acertain classification at any point, or can continue navigating to morespecific classifications. The Sankey diagram on the ‘Environment’ pageprovides an interactive visualization of the GOLD classification system.

Similarly, ENVO terms can be used to query the portal, and these arebroken down into environmental biome, feature, and material categories.ENVO is another effective classification system that can be used todescribe environments where samples were collected from.

Interactive visualizations

Omics Page

Barplot

The barplot on the omics page displays the number of omics processingruns (not number of samples) for each data type available: organicmatter, metagenomic, metatranscriptomic, proteomic, and metabolomic.Selecting the bar of a data type will limit the search to just that datatype.

Geographic map

The geographic map on the omics page allows for samples to be queried bythe geographic location from which they were isolated. The map displaysthe geographical location (latitude, longitude) of the sample collectionsites as clusters with colors corresponding to the number of samplesfrom that area. The map can be zoomed in and out of, and clusters can beselected to focus on that specific area. After zooming and moving aroundthe map to a region of interest, selecting the ‘Search this region’button will limit the search to the current map bounds.

Temporal slider

Samples can also be queried by a sample collection date range bydragging the dots below the temporal slider on the omics page. Samplecollection dates are grouped by month.

Upset plot

The upset plot on the omics page displays the number of samples thathave various combinations of associated omics data. The axis at the topof the plot refers to the different omics types (MG: metagenomic, MT:metatranscriptomic, MP: metaproteomic, MB: metabolomic, NOM: naturalorganic matter) and the dots and lines in the graph below represent thecombinations of the omics data types. The numbers and bars on the rightside represent the number of samples searchable in the NMDC data portalwith each corresponding combination of omics data types. This plot willupdate as query terms are added.

Environment Page

Sankey diagram

On the environment page, the Sankey diagram displays the environmentsthat NMDC-linked samples were isolated from. This visualization is basedon the GOLD ecosystem classification path, and the diagram is fullyinteractive, so environments of interest can be chosen at descendinglevels of specificity. This will then limit your search to samples thatcame from that selected environment.

Download

Individual file

Various output data files are available from samples findable throughthe NMDC that have been run through the NMDC standardized workflows.Output files from each omic type are sorted by the specific workflow(e.g. Metagenome Assembly, Annotation) that was run and are eachavailable for download when the sample of interest is selected. Usersmust log in with an ORCID account before downloading data.

Bulk download

In addition to the ability to download single output files from samplesrun through the NMDC standardized workflows, the NMDC portal allowsusers to perform bulk downloads on workflow output files. Once samplesof interest are down-selected through query terms, output files fromeach NMDC standardized workflow run on those samples are available asbulk downloads. Users must be logged in with an ORCID account beforedownloading data.

References

  1. Abras C, Maloney-Krichmar, D., Preece, J. 2004. User-CenteredDesign. _In _Bainbridge W (ed), Encyclopedia of Human-ComputerInteraction. Sage Publications, Thousand Oaks.

  2. Preece J, Rogers, Y., & Sharp, H. 2002. Interaction design: Beyondhuman-computer interaction. John Wiley & Sons, New York, NY.

  3. Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-ZettlerL, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G, VaughanR, Hunter C, Park J, Morrison N, Rocca-Serra P, Sterk P, ArumugamM, Bailey M, Baumgartner L, Birren BW, Blaser MJ, Bonazzi V, BoothT, Bork P, Bushman FD, Buttigieg PL, Chain PSG, Charlson E,Costello EK, Huot-Creasy H, Dawyndt P, DeSantis T, Fierer N,Fuhrman JA, Gallery RE, Gevers D, Gibbs RA, Gil IS, Gonzalez A,Gordon JI, Guralnick R, Hankeln W, Highlander S, Hugenholtz P,Jansson J, Kau AL, Kelley ST, Kennedy J, Knights D, Koren O, etal. 2011. Minimum information about a marker gene sequence(MIMARKS) and minimum information about any (x) sequence (MIxS)specifications. _Nature Biotechnol. _29:415-420.

  4. Taylor CF, Paton NW, Lilley KS, Binz P-A, Julian RK, Jones AR, ZhuW, Apweiler R, Aebersold R, Deutsch EW, Dunn MJ, Heck AJR, LeitnerA, Macht M, Mann M, Martens L, Neubert TA, Patterson SD, Ping P,Seymour SL, Souda P, Tsugita A, Vandekerckhove J, Vondriska TM,Whitelegge JP, Wilkins MR, Xenarios I, Yates JR,Hermjakob H. 2007. The minimum information about a proteomicsexperiment (MIAPE). _Nature Biotechnol. _25:887-893.

  5. Sansone S-A, Fan T, Goodacre R, Griffin JL, Hardy NW,Kaddurah-Daouk R, Kristal BS, Lindon J, Mendes P, Morrison N,Nikolau B, Robertson D, Sumner LW, Taylor C, van der Werf M, vanOmmen B, Fiehn O, Members MSIB. 2007. The Metabolomics StandardsInitiative. _Nature Biotechnol. _25:846-848.

  6. Clum A, Huntemann M, Bushnell B, Foster B, Foster B, Roux S, HajekPP, Varghese N, Mukherjee S, Reddy TBK, Daum C, Yoshinaga Y,O’Malley R, Seshadri R, Kyrpides NC, Eloe-Fadrosh EA, Chen I-MA,Copeland A, Ivanova NN, Segata N. 2021. DOE JGI MetagenomeWorkflow. _mSystems _6:e00804-20.

  7. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. 2015. MEGAHIT: anultra-fast single-node solution for large and complex metagenomicsassembly via succinct de Bruijn graph. _Bioinformatics_31:1674-1676.

  8. Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. 2019. Graph-basedgenome alignment and genotyping with HISAT2 and HISAT-genotype._Nature Biotechnol. _37:907-915.

  9. Liao Y, Smyth GK, Shi W. 2014. featureCounts: an efficient generalpurpose program for assigning sequence reads to genomic features._Bioinformatics _30:923-30.

  10. Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductorpackage for differential expression analysis of digital geneexpression data. _Bioinformatics _26:139-140.

  11. Chambers MC, Maclean B, Burke R, Amodei D, Ruderman DL, Neumann S,Gatto L, Fischer B, Pratt B, Egertson J, Hoff K, Kessner D, TasmanN, Shulman N, Frewen B, Baker TA, Brusniak MY, Paulse C, Creasy D,Flashner L, Kani K, Moulding C, Seymour SL, Nuwaysir LM, LefebvreB, Kuhlmann F, Roark J, Rainer P, Detlev S, Hemenway T, Huhmer A,Langridge J, Connolly B, Chadick T, Holly K, Eckels J, Deutsch EW,Moritz RL, Katz JE, Agus DB, MacCoss M, Tabb DL, Mallick P. 2012.A cross-platform toolkit for mass spectrometry and proteomics._Nature Biotechnol. _30:918-20.

  12. Kim S, Gupta N, Pevzner PA. 2008. Spectral Probabilities andGenerating Functions of Tandem Mass Spectra: A Strike againstDecoy Databases. _J Proteome Res. _7:3354-3363.

  13. Monroe ME, Shaw JL, Daly DS, Adkins JN, Smith RD. 2008. MASIC: Asoftware program for fast quantitation and flexible visualizationof chromatographic profiles from detected LC– MS(/MS) features._Comp. Biol. Chemistry _32:215-217.

  14. Corilo YE, Kew WR, McCue LA. 2021. EMSL-Computing/CoreMS: CoreMS1.0.0 (v1.0.0). Zenodo. 10.5281/zenodo.4641552.

  15. Hiller K, Hangebrauk J, Jäger C, Spura J, Schreiber K,Schomburg D. 2009. MetaboliteDetector: comprehensive analysis toolfor targeted and nontargeted GC/MS based metabolome analysis._Anal Chem _81:3429-39.

  16. Marshall AG, Hendrickson CL, Jackson GS. 1998. Fourier transformion cyclotron resonance mass spectrometry: a primer. _MassSpectrom Rev _17:1-35.

The NMDC Data Portal User Guide — NMDC (2024)
Top Articles
Latest Posts
Article information

Author: Lidia Grady

Last Updated:

Views: 6086

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.