1. About the MDLink

MDLink is designed to explore the biological relevance of understudied metabolites and their potential links with diseases, enabling evidence-based prioritization of metabolic biomarkers for further investigation. MDLink consists of three branches, each grounded in a specific biological assumption, to predict potential interaction proteins/genes for metabolites of interest, identify involved pathways and infer potential disease associations linked to these metabolites.

The default database integrates multiple data sources across four main categories:

Interaction Networks: STRING and STITCH for protein-protein and metabolite-protein interactions.
Protein and Metabolite Information: STRING, STITCH, and PubChem for comprehensive annotations of proteins and metabolites.
Disease-Related Genes: Disease Ontology (DO), DisGeNET, and NCG.
Biological Annotations: Gene Ontology (including Molecular Function, Cellular Component, and Biological Process), KEGG, WikiPathways, and Reactome pathways.

The database encompasses approximately xxxx metabolites, 21,300 proteins, xxx protein-protein interactions, xxx metabolite-protein interactions, xxx diseases, xxx GO terms, xxx KEGG pathways, xxx WikiPathways, and xxx Reactome pathways.

2. About the Search Page

On the Search page, users have the option to analyze a single metabolite or perform batch analysis of multiple metabolites. Metabolites of interest can be explored through individual or combined analyses across three branches: the target proteins branch, the structurally similar metabolites branch, and the co-abundant metabolites branch. The results for each selected branch will be displayed and summarized separately.

Upon completion of each step, corresponding summaries and results are automatically generated, presented as downloadable figures and interactive tables. Figures include scaling controls in the lower-right corner, while tables support horizontal scrolling, column-specific sorting, and keyword search functionality.

Icon Functions:
: provide explanatory notes.
#1 #2 #3 : supply example datasets.
: enable data download.

2.1 Analysis of a specific metabolite

Step 0 select module

In the step 0, select Analysis of a specific metabolite module to start analyses.

Step 1 input metabolite

To begin analyzing a specific metabolite, users can search using either a metabolite name or ID. All synonyms and identifiers are supported, including the common names and multiple identification systems for metabolites (e.g., KEGG ID, HMDB ID, CAS numbers, ChEBI IDs, and all aliases from PubChem). For example, the following terms are all allowed as input to search for arachidonic acid:

Common name:: arachidonic acid
PubChem CID:: CID444899
KEGG compound ID:: C00219
ChEBI:: CHEBI:137828
HMDB:: HMDB0001043
CAS-RN:: 506-32-1

After inputting the metabolite, a brief summary of query result for this metabolite will be shown (take arachidonic acid as an example).

Step 2 input target disease

The input of a target disease is used to obtain proteins associated with that specific disease. Then, an over-representation analysis (ORA) will be performed to determine whether the resulting disease-related proteins are statistically over-represented among those predicted to interact with the queried metabolite. By selecting a disease from the list (using the "radio button", e.g., DOID:8778), the corresponding disease-related proteins will be automatically passed to the ORA analysis.

Or users can manually input a list of disease-related proteins in the second option with protein names or NCBI Entrez Gene IDs

Alternatively, if there is no specific disease to be studied, users can opt to skip this section using the third option, in which case disease-related proteins ORA will not be performed in the later analyses.

Step 3 set parameters for the analytic branches

3.1 The target proteins branch

By selecting the first analytic module in Step 3, users can use the target proteins branch.

Within this module, users can either manually input target proteins associated with the query metabolite (no confidence score required) or retrieve target proteins from the default database, in which case a minimum confidence score must be set (in the "Define the Target Proteins" panel). The resulting target proteins will be used to retrieve potential interaction proteins/genes based on the threshold of confidence score (in the "Interaction Network" panel). Confidence score ranges from 0 to 1, with a default threshold of 0.7.

The matched target proteins, their potential interacting genes/proteins, and the associated interaction relationships will be presented in tables. Furthermore, the interaction data is available for download in edges-and-nodes format, enabling visualization in network analysis tools such as Cytoscape.

The results of the disease-related proteins ORA will be displayed (using Crohn's diseases and arachidonic acid as an example). The Venn diagram shows the overlap between the predicted metabolite-related proteins and the disease-related proteins. Download options for both the figure and table are provided.

The predicted interaction proteins/genes will be used to perform term enrichment analysis in Step 4, Users can select standard databases (e.g., Gene Ontology, KEGG Pathways, WikiPathways, Reactome Pathways, Disease Ontology) for enrichment analysis. If the provided databases do not meet specific needs, users have the option to upload custom databases by submitting tab-separated TERM2GENE and TERM2NAME files (examples provided in #1 ). These files must adhere to specific formats: TERM2GENE requires Term IDs paired with Gene IDs, while TERM2NAME maps Term IDs to descriptive names.

The results of the enrichment analysis will be displayed in a table with a global search box at the top-right and filters under the header for domain-specific browsing. These tools enable users to focus on specific terms and view statistical details. Clicking the square icon next to the term name can choose terms of interest for visualization. It will be presented as a composite diagram comprising a Sankey plot and a dot plot. The Sankey plot on the left shows connections between metabolite-targeted genes, interaction genes/proteins, and selected terms. The dot plot on the right illustrates the number and proportion of predicted genes/proteins involved with the enriched terms, along with the adjusted p-values indicating statistical significance. The plot can be zoomed in/out using the controls in the lower-right corner.

Users can also use custom databases separately for enrichment analysis.

3.2 The structurally-similar metabolites branch

By selecting the second analytic module in Step 3, users can use the structurally similar metabolites branch.

Within this module, users can either manually input structurally similar metabolites associated with the query metabolite or retrieve structurally similar metabolites (with a Tanimoto score >= 90%) from the default database. The resulting structurally similar metabolites will be used to retrieve potential interaction proteins/genes based on the threshold of confidence score (in the "Interaction Network" panel). Confidence score ranges from 0 to 1, with a default threshold of 0.7.

The matched structurally similar metabolites, their potential interacting genes/proteins, and the associated interaction relationships will be presented in tables. Furthermore, the interaction data is available for download in edges-and-nodes format, enabling visualization in network analysis tools such as Cytoscape.

The results of the disease-related proteins ORA will be displayed (using Crohn's diseases and arachidonic acid as an example). The Venn diagram shows the overlap between the predicted structurally similar metabolite-related proteins and the disease-related proteins. Download options for both the figure and table are provided.

The predicted interaction proteins/genes can be used to perform term enrichment analysis in Step 4. Users can select standard databases (e.g., Gene Ontology, KEGG Pathways, WikiPathways, Reactome Pathways, Disease Ontology) for enrichment analysis. If the provided databases do not meet specific needs, users have the option to upload custom databases by submitting tab-separated TERM2GENE and TERM2NAME files. These files must adhere to specific formats: TERM2GENE requires Term IDs paired with Gene IDs, while TERM2NAME maps Term IDs to descriptive names.

The results of the enrichment analysis will be displayed in a table with a global search box at the top-right and filters under the header for domain-specific browsing. These tools enable users to focus on specific terms and view statistical details. Clicking the square icon to tick the term name can choose terms of interest for visualization. It will be presented as a composite diagram comprising a Sankey plot and a dot plot. The left-sided Sankey plot shows connections between structurally similar metabolites, interaction genes/proteins, and selected terms. The right-sided dot plot illustrates the number and proportion of predicted genes/proteins involved with the enriched terms, along with the adjusted p-values indicating statistical significance. The plot can be zoomed in/out using the controls in the lower-right corner.

3.3 The co-abundant metabolites branch

By selecting the third analytic module in Step 3, users can use the co-abundant metabolites branch.

Within this module, users have three options to obtain metabolites that co-vary with the queried metabolite: (i) upload a metabolic abundance table to identify the co-abundant metabolites associated with queried metabolite using Weighted Gene Co-expression Network analysis (WGCNA), (ii) directly upload precomputed WGCNA results containing metabolites and modules, or (iii) manually input metabolites that co-vary with the queried metabolite. The resulting co-abundant metabolites, including the queried metabolite, will be used to retrieve potential interaction proteins/genes based on the threshold of confidence score (in the "Interaction Network" panel). Confidence score ranges from 0 to 1, with a default threshold of 0.7.

When identifying abundance-correlated metabolites (i.e., consistent abundance module) using WGCNA, it is important to ensure the correct input file format and to properly set the parameters required for the analysis. The input file should be in TXT or TSV format, with metabolites as columns and samples as rows. An example of abundance table is provided for referring to the format.

The WGCNA process can be divided into 5 steps as outlined in WGCNA tutorial WGCNA tutorial : (a) choosing the soft-thresholding power; (b) calculating co-expression similarity and adjacency; (c) calculating topological overlap matrix (TOM) ; (d) clustering using TOM; (e) merging of modules whose expression profiles are very similar. Several important parameters are made available to users for customization. These include selecting the network type (signed, unsigned, or signed hybrid) in step a and b, choosing the correlation method (Spearman or Pearson) in step b, setting the minimum module size (default: 5) in step d, and defining the module merging threshold (i.e., 1-TOM dissimilarity, default TOM dissimilarity cutoff: 0.25) in step e.

The results of WGCNA will be presented on a table.

The identified co-abundant metabolites, their potential interacting genes/proteins, and the associated interaction relationships will be presented in tables. Furthermore, the interaction data is available for download in edges-and-nodes format, enabling visualization in network analysis tools such as Cytoscape.

The results of the disease-related proteins ORA will be displayed (using Crohn's diseases and arachidonic acid as an example). The Venn diagram shows the overlap between the predicted co-abundant metabolite-related proteins and the disease-related proteins. Download options for both the figure and table are provided.

The results of the enrichment analysis will be displayed in a table with a global search box at the top-right and filters under the header for domain-specific browsing. These tools enable users to focus on specific terms and view statistical details. Clicking the square icon next to the term name can choose terms of interest for visualization. It will be presented as a composite diagram comprising a Sankey plot and a dot plot. The left-sided Sankey plot shows connections between co-abundant metabolites, interaction genes/proteins, and selected terms. The right-sided dot plot illustrates the number and proportion of predicted genes/proteins involved with the enriched terms, along with the adjusted p-values indicating statistical significance. The plot can be zoomed in/out using the controls in the lower-right corner.

3.4 The user defined branch

By selecting the fourth analytic module in Step 3, users can use the user defined branch.

This feature allows users to manually input gene names linked to metabolites. The entered genes will be incorporated as independent entities into subsequent network construction and pathway enrichment analyses.

The user defined metabolite-related proteins/genes and their interaction relationships will be presented in tables. Furthermore, the interaction data is available for download in edges-and-nodes format, enabling visualization in network analysis tools such as Cytoscape.

The results of the disease-related proteins ORA will be displayed. In this example, a warning was generated because there was no overlap between the user-defined genes/proteins and the disease-related proteins. Additionally, if disease specification was skipped in Step 2, the ORA could not be conducted, resulting in the same warning.

The matched proteins/genes can be used to perform term enrichment analysis in Step 4. Users can select standard databases (e.g., Gene Ontology, KEGG Pathways, WikiPathways, Reactome Pathways, Disease Ontology) for enrichment analysis. If the provided databases do not meet specific needs, users have the option to upload custom databases by submitting tab-separated TERM2GENE and TERM2NAME files. These files must adhere to specific formats: TERM2GENE requires Term IDs paired with Gene IDs, while TERM2NAME maps Term IDs to descriptive names.

The results of the enrichment analysis will be displayed in a table with a global search box at the top-right and filters under the header for domain-specific browsing. These tools enable users to focus on specific terms and view statistical details. Clicking the square icon next to the term name can choose terms of interest for visualization. It will be presented as a composite diagram comprising a Sankey plot and a dot plot. The left-sided Sankey plot shows connections between co-abundant metabolites, interaction genes/proteins, and selected terms. The right-sided dot plot illustrates the number and proportion of predicted genes/proteins involved with the enriched terms, along with the adjusted p-values indicating statistical significance. The plot can be zoomed in/out using the controls in the lower-right corner.

2.2 Batch analysis of multiple metabolites

Step 0 select module

In step 0, select Batch analysis of multiple metabolites module to start analyses.

Step 1 input metabolites/upload data

To perform analyses on multiple metabolites, the first step involves inputting the relevant metabolite information. We provide two options for data upload:

(i) input a list of metabolites for downstream analysis, multiple mixed naming and identification systems for metabolites are allowed;

(ii) alternatively, use significantly differential abundant metabolites identified by the Biomarker discovery module as input for downstream analysis.

To define the metabolite biomarkers, a metabolic abundance table and a corresponding metadata file containing sample grouping information are required. The metabolite abundance table supports multiple naming conventions and identification systems, which can be used concurrently. As a reference, an example dataset including metabolomic profiles and metadata from individuals with Crohn's disease and non-IBD controls (PMID: 30531976) is available for download.

Users can upload their own files by selecting the “Browse” button to locate the desired files and clicking “Upload” to initiate the upload process.

The uploaded data will undergo transformation and scaling. Choose the appropriate methods, and then click “Proceed” to continue.

Next, choose the grouping information from the uploaded metadata and set parameters for differential abundance analysis, then click “Run” to move forward.

Now, set the statistic thresholds to identify significantly differential abundant metabolite, then click “Filter” to continue.

Statistical results for all metabolites will be presented in a table and are available for download. Only metabolites identified as significantly differential abundant will be carried forward into subsequent analysis. Comprehensive summaries will be provided to describe the analytical methodology, including data structure, preprocessing techniques, and the statistical criteria used to identify differentially abundant metabolites.

Step 2 input target disease

The input of a target disease is used to obtain proteins associated with that specific disease. Then, over-representation analysis (ORA) will be performed to determine whether the resulting disease-related proteins are statistically over-represented among those predicted to interact with the query metabolite. By selecting a diseases from the list (using the "radio button", e.g., DOID:8778), the corresponding disease-related proteins will be automatically loaded to the ORA analyses.

Or users can manually input a list of disease-related proteins in the second option with protein names or NCBI Entrez Gene IDs.

Step 3: set parameters for three branches

The analysis scope depends on the input selection made in Step 1:

If the input is a list of metabolite names or IDs, the server will conduct analyses on all metabolites;

If the input is derived from the Biomarker Discovery module, only those metabolites identified as significantly differential abundant are subjected to further analysis.

Set the parameters for the three analytic branches:

The target proteins branch: Target proteins will be retrieved from the default database using a user-defined minimum confidence score set by users.

The structurally similar metabolites branch: Metabolites with a Tanimoto score >= 90% in the default database are considered as structural similar metabolites.

The co-abundant metabolites branch: This branch provides three options to obtain metabolites that co-vary with the queried metabolite or skip this step: (i) upload a metabolic abundance table to identify the co-abundant metabolites using Weighted Gene Co-expression Network analysis (WGCNA), (ii) directly upload precomputed WGCNA results containing metabolites and modules, or (iii) skip this branch by selecting the “no co-abundant metabolites” option. To learn more about the WGCNA settings, check out section 3.3 The co-abundant metabolites branch in the Analysis of a specific metabolite module.

The resulting target proteins, structurally similar metabolites and co-abundant metabolites will each be used to retrieve potential interaction proteins/genes based on the threshold of confidence score (in the "Interaction Network" panel). Confidence score ranges from 0 to 1, with a default threshold of 0.7.
Note: This step take time to complete.

If the co-abundant metabolites branch is not skipped, the results of WGCNA analysis will be presented in a tabular format.

When choose to use significantly differential abundant metabolites identified by the Biomarker discovery module as input in step1, a circular plot summarizing key features of these biomarkers will be displayed. The layers of the plot, from inner to outer, are described as follows:

layers 1 to 3 (blue/green/red dots) represent the predicted interaction proteins of the biomarkers are significantly enriched within CD-related genes/proteins in the target proteins branch, the structural similar metabolites branch, and the co- abundant metabolites branch, respectively;

layers 4 to 6 (heatmap) represent the statistical metrics of the biomarkers, which are Area Under the Curve (AUC), absolute value of log2 fold change, and -log10(P-value);

the outer layer represents the Variable Importance in Projection (VIP) score of each biomarker.

A summary of disease-related gene statistics for each metabolite, based on ORA results from the three analytic branches, will be shown in a table. To view detailed results and perform further analysis for a specific metabolite, click the circle icon next to the corresponding metabolite name.

Users can search for a specific metabolite using the global search box positioned at the top-right of the interface. Upon locating the desired metabolite (e.g., urobilin), clicking the radio button next to its name will enable further analysis.

The target proteins branch for urobilin will show a warning : “The potential interaction genes/proteins were not found”. This indicates that no target proteins for urobilin are present in the default database.

The results of the structural similar metabolites branch will be presented, with details as previously described (see "Analysis of a specific metabolite" module).

Similarly, the results of the co-abundant metabolites branch will be presented, with details as previously described (see "Analysis of a specific metabolite" module).

MDLink

extrapolation of potential Metabolite-Disease Associations by Mining biomedical knowledge

Welcome to MDLink!

Pipeline Overview

Citation

Analysis of a specific metabolite

Batch analysis of multiple metabolites

Define the branch

Define the target proteins

Interaction Network

Define the structurally-similar metabolites

Interaction Network

Define the co-abundant metabolites

Interaction Network

Define metabolite-related proteins/genes

Database Selection

Analysis of a specific metabolite

Batch analysis of multiple metabolites

Upload abundance table and metadata

Normalization

Set parameters for differential abundance analysis

Set thresholds to define metabolite biomarkers

Define the target proteins

Define the structurally-similar metabolites

Define the co-abundant metabolites

Interaction Network

Table of Contents

1. About the MDLink

2. About the Search Page

2.1 Analysis of a specific metabolite

Step 0 select module

Step 1 input metabolite

Step 2 input target disease

Step 3 set parameters for the analytic branches

3.1 The target proteins branch

3.2 The structurally-similar metabolites branch

3.3 The co-abundant metabolites branch

3.4 The user defined branch

2.2 Batch analysis of multiple metabolites

Step 0 select module

Step 1 input metabolites/upload data

Step 2 input target disease

Step 3: set parameters for three branches

end

Thank you for using MDLink!

Comment and Feedback