If you are viewing the HTML, the notebook can be found here.
This notebook attempts to demonstrate the following:
The results are intended to help identify clusters of samples (tumors) displaying similar patterns of protein expression.
These are the required imports. Install them using pip, if needed.
import requests
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore
Next, set up the query parameters.
The first one is pdc_submitter_id. This example is for the ccRCC whole proteome study. PDC sample identifiers are available on the portal.
pdc_study_id = 'PDC000127'
The next one is data type. A table of data_types is available here. A description of how these values are computed can be found here. In brief, these values are log2 transformed ratio data, where aliquot is the numerator and the denominator is a pooled, reference sample. Ratios are calculated following normalization in the CPTAC Common Data Analysis Pipeline (CDAP.)
data_type = 'log2_ratio' # Retrieves CDAP iTRAQ or TMT data
Finally, we are ready to build the query string using pdc_study_id and data_type.
quant_data_query = '''
{
quantDataMatrix(
pdc_study_id: "''' + pdc_study_id +'''" data_type: "''' + data_type + '''" acceptDUA: true
)
}'''
def query_pdc(query):
# PDC API url
url = 'https://pdc.cancer.gov/graphql'
# Send the POST graphql query
print('Sending query.')
pdc_response = requests.post(url, json={'query': query})
# Check the results
if pdc_response.ok:
# Decode the response
return pdc_response.json()
else:
# Response not OK, see error
return pdc_response.raise_for_status()
Now, we are ready to make the first API call.
decoded = query_pdc(quant_data_query)
# You may want to check this for an error, here
matrix = decoded['data']['quantDataMatrix']
Next, load the data into a dataframe.
# Aliquots are first row, gene names are first column
ga = pd.DataFrame(matrix[1:], columns=matrix[0]).set_index('Gene/Aliquot')
oldnames = list(ga.columns)
newnames = [ x.split(':')[1] for x in oldnames ]
ga.rename(columns=dict(zip(oldnames, newnames)), inplace=True)
ga = ga.sort_index(axis=1)
print(ga.shape)
The first number above is the number of genes and the second is the number of samples (i.e., aliquots) in the returned data set.
Since the expression values are returned as strings, we need to convert those to floats and deal with missing data.
for col in ga.keys():
ga[col] = pd.to_numeric(ga[col], errors='coerce')
The clustermap module within the Seaborn package does not allow for NaN values. So we must create a mask value that does not interfere much with the clustering and is likely to be unique. No imputation is used. By using a value close to 0, we are saying that these are unchanged between samples. Better solutions should be used to deal with missing data points.
mask_na = 0.000666
ga = ga.fillna(mask_na)
Next, build the clinical data query. This API needs only a study_submitter_id as input. We will be retrieving 5 fields and using 2 in our clustergram.
metadata_query = '''
{
clinicalMetadata(pdc_study_id: "''' + pdc_study_id + '''" acceptDUA: true) {
aliquot_submitter_id
morphology
primary_diagnosis
tumor_grade
tumor_stage
}
}
'''
Next, let's query for the clinical metadata. This will be used to annotate the columns (aliquots) to look for correlations between clinical attributes and samples. That data are loaded into a dataframe after the query returns.
decoded = query_pdc(metadata_query)
print(len(decoded))
clin_matrix = decoded['data']['clinicalMetadata']
metadata = pd.DataFrame(clin_matrix, columns=clin_matrix[0]).set_index('aliquot_submitter_id')
metadata = metadata.sort_index(axis=0)
We can then set up a color mapping function for the clinical annotations.
def get_colors(df, name, color) -> pd.Series:
s = df[name]
su = s.unique()
colors = sns.light_palette(color, len(su))
lut = dict(zip(su, colors))
return s.map(lut)
Next, call get_colors() to map the tumor_stage and primary_diagnosis attributes. Others are available.
stage_col_colors = get_colors(metadata, 'tumor_stage', 'red')
diagnosis_col_colors = get_colors(metadata, 'primary_diagnosis', 'green')
And, finally, generate the large clustermap using seaborn.clustermap
sns.clustermap(ga, metric='euclidean', method='complete', cmap='seismic', mask=ga == mask_na, center=0.,
figsize=(12.5, 50), col_colors=[stage_col_colors, diagnosis_col_colors])
plt.show()
You could also convert the log2 ratio data to a standard statistic, like z-score. This can help compress the range, accounting for outliers.
zdf = ga.T.apply(zscore, ddof=len(ga.columns)-1)
zdf = zdf.T
And examine clustering according to that transformation.
sns.clustermap(zdf, metric='euclidean', method='complete', cmap='seismic', mask=ga == mask_na, center=0.,
col_colors=[stage_col_colors, diagnosis_col_colors], figsize=(12.5, 50))
plt.show()
Hopefully, the clustermaps have been drawn. Columns are labeled aliquot_id:aliquot_submitter_id, according to the PDC data model. Additional work pre-processing the data, removing outliers and batch effects etc., should always be invested on any large-scale analysis. This ends this example notebook.
We hope that you found this tutorial useful. There is also an accompanying tutorial on the PDC site, if you are finding this notebook and have not seen the video. Please submit any questions or requests to: PDCHelpDesk@mail.nih.gov