multitraitclustering package

Submodules

multitraitclustering.checks module

multitraitclustering.checks.df_check(var, var_name)

Raise Type Error if var is not a Dataframe

Parameters:
  • var – The variable to be checked

  • var_name – The name of the variable for the error string.

Return: None

multitraitclustering.checks.float_check(var, var_name)

Raise Type Error if var is not a float

Parameters:
  • var – The variable to be checked

  • var_name – The name of the variable for the error string.

Return: None

multitraitclustering.checks.int_check(var, var_name)

Raise Type Error if var is not a int

Parameters:
  • var – The variable to be checked

  • var_name – The name of the variable for the error string.

Return: None

multitraitclustering.checks.str_check(var, var_name)

Raise Type Error if var is not a string

Parameters:
  • var – The variable to be checked

  • var_name – The name of the variable for the error string.

Return: None

multitraitclustering.clustering_methods module

Author: Hayley Wragg Created: 7th February 2025 Description:

This module implements various clustering methods for multi-trait clustering analysis. It includes functions for: - K-means clustering - K-medoids clustering - DBSCAN clustering - Gaussian Mixture Model (GMM) clustering - Birch clustering - Mini-batch K-means clustering Each function takes association data, distance data, and a results dataframe as input, performs the specified clustering algorithm, and returns a dictionary containing the updated results dataframe with cluster assignments and a dictionary of clustering parameters. The module also includes extensive input validation to ensure data integrity and parameter correctness. It leverages helper functions from the checks, data_processing, and multi_trait_clustering submodules for data validation, distance calculations, and string formatting.

multitraitclustering.clustering_methods.birch(assoc_df, dist_df, res_df, thresh=0.5, branch_fac=50, dist_met='Euclidean')

Birch Clustering

Parameters:
  • assoc_df (pd.Dataframe) – Association with the exposure

  • dist_df (pd.Dataframe) – Distances between all data-points

  • res_df (pd.Dataframe) – Dataframe - association data and cluster results

  • thresh (float, optional) – radius of subcluster obtained by merging a sample and closest subcluster. Default 0.5.

  • branch_fac (int, optional) – Max no. of CF subclusters in each node. Default 50.

  • dist_met (str, optional) – Distance Metric, either “Euclidean”, or “CosineDistance”(default)

Raises:
  • TypeError – assoc_df not a dataframe

  • TypeError – dist_df not a dataframe

  • TypeError – res_df not a dataframe

  • TypeError – thresh not a float

  • TypeError – branch_fac not an integer

  • TypeError – dist_met not a string

  • ValueError – dist_met not “Euclidean” or “CosineDistance”

  • ValueError – assoc_df does not have the same number of rows as res_df

  • ValueError – dist_df doesn’t have the same number of rows as columns.

  • ValueError – dist_df doesn’t have the same number of rows as assoc_df.

Returns:

  • “results”- results dataframe with appended new clusters

  • ”cluster_dict” - dictionary of the parameters for clustering

Return type:

Dictionary containing

multitraitclustering.clustering_methods.dbscan(assoc_df, dist_df, res_df, min_s=5, eps=0.5, db_alg='auto', dist_met='Euclidean')

Compute the clusters found using the K-means algorithm

Parameters:
  • assoc_df (pd.Dataframe) – Dataframe of association scores

  • dist_df (pd.Dataframe) – Distances between all data-points

  • res_df (pd.Dataframe) – Dataframe - association data and cluster results

  • min_s (int, optional) – No. of samples (or total weight) in neighborhood for a point to be considered core point. Default 5.

  • eps (float, optional) – Max distance between two samples for one to be considered as in neighborhood of other. Default 0.5

  • db_alg (str, optional) – Implementation algorithm. Defaults to “auto”.

  • dist_met (str, optional) – Distance Metric, “Euclidean”(default), or “CosineSimilarity”.

Raises:
  • TypeError – assoc_df not a dataframe

  • TypeError – dist_df not a dataframe

  • TypeError – res_df not a dataframe

  • TypeError – min_s not an integer

  • TypeError – eps not a float

  • TypeError – db_alg not a string

  • ValueError – db_alg not one of: auto, ball_tree, kd_tree, brute

  • ValueError – dist_met is not “Euclidean” or “CosineSimilarity”

  • ValueError – the dimensions of the dataframes don’t match.

Returns:

  • “results”- results dataframe with appended new clusters

  • ”cluster_dict” - dictionary of the parameters for this clustering

Return type:

Dictionary containing

multitraitclustering.clustering_methods.gmm(assoc_df, dist_df, res_df, n_comps=4, rand_st=0, max_iter=100, cov_type='diag', in_pars='random_from_data', dist_met='CosineDistance')

Gaussian Mixture Model clustering

Parameters:
  • assoc_df (pd.Dataframe) – Association with the exposure

  • dist_df (pd.Dataframe) – Distances between all data-points

  • res_df (pd.Dataframe) – Dataframe - association data and cluster results

  • n_comps (int, optional) – Number of components. Defaults to 4.

  • rand_st (int, optional) – Random seed. Defaults to 0.

  • max_iter (int, optional) – Number of EM iterations. Defaults to 100.

  • cov_type (str, optional) – Covariance type. Defaults to “diag”.

  • in_pars (str, optional) – Method to intialising weights. Default “random_from_data”.

  • gmm_met (str, optional) – Distance Metric, either “Euclidean”, or “CosineDistance”(default)

Raises:
  • TypeError – assoc_df not a dataframe

  • TypeError – dist_df not a dataframe

  • TypeError – res_df not a dataframe

  • TypeError – n_comps not an integer

  • TypeError – rand_st not an integer

  • TypeError – max_iter not an integer

  • TypeError – cov_type not a string

  • TypeError – dist_met not a string

  • TypeError – in_pars not a string

  • ValueError – cov_type not one of: “full”, “tied”, “diag”, “spherical”

  • ValueError – in_pars not one of: “kmeans”, “k-means++”, “random”, “random_from_data”

  • ValueError – dist_met not one of “CosineSimilarity” or “Euclidean”

  • ValueError – dataframes are not compatible dimensions

Returns:

  • “results”- results dataframe with appended new clusters

  • ”cluster_dict” - dictionary of the parameters for this clustering

Return type:

Dictionary containing

multitraitclustering.clustering_methods.kmeans(assoc_df, dist_df, res_df, nclust=4, rand_st=240, n_in=50, init_km='k-means++', iter_max=300, kmeans_alg='lloyd', dist_met='Euclidean')

Compute the clusters found using the K-means algorithm

Parameters:
  • assoc_df (pd.Dataframe) – Association with the exposure

  • dist_df (pd.Dataframe) – Distances between all data-points

  • res_df (pd.Dataframe) – Dataframe - association data and cluster results

  • nclust (int, optional) – Number of desired clusters. Defaults to 4.

  • rand_st (int, optional) – Random number initialisation. Defaults to 240.

  • n_in (int, optional) – _description_. Defaults to 50.

  • init_km (str, optional) – Method for initialising the clusters. Defaults to “k-means++”.

  • iter_max (int, optional) – Max no. iterations if no cluster convergence. Defaults to 300.

  • kmeans_alg (str, optional) – Implementation algorithm. Default “lloyd”.

  • dist_met (str, optional) – Distance metric, “Euclidean” (default), or “CosineSimilarity”

Raises:
  • TypeError – assoc_df is not a dataframe

  • TypeError – dist_df is not a dataframe

  • TypeError – res_df is not a dataframe

  • TypeError – nclust is not an integer

  • TypeError – rand_st is not an integer

  • TypeError – init_km is not an integer

  • TypeError – init_km is not a string

  • TypeError – iter_max is not an integer

  • TypeError – kmeans_alg is not a string

  • TypeError – dist_met is not a string

  • ValueError – n_in is not an integer or auto

  • ValueError – init_km is not one of: k-means++, random or an array

  • ValueError – kmeans_alg is not one of: “lloyd”, “elkan”

  • ValueError – dist_met is not “Euclidean” or “CosineSimilarity”

Returns:

  • “results”- results dataframe with appended new clusters

  • ”cluster_dict” - dictionary of parameters for clustering iteration

Return type:

Dictionary containing

multitraitclustering.clustering_methods.kmeans_minibatch(assoc_df, dist_df, res_df, nclust=4, rand_st=240, n_in=50, batch_size=30, iter_max=300, dist_met='Euclidean')

Kmeans Minibatch clustering

Parameters:
  • assoc_df (pd.Dataframe) – Dataframe for exposure data. rows are the snps, columns are the trait axes.

  • dist_df (pd.Dataframe) – Dataframe with the distance between all points. rows and columns are snps, values are the distances between the pairs.

  • res_df (pd.Dataframe) – Dataframe with the clustering results. rows are snps, columns clustering methods

  • nclust (int) – No. clusters to be assigned. Default 4

  • rand_st (int) – Random seed. Defaults to 240

  • n_in (int) – No. random initialisations. Default 50.

  • batch_size (int) – No. points in each batch. Default 30

  • iter_max (int) – Max no. iterations. Default 300.

  • dist_met (str) – Distance Metric. Default “Euclidean”. Must be either “Euclidean”, or “CosineDistance”

Raises:
  • TypeError – assoc_df not a dataframe

  • TypeError – dist_df not a dataframe

  • TypeError – res_df not a dataframe

  • TypeError – nclust not an integer

  • TypeError – batch_size not an integer

  • TypeError – iter_max not an integer

  • TypeError – dist_met not a string

  • ValueError – dist_met not one of “CosineSimilarity” or “Euclidean”

  • ValueError – the dataframes do not have compatible dimensions

Returns:

  • “results”- results dataframe with appended new clusters

  • ”cluster_dict” - dictionary of the parameters for clustering

Return type:

Dictionary containing

multitraitclustering.clustering_methods.kmedoids(assoc_df, dist_df, res_df, nclust=4, rand_st=240, init_kmed='k-medoids++', iter_max=300, kmedoids_alg='alternate', dist_met='Euclidean')

Compute the clusters found using the K-means algorithm

Parameters:
  • assoc_df (pd.Dataframe) – Association with the exposure

  • dist_df (pd.Dataframe) – Distances between all data-points

  • res_df (pd.Dataframe) – Dataframe - association data and cluster results

  • nclust (int, optional) – Number of desired clusters. Defaults to 4.

  • rand_st (int, optional) – Random number initialisation. Defaults to 240.

  • init_kmed (str, optional) – Method for initialising the clusters. Default “k-medoids++”.

  • iter_max (int, optional) – Max no. iterations if no cluster convergence. Defaults to 300.

  • kmedoids_alg (str, optional) – Implementation algorithm. Defaults to “alternate”.

  • dist_met (str, optional) – Distance Metric “Euclidean”(default), or “CosineSimilarity”.

Raises:
  • TypeError – If inputs are of the incorrect type

  • ValueError

    • If init_km is not one of: k-means++, random or array * If kmeans_alg is not one of: “lloyd”, “elkan” * If dist_met is not “Euclidean” or “CosineSimilarity”

Returns:

  • “results”- The results dataframe with appended new clusters

  • ”cluster_dict” - The dictionary of the parameters for clustering

Return type:

Dictionary containing

multitraitclustering.data_processing module

Author: Hayley Wragg Created: 6th February 2025 Description:

This module provides functions for processing and comparing clustering results.

Functions:
`compare_results_list_to_external`(clust_df, external_df, external_lab):

Compare clustering results to an external method and return a dictionary of DataFrames corresponding to each clustering method.

`centroid_distance`(cents, data, membership, metric=”euc”):

Calculate the distance between points in a cluster and the centroid.

`calc_medoids`(data, data_dist, membership):

Calculate the coordinates of the medoids for each cluster.

`overlap_score`(comp_percent_df):

Compute overlap score from percentage overlaps between cluster pairings.

`overlap_pairs`(comp_percent_df, meth_lab, meth_sec_lab=”paper”):

Find cluster labels for best match pairs between clustering methods.

`calc_per_from_comp`(comp_vals):

Calculate percentage overlap between clusters from the no. of points in the intersection.

multitraitclustering.data_processing.calc_medoids(data, data_dist, membership)

Calculate the co-ordinates of the medoids for each cluster.

The medoid is given by the point with the minimal distance to the other points in the cluster. The is calculated by finding the total distance to the other cluster points for each point, then returning the point whose total is the minimum. This varies from the centroids as it always returns a point which is in the cluster.

Parameters:
  • data (pd.Dataframe) – The original data used to create the clusters. rows are the points, columns are the axes.

  • data_dist (pd.Dataframe) – The distance between all pairs of SNPs. rows and columns are the SNPs and cell-values are the distance.

  • membership (pd.Series) – SNPs correspond to the rows, the column in the cluster results, the cell values are the value corresponding to the cluster.

Returns:

Each row corresponds to the medoid for the

corresponding cluster. The columns correspond to the data axes.

Return type:

medoids_df (pd.Dataframe)

multitraitclustering.data_processing.calc_per_from_comp(comp_vals)

Calculate the percentage overlap between clusters from the number of points in

Parameters:

comp_vals (pd.Dataframe) – rows - clusters for first cluster method columns - clusters for second cluster method, cell values are no. of points in both

Percentage is calculated by taking the number of points in the intersection of the clustering methods (the cell values) and dividing by the number of points in the union (the sum of the number of points in the full column and full row).

Returns:

rows - clusters for first cluster method

columns - clusters for second cluster method, cell values are percentage of points in both

Return type:

comp_out (pd.Dataframe)

multitraitclustering.data_processing.centroid_distance(cents, data, membership, metric='euc')

Calculate the distance between points in a cluster and the centroid

Parameters:
  • cents (pd.Dataframe) – rows correspond to the cluster labels, the columns are the axes labels for the data-space and the values form the position of the centre

  • data (pd.Dataframe) – rows are snps giving the individual data-points, the columns are the axes labels for the data-space and the values form the positions of the data-point

  • membership (pd.Series) – indexes are snps, values are the cluster labels

  • metric (str, optional) – String to indicate which metric to use. Defaults to “euc” for the Euclidean distance.

Returns:

The distance between each data-point and

cluster centre for the cluster it is assigned to.

Return type:

distance_df (pd.Dataframe)

multitraitclustering.data_processing.compare_results_list_to_external(clust_df, external_df, external_lab)

Compare clustering results to an external method.

Take all clustering results in clust_df and compare to clusters in external_df given by external_lab. Computes the comparison as the percentage of the number of points in the intersection over the union of cluster pairs. Return a dictionary of Dataframes corresponding to each clustering method.

Parameters:
  • clust_df (pd.DataFrame) – Main dataframe containing all the computed clustering results.

  • external_df (pd.DataFrame) – External dataframe to compare the cluster membership to.

  • external_lab (str, int, float) – Label for the column in external_df to use for comparison results.

Raises:
  • TypeError – clust_df is not a dataframe

  • TypeError – eternal_df is not a dataframe

  • TypeError – external_lab is not a string

  • ValueError – external_lab is not an available column in external_df

  • ValueError – no overlapping indices between clust_df and external_df

Returns: Dictionary with terms
  • comp_dfs, dictionary of Dataframes with comparison percentage grids

  • cluster_matchings, dictionary of best match cluster pairs Dataframes

  • overlap_score, dictionary of the overlap scores

multitraitclustering.data_processing.long_df_from_p_cnum_arr(a_mat, row_lab_dict, col_lab_dict, score_lab, row_lab='pathway', col_lab='ClusterNumber')

Convert a matrix to a long format DataFrame. :param a_mat: A matrix of values to convert to a long format DataFrame. :type a_mat: np.ndarray :param row_lab_dict: A dictionary mapping row indices to row labels. :type row_lab_dict: dict :param col_lab_dict: A dictionary mapping column indices to column labels. :type col_lab_dict: dict :param score_lab: The name to use for the column containing the values in the matrix. :type score_lab: str :param row_lab: The name to use for the column containing the row labels. Defaults to “pathway”. :type row_lab: str, optional :param col_lab: The name to use for the column containing the column labels. Defaults to “ClusterNumber”. :type col_lab: str, optional

Returns:

A long format DataFrame with columns for row labels, column labels, and values.

Return type:

pd.DataFrame

multitraitclustering.data_processing.overlap_pairs(comp_percent_df, meth_lab, meth_sec_lab='paper')

Find cluster labels for best match pairs between clustering methods

Parameters:
  • comp_percent_df (pd.Dataframe) – Rows are clusters for first method. Columns are clusters for second method. cell value is no. of points in intersection of the cluster pairs divided by the no. of points in the union.

  • meth_lab (string) – Label clustering method

  • meth_sec_lab (string) – Label second cluster method, default = “paper”

Returns:

Col 1 clusters for first cluster method

Col 2 clusters for second cluster method which best match the clusters from first Col 3 no. of points in intersection between the clustering methods divided by the no. in the union.

Return type:

clust_match_df (pd.Dataframe)

multitraitclustering.data_processing.overlap_score(comp_percent_df)

Compute overlap score from percentage overlaps between cluster pairings.

Find the best matching between the cluster methods by finding the largest percentage overlap for each column. This gives the clusters in method 1 which best match the clusters in method 2.

Parameters:

comp_percent_df (pd.Dataframe) – Rows are clusters for the first method. Columns are clusters for the second method. Each cell value is the number of points in the intersection of the cluster pairs divided by number of points in the union

Returns:

The mean of the overlap for the best matches.

Return type:

overlap_score (float)

multitraitclustering.data_setup module

multitraitclustering.data_setup.calc_sum_sq(c_num, data)

Calculate the sum of the square of the cluster distances for the cluster clust_num.

\[c_n = ext{ cluster center }c_n ext{ for cluster }n. s = \sigma_{\]

orall p_i in ext{ cluster}n} ||p_i - c_n||^2,

Args:

c_num (string): cluster label data (pd.Dataframe): Rows correspond to snps and contains the columns:

  • clust_num indicating the cluster label for each snp,

  • clust_dist the distance between each snp and the cluster centre.

Returns:

s (float): the sum of the square of the distances between the snps and the cluster centre for cluster number clust_num.

multitraitclustering.data_setup.compare_df1_to_df2(clust_df1, clust_df2, lab1, lab2)

The function compare_df1_to_df2 compares the cluster assignments between two clustering methods. It creates a numpy array res_arr which represents the number of points in the intersection between clusters.

In res_arr the rows correspond to the clusters in clust_df1 and the columns the clusters in clust_df2. The value in each cell of the array is the number of points in cluster i for the first method and cluster j for the second method.

Parameters:
  • clust_df1 (pd.Dataframe) – The column lab1 of clust_df1 gives the cluster label assigned to each snp.

  • clust_df2 (pd.Dataframe) – The column lab2 of clust_df2 gives the cluster label assigned to each snp.

  • lab1 (string) – The name of the column of clust_df1 which gives the clusters assigned to each snp.

  • lab2 (string) – The name of the column of clust_df1 which gives the clusters assigned to each snp.

Returns:

with values corresponding to the number of points in the intersection of the clusters between clustering methods.

Return type:

res_arr (numpy array)

multitraitclustering.data_setup.compute_pca(eff_df, n_components=2)

Compute the principal components from the association data.

Parameters:
  • eff_df (pd.Dataframe) – The association dataframe, rows are snps, columns are traits, cell-values are association values.

  • n_components (int) – The number of principal components to reduce to.

Find the two most dominant principal components for the vector space.

Returns:

Rows are the traits, columns correspond to the principal component.

Return type:

pd_df (pd.Dataframe)

multitraitclustering.data_setup.load_association_data(path_dir, eff_fname='/transformed_eff_dfs.csv', exp_fname='/transformed_stdBeta_EXP.csv')

Load the beta association scores and standard errors. Any transformations for dealing with negative values or normalisation should already have been performed on this data.

Parameters:
  • path_dir (string) – Location of the data files.

  • eff_fname (string) – Name of the csv file for the association scores. default: “/transformed_eff_dfs.csv”

  • exp_fname (string) – Name of the csv file for the exposure association values. default: “/transformed_stdBeta_EXP.csv”

eff_df = The association dataframe, rows are snps, columns are traits, cell-values are association values. stdBeta_EXP_df = The exposure dataframe, rows are snps, column for the exposure trait, cell-values are association values. out = {“eff_df”: eff_df, “exp_df”: stdBeta_EXP_df} :returns: Dictionary containing the dataframe for the association

values and the standard errors.

Return type:

out (dict)

multitraitclustering.data_setup.mat_dist(df, met='euclidean')

Creates a matrix corresponding to the distances between all points in the dataframe.

Parameters:
  • df (pd.Dataframe) – Rows correspond to data-points

  • met (str, optional) – Label for distance metric. Defaults to “euclidean”.

Returns:

Array of distances between data-points.

Return type:

dist_matrix (np.Array)

multitraitclustering.multi_trait_clustering module

Author: Hayley Wragg Created: 10th December 2024 Description:

This module provides functions for performing multi-trait clustering analysis. It includes functions for: - Calculating the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

for evaluating clustering results.

  • Applying various clustering methods to association data.

  • Generating descriptive strings for clustering methods based on their parameters.

multitraitclustering.multi_trait_clustering.cluster_all_methods(exp_df, assoc_df)

Apply all clustering methods to association data and return clusters

Parameters:
  • exp_df (pd.Dataframe) – Association score with the exposure

  • assoc_df (pd.Dataframe) – Association score with traits. Normalised.

Returns:

{“clust_pars_dict”: dictionary of cluster parameters

-keys matching cluster label.,

”clust_results”: Dataframe with rows corresponding to snps and columns to the cluster methods.}

Return type:

out_dict (dict)

Raises:
  • TypeError – If exp_df or assoc_df are not pandas DataFrames.

  • KeyError – If the ‘EXP’ column is not present in exp_df.

  • ValueError – If the number of rows in assoc_df and exp_df do not match.

  • ValueError – If the keys between the clustering method parameter labels and clustering results do not match.

multitraitclustering.multi_trait_clustering.get_aic(clust_mem_dist, dimension=2)

Calculate the Akaike Information Criterion (AIC)

Args:
clust_mem_dist (pd.Dataframe): Distance for point to cluster centroid.

rows are data-point, has columns clust_dist clust_num

dimension (int): The dimension for each data-point

\[k = ext{number of clusters} n = ext{number of points} sse = ext{variance of the cluster distances} ext{log}\_ ext{likelihood} = -n / 2 \log(\]
rac{sse}{n})

aic = 2 * k * ( ext{dimension} + 1) - 2 * ext{log}_ ext{likelihood}

Returns:

aic (float): value for the AIC

multitraitclustering.multi_trait_clustering.get_bic(clust_mem_dist, n_params=-1, dimension=2)

Calculate the Akaike Information Criterion (AIC)

Args:
clust_mem_dist (pd.Dataframe): Distance for point to cluster centroid.

indexed with the data-point label. has terms clust_dist clust_num

n_params (int): no. of parameters estimated. If -1 use no. of clusters.

Default -1

dimension (int): dimension for each data-point. Default 2

\[k = ext{number of clusters} n = ext{number of points} sse = ext{variance of the cluster distances} ext{log}\_ ext{likelihood} = -n / 2 * \log(\]
rac{sse}{n})

bic = log(n)*k*( ext{dimension}+1)-2* ext{log}_ ext{likelihood}

Returns:

bic (float): value for BIC

multitraitclustering.multi_trait_clustering.method_string(meth_str, alg_str, dist_str, num)

Create a string describing the clustering method and it’s parameters.

method + algorithm type + distance type + number in title case words

multitraitclustering.plotting_funcs module

Author: Hayley Wragg Created: 6th February 2025 Description:

This module provides a collection of functions for visualizing clustering results using Altair. Includes functions for creating scatter plots of clusters, generating multiple scatter plots with a fixed x-axis, comparing clustering methods using heatmaps, and visualizing cluster pathways. Module relies on Altair library to create visualizations and Pandas for data manipulation.

multitraitclustering.plotting_funcs.chart_cluster_compare(data_array, xlabels, ylabels, x_lab, y_lab, z_lab, text_precision='.0f')

chart_cluster_compare Heatmap of the comparison percentage overlap of two clustering methods

Parameters:
  • data_array (numpy array) – Comparison data - long form - x_lab (clust_no. for method 1), y_lab (clust_no for method 2), z_lab - no. in intersection/ no. in union

  • xlabels (list of strings or array of numbers) – Cluster labels for the columns

  • ylabels (list of strings or array of numbers) – Cluster labels for the rows

  • x_lab (string) – Label to get the x data from

  • y_lab (string) – Label to get the y data from

  • z_lab (string) – Label for the column containing the overlap percentage

  • text_precision (str, optional) – Precision for printing numeric variables. Defaults to “.0f”.

Returns:

An Altair chart object representing the heatmap.

Return type:

alt.Chart

multitraitclustering.plotting_funcs.chart_cluster_pathway(data_array, x_lab, y_lab, z_lab, title_str, text_precision='.0f')

Generates a heatmap-like chart using Altair to visualize a cluster pathway. The chart displays the relationship between three variables (x, y, and z) from the input array. It uses rectangles to represent the combinations of x and y, with the color of the rectangle indicating the value of z. Text annotations are added to each rectangle to display the z value. :param data_array: A pandas DataFrame or similar data structure that can be processed by Altair.

It should contain columns corresponding to x_lab, y_lab, and z_lab.

Parameters:
  • x_lab (str) – The name of the column in data_array to be used for the x-axis.

  • y_lab (str) – The name of the column in data_array to be used for the y-axis.

  • z_lab (str) – The name of the column in data_array to be used for color and annotation.

  • title_str (str) – The title of the chart.

  • text_precision – format string to control precision of the text annotation. Defaults to “.0f” (no decimal places).

multitraitclustering.plotting_funcs.chart_clusters(data, title, color_var, tooltip, col1='pc_1', col2='pc_2')

Plots the data as a scatter plot coloured by the cluster labels.

The default is to use columns labeled pc_1 and pc_2 indicating the dominant principal components. However for smaller dimensions you may wish to use explicit axes. Set these with col1 = ‘x_label’, col2 = ‘y_label’.

Parameters:
  • data – Data including the clusters and original source.

  • title – plot title.

  • color_var – label for the variable to group colours by.

  • tooltip – list of labels for data to show when hovered.

  • col1 – column label for the x-axis, defaults to “pc_1”.

  • col2 – column label for the y axis, defaults to “pc_2”.

Returns:

An altair chart object.

multitraitclustering.plotting_funcs.chart_clusters_multi(data, title, color_var, tooltip, xcol=None, col_list=None)

Generates multiple scatter plots with a fixed x-axis and varying y-axes, colored by cluster labels.

This function iterates through a list of columns, creating a scatter plot for each column against a fixed x-axis. The plots are colored according to a specified variable, typically cluster assignments.

Parameters:
  • data (pd.DataFrame) – Data including the clusters and original source.

  • title (str) – Plot title.

  • color_var (str) – Label for the variable to group colors by.

  • tooltip (list) – List of labels for data to show when hovered.

  • xcol (str, optional) – Label for fixed x column. Defaults to first column in the data.

  • col_list (list, optional) – List of columns for the y-axis. Defaults to [].

Returns:

A dictionary where keys are column names from col_list and values are corresponding Altair chart objects.

Return type:

dict

multitraitclustering.plotting_funcs.overlap_bars(df, xlab, ylab, group_lab, sep_lab, title)

Generates an Altair bar chart with overlapping bars, faceted by group.

Parameters:
  • df (pd.DataFrame) – The input DataFrame containing the data to plot.

  • xlab (str) – col in DataFrame for the x-axis, categorical data.

  • ylab (str) – col in DataFrame for the y-axis, quantitative data.

  • group_lab (str) – col in DataFrame for chart groups, categorical data.

  • sep_lab (str) – col in DataFrame to separate bars within groups, categorical data.

  • title (str) – The title of the chart.

Returns:

An Altair chart object representing the overlapping bar chart. The x-axis labels are hidden. The column headers are rotated and aligned for better readability.

Return type:

alt.Chart

multitraitclustering.plotting_funcs.pathway_bars(df, xlab, ylab, group_lab, max_val, title)

Generates a bar chart visualizing pathway data. :param df: DataFrame containing the data for the chart.

Must contain columns corresponding to xlab, ylab, and group_lab.

Parameters:
  • xlab (str) – Name of the column to use for the x-axis (pathway names).

  • ylab (str) – Name of the column to use for the y-axis (pathway values).

  • group_lab (str) – Name of the column to use for grouping the bars into separate subplots.

  • max_val (float) – Max value, y-vals greater than or equal to this value will be colored green.

  • title (str) – Title of the chart.

Returns:

An Altair bar chart object. The chart is interactive, allowing

for zooming and panning.

Return type:

alt.Chart

multitraitclustering.string_funcs module

multitraitclustering.string_funcs.format_strings(test_str)

Replace digits with words and remove punctuation and spaces

Parameters:

test_str (string) – string to be formatted

Returns:

formatted string

Return type:

test_str (string)

multitraitclustering.string_funcs.num_to_word(num_orig)

Convert integer numbers into their word equivalents.

Parameters:

num_orig (int) – Number to be converted

Returns:

Space free string describing the number in title case.

Return type:

num_word (string)

Module contents