diff --git a/CHANGELOG.md b/CHANGELOG.md index ac7c3710..7b1d0aee 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,219 +9,217 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added -* Add minimum marker count `colocalization_min_marker_count` parameter to calculate colocalization score. -* Add `density_scatter_plot` function to make two-marker abundance scatter plots with pseudo-density coloring. -* Add `wpmds` option in `pmds_layout` to compute edge weighted layouts. This is now set as the default layout algorithm. -* Add `dsb_normalization` function for normalization of marker abundance. -* Add a `Fraction of Outlier Cells` metric to the QC report. -* Add a `Panel Version` metadata field to the QC report. +- Add minimum marker count `colocalization_min_marker_count` parameter to calculate colocalization score. +- Add `density_scatter_plot` function to make two-marker abundance scatter plots with pseudo-density coloring. +- Add `wpmds` option in `pmds_layout` to compute edge weighted layouts. This is now set as the default layout algorithm. +- Add `dsb_normalization` function for normalization of marker abundance. +- Add a `Fraction of Outlier Cells` metric to the QC report. +- Add a `Panel Version` metadata field to the QC report. ### Changed -* The default transformation for the calculation of the colocalization score is now `rate-diff` instead of `log1p`. -* Rename `edge_rank_plot` function to `molecule_rank_plot`. +- The default value for `normalize_counts` in `local_g` is now `False` instead of `True`. +- The default transformation for the calculation of the colocalization score is now `rate-diff` instead of `log1p`. +- Rename `edge_rank_plot` function to `molecule_rank_plot`. ### Fixed -* Fix a bug where `a_pixels_per_b_pixel` summary statistics where equal to the `b_pixels_per_a_pixel` statistics. -* `collapse` will return exit code 137 when one of the child processes is killed by the system (e.g. because it is - to much memory). This allows e.g. Nextflow to retry the process with more memory automatically. -* Hide the `Sample Description` metadata field in the QC report when no value is available. -* Fix an issue where boolean parameters were formatted as integers in the Parameters section of the QC report. -* A bug in aggregating files with precomputed layouts, where the lazy-loading of the layouts was not working correctly. +- Fix a bug in `compute_transition_probabilities` when `k>1` where the stochastic matrix was not correctly row-normalized. +- Fix a bug in `local_g` when `use_weights=False` where the adjacency matrix was not correctly expended if `k>1`. +- Fix a bug where `a_pixels_per_b_pixel` summary statistics where equal to the `b_pixels_per_a_pixel` statistics. +- `collapse` will return exit code 137 when one of the child processes is killed by the system (e.g. because it is + to much memory). This allows e.g. Nextflow to retry the process with more memory automatically. +- Hide the `Sample Description` metadata field in the QC report when no value is available. +- Fix an issue where boolean parameters were formatted as integers in the Parameters section of the QC report. +- A bug in aggregating files with precomputed layouts, where the lazy-loading of the layouts was not working correctly. ### Removed -* Remove the `Pixel Version` metadata field from the QC report. +- Remove the `Pixel Version` metadata field from the QC report. ## [0.17.1] - 2024-05-27 ### Fixed -* Poor performance when writing many small layouts to pxl file (~45x speed-up). This should almost only - impact test scenarios, since most real components should be large enough for this not to be an issue. +- Poor performance when writing many small layouts to pxl file (~45x speed-up). This should almost only + impact test scenarios, since most real components should be large enough for this not to be an issue. ## [0.17.0] - 2024-05-23 ### Added -* Add `rate_diff_transformation` function with `rate-diff` alias as an alternative option for transforming marker counts before colocalization calculation. -* Add `local_g` function to compute spatial autocorrelation of marker counts per node. -* Add `compute_transition_probabilities` function to compute transition probabilities for k-step random walks for node pairs in a graph. -* Add QC plot showing UMIs per UPIA vs Tau. -* Add plot functions showing edge rank and cell counts. -* Add 2D and 3D graph plot functions. -* Add heatmap plot functions showing colocalization and differential colocalization. -* Add volcano plot (value difference vs log p-value) function for differential colocalization. -* Add a function to calculate the differential colocalization between two conditions. -* Performance improvements and reduced bundle size in QC report. -* Improved console output in verbose mode. -* Improved logging from multiprocessing jobs. -* Improved runtime for graph creation. -* Added PMDS layout algorithm. -* Add `--sample_name` option to `single-cell amplicon` to overwrite the name derived from the input filename. -* Add `--skip-input-checks` option to `single-cell amplicon` to make input filename checks warnings instead of errors. -* `PixelDataset` instances are now written to disk without creating intermediate files on-disk. -* A nice string representation for the `Graph` class, to let you know how many nodes and edges there are in the current graph object instance. -* Metric to collect molecules (edges) in cells with outlier distributions of antibodies (aggregates). -* Provide typed interfaces for all per-stage report files using pydantic. -* Centralize pixelator intermediate file lookup and access. -* Add a `precomputed_layouts` property to `PixelDataset` to allow for loading precomputed layouts. -* Add `pixelator single-cell layout` stage to pixelator, which allows users to compute layouts for a PXL file that can then be used to visualize the graph in 2D or 3D downstream. -* Add minimum marker count `polarization_min_marker_count` parameter to calculate Polarity Score. -* Add "log1p" as an alternative for `PolarizationNormalizationTypes`. -* Add `convert_indices_to_integers` option when creating graphs. -* Add a feature flag module to aid in the development of new features. +- Add `rate_diff_transformation` function with `rate-diff` alias as an alternative option for transforming marker counts before colocalization calculation. +- Add `local_g` function to compute spatial autocorrelation of marker counts per node. +- Add `compute_transition_probabilities` function to compute transition probabilities for k-step random walks for node pairs in a graph. +- Add QC plot showing UMIs per UPIA vs Tau. +- Add plot functions showing edge rank and cell counts. +- Add 2D and 3D graph plot functions. +- Add heatmap plot functions showing colocalization and differential colocalization. +- Add volcano plot (value difference vs log p-value) function for differential colocalization. +- Add a function to calculate the differential colocalization between two conditions. +- Performance improvements and reduced bundle size in QC report. +- Improved console output in verbose mode. +- Improved logging from multiprocessing jobs. +- Improved runtime for graph creation. +- Added PMDS layout algorithm. +- Add `--sample_name` option to `single-cell amplicon` to overwrite the name derived from the input filename. +- Add `--skip-input-checks` option to `single-cell amplicon` to make input filename checks warnings instead of errors. +- `PixelDataset` instances are now written to disk without creating intermediate files on-disk. +- A nice string representation for the `Graph` class, to let you know how many nodes and edges there are in the current graph object instance. +- Metric to collect molecules (edges) in cells with outlier distributions of antibodies (aggregates). +- Provide typed interfaces for all per-stage report files using pydantic. +- Centralize pixelator intermediate file lookup and access. +- Add a `precomputed_layouts` property to `PixelDataset` to allow for loading precomputed layouts. +- Add `pixelator single-cell layout` stage to pixelator, which allows users to compute layouts for a PXL file that can then be used to visualize the graph in 2D or 3D downstream. +- Add minimum marker count `polarization_min_marker_count` parameter to calculate Polarity Score. +- Add "log1p" as an alternative for `PolarizationNormalizationTypes`. +- Add `convert_indices_to_integers` option when creating graphs. +- Add a feature flag module to aid in the development of new features. ### Changed -* Change name and description of `Avg. Reads per Cell` and `Avg. Reads Usable per Cell` in QC report. -* The output name of the `.pxl` file from the `annotate` step is now `*.annotated.dataset.pxl`. -* The output name of the `.pxl` file from the `analysis` step is now `*.analysis.dataset.pxl`. -* The term `edges` in `metrics` and `adata` is now replaced with `molecules`. -* Renaming of variables in per-stage JSON reports. -* Changed name of TCRb to TCRVb5 antibody in human-immunology-panel file and bumped to version 0.5.0. -* Renaming of component metrics in adata. -* Use MPX graph compatible permutation strategy when calculating Moran's I related statistics. -* Marker filtering is now done after count transformation in polarization score calculation. -* Use the input read count at the annotate stage for the `fraction_antibody_reads_in_outliers` metric denominator instead of the total raw input reads. -* Use common analysis engine to orchestrate running different "per component" analyses, like polarization and colocalization analysis (yielding a roughly 3x speed-up over the previous approach). -* The default transformation for the calculation of the polarity score is now `log1p` instead of `clr`. +- Change name and description of `Avg. Reads per Cell` and `Avg. Reads Usable per Cell` in QC report. +- The output name of the `.pxl` file from the `annotate` step is now `*.annotated.dataset.pxl`. +- The output name of the `.pxl` file from the `analysis` step is now `*.analysis.dataset.pxl`. +- The term `edges` in `metrics` and `adata` is now replaced with `molecules`. +- Renaming of variables in per-stage JSON reports. +- Changed name of TCRb to TCRVb5 antibody in human-immunology-panel file and bumped to version 0.5.0. +- Renaming of component metrics in adata. +- Use MPX graph compatible permutation strategy when calculating Moran's I related statistics. +- Marker filtering is now done after count transformation in polarization score calculation. +- Use the input read count at the annotate stage for the `fraction_antibody_reads_in_outliers` metric denominator instead of the total raw input reads. +- Use common analysis engine to orchestrate running different "per component" analyses, like polarization and colocalization analysis (yielding a roughly 3x speed-up over the previous approach). +- The default transformation for the calculation of the polarity score is now `log1p` instead of `clr`. ### Fixed -* Fix a bug in how discarded UMIs are calculated and reported. -* Fix deflated counts in the edgelist after collapse. -* Fix a bug where an `r1` or `r2` in the directory part of a read file would break file name sanity checks. -* Fix a bug where the wrong `r1` or `r2` in the filename would be removed when multiple matches are present. -* Logging would cause deadlocks in multiprocessing scenarios, this has been resolved by switching to a server/client-based logging system. -* Fix a bug in the amplicon stage where read suffixes were not correctly recognized. -* Ensure deterministic results from `pmds_layout` (given a set seed). -* Fix an issue with the `fraction_antibody_reads_usable_per_cell` metric where the denominator read count was not correctly averaged with the cell count. +- Fix a bug in how discarded UMIs are calculated and reported. +- Fix deflated counts in the edgelist after collapse. +- Fix a bug where an `r1` or `r2` in the directory part of a read file would break file name sanity checks. +- Fix a bug where the wrong `r1` or `r2` in the filename would be removed when multiple matches are present. +- Logging would cause deadlocks in multiprocessing scenarios, this has been resolved by switching to a server/client-based logging system. +- Fix a bug in the amplicon stage where read suffixes were not correctly recognized. +- Ensure deterministic results from `pmds_layout` (given a set seed). +- Fix an issue with the `fraction_antibody_reads_usable_per_cell` metric where the denominator read count was not correctly averaged with the cell count. ### Removed -* Remove multi-sample processing from all `single-cell` subcommands. -* Remove `--input1_pattern` and `--input2_pattern` from `single-cell amplicon` command. -* Self-correlations, e.g. CD8 vs CD8 are no longer part of the colocalization results, as these values will always be undefined. -* Remove `umi_unique_count` and `upi_unique_count` from `edgelist`. -* Remove `umi` and `median_umi_degree` from `component` metrics. -* Remove `normalized_rel` and `denoised` from `obsm` in `anndata`. -* Remove the `denoise` function. -* Remove cell type selector in QC report for UMAP colored by molecule count plots. -* Remove `clr` as a transformation option in `pixelator analysis`. +- Remove multi-sample processing from all `single-cell` subcommands. +- Remove `--input1_pattern` and `--input2_pattern` from `single-cell amplicon` command. +- Self-correlations, e.g. CD8 vs CD8 are no longer part of the colocalization results, as these values will always be undefined. +- Remove `umi_unique_count` and `upi_unique_count` from `edgelist`. +- Remove `umi` and `median_umi_degree` from `component` metrics. +- Remove `normalized_rel` and `denoised` from `obsm` in `anndata`. +- Remove the `denoise` function. +- Remove cell type selector in QC report for UMAP colored by molecule count plots. +- Remove `clr` as a transformation option in `pixelator analysis`. ## [0.16.2] - 2024-03-19 ### Fixed -* Uninitialized value for `--polarization-n-permutations` +- Uninitialized value for `--polarization-n-permutations` ## [0.16.1] - 2024-01-12 ### Fixed -* Bug in README shield formatting - +- Bug in README shield formatting ## [0.16.0] - 2024-01-12 This release introduces two major change in pixelator: - 1) the Graph backend has been switched from using igraph to using networkx - 2) the license has been changed from GLP2.0 to MIT + +1. the Graph backend has been switched from using igraph to using networkx +2. the license has been changed from GLP2.0 to MIT ### Added -* Experimental 3D heatmap plotting feature. -* Optional caching of layouts to speed up computations in some scenarios. -* `experimental` mark that can be added to functions that are not yet production ready. -* The underlying graph instance e.g. a networkx `Graph` instance is exposed as a property called `raw` from the pixelator `Graph` class. -* Monte Carlo permutations supported when calculating Moran's I (`morans_z_sim`) in `polarization_scores`. +- Experimental 3D heatmap plotting feature. +- Optional caching of layouts to speed up computations in some scenarios. +- `experimental` mark that can be added to functions that are not yet production ready. +- The underlying graph instance e.g. a networkx `Graph` instance is exposed as a property called `raw` from the pixelator `Graph` class. +- Monte Carlo permutations supported when calculating Moran's I (`morans_z_sim`) in `polarization_scores`. ### Changed -* The default (and only) graph backend in pixelator is now based on networkx. -* `mean_reads` and `median_reads` in adata.obs to `mean_reads_per_molecule` and `median_reads_per_molecule` respectively. -* Drop support for python 3.8 and 3.9. -* Change output format of `collapse` from csv to parquet. -* Change input and output format of `graph` from csv to parquet. -* Change input format of `annotate` from csv to parquet. -* Rename the report to "qc report" -* Add a Reads per Molecule frequency figure to the sequencing section of the qc report. -* Remove placeholder warning of missing data for not yet implemented features. -* Change "Median antibody molecules per cell" to "Average antibody molecules per cell" in the qc report. -* Refactoring of the graph backend implementations module. -* Speeding up the `amplicon` step by roughly 3x. +- The default (and only) graph backend in pixelator is now based on networkx. +- `mean_reads` and `median_reads` in adata.obs to `mean_reads_per_molecule` and `median_reads_per_molecule` respectively. +- Drop support for python 3.8 and 3.9. +- Change output format of `collapse` from csv to parquet. +- Change input and output format of `graph` from csv to parquet. +- Change input format of `annotate` from csv to parquet. +- Rename the report to "qc report" +- Add a Reads per Molecule frequency figure to the sequencing section of the qc report. +- Remove placeholder warning of missing data for not yet implemented features. +- Change "Median antibody molecules per cell" to "Average antibody molecules per cell" in the qc report. +- Refactoring of the graph backend implementations module. +- Speeding up the `amplicon` step by roughly 3x. ### Fixed -* Nicer error messages when there are no components valid for computing colocalization. -* A bunch of warnings. +- Nicer error messages when there are no components valid for computing colocalization. +- A bunch of warnings. ### Removed -* `graph` no longer outputs the raw edge list. -* igraph has been dropped as a graph backend for pixelator. - +- `graph` no longer outputs the raw edge list. +- igraph has been dropped as a graph backend for pixelator. ## [0.15.2] - 2023-10-23 ### Fixed -* Fixed broken pixeldataset aggregation for more than two samples. -* Fixed a bug in graph generation caused by accidentally writing the index to the parquet file. - For backwards compatibility, if there is a column named `index` in the edgelist, this - will be removed and the user will get a warning indicating that this has happened. - +- Fixed broken pixeldataset aggregation for more than two samples. +- Fixed a bug in graph generation caused by accidentally writing the index to the parquet file. + For backwards compatibility, if there is a column named `index` in the edgelist, this + will be removed and the user will get a warning indicating that this has happened. ## [0.15.1] - 2023-10-18 ### Fixed -* Fixed a bug in filtering pixeldataset causing it to return the wrong types. -* Fixed a bug in graph layout generation due to incorrect data frame concatenation. - +- Fixed a bug in filtering pixeldataset causing it to return the wrong types. +- Fixed a bug in graph layout generation due to incorrect data frame concatenation. ## [0.15.0] - 2023-10-16 ### Added -* Add support for Python 3.11. -* Add early enablement work for a networkx backend for the graph stage. +- Add support for Python 3.11. +- Add early enablement work for a networkx backend for the graph stage. ### Fixed -* Fix report color axis in report figures not updating when selecting markers or cell types. -* Remove placeholder links in report tooltips. -* Fix a bug where aggregating data did not add the correct sample, and unique component columns. - +- Fix report color axis in report figures not updating when selecting markers or cell types. +- Remove placeholder links in report tooltips. +- Fix a bug where aggregating data did not add the correct sample, and unique component columns. ## [0.14.0] - 2023-10-05 ### Added -* Lazy option for edge list loading (`pixeldataset.edgelist_lazy`), which returns a - `polars` `LazyFrame` that can be used to operate on the edge list without reading - all of it into memory. -* Option (`ignore_edgelists`) to skip the edge lists when aggregating files. This defaults - to `False`. +- Lazy option for edge list loading (`pixeldataset.edgelist_lazy`), which returns a + `polars` `LazyFrame` that can be used to operate on the edge list without reading + all of it into memory. +- Option (`ignore_edgelists`) to skip the edge lists when aggregating files. This defaults + to `False`. ### Changed -* Types on the edge list in memory will utilize the `pandas` `category` type for string, and - `uint16` for numeric values to lower the memory consumption when working with the - edge list -* Remove `--pbs1` and `--pbs2` commandline arguments to `pixelator single-cell adapterqc`. -* Restructure report figures. -* Improve metric names and tooltips in the report. -* Synchronize zoom level between the scatter plots in cell annotations section of the report. -* Add report placeholder for missing cell annotation data -* Add `Fraction of discarded UMIs` and `Avg. Reads per Molecule` metrics to the report. +- Types on the edge list in memory will utilize the `pandas` `category` type for string, and + `uint16` for numeric values to lower the memory consumption when working with the + edge list +- Remove `--pbs1` and `--pbs2` commandline arguments to `pixelator single-cell adapterqc`. +- Restructure report figures. +- Improve metric names and tooltips in the report. +- Synchronize zoom level between the scatter plots in cell annotations section of the report. +- Add report placeholder for missing cell annotation data +- Add `Fraction of discarded UMIs` and `Avg. Reads per Molecule` metrics to the report. ### Fixed -* Fix an issue where pixelator --version would return 0.0.0 when installing in editable mode. - +- Fix an issue where pixelator --version would return 0.0.0 when installing in editable mode. ## [0.13.1] - 2023-09-15 @@ -229,21 +227,20 @@ This release introduces two major change in pixelator: ### Changed -* Unpin igraph dependency to allow for newer versions of igraph to be used. -* Cleanup README and point to the external documentation site. -* Change PyPi package name to pixelgen-pixelator. +- Unpin igraph dependency to allow for newer versions of igraph to be used. +- Cleanup README and point to the external documentation site. +- Change PyPi package name to pixelgen-pixelator. ### Fixed -* Fix an issue where `--keep-workdirs` option for pytest was not available when running pytest without - restricting the testdir to `tests/integration`. -* Fix an issue where pixelator --version would return 0.0.0. +- Fix an issue where `--keep-workdirs` option for pytest was not available when running pytest without + restricting the testdir to `tests/integration`. +- Fix an issue where pixelator --version would return 0.0.0. ### Removed -* `clr` and `relative` transformation options for the colocalization computations in `analysis` - +- `clr` and `relative` transformation options for the colocalization computations in `analysis` ## [0.13.0] - 2023-09-13 -* First public release of pixelator. +- First public release of pixelator. diff --git a/src/pixelator/graph/node_metrics.py b/src/pixelator/graph/node_metrics.py index d6495376..30a6e92c 100644 --- a/src/pixelator/graph/node_metrics.py +++ b/src/pixelator/graph/node_metrics.py @@ -52,7 +52,9 @@ def compute_transition_probabilities( # Set diagonal of W to 0 if k > 1 for gi to avoid self-loops W_out.setdiag(values=0) # Renormalize transition probabilities to sum to 1 - W_out = W_out / W_out.sum(axis=0)[:, None] + row_sums = np.array(W_out.sum(axis=1)).flatten() + inv_row_sums = np.reciprocal(row_sums, where=row_sums != 0) + W_out = W_out.multiply(inv_row_sums[:, np.newaxis]) return W_out @@ -63,7 +65,7 @@ def local_g( counts: pd.DataFrame, k: int = 1, use_weights: bool = True, - normalize_counts: bool = True, + normalize_counts: bool = False, W: sp.sparse.csr_array | None = None, method: Literal["gi", "gstari"] = "gi", ) -> pd.DataFrame: @@ -97,7 +99,7 @@ def local_g( :param k: The number of steps in the k-step random walk. Default is 1. :param use_weights: Whether to use weights in the computation. When turned off, all edge weights will be qeual to 1. Default is True. - :param normalize_counts: Whether to normalize counts to proportions. Default is True. + :param normalize_counts: Whether to normalize counts to proportions. Default is False. :param W: A sparse matrix of custom edge weights. This will override the automated computation of edge weights. `W` must have the same dimensions as A. Note that weights can be defined for any pair of nodes, not only the pairs represented by edges in `A`. Default is None. @@ -156,6 +158,17 @@ def local_g( W = W.T else: W = A + if k > 1: + # Expand local neighborhood using matrix powers + W = sp.sparse.linalg.matrix_power(W, k) + # Set all positive elements to 1 + W.data = np.ones_like(W.data) + if method == "gstari": + # Set diagonal of A to 1 for gstari which expects self-loops + W = W + sp.sparse.eye(n_nodes) + else: + # Set diagonal of W to 0 for gi to avoid self-loops + W.setdiag(values=0) # Compute lag matrix lag_mat = W @ counts diff --git a/tests/graph/test_node_metrics.py b/tests/graph/test_node_metrics.py index a6ec4f61..1be36c0e 100644 --- a/tests/graph/test_node_metrics.py +++ b/tests/graph/test_node_metrics.py @@ -307,6 +307,80 @@ def test_local_g_normalize_method_gstari(pentagram_graph): ) +def test_local_g_k4(pentagram_graph): + # Create a sparse adjacency matrix + A = pentagram_graph.get_adjacency_sparse() + # Create a 5x5 DataFrame of marker counts + counts = pd.DataFrame( + [ + [1, 2, 3, 4, 5], + [6, 7, 8, 9, 10], + [11, 12, 13, 14, 15], + [16, 17, 18, 19, 20], + [21, 22, 23, 24, 25], + ], + columns=["A", "B", "C", "D", "E"], + ) + counts.index.name = "node" + counts.columns.name = "markers" + + # Compute local g-scores + gi_scores = local_g( + A, counts, k=4, use_weights=True, normalize_counts=True, method="gstari" + ) + + # Expected local g-scores + expected_gi_scores = pd.DataFrame.from_dict( + { + 0: { + "A": -1.6919339037036998, + "B": -1.6919339037037184, + "C": 0.0, + "D": 1.6919339037036492, + "E": 1.691933903703703, + }, + 1: { + "A": -0.6945173598360008, + "B": -0.6945173598360104, + "C": 0.0, + "D": 0.6945173598359781, + "E": 0.6945173598359913, + }, + 2: { + "A": -0.30065880545156753, + "B": -0.3006588054515778, + "C": 0.0, + "D": 0.30065880545157325, + "E": 0.3006588054516006, + }, + 3: { + "A": 1.4593826070876255, + "B": 1.459382607087601, + "C": 0.0, + "D": -1.4593826070876004, + "E": -1.4593826070876499, + }, + 4: { + "A": 1.2277274619036243, + "B": 1.2277274619036218, + "C": 0.0, + "D": -1.2277274619036032, + "E": -1.2277274619036265, + }, + }, + orient="index", + ) + + expected_gi_scores.index.name = "node" + expected_gi_scores.columns.name = "markers" + + # Compare the computed and expected local g-scores + assert isinstance(gi_scores, pd.DataFrame) + assert_frame_equal( + gi_scores.sort_index(), expected_gi_scores.sort_index(), check_column_type=False + ) + + # Run the tests if __name__ == "__main__": pytest.main()