Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index problem in run_enact() #13

Open
Nick-Eagles opened this issue Nov 27, 2024 · 2 comments
Open

Index problem in run_enact() #13

Nick-Eagles opened this issue Nov 27, 2024 · 2 comments

Comments

@Nick-Eagles
Copy link

Hello,

Thanks for developing this useful software. I'm attempting to run the complete ENACT pipeline from within Python. When invoking the run_enact() method of an ENACT object, many steps seem to complete, but the analysis ultimately halts with a complaint that appears to reference a mismatch in cell IDs. I'll attach the full configuration and traceback below.

Best,
-Nick

Configuration as printed when running run_enact():

analysis_name: H1-W369TJK_D1_9090
 run_synthetic: False
 cache_dir: /dcs04/lieber/lcolladotor/Habenula_R01_LIBD4270/Habenula_Visium/processed-data/09_HD_cell_level/enact/H1-W369TJK_D1_9090
 wsi_path: /dcs04/lieber/lcolladotor/Habenula_R01_LIBD4270/Habenula_Visium/raw-data/images/vis-hd/H1-W369TJK_D1_9090.tif
 visiumhd_h5_path: /dcs04/lieber/lcolladotor/Habenula_R01_LIBD4270/Habenula_Visium/processed-data/01_spaceranger/H1-W369TJK_D1_9090/outs/binned_outputs/square_002um/filtered_feature_bc_matrix.h5
 tissue_positions_path: /dcs04/lieber/lcolladotor/Habenula_R01_LIBD4270/Habenula_Visium/processed-data/01_spaceranger/H1-W369TJK_D1_9090/outs/binned_outputs/square_002um/spatial/tissue_positions.parquet
 segmentation: True
 bin_to_geodataframes: True
 bin_to_cell_assignment: True
 cell_type_annotation: True
 seg_method: stardist
 patch_size: 4000
 bin_representation: polygon
 bin_to_cell_method: weighted_by_area
 cell_annotation_method: celltypist
 cell_typist_model: Developing_Human_Brain.pkl
 use_hvg: True
 n_hvg: 1000
 n_clusters: 4
 chunks_to_run: []
 cell_markers: {}

Traceback:

Traceback (most recent call last):
  File "/dcs04/lieber/lcolladotor/Habenula_R01_LIBD4270/Habenula_Visium/code/09_HD_cell_level/enact/01_run_enact.py", line 28, in <module>
    so_hd.run_enact()
  File "/jhpce/shared/libd/core/visium_hd/1.0/hd_env/lib/python3.9/site-packages/enact/pipeline.py", line 1040, in run_enact
    self.package_results()
  File "/jhpce/shared/libd/core/visium_hd/1.0/hd_env/lib/python3.9/site-packages/enact/pipeline.py", line 981, in package_results
    adata = pack_obj.df_to_adata(results_df, cell_by_gene_df)
  File "/jhpce/shared/libd/core/visium_hd/1.0/hd_env/lib/python3.9/site-packages/enact/package_results.py", line 111, in df_to_adata
    adata.obsm["spatial"] = results_df[spatial_cols].astype(int)
  File "/jhpce/shared/libd/core/visium_hd/1.0/hd_env/lib/python3.9/site-packages/anndata/_core/aligned_mapping.py", line 199, in __setitem__
    value = self._validate_value(value, key)
  File "/jhpce/shared/libd/core/visium_hd/1.0/hd_env/lib/python3.9/site-packages/anndata/_core/aligned_mapping.py", line 264, in _validate_value
    raise ValueError(msg) from None
ValueError: value.index does not match parent’s obs names:
Index are different

Index values are different (100.0 %)
[left]:  Index(['ID_10822', 'ID_10867', 'ID_10934', 'ID_11145', 'ID_11195', 'ID_11507',
       'ID_11785', 'ID_12211', 'ID_4692', 'ID_4693',
       ...
       'ID_9269', 'ID_9270', 'ID_9272', 'ID_9273', 'ID_9275', 'ID_9276',
       'ID_9277', 'ID_9278', 'ID_9279', 'ID_9282'],
      dtype='object', name='id', length=59071)
[right]: Index(['ID_1', 'ID_10', 'ID_100', 'ID_1000', 'ID_1002', 'ID_1003', 'ID_1004',
       'ID_1006', 'ID_1007', 'ID_1008',
       ...
       'ID_60482', 'ID_60485', 'ID_60493', 'ID_60497', 'ID_60498', 'ID_60499',
       'ID_60501', 'ID_60502', 'ID_60507', 'ID_60508'],
      dtype='object', name='id', length=59071)
@XinchaoWu99
Copy link

XinchaoWu99 commented Dec 3, 2024

Same problem and possible solution
Hi, I encounter the same problem here:

ValueError Traceback (most recent call last)

Cell In[2], line 20
2 sample = "Visium_HD_060424_5X" # Visium_HD_060424_5X Visium_HD_060424-WT
4 so_hd = ENACT(
5 cache_dir=f"{data_path}/test_cache",
6 wsi_path=f"{data_path}/2024_05_22_04_5xfad.tif", # "2024_05_22_04_5xfad.tif" "2024_05_22_02_ctrl.tif"
(...)
17 cell_type_annotation=True,
18 )
---> 20 so_hd.run_enact()

File ~/.local/lib/python3.9/site-packages/enact/pipeline.py:1040, in ENACT.run_enact(self)
1038 if self.cell_type_annotation:
1039 self.run_cell_type_annotation()
-> 1040 self.package_results()
1042 else:
1043 # Generating synthetic data
1044 if self.analysis_name in ["xenium", "xenium_nuclei"]:

File ~/.local/lib/python3.9/site-packages/enact/pipeline.py:981, in ENACT.package_results(self)
977 cell_by_gene_df = pack_obj.merge_cellassign_output_files()
978 results_df = pd.read_csv(
979 os.path.join(self.cellannotation_results_dir, "merged_results.csv")
980 )
--> 981 adata = pack_obj.df_to_adata(results_df, cell_by_gene_df)
982 pack_obj.save_adata(adata)
983 pack_obj.create_tmap_file()

File ~/.local/lib/python3.9/site-packages/enact/package_results.py:111, in PackageResults.df_to_adata(self, results_df, cell_by_gene_df)
109 adata = anndata.AnnData(cell_by_gene_df.set_index("id"))
110 # adata = anndata.AnnData(results_df[stat_columns].astype(int))
--> 111 adata.obsm["spatial"] = results_df[spatial_cols].astype(int)
112 adata.obsm["stats"] = results_df[stat_columns].astype(int)
113 # This column is the output of cell type inference pipeline

File ~/.local/lib/python3.9/site-packages/anndata/_core/aligned_mapping.py:199, in AlignedActualMixin.setitem(self, key, value)
198 def setitem(self, key: str, value: V):
--> 199 value = self._validate_value(value, key)
200 self._data[key] = value

File ~/.local/lib/python3.9/site-packages/anndata/_core/aligned_mapping.py:264, in AxisArraysBase._validate_value(self, val, key)
262 except AssertionError as e:
263 msg = f"value.index does not match parent’s {self.dim} names:\n{e}"
--> 264 raise ValueError(msg) from None
265 else:
266 msg = "Index.equals and pd.testing.assert_index_equal disagree"

ValueError: value.index does not match parent’s obs names:
Index are different

Index values are different (100.0 %)
[left]: Index(['ID_1', 'ID_10', 'ID_100', 'ID_1000', 'ID_1001', 'ID_1002', 'ID_1003',
'ID_1004', 'ID_1005', 'ID_1006',
...
'ID_50670', 'ID_50671', 'ID_50672', 'ID_50673', 'ID_50674', 'ID_50675',
'ID_50676', 'ID_50677', 'ID_50678', 'ID_50679'],
dtype='object', name='id', length=21796)
[right]: Index(['ID_10003', 'ID_10005', 'ID_10007', 'ID_10010', 'ID_10012', 'ID_10021',
'ID_10023', 'ID_10026', 'ID_10035', 'ID_10037',
...
'ID_49667', 'ID_49668', 'ID_49670', 'ID_49671', 'ID_49672', 'ID_49675',
'ID_49676', 'ID_49678', 'ID_49684', 'ID_49685'],
dtype='object', name='id', length=21796)
"""'

And I think there is a bug in the function PackageResults.df_to_adata() that it tries to set adata.obsm directly form results_df which may cause index problem, thus I revised the code in this function to get it done:

def df_to_adata(self, results_df, cell_by_gene_df):
    """Converts pd.DataFrame object with pipeline results to AnnData

    Args:
        results_df (_type_): _description_

    Returns:
        anndata.AnnData: Anndata with pipeline outputs
    """
    file_columns = results_df.columns
    spatial_cols = ["cell_x", "cell_y"]
    stat_columns = ["num_shared_bins", "num_unique_bins", "num_transcripts"]
    results_df.loc[:, "id"] = results_df["id"].astype(str)
    results_df = results_df.set_index("id")
    results_df["num_transcripts"] = results_df["num_transcripts"].fillna(0)
    results_df["cell_type"] = results_df["cell_type"].str.lower()
    # adata = anndata.AnnData(cell_by_gene_df.set_index("id").astype(int))
    adata = anndata.AnnData(cell_by_gene_df.set_index("id"))
    adata.obs = adata.obs.merge(results_df, on="id").drop_duplicates(keep='first')
    # adata = anndata.AnnData(results_df[stat_columns].astype(int))
    # adata.obsm["spatial"] = results_df[spatial_cols].astype(int)
    adata.obsm["spatial"] = adata.obs[spatial_cols].astype(int)
    # adata.obsm["stats"] = results_df[stat_columns].astype(int)
    adata.obsm["stats"] = adata.obs[stat_columns].astype(int)
    # This column is the output of cell type inference pipeline
    # adata.obs["cell_type"] = results_df[["cell_type"]].astype("category")
    adata.obs["cell_type"] = adata.obs["cell_type"].astype("category")
    # adata.obs["patch_id"] = results_df[["chunk_name"]]
    adata.obs["patch_id"] = adata.obs["chunk_name"]
    adata.obs = adata.obs[["cell_type", "patch_id"]]
    return adata

"""

Which is works for me.

@stc120121
Copy link

Thanks for catching this and suggesting a fix! We’ve implemented a very similar solution, and it’ll be rolled out soon with some other updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants