整合全基因组关联研究和空间转录组数据定位复杂疾病相关细胞的空间分布 (运行中)

Xing Abao Lv3

全基因组关联研究GWAS是一种广泛应用于检测遗传变异与复杂疾病(性状)关联的研究方法。尽管 GWAS 已经成功鉴定了大量与复杂疾病(性状)相关的遗传变异,但这些变异通过人体组织中哪些特定位置的细胞影响疾病的发生和发展,仍然是当前人类遗传学研究的未解之谜。西湖大学杨剑团队 2025 年 3 月在 Nature 上发表了Spatially resolved mapping of cells associated with human complex traits 的研究论文。该研究开发了一种新的分析方法gsMap,通过整合全基因组关联研究和空间转录组数据,描绘了人类复杂疾病(性状)相关细胞在组织中的空间分布。

通过对小鼠胚胎、大脑、猕猴大脑皮层以及人类GWAS数据进行整合分析,gsMap成功揭示了不同疾病相关细胞的空间分布模式。例如,研究发现精神分裂症相关的谷氨酸能神经元主要富集于海马背侧,并上调钙信号通路基因,而抑郁症相关神经元主要分布在内侧前额叶皮层,并富含精神药物靶点基因。该研究不仅提供了一种空间解析的遗传学工具,还为精神疾病的病理研究和潜在治疗靶点识别提供了新思路。

gsMap利用图神经网络整合细胞的表达谱和空间位置信息,在二维空间上为组织的每个细胞识别出一组标签基因;以这些基因为桥梁,通过人类遗传方法,将全基因组关联研究GWAS和空间转录组数据Spatial Transcriptomics, ST相结合,构建疾病与细胞的关联,从而在单细胞水平上描绘疾病相关细胞在组织中的空间分布。

算法原理

**核心思想:**利用 GWAS 数据确定与复杂性状相关的基因,并将其表达模式映射到空间转录组数据中的细胞,以评估某个空间区域的细胞是否与某种复杂性状相关。

  • 计算基因特异性得分 (GSS):使用图神经网络对空间转录组数据进行处理,识别基因表达模式相似的空间区域 (spot)。计算每个 spot 的 基因特异性得分,表示该区域特定基因的表达水平排名;
  • 将 GSS 分配给 SNP:结合 GWAS 数据,分析 SNP 与 GSS 之间的关系,确定某个空间区域的 SNP 是否富集于某种复杂性状的遗传变异;
  • SNP 遗传变异富集分析:使用分层连锁不平衡评分回归 (S-LDSC) 方法,评估具有高 GSS 的 SNP 是否富集于特定性状的遗传变异,计算富集 P 值;
  • 空间区域的统计检验:通过 Cauchy 组合检验计算某个空间区域整体上是否与目标性状相关。

安装程序

1
2
3
4
5
6
7
# 通过 pip 直接安装
/tools/Python-3.10.8/bin/pip install gsMap -i https://pypi.tuna.tsinghua.edu.cn/simple

# 通过源码安装:
git clone https://github.com/JianYang-Lab/gsMap
cd gsMap
/tools/Python-3.10.8/bin/pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -e .

检查安装:

1
/tools/Python-3.10.8/bin/gsmap -v

小鼠胚胎 (快速模式)

快速模式提供了一种简化且高效的方式来运行整个 gsMap 分析流程。它通过利用 1000G EUR 参考基因组和 Gencode v46 蛋白编码基因预先计算的权重,减少了运行时间和配置复杂度。

如果希望自定义 GTF 文件、参考基因组或参数,请参考分步指南Step by Step guide

准备工作

首先按照安装程序部署gsMap软件包,然后下载相应的依赖,本示例需要以下资源文件:

  • GTF 文件,用于提供基因在基因组上的坐标;
  • LD 参考基因组,快速模式下,已提供基于 1000G_EUR_Phase3的预构建 LD 分数 SNP-基因矩阵 (pre-built LD score snp-by-gene matrix based on 1000G_EUR_Phase3);
  • SNP 权重文件,用于调整 SNP-性状关联统计之间的相关性;
  • 同源基因转换文件,可选,用于不同物种间基因名称的映射。

下载所有所需的文件

1
2
3
wget https://yanglab.westlake.edu.cn/data/gsMap/gsMap_resource.tar.gz

tar -xvzf gsMap_resource.tar.gz

下载示例数据

1
2
3
wget https://yanglab.westlake.edu.cn/data/gsMap/gsMap_example_data.tar.gz

tar -xvzf gsMap_example_data.tar.gz

文件结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$ tree -L 3
.
├── gsMap_example_data
│   ├── GWAS
│   │   ├── BCX2_MCHC_EA_GWAMA.sumstats.gz
│   │   ├── GIANT_EUR_Height_2022_Nature.sumstats.gz
│   │   ├── gwas_config.yaml
│   │   └── IQ_NG_2018.sumstats.gz
│   └── ST
│   ├── E16.5_E1S1.MOSTA.h5ad
│   └── E16.5_E2S11.MOSTA.h5ad
├── gsMap_example_data.tar.gz
├── gsMap_resource
│   ├── genome_annotation
│   │   ├── enhancer
│   │   └── gtf
│   ├── homologs
│   │   ├── human_macaque_marmoset_mouse_homologs.txt.gz
│   │   ├── macaque_human_homologs.txt
│   │   └── mouse_human_homologs.txt
│   ├── LD_Reference_Panel
│   │   └── 1000G_EUR_Phase3_plink
│   ├── LDSC_resource
│   │   ├── hapmap3_snps
│   │   └── weights_hm3_no_hla
│   └── quick_mode
│   ├── baseline
│   ├── SNP_gene_pair
│   └── snp_gene_weight_matrix.h5ad
└── gsMap_resource.tar.gz

16 directories, 12 files

用快速模式运行

Execution: Required memory: 80G (120K cells)

1
2
3
4
5
6
7
8
9
10
11
# 2025-05-01 22:54 测试通过
nohup /tools/Python-3.10.8/bin/gsmap quick_mode \
--workdir '/home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo' \
--homolog_file '/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt' \
--sample_name 'E16.5_E1S1.MOSTA' \
--gsMap_resource_dir '/home/hello/wkdir/gsMap/gsMap_resource' \
--hdf5_path '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad' \
--annotation 'annotation' \
--data_layer 'count' \
--sumstats_file '/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz' \
--trait_name 'IQ' &
1
2
3
4
5
6
7
8
9
10
# 参数说明
--workdir: 输出文件保存的工作目录
--homolog_file: 同源基因文件,用于不同物种基因名称转为人类基因名
--sample_name: 样本名称,如 E16.5_E1S1.MOSTA
--gsMap_resource_dir: gsMap 依赖资源所在目录
--hdf5_path: 空间转录组(ST)输入 HDF5 文件路径
--annotation: 输入 HDF5 文件 adata.obs 中注释列的名称
--data_layer: 表达矩阵的层,如 count
--sumstats_file: GWAS 总结统计文件路径
--trait_name: 性状名称,如 IQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-01 22:55:00,244] INFO | gsMap - Running quick_mode...
[2025-05-01 22:55:00,244] INFO | gsMap - Started at: 2025-05-01 22:55:00
Using the following arguments for RunAllModeConfig:
{ 'annotation': 'annotation',
'data_layer': 'count',
'gM_slices': None,
'gsMap_resource_dir': '/home/hello/wkdir/gsMap/gsMap_resource',
'hdf5_path': '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad',
'homolog_file': '/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt',
'latent_representation': None,
'max_processes': 10,
'num_neighbour': 21,
'num_neighbour_spatial': 101,
'pearson_residuals': False,
'sample_name': 'E16.5_E1S1.MOSTA',
'sumstats_config_file': None,
'sumstats_file': '/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz',
'trait_name': 'IQ',
'workdir': '/home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo'}
[2025-05-01 22:55:04,709] INFO | gsMap.pipeline - Starting pipeline with configuration: RunAllModeConfig(workdir='/home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo', sample_name='E16.5_E1S1.MOSTA', gsMap_
resource_dir='/home/hello/wkdir/gsMap/gsMap_resource', hdf5_path='/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad', annotation='annotation', data_layer='count', n_comps=300, pearson_residual
s=False, gM_slices=None, latent_representation=None, num_neighbour=21, num_neighbour_spatial=101, trait_name='IQ', sumstats_file='/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz', sumstat
s_config_file=None, homolog_file='/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt', max_processes=10)
[2025-05-01 22:55:04,709] INFO | gsMap.pipeline - No latent representation provided. Will run the Find_latent_representations step.
[2025-05-01 22:55:04,709] INFO | gsMap - ------Find latent representations for annotation...
[2025-05-01 22:55:04,709] INFO | gsMap.pipeline - Step 1: Finding latent representations
[2025-05-01 22:55:04,711] INFO | gsMap.find_latent_representation - Using CPU for computations.
[2025-05-01 22:55:04,711] INFO | gsMap.find_latent_representation - Loading ST data of E16.5_E1S1.MOSTA...
[2025-05-01 22:55:11,604] INFO | gsMap.find_latent_representation - The ST data contains 121767 cells, 28204 genes.
[2025-05-01 22:55:11,851] INFO | gsMap.find_latent_representation - Preprocessing data...
[2025-05-01 22:55:15,088] INFO | gsMap.find_latent_representation - Using data layer: count...
[2025-05-01 22:55:15,717] INFO | gsMap.find_latent_representation - Dealing with count data...
[2025-05-01 22:55:59,059] INFO | gsMap.find_latent_representation - Finding latent representations for whole ST data...
[2025-05-01 22:55:59,268] INFO | gsMap.GNN.train - Start training...
------Calculating spatial graph...
The graph contains 1217670 edges, 121767 cells.
10.00 neighbors per cell on average.
......
[2025-05-02 00:52:53,349] INFO | gsMap.spatial_ldsc - Output saved to /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/spatial_ldsc/E16.5_E1S1.MOSTA_IQ.csv.gz for IQ
[2025-05-02 00:52:53,350] INFO | gsMap.spatial_ldsc - ------Spatial LDSC for E16.5_E1S1.MOSTA finished!
[2025-05-02 00:52:53,819] INFO | gsMap.pipeline - Step 4 completed in 1h 2m.
[2025-05-02 00:52:53,819] INFO | gsMap.pipeline - Step 6: Running Cauchy combination test
[2025-05-02 00:52:53,820] INFO | gsMap.cauchy_combination_test - ------Loading LDSC results for sample E16.5_E1S1.MOSTA...
[2025-05-02 00:52:53,981] INFO | gsMap.cauchy_combination_test - ------Loading ST data for sample E16.5_E1S1.MOSTA...
[2025-05-02 00:52:57,696] INFO | gsMap.cauchy_combination_test - Removed 84/8803 outliers (median + 3IQR) for Connective tissue.
[2025-05-02 00:52:57,776] INFO | gsMap.cauchy_combination_test - Removed 8/5850 outliers (median + 3IQR) for Cartilage primordium.
[2025-05-02 00:52:57,799] INFO | gsMap.cauchy_combination_test - Removed 8/2007 outliers (median + 3IQR) for Submandibular gland.
[2025-05-02 00:52:57,813] INFO | gsMap.cauchy_combination_test - Removed 5/3078 outliers (median + 3IQR) for Jaw and tooth.
[2025-05-02 00:52:57,846] INFO | gsMap.cauchy_combination_test - Removed 17/5602 outliers (median + 3IQR) for Cartilage.
[2025-05-02 00:52:57,868] INFO | gsMap.cauchy_combination_test - Removed 11/3760 outliers (median + 3IQR) for Lung.
[2025-05-02 00:52:57,901] INFO | gsMap.cauchy_combination_test - Removed 30/6656 outliers (median + 3IQR) for Meninges.
[2025-05-02 00:52:57,976] INFO | gsMap.cauchy_combination_test - Removed 1/457 outliers (median + 3IQR) for Inner ear.
[2025-05-02 00:52:58,075] INFO | gsMap.cauchy_combination_test - Removed 2/2783 outliers (median + 3IQR) for Mucosal epithelium.
[2025-05-02 00:52:58,092] INFO | gsMap.cauchy_combination_test - Removed 14/383 outliers (median + 3IQR) for Smooth muscle.
[2025-05-02 00:52:58,102] INFO | gsMap.cauchy_combination_test - Removed 30/3723 outliers (median + 3IQR) for Heart.
[2025-05-02 00:52:58,262] INFO | gsMap.cauchy_combination_test - Removed 1/17374 outliers (median + 3IQR) for Brain.
[2025-05-02 00:52:58,854] INFO | gsMap.cauchy_combination_test - Removed 3/1727 outliers (median + 3IQR) for Choroid plexus.
[2025-05-02 00:52:58,874] INFO | gsMap.cauchy_combination_test - Cauchy combination results saved at /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/cauchy_combination/E16.5_E1S1.MOSTA_
IQ.Cauchy.csv.gz.
[2025-05-02 00:52:58,878] INFO | gsMap.pipeline - Step 5 completed in 0h 0m.
[2025-05-02 00:52:58,878] INFO | gsMap.pipeline - Running final report generation for trait: IQ
[2025-05-02 00:52:58,879] INFO | gsMap.report - Running gsMap Diagnosis Module
[2025-05-02 00:53:02,320] INFO | gsMap.diagnosis - Creating gsMap plot...
[2025-05-02 00:53:06,729] INFO | gsMap.diagnosis - gsMap plot created and saved in /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/gsMap_plot.
[2025-05-02 00:53:06,732] INFO | gsMap.diagnosis - Loading and processing GWAS data...
[2025-05-02 00:53:29,899] INFO | gsMap.diagnosis - Gene diagnostic information not found. Calculating gene diagnostic information...
[2025-05-02 00:53:29,899] INFO | gsMap.diagnosis - Loading ST data and LDSC results...
[2025-05-02 00:53:49,771] INFO | gsMap.diagnosis - Calculating correlation between gene marker scores and trait logp-values...
[2025-05-02 00:58:09,518] INFO | gsMap.diagnosis - Gene diagnostic information saved to /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/E16.5_E1S1.MOSTA_IQ_Gene_Diagnostic_Inf
o.csv.
[2025-05-02 00:58:12,332] INFO | gsMap.diagnosis - To reduce the number of SNPs to plot, only 100000 SNPs with the smallest P-values are plotted.
[2025-05-02 00:58:13,180] INFO | gsMap.diagnosis - Creating Manhattan plot with 100000 SNPs
[2025-05-02 00:58:13,181] INFO | gsMap.diagnosis - Chromosome column values: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22]
/tools/Python-3.10.8/lib/python3.10/site-packages/gsMap/utils/manhattan_plot.py:164: UserWarning:

Some genes don't contain any SNP to highlight: : {'JAKMIP2', 'NSG2', 'CDK5R1', 'ACTL6B', 'TAGLN3', 'APC2', 'CRMP1', 'APBA2', 'RUFY3', 'DCX', 'ZCCHC18', 'SOGA3', 'CNIH2', 'ELAVL3', 'MLLT11', 'MAP2', 'NSG1', 'SNR
PN', 'KIF5C', 'STMN4', 'PAK3'}

[2025-05-02 00:58:13,665] INFO | gsMap.diagnosis - Diagnostic Manhattan Plot saved to /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/manhattan_plot/E16.5_E1S1.MOSTA_IQ_Diagno
stic_Manhattan_Plot.html.
[2025-05-02 00:58:20,823] INFO | gsMap.diagnosis - Loading gene diagnostic information from /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/E16.5_E1S1.MOSTA_IQ_Gene_Diagnostic
_Info.csv...
[2025-05-02 00:58:20,835] INFO | gsMap.diagnosis - Generating GSS & Expression distribution plot for top 50 correlated genes...
[2025-05-02 01:01:33,605] INFO | gsMap.report - gsMap Diagnosis running successfully
[2025-05-02 01:01:33,618] INFO | gsMap.report - Cauchy combination already done for trait IQ. Results saved at /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/cauchy_combination/E16.5_E
1S1.MOSTA_IQ.Cauchy.csv.gz. Skipping...
[2025-05-02 01:01:33,683] INFO | gsMap.report - Report generated successfully! Saved at /home/hello/wkdir/gsMap/example_quick_mode/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/E16.5_E1S1.MOSTA_IQ_gsMap_Report.html.
[2025-05-02 01:01:33,683] INFO | gsMap.report - Copy the report directory to your local PC and open the HTML report file in a web browser to view the report.
[2025-05-02 01:01:33,683] INFO | gsMap.pipeline - Pipeline completed successfully.
[2025-05-02 01:01:33,683] INFO | gsMap - Finished running quick_mode at: 2025-05-02 01:01:33.
[2025-05-02 01:01:33,986] INFO | gsMap - Resource usage summary:
[2025-05-02 01:01:33,986] INFO | gsMap - • Wall clock time: 2.11 hours
[2025-05-02 01:01:33,986] INFO | gsMap - • CPU time: 14.80 hours
[2025-05-02 01:01:33,986] INFO | gsMap - • Average CPU utilization: 712.3%
[2025-05-02 01:01:33,986] INFO | gsMap - • Peak memory usage: 56.46 GB
33097.77user 20170.86system 2:06:35elapsed 701%CPU (0avgtext+0avgdata 59205924maxresident)k
872inputs+25417672outputs (3major+2555973742minor)pagefaults 0swaps

如果想一次分析多个性状,可以用配置文件(--sumstats_config_file)代替单个 GWAS 统计文件:

1
2
3
4
5
6
7
8
9
gsmap quick_mode \
--workdir './example_quick_mode/Mouse_Embryo' \
--homolog_file 'gsMap_resource/homologs/mouse_human_homologs.txt' \
--sample_name 'E16.5_E1S1.MOSTA' \
--gsMap_resource_dir 'gsMap_resource' \
--hdf5_path 'gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad' \
--annotation 'annotation' \
--data_layer 'count' \
--sumstats_config_file 'gsMap_example_data/GWAS/gwas_config.yaml'

其中gwas_config.yaml文件内容如下:

1
2
3
Height: gsMap_example_data/GWAS/GIANT_EUR_Height_2022_Nature.sumstats.gz
IQ: gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz
SCZ: gsMap_example_data/GWAS/PGC3_SCZ_wave3_public_INFO80.sumstats.gz

结果说明

所有输出文件会保存在指定的--workdir中,包含中间文件(如潜变量、基因 marker 分数、LD 分数),这些中间文件在同一 sample 多性状分析时会自动复用。report文件夹中会生成网页报告,包括空间细胞-性状关联的可视化和诊断图,可复制到本地机器后用浏览器打开。

典型输出结构

1
2
3
4
5
6
7
8
9
10
tree -L 3

example_quick_mode/Mouse_Embryo
├── E16.5_E1S1.MOSTA
│ ├── find_latent_representations # Contains latent representations in h5ad format
│ ├── latent_to_gene # Gene marker scores
│ ├── generate_ldscore # LD scores
│ ├── spatial_ldsc # Spatial cell-trait association results
│ ├── cauchy_combination # Region-level or cell type-level association results
│ └── report # Web report with visualizations and diagnostics

小鼠胚胎 (分步模式)

这个示例用于一步一步的运行gsMap,允许用户自定义参数和资源,从而在分析过程中获得更大的灵活性和控制力。此模式适合需要对流程进行详细定制的用户。

准备工作

首先按照安装程序部署gsMap软件包,然后下载相应的依赖,本示例需要以下资源文件:

  • GTF 文件,用于提供基因在基因组上的坐标;

  • LD 参考基因组 (plink bfile),用于计算 LD 分数;

  • SNP 权重文件,用于调整 SNP-性状关联统计之间的相关性;

  • 同源基因转换文件,可选,用于不同物种间基因名称的映射;

  • 增强子-基因映射文件,可选,用于基于增强子注释将 SNP 链接到基因。

下载所有所需的文件

1
2
3
wget https://yanglab.westlake.edu.cn/data/gsMap/gsMap_resource.tar.gz

tar -xvzf gsMap_resource.tar.gz

下载示例数据

1
2
3
wget https://yanglab.westlake.edu.cn/data/gsMap/gsMap_example_data.tar.gz

tar -xvzf gsMap_example_data.tar.gz

文件结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$ tree -L 3
.
├── gsMap_example_data
│   ├── GWAS
│   │   ├── BCX2_MCHC_EA_GWAMA.sumstats.gz
│   │   ├── GIANT_EUR_Height_2022_Nature.sumstats.gz
│   │   ├── gwas_config.yaml
│   │   └── IQ_NG_2018.sumstats.gz
│   └── ST
│   ├── E16.5_E1S1.MOSTA.h5ad
│   └── E16.5_E2S11.MOSTA.h5ad
├── gsMap_example_data.tar.gz
├── gsMap_resource
│   ├── genome_annotation
│   │   ├── enhancer
│   │   └── gtf
│   ├── homologs
│   │   ├── human_macaque_marmoset_mouse_homologs.txt.gz
│   │   ├── macaque_human_homologs.txt
│   │   └── mouse_human_homologs.txt
│   ├── LD_Reference_Panel
│   │   └── 1000G_EUR_Phase3_plink
│   ├── LDSC_resource
│   │   ├── hapmap3_snps
│   │   └── weights_hm3_no_hla
│   └── quick_mode
│   ├── baseline
│   ├── SNP_gene_pair
│   └── snp_gene_weight_matrix.h5ad
└── gsMap_resource.tar.gz

16 directories, 12 files

如果你希望使用自己的参考文件,请确保 GTF 文件和 LD 参考基因组的基因组版本(如 hg37 或 hg38)一致。

运行程序

计算潜在表达 (可选)

Execution: required memory: ~60G (120K cells)

计算潜在表达 (find latent representations),目的是为每一个点学习潜在表示,此步骤学到的latent embedding会被保存在 AnnData 对象的 obsm 字段,键名为latent_GVAE

--workdir参数指定 gsMap 的工作目录,所有输出文件都会保存在这里

1
2
3
4
5
6
7
# 2025-05-01 23:16 测试通过
nohup /tools/Python-3.10.8/bin/gsmap run_find_latent_representations \
--workdir '/home/hello/wkdir/gsMap/example/Mouse_Embryo' \
--sample_name 'E16.5_E1S1.MOSTA' \
--input_hdf5_path '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad' \
--annotation 'annotation' \
--data_layer 'count' &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-01 23:17:12,013] INFO | gsMap - Running run_find_latent_representations...
[2025-05-01 23:17:12,013] INFO | gsMap - Started at: 2025-05-01 23:17:12
Using the following arguments for FindLatentRepresentationsConfig:
{ 'annotation': 'annotation',
'convergence_threshold': 0.0001,
'data_layer': 'count',
'epochs': 300,
'feat_hidden1': 256,
'feat_hidden2': 128,
'gat_hidden1': 64,
'gat_hidden2': 30,
'gat_lr': 0.001,
'hierarchically': False,
'input_hdf5_path': '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad',
'n_comps': 300,
'n_neighbors': 11,
'p_drop': 0.1,
'pearson_residuals': False,
'sample_name': 'E16.5_E1S1.MOSTA',
'weighted_adj': False,
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-01 23:17:17,633] INFO | gsMap - ------Find latent representations for annotation...
[2025-05-01 23:17:17,635] INFO | gsMap.find_latent_representation - Using CPU for computations.
[2025-05-01 23:17:17,635] INFO | gsMap.find_latent_representation - Loading ST data of E16.5_E1S1.MOSTA...
[2025-05-01 23:17:26,649] INFO | gsMap.find_latent_representation - The ST data contains 121767 cells, 28204 genes.
[2025-05-01 23:17:26,918] INFO | gsMap.find_latent_representation - Preprocessing data...
[2025-05-01 23:17:28,485] INFO | gsMap.find_latent_representation - Using data layer: count...
[2025-05-01 23:17:29,050] INFO | gsMap.find_latent_representation - Dealing with count data...
[2025-05-01 23:18:45,467] INFO | gsMap.find_latent_representation - Finding latent representations for whole ST data...
[2025-05-01 23:18:45,789] INFO | gsMap.GNN.train - Start training...
------Calculating spatial graph...
The graph contains 1217670 edges, 121767 cells.
10.00 neighbors per cell on average.
......
[2025-05-01 23:46:31,610] INFO | gsMap.GNN.train - Convergence reached. Training stopped.
[2025-05-01 23:46:32,875] INFO | gsMap.find_latent_representation - Adding latent representations...
[2025-05-01 23:46:32,876] INFO | gsMap.find_latent_representation - Saving ST data...
[2025-05-01 23:46:38,504] INFO | gsMap - Finished running run_find_latent_representations at: 2025-05-01 23:46:38.
[2025-05-01 23:46:39,006] INFO | gsMap - Resource usage summary:
[2025-05-01 23:46:39,006] INFO | gsMap - • Wall clock time: 29.45 minutes
[2025-05-01 23:46:39,006] INFO | gsMap - • CPU time: 8.93 hours
[2025-05-01 23:46:39,006] INFO | gsMap - • Average CPU utilization: 1781.8%
[2025-05-01 23:46:39,006] INFO | gsMap - • Peak memory usage: 14.89 GB
24116.70user 8037.98system 29:28.68elapsed 1818%CPU (0avgtext+0avgdata 15669660maxresident)k
0inputs+11561200outputs (0major+828733267minor)pagefaults 0swaps

生成基因特异性得分

Execution: required memory: ~45G (120K cells)

生成基因特异性得分 (generate gene specificity scores),基于 --latent_representation 指定的潜在表达,为每个点识别其同质点 (homogeneous spots),随后通过聚合其同质点信息,为每个点生成基因特异性得分 (Gene Specificity Scores, GSS)。

**注意:**如果空间转录组数据不是来自人类物种,但希望将人类的 GWAS 数据映射到上面,请提供一个同源基因转换文件 (homologous transformation file) 以实现基因名称转换。文件第一列应为空间转录组数据物种的基因名,第二列为 GWAS 数据物种的基因名。

1
2
3
4
5
6
7
8
9
# 2025-05-02 00:02 测试通过
nohup /tools/Python-3.10.8/bin/gsmap run_latent_to_gene \
--workdir '/home/hello/wkdir/gsMap/example/Mouse_Embryo' \
--sample_name 'E16.5_E1S1.MOSTA' \
--annotation 'annotation' \
--latent_representation 'latent_GVAE' \
--num_neighbour 51 \
--num_neighbour_spatial 201 \
--homolog_file '/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt' &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-02 00:05:43,439] INFO | gsMap - Running run_latent_to_gene...
[2025-05-02 00:05:43,439] INFO | gsMap - Started at: 2025-05-02 00:05:43
Using the following arguments for LatentToGeneConfig:
{ 'annotation': 'annotation',
'gM_slices': None,
'homolog_file': '/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt',
'input_hdf5_path': None,
'latent_representation': 'latent_GVAE',
'no_expression_fraction': False,
'num_neighbour': 51,
'num_neighbour_spatial': 201,
'sample_name': 'E16.5_E1S1.MOSTA',
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-02 00:05:46,453] INFO | gsMap - Using the provided latent representation: latent_GVAE
[2025-05-02 00:05:46,453] INFO | gsMap - User provided homolog file to map gene names to human: /home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt
[2025-05-02 00:05:46,453] INFO | gsMap - Homolog file provided and will map gene name from column1:MOUSE_GENE_SYM to column2:HUMAN_GENE_SYM
[2025-05-02 00:05:46,453] INFO | gsMap.latent_to_gene - ------Loading the spatial data...
[2025-05-02 00:05:51,201] INFO | gsMap.latent_to_gene - Loaded spatial data with 121767 cells and 28204 genes.
[2025-05-02 00:05:51,201] INFO | gsMap.latent_to_gene - ------Cell annotations are provided as annotation...
[2025-05-02 00:05:51,355] INFO | gsMap.latent_to_gene - Removed null annotations. Cells retained: 121767 (initial: 121767).
[2025-05-02 00:05:51,356] INFO | gsMap.latent_to_gene - ------Transforming the MOUSE_GENE_SYM to HUMAN_GENE_SYM...
[2025-05-02 00:05:51,826] INFO | gsMap.latent_to_gene - 16331 genes retained after homolog transformation.
/tools/Python-3.10.8/lib/python3.10/site-packages/gsMap/latent_to_gene.py:187: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var[species_col_name] = adata.var_names.values
[2025-05-02 00:06:08,426] INFO | gsMap.latent_to_gene - 16331 genes retained after removing duplicates.
[2025-05-02 00:06:08,427] INFO | gsMap.latent_to_gene - Using cell annotations for 121767 cells.
[2025-05-02 00:06:08,427] INFO | gsMap.latent_to_gene - ------Building the spatial graph...
[2025-05-02 00:06:08,427] INFO | gsMap.latent_to_gene - ------Building spatial graph based on spatial coordinates...
[2025-05-02 00:06:08,427] INFO | gsMap.latent_to_gene - Cell annotations are provided...
[2025-05-02 00:06:08,748] INFO | gsMap.latent_to_gene - Cavity: 11287 cells
[2025-05-02 00:06:08,892] INFO | gsMap.latent_to_gene - Epidermis: 6304 cells
[2025-05-02 00:06:09,141] INFO | gsMap.latent_to_gene - Connective tissue: 8803 cells
[2025-05-02 00:06:09,421] INFO | gsMap.latent_to_gene - Muscle: 9853 cells
[2025-05-02 00:06:09,513] INFO | gsMap.latent_to_gene - Adipose tissue: 3822 cells
[2025-05-02 00:06:09,664] INFO | gsMap.latent_to_gene - Cartilage primordium: 5850 cells
[2025-05-02 00:06:09,722] INFO | gsMap.latent_to_gene - Submandibular gland: 2007 cells
[2025-05-02 00:06:09,810] INFO | gsMap.latent_to_gene - Jaw and tooth: 3078 cells
[2025-05-02 00:06:09,889] INFO | gsMap.latent_to_gene - Bone: 3400 cells
[2025-05-02 00:06:10,039] INFO | gsMap.latent_to_gene - Cartilage: 5602 cells
[2025-05-02 00:06:10,144] INFO | gsMap.latent_to_gene - Lung: 3760 cells
[2025-05-02 00:06:10,204] INFO | gsMap.latent_to_gene - Kidney: 2158 cells
......
[2025-05-02 00:53:57,511] INFO | gsMap.latent_to_gene - Marker scores computed.
[2025-05-02 00:54:19,638] INFO | gsMap.latent_to_gene - Removed mitochondrial genes. Remaining genes: 16331.
[2025-05-02 00:54:19,639] INFO | gsMap.latent_to_gene - ------Saving marker scores ...
[2025-05-02 00:54:46,445] INFO | gsMap.latent_to_gene - Marker scores saved to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/latent_to_gene/E16.5_E1S1.MOSTA_gene_marker_score.feather.
[2025-05-02 00:54:59,915] INFO | gsMap.latent_to_gene - Modified adata object saved to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/find_latent_representations/E16.5_E1S1.MOSTA_add_latent.h5ad.
[2025-05-02 00:55:00,290] INFO | gsMap - Finished running run_latent_to_gene at: 2025-05-02 00:55:00.
[2025-05-02 00:55:00,793] INFO | gsMap - Resource usage summary:
[2025-05-02 00:55:00,793] INFO | gsMap - • Wall clock time: 49.29 minutes
[2025-05-02 00:55:00,793] INFO | gsMap - • CPU time: 49.96 minutes
[2025-05-02 00:55:00,793] INFO | gsMap - • Average CPU utilization: 101.4%
[2025-05-02 00:55:00,793] INFO | gsMap - • Peak memory usage: 24.69 GB
2918.17user 80.75system 49:18.44elapsed 101%CPU (0avgtext+0avgdata 25893600maxresident)k
448inputs+13578264outputs (1major+18754742minor)pagefaults 0swaps

生成 LD 分数

Execution: required memory: ~40G

生成 LD 分数 (generate ldscore),目的将基因特异性得分 (GSS) 分配给 SNP,并计算分层 LD 分数 (stratified LD score)。

SNP 到基因的关联有三种策略可选 (Three SNP to gene linking strategies are available)

如果你在此步骤或下一步遇到内存不足的问题,可以将 --spots_per_chunk 参数设置为更小的数值。通常,当 --spots_per_chunk 设为 1000 时,大约需要 40GB 内存。

使用转录起始位点

该策略通过转录起始位点(TSS) 将 GSS 分配给 SNP;--gene_window_size参数定义了基因体两侧的窗口大小;如果某个 SNP 落在多个基因的窗口内,则使用距离最近基因的 GSS 。

1
2
3
4
5
6
7
8
9
10
11
12
# 2025-05-02 09:35 测试通过
nohup bash -c 'for CHROM in {1..22}
do
/tools/Python-3.10.8/bin/gsmap run_generate_ldscore \
--workdir "/home/hello/wkdir/gsMap/example/Mouse_Embryo" \
--sample_name "E16.5_E1S1.MOSTA" \
--chrom $CHROM \
--bfile_root "/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC" \
--keep_snp_root "/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm" \
--gtf_annotation_file "/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf" \
--gene_window_size 50000
done' > run_ldscore.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-02 09:40:36,632] INFO | gsMap - Running run_generate_ldscore...
[2025-05-02 09:40:36,632] INFO | gsMap - Started at: 2025-05-02 09:40:36
Using the following arguments for GenerateLDScoreConfig:
{ 'additional_baseline_annotation': None,
'bfile_root': '/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC',
'chrom': '1',
'enhancer_annotation_file': None,
'gene_window_enhancer_priority': None,
'gene_window_size': 50000,
'gtf_annotation_file': '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf',
'keep_snp_root': '/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm',
'ld_unit': 'CM',
'ld_wind': 1,
'sample_name': 'E16.5_E1S1.MOSTA',
'snp_multiple_enhancer_strategy': 'max_mkscore',
'spots_per_chunk': 1000,
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-02 09:40:37,139] INFO | gsMap - Only gene window annotation will be used to calculate LD score. SNP within +-50000 bp of gene body will be used.
[2025-05-02 09:40:37,139] INFO | gsMap - ------Baseline annotation is not provided. Default baseline annotation will be used.
[2025-05-02 09:40:57,231] INFO | gsMap.generate_ldscore - Loading GTF data from /home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf
[2025-05-02 09:41:52,806] INFO | gsMap.generate_ldscore - Found 16140 common genes between GTF and marker scores
[2025-05-02 09:41:55,043] INFO | gsMap.generate_ldscore - Processing chromosome 1
[2025-05-02 09:41:55,529] INFO | gsMap.utils.plink_ldscore_tool - Loading Plink genotype data from /home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC.1.bed
[2025-05-02 09:41:55,571] INFO | gsMap.utils.plink_ldscore_tool - Calculating MAF and QC for all SNPs
[2025-05-02 09:41:59,389] INFO | gsMap.utils.plink_ldscore_tool - All SNPs passed the basic quality check
[2025-05-02 09:42:00,187] INFO | gsMap.utils.plink_ldscore_tool - Loaded genotype data with 779354 SNPs and 489 individuals
[2025-05-02 09:42:00,222] INFO | gsMap.utils.plink_ldscore_tool - 454682 SNPs with MAF > f0.05
[2025-05-02 09:42:00,222] INFO | gsMap.generate_ldscore - Creating SNP-gene mappings for chromosome 1
[2025-05-02 09:42:00,784] INFO | gsMap.generate_ldscore - Getting SNP-gene pairs from GTF. SNPs in multiple genes will be assigned to the nearest gene (by TSS)
[2025-05-02 09:42:03,095] INFO | gsMap.generate_ldscore - Found 536664 SNP-gene pairs from GTF
[2025-05-02 09:42:03,643] INFO | gsMap.generate_ldscore - Kept 98690 SNPs after filtering with /home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm.1.snp
[2025-05-02 09:42:03,643] INFO | gsMap.generate_ldscore - These filtered SNPs will be used to calculate w_ld
......
[2025-05-02 12:58:39,434] INFO | gsMap.generate_ldscore - Generating w_ld for chr22
[2025-05-02 12:58:39,516] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 17261 SNPs remain
[2025-05-02 12:58:42,065] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 141123 SNPs remain
[2025-05-02 12:58:42,230] INFO | gsMap.generate_ldscore - Saved w_ld for chr22 to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/generate_ldscore/w_ld/weights.22.l2.ldscore.gz
[2025-05-02 12:58:42,281] INFO | gsMap.generate_ldscore - SNP-gene weight matrix shape: (17261, 365)
[2025-05-02 12:58:42,281] INFO | gsMap.generate_ldscore - Calculating baseline LD scores for chr22
[2025-05-02 12:58:42,433] INFO | gsMap.generate_ldscore - Calculating annotation LD scores for chr22
[2025-05-02 12:59:55,794] INFO | gsMap.generate_ldscore - LD score generation completed for E16.5_E1S1.MOSTA
[2025-05-02 12:59:55,878] INFO | gsMap - Finished running run_generate_ldscore at: 2025-05-02 12:59:55.
[2025-05-02 12:59:55,936] INFO | gsMap - Resource usage summary:
[2025-05-02 12:59:55,936] INFO | gsMap - • Wall clock time: 3.21 minutes
[2025-05-02 12:59:55,936] INFO | gsMap - • CPU time: 54.30 minutes
[2025-05-02 12:59:55,936] INFO | gsMap - • Average CPU utilization: 1802.3%
[2025-05-02 12:59:55,936] INFO | gsMap - • Peak memory usage: 22.16 GB

使用增强子-基因关联

该策略利用增强子-基因关联将 GSS 分配给 SNP 。当一个 SNP 对应多个增强子时,SNP 的 GSS 由--snp_multiple_enhancer_strategy参数决定。默认设置为max_mkscore,即为 SNP 分配其对应增强子中的最大 GSS ;另一种选择是nearest_TSS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 2025-05-02 21:28 测试通过,运行了 3-4 个小时
nohup bash -c 'for CHROM in {1..22}
do
/tools/Python-3.10.8/bin/gsmap run_generate_ldscore \
--workdir "/home/hello/wkdir/gsMap/example/Mouse_Embryo" \
--sample_name 'E16.5_E1S1.MOSTA' \
--chrom $CHROM \
--bfile_root "/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC" \
--keep_snp_root "/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm" \
--gtf_annotation_file '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf' \
--enhancer_annotation_file '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/enhancer/by_tissue/ALL/ABC_roadmap_merged.bed' \
--snp_multiple_enhancer_strategy 'max_mkscore' \
--gene_window_enhancer_priority 'enhancer_only'
done' > run_ldscore.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-02 21:27:37,220] INFO | gsMap - Running run_generate_ldscore...
[2025-05-02 21:27:37,220] INFO | gsMap - Started at: 2025-05-02 21:27:37
Using the following arguments for GenerateLDScoreConfig:
{ 'additional_baseline_annotation': None,
'bfile_root': '/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC',
'chrom': '1',
'enhancer_annotation_file': '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/enhancer/by_tissue/ALL/ABC_roadmap_merged.bed',
'gene_window_enhancer_priority': 'enhancer_only',
'gene_window_size': 50000,
'gtf_annotation_file': '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf',
'keep_snp_root': '/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm',
'ld_unit': 'CM',
'ld_wind': 1,
'sample_name': 'E16.5_E1S1.MOSTA',
'snp_multiple_enhancer_strategy': 'max_mkscore',
'spots_per_chunk': 1000,
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-02 21:27:37,727] INFO | gsMap - Only enhancer annotation will be used to calculate LD score.
[2025-05-02 21:27:37,727] INFO | gsMap - ------Baseline annotation is not provided. Default baseline annotation will be used.
[2025-05-02 21:27:56,196] INFO | gsMap.generate_ldscore - Loading GTF data from /home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf
[2025-05-02 21:28:51,593] INFO | gsMap.generate_ldscore - Found 16140 common genes between GTF and marker scores
[2025-05-02 21:29:00,951] INFO | gsMap.generate_ldscore - Processing chromosome 1
[2025-05-02 21:29:01,416] INFO | gsMap.utils.plink_ldscore_tool - Loading Plink genotype data from /home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC.1.bed
[2025-05-02 21:29:01,477] INFO | gsMap.utils.plink_ldscore_tool - Calculating MAF and QC for all SNPs
[2025-05-02 21:29:05,204] INFO | gsMap.utils.plink_ldscore_tool - All SNPs passed the basic quality check
[2025-05-02 21:29:05,957] INFO | gsMap.utils.plink_ldscore_tool - Loaded genotype data with 779354 SNPs and 489 individuals
[2025-05-02 21:29:05,988] INFO | gsMap.utils.plink_ldscore_tool - 454682 SNPs with MAF > f0.05
[2025-05-02 21:29:05,988] INFO | gsMap.generate_ldscore - Creating SNP-gene mappings for chromosome 1
[2025-05-02 21:29:06,823] INFO | gsMap.generate_ldscore - SNPs in multiple enhancers will be assigned to the gene with highest marker score
[2025-05-02 21:29:07,474] INFO | gsMap.generate_ldscore - Found 115905 SNP-gene pairs from enhancers
[2025-05-02 21:29:07,933] INFO | gsMap.generate_ldscore - Kept 98690 SNPs after filtering with /home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm.1.snp
[2025-05-02 21:29:07,933] INFO | gsMap.generate_ldscore - These filtered SNPs will be used to calculate w_ld
......
[2025-05-03 00:49:50,400] INFO | gsMap.generate_ldscore - Generating w_ld for chr22
[2025-05-03 00:49:50,438] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 17261 SNPs remain
[2025-05-03 00:49:53,406] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 141123 SNPs remain
[2025-05-03 00:49:53,585] INFO | gsMap.generate_ldscore - Saved w_ld for chr22 to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/generate_ldscore/w_ld/weights.22.l2.ldscore.gz
[2025-05-03 00:49:53,634] INFO | gsMap.generate_ldscore - SNP-gene weight matrix shape: (17261, 356)
[2025-05-03 00:49:53,634] INFO | gsMap.generate_ldscore - Calculating baseline LD scores for chr22
[2025-05-03 00:49:53,755] INFO | gsMap.generate_ldscore - Calculating annotation LD scores for chr22
[2025-05-03 00:51:08,741] INFO | gsMap.generate_ldscore - LD score generation completed for E16.5_E1S1.MOSTA
[2025-05-03 00:51:08,785] INFO | gsMap - Finished running run_generate_ldscore at: 2025-05-03 00:51:08.
[2025-05-03 00:51:08,843] INFO | gsMap - Resource usage summary:
[2025-05-03 00:51:08,843] INFO | gsMap - • Wall clock time: 3.34 minutes
[2025-05-03 00:51:08,843] INFO | gsMap - • CPU time: 56.30 minutes
[2025-05-03 00:51:08,843] INFO | gsMap - • Average CPU utilization: 1795.9%
[2025-05-03 00:51:08,843] INFO | gsMap - • Peak memory usage: 23.98 GB

同时使用转录起始位点和增强子-基因关联

该策略同时利用 TSS 和增强子-基因关联将 GSS 分配给 SNP 。如果某个 SNP 既落在基因的 TSS 窗口内,又通过增强子关联到另一个基因,--gene_window_enhancer_priority参数将决定 SNP 最终分配给哪个基因;可选项为gene_window_firstenhancer_first

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 2025-05-03 09:49 测试通过,大概运行了 5-6 个小时
nohup bash -c 'for CHROM in {1..22}
do
/tools/Python-3.10.8/bin/gsmap run_generate_ldscore \
--workdir "/home/hello/wkdir/gsMap/example/Mouse_Embryo" \
--sample_name 'E16.5_E1S1.MOSTA' \
--chrom $CHROM \
--bfile_root "/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC" \
--keep_snp_root "/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm" \
--gtf_annotation_file '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf' \
--gene_window_size 50000 \
--enhancer_annotation_file '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/enhancer/by_tissue/ALL/ABC_roadmap_merged.bed' \
--snp_multiple_enhancer_strategy 'max_mkscore' \
--gene_window_enhancer_priority 'gene_window_first'
done' > run_ldscore.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-03 09:50:23,931] INFO | gsMap - Running run_generate_ldscore...
[2025-05-03 09:50:23,931] INFO | gsMap - Started at: 2025-05-03 09:50:23
Using the following arguments for GenerateLDScoreConfig:
{ 'additional_baseline_annotation': None,
'bfile_root': '/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC',
'chrom': '1',
'enhancer_annotation_file': '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/enhancer/by_tissue/ALL/ABC_roadmap_merged.bed',
'gene_window_enhancer_priority': 'gene_window_first',
'gene_window_size': 50000,
'gtf_annotation_file': '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf',
'keep_snp_root': '/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm',
'ld_unit': 'CM',
'ld_wind': 1,
'sample_name': 'E16.5_E1S1.MOSTA',
'snp_multiple_enhancer_strategy': 'max_mkscore',
'spots_per_chunk': 1000,
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-03 09:50:24,455] INFO | gsMap - Both gene_window and enhancer annotation will be used to calculate LD score.
[2025-05-03 09:50:24,455] INFO | gsMap - SNP within +-50000 bp of gene body will be used and enhancer annotation will be used to calculate LD score. If a snp maps to multiple enhancers, the strategy to choose b
y your select strategy: max_mkscore.
[2025-05-03 09:50:24,455] INFO | gsMap - ------Baseline annotation is not provided. Default baseline annotation will be used.
[2025-05-03 09:50:42,056] INFO | gsMap.generate_ldscore - Loading GTF data from /home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf
[2025-05-03 09:51:44,159] INFO | gsMap.generate_ldscore - Found 16140 common genes between GTF and marker scores
[2025-05-03 09:51:50,745] INFO | gsMap.generate_ldscore - Processing chromosome 1
[2025-05-03 09:51:51,214] INFO | gsMap.utils.plink_ldscore_tool - Loading Plink genotype data from /home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC.1.bed
[2025-05-03 09:51:51,305] INFO | gsMap.utils.plink_ldscore_tool - Calculating MAF and QC for all SNPs
[2025-05-03 09:51:55,082] INFO | gsMap.utils.plink_ldscore_tool - All SNPs passed the basic quality check
[2025-05-03 09:51:55,857] INFO | gsMap.utils.plink_ldscore_tool - Loaded genotype data with 779354 SNPs and 489 individuals
[2025-05-03 09:51:55,890] INFO | gsMap.utils.plink_ldscore_tool - 454682 SNPs with MAF > f0.05
[2025-05-03 09:51:55,890] INFO | gsMap.generate_ldscore - Creating SNP-gene mappings for chromosome 1
[2025-05-03 09:51:56,556] INFO | gsMap.generate_ldscore - Getting SNP-gene pairs from GTF. SNPs in multiple genes will be assigned to the nearest gene (by TSS)
[2025-05-03 09:51:58,800] INFO | gsMap.generate_ldscore - Found 536664 SNP-gene pairs from GTF
[2025-05-03 09:51:59,092] INFO | gsMap.generate_ldscore - SNPs in multiple enhancers will be assigned to the gene with highest marker score
[2025-05-03 09:51:59,717] INFO | gsMap.generate_ldscore - Found 115905 SNP-gene pairs from enhancers
[2025-05-03 09:51:59,840] INFO | gsMap.generate_ldscore - Filled 242690 SNPs with no GTF mapping using enhancer mappings
[2025-05-03 09:52:00,350] INFO | gsMap.generate_ldscore - Kept 98690 SNPs after filtering with /home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm.1.snp
[2025-05-03 09:52:00,350] INFO | gsMap.generate_ldscore - These filtered SNPs will be used to calculate w_ld
......
[2025-05-03 13:14:01,872] INFO | gsMap.generate_ldscore - Generating w_ld for chr22
[2025-05-03 13:14:01,914] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 17261 SNPs remain
[2025-05-03 13:14:04,451] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 141123 SNPs remain
[2025-05-03 13:14:04,608] INFO | gsMap.generate_ldscore - Saved w_ld for chr22 to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/generate_ldscore/w_ld/weights.22.l2.ldscore.gz
[2025-05-03 13:14:04,660] INFO | gsMap.generate_ldscore - SNP-gene weight matrix shape: (17261, 365)
[2025-05-03 13:14:04,660] INFO | gsMap.generate_ldscore - Calculating baseline LD scores for chr22
[2025-05-03 13:14:04,802] INFO | gsMap.generate_ldscore - Calculating annotation LD scores for chr22
[2025-05-03 13:15:17,581] INFO | gsMap.generate_ldscore - LD score generation completed for E16.5_E1S1.MOSTA
[2025-05-03 13:15:17,625] INFO | gsMap - Finished running run_generate_ldscore at: 2025-05-03 13:15:17.
[2025-05-03 13:15:17,681] INFO | gsMap - Resource usage summary:
[2025-05-03 13:15:17,681] INFO | gsMap - • Wall clock time: 3.29 minutes
[2025-05-03 13:15:17,681] INFO | gsMap - • CPU time: 55.56 minutes
[2025-05-03 13:15:17,681] INFO | gsMap - • Average CPU utilization: 1788.8%
[2025-05-03 13:15:17,681] INFO | gsMap - • Peak memory usage: 23.98 GB

空间 LDSC

Execution: required memory: ~40G

空间 LDSC (spatial ldsc),运行空间 LDSC (Linkage Disequilibrium Score Regression),以将 spots 与性状 (traits) 相关联。

1
2
3
4
5
6
7
8
# 2025-05-04 01:13 测试通过,大概运行了 1 个小时
nohup /tools/Python-3.10.8/bin/gsmap run_spatial_ldsc \
--workdir '/home/hello/wkdir/gsMap/example/Mouse_Embryo' \
--sample_name 'E16.5_E1S1.MOSTA' \
--trait_name 'IQ' \
--sumstats_file '/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz' \
--w_file '/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/weights_hm3_no_hla/weights.' \
--num_processes 4 > run_spatial_ldsc.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-04 01:16:20,390] INFO | gsMap - Running run_spatial_ldsc...
[2025-05-04 01:16:20,390] INFO | gsMap - Started at: 2025-05-04 01:16:20
Using the following arguments for SpatialLDSCConfig:
{ 'chisq_max': None,
'n_blocks': 200,
'num_processes': 4,
'sample_name': 'E16.5_E1S1.MOSTA',
'sumstats_file': '/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz',
'trait_name': 'IQ',
'use_additional_baseline_annotation': True,
'w_file': '/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/weights_hm3_no_hla/weights.',
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-04 01:16:22,670] INFO | gsMap - Using provided weights file: /home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/weights_hm3_no_hla/weights.
[2025-05-04 01:16:22,670] INFO | gsMap.spatial_ldsc - ------Running Spatial LDSC for E16.5_E1S1.MOSTA...
[2025-05-04 01:16:22,670] INFO | gsMap.utils.regression_read - Reading LD score annotations from /home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/weights_hm3_no_hla/weights.[1-22].l2.ldscore...
[2025-05-04 01:16:24,039] INFO | gsMap.utils.regression_read - Loaded 1242190 SNPs from LD weight files
[2025-05-04 01:16:24,049] INFO | gsMap.utils.regression_read - Reading LD score annotations from /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/generate_ldscore/baseline/baseline.[1-22].l2.ldscor
......
[2025-05-04 02:08:15,722] INFO | gsMap.utils.regression_read - Reading LD score annotations from /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/generate_ldscore/E16.5_E1S1.MOSTA_chunk121/E16.5_E1
[2025-05-04 02:08:19,226] INFO | gsMap.utils.regression_read - Loaded 1195563 SNPs from LD score files
[2025-05-04 02:08:42,215] INFO | gsMap.utils.regression_read - Reading LD score annotations from /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/generate_ldscore/E16.5_E1S1.MOSTA_chunk122/E16.5_E1
[2025-05-04 02:08:44,086] INFO | gsMap.utils.regression_read - Loaded 1195563 SNPs from LD score files
[2025-05-04 02:09:03,547] INFO | gsMap.spatial_ldsc - Output saved to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/spatial_ldsc/E16.5_E1S1.MOSTA_IQ.csv.gz for IQ
[2025-05-04 02:09:03,550] INFO | gsMap.spatial_ldsc - ------Spatial LDSC for E16.5_E1S1.MOSTA finished!
[2025-05-04 02:09:03,776] INFO | gsMap - Finished running run_spatial_ldsc at: 2025-05-04 02:09:03.
[2025-05-04 02:09:04,277] INFO | gsMap - Resource usage summary:
[2025-05-04 02:09:04,277] INFO | gsMap - • Wall clock time: 52.73 minutes
[2025-05-04 02:09:04,277] INFO | gsMap - • CPU time: 2.60 hours
[2025-05-04 02:09:04,277] INFO | gsMap - • Average CPU utilization: 296.5%
[2025-05-04 02:09:04,277] INFO | gsMap - • Peak memory usage: 12.81 GB

柯西组合 (可选)

Execution: required memory: ~12G

柯西组合 (cauchy combination),聚合特定空间区域 (细胞类型) 内单个 spot 的 P 值,以评估这些区域 (细胞类型) 与性状的关联性。

1
2
3
4
5
6
# 2025-05-04 09:50 测试通过,大概运行了十几秒
nohup /tools/Python-3.10.8/bin/gsmap run_cauchy_combination \
--workdir '/home/hello/wkdir/gsMap/example/Mouse_Embryo' \
--sample_name 'E16.5_E1S1.MOSTA' \
--trait_name 'IQ' \
--annotation 'annotation' > run_cauchy_combination.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-04 09:51:54,371] INFO | gsMap - Running run_cauchy_combination...
[2025-05-04 09:51:54,371] INFO | gsMap - Started at: 2025-05-04 09:51:54
Using the following arguments for CauchyCombinationConfig:
{ 'annotation': 'annotation',
'output_file': None,
'sample_name': 'E16.5_E1S1.MOSTA',
'sample_name_list': None,
'trait_name': 'IQ',
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-04 09:51:57,261] INFO | gsMap.cauchy_combination_test - ------Loading LDSC results for sample E16.5_E1S1.MOSTA...
[2025-05-04 09:51:57,433] INFO | gsMap.cauchy_combination_test - ------Loading ST data for sample E16.5_E1S1.MOSTA...
[2025-05-04 09:52:07,604] INFO | gsMap.cauchy_combination_test - Removed 2/9853 outliers (median + 3IQR) for Muscle.
[2025-05-04 09:52:07,654] INFO | gsMap.cauchy_combination_test - Removed 3/5850 outliers (median + 3IQR) for Cartilage primordium.
[2025-05-04 09:52:07,689] INFO | gsMap.cauchy_combination_test - Removed 27/3078 outliers (median + 3IQR) for Jaw and tooth.
[2025-05-04 09:52:07,856] INFO | gsMap.cauchy_combination_test - Removed 2/457 outliers (median + 3IQR) for Inner ear.
[2025-05-04 09:52:07,865] INFO | gsMap.cauchy_combination_test - Removed 2/194 outliers (median + 3IQR) for Adrenal gland.
[2025-05-04 09:52:07,963] INFO | gsMap.cauchy_combination_test - Removed 14/2783 outliers (median + 3IQR) for Mucosal epithelium.
[2025-05-04 09:52:07,978] INFO | gsMap.cauchy_combination_test - Removed 10/383 outliers (median + 3IQR) for Smooth muscle.
[2025-05-04 09:52:08,130] INFO | gsMap.cauchy_combination_test - Removed 1/17374 outliers (median + 3IQR) for Brain.
[2025-05-04 09:52:08,609] INFO | gsMap.cauchy_combination_test - Removed 1/1727 outliers (median + 3IQR) for Choroid plexus.
[2025-05-04 09:52:08,626] INFO | gsMap.cauchy_combination_test - Cauchy combination results saved at /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/cauchy_combination/E16.5_E1S1.MOSTA_IQ.Cauchy.csv.gz.
[2025-05-04 09:52:08,632] INFO | gsMap - Finished running run_cauchy_combination at: 2025-05-04 09:52:08.
[2025-05-04 09:52:08,754] INFO | gsMap - Resource usage summary:
[2025-05-04 09:52:08,754] INFO | gsMap - • Wall clock time: 14.39 seconds
[2025-05-04 09:52:08,754] INFO | gsMap - • CPU time: 17.80 seconds
[2025-05-04 09:52:08,754] INFO | gsMap - • Average CPU utilization: 126.2%
[2025-05-04 09:52:08,754] INFO | gsMap - • Peak memory usage: 6.56 GB

报告生成 (可选)

Execution: required memory: ~60G

报告生成 (),生成 gsMap 报告,包括映射结果的可视化以及诊断表。

默认用于可视化的基因是 GSS (基因集评分, Gene Set Score) 与性状-细胞关联的 -log10p 值相关性最高的 top 50 基因。如果需要选择特定基因进行可视化,可以使用--selected_genes参数。

1
2
3
4
5
6
7
8
# 2025-05-04 10:01 测试通过,大概运行了十几分钟
nohup /tools/Python-3.10.8/bin/gsmap run_report \
--workdir '/home/hello/wkdir/gsMap/example/Mouse_Embryo' \
--sample_name 'E16.5_E1S1.MOSTA' \
--trait_name 'IQ' \
--annotation 'annotation' \
--sumstats_file '/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz' \
--top_corr_genes 50 > run_report.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-04 10:03:10,119] INFO | gsMap - Running run_report...
[2025-05-04 10:03:10,120] INFO | gsMap - Started at: 2025-05-04 10:03:10
Using the following arguments for ReportConfig:
{ 'annotation': 'annotation',
'fig_height': None,
'fig_style': 'light',
'fig_width': None,
'point_size': None,
'sample_name': 'E16.5_E1S1.MOSTA',
'selected_genes': None,
'sumstats_file': '/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz',
'top_corr_genes': 50,
'trait_name': 'IQ',
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-04 10:03:13,104] INFO | gsMap.report - Running gsMap Diagnosis Module
[2025-05-04 10:03:16,441] INFO | gsMap.diagnosis - Creating gsMap plot...
[2025-05-04 10:03:21,063] INFO | gsMap.diagnosis - gsMap plot created and saved in /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/gsMap_plot.
[2025-05-04 10:03:21,065] INFO | gsMap.diagnosis - Loading and processing GWAS data...
[2025-05-04 10:03:47,023] INFO | gsMap.diagnosis - Gene diagnostic information not found. Calculating gene diagnostic information...
[2025-05-04 10:03:47,023] INFO | gsMap.diagnosis - Loading ST data and LDSC results...
[2025-05-04 10:04:19,471] INFO | gsMap.diagnosis - Calculating correlation between gene marker scores and trait logp-values...
[2025-05-04 10:08:20,965] INFO | gsMap.diagnosis - Gene diagnostic information saved to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/E16.5_E1S1.MOSTA_IQ_Gene_Diagnostic_Info.csv.
[2025-05-04 10:08:23,889] INFO | gsMap.diagnosis - To reduce the number of SNPs to plot, only 100000 SNPs with the smallest P-values are plotted.
[2025-05-04 10:08:24,799] INFO | gsMap.diagnosis - Creating Manhattan plot with 100000 SNPs
[2025-05-04 10:08:24,800] INFO | gsMap.diagnosis - Chromosome column values: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22]
/tools/Python-3.10.8/lib/python3.10/site-packages/gsMap/utils/manhattan_plot.py:164: UserWarning:
Some genes don't contain any SNP to highlight: : {'STMN4', 'GPRIN1', 'KIF5C', 'NSG1', 'CRMP1', 'APBA2', 'MLLT11', 'SNRPN', 'MAP2', 'ACTL6B', 'SOGA3', 'NSG2', 'ELAVL3', 'PAK3', 'JAKMIP2', 'ZCCHC18', 'CDK5R1', 'CNIH2', 'RUFY3', 'DCX'}
[2025-05-04 10:08:25,342] INFO | gsMap.diagnosis - Diagnostic Manhattan Plot saved to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/manhattan_plot/E16.5_E1S1.MOSTA_IQ_Diagnostic_Manhat
tan_Plot.html.
[2025-05-04 10:08:31,114] INFO | gsMap.diagnosis - Loading gene diagnostic information from /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/E16.5_E1S1.MOSTA_IQ_Gene_Diagnostic_Info.csv...
[2025-05-04 10:08:31,126] INFO | gsMap.diagnosis - Generating GSS & Expression distribution plot for top 50 correlated genes...
[2025-05-04 10:11:48,892] INFO | gsMap.report - gsMap Diagnosis running successfully
[2025-05-04 10:11:48,906] INFO | gsMap.report - Cauchy combination already done for trait IQ. Results saved at /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/cauchy_combination/E16.5_E1S1.MOSTA_IQ.Cauchy.csv.gz. Skipping...
[2025-05-04 10:11:48,996] INFO | gsMap.report - Report generated successfully! Saved at /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/report/IQ/E16.5_E1S1.MOSTA_IQ_gsMap_Report.html.
[2025-05-04 10:11:48,996] INFO | gsMap.report - Copy the report directory to your local PC and open the HTML report file in a web browser to view the report.
[2025-05-04 10:11:48,999] INFO | gsMap - Finished running run_report at: 2025-05-04 10:11:48.
[2025-05-04 10:11:49,117] INFO | gsMap - Resource usage summary:
[2025-05-04 10:11:49,117] INFO | gsMap - • Wall clock time: 8.65 minutes
[2025-05-04 10:11:49,117] INFO | gsMap - • CPU time: 7.64 minutes
[2025-05-04 10:11:49,117] INFO | gsMap - • Average CPU utilization: 107.5%
[2025-05-04 10:11:49,117] INFO | gsMap - • Peak memory usage: 45.23 GB

进阶用法

使用自定义的潜在表示

Using Customized Latent Representations

1
2
3
4
5
6
7
8
9
# 未测试
# This could be any key in the obsm field of the AnnData object.
latent_customized = 'latent_customized'
gsmap run_latent_to_gene \
--input_hdf5_path 'sample1.h5ad' \
--workdir './workdir' \
--sample_name 'sample1' \
--annotation 'annotation' \
--latent_representation $latent_customized

条件分析

Execution: required memory: ~50G

条件分析 (conditional analysis),通过调整其他功能注释或细胞类型级别的注释来进行条件分析。

这一步骤是对第三步生成 LD 分数generate ldscore的拓展,通过在基线模型中添加额外的功能注释来进行条件分析的。

可以通过参数--additional_baseline_annotation 指定额外注释的目录。

Download the additional annotations:

1
2
3
wget https://yanglab.westlake.edu.cn/data/gsMap/gsMap_additional_annotation.tar.gz

tar -xvzf gsMap_additional_annotation.tar.gz


The format of the additional annotation files is such that each line represents a SNP, with columns indicating the annotation values for that SNP. These values can be either binary or continuous.

1
2
3
4
5
6
7
8
9
10
$ zless -S gsMap_additional_annotation/baseline.1.annot.gz
CHR BP SNP CM Coding_UCSC Coding_UCSC.extend.500 Conserved_LindbladToh Conserved_LindbladToh.extend.500 CTCF_Hoffman CTCF_Hoffman.extend.500 DGF_ENCODE DGF_ENCODE.extend>
1 11008 rs575272151 0.0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0>
1 11012 rs544419019 0.0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0>
1 13110 rs540538026 0.0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1>
1 13116 rs62635286 0.0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1>
1 13118 rs200579949 0.0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1>
1 13273 rs531730856 0.0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0>
1 13550 rs554008981 0.0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0>
1 14464 rs546169444 0.0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0>
1
2
3
4
5
6
7
8
9
10
11
12
13
# 2025-05-04 13:37 测试通过,运行了 4-5 个小时
nohup bash -c 'for CHROM in {1..22}
do
/tools/Python-3.10.8/bin/gsmap run_generate_ldscore \
--workdir "/home/hello/wkdir/gsMap/example/Mouse_Embryo" \
--sample_name 'E16.5_E1S1.MOSTA' \
--chrom $CHROM \
--bfile_root "/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC" \
--keep_snp_root "/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm" \
--gtf_annotation_file '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf' \
--gene_window_size 50000 \
--additional_baseline_annotation '/home/hello/wkdir/gsMap/gsMap_additional_annotation'
done' > run_ldscore.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-04 13:41:21,710] INFO | gsMap - Running run_generate_ldscore...
[2025-05-04 13:41:21,710] INFO | gsMap - Started at: 2025-05-04 13:41:21
Using the following arguments for GenerateLDScoreConfig:
{ 'additional_baseline_annotation': '/home/hello/wkdir/gsMap/gsMap_additional_annotation',
'bfile_root': '/home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC',
'chrom': '1',
'enhancer_annotation_file': None,
'gene_window_enhancer_priority': None,
'gene_window_size': 50000,
'gtf_annotation_file': '/home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf',
'keep_snp_root': '/home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm',
'ld_unit': 'CM',
'ld_wind': 1,
'sample_name': 'E16.5_E1S1.MOSTA',
'snp_multiple_enhancer_strategy': 'max_mkscore',
'spots_per_chunk': 1000,
'workdir': '/home/hello/wkdir/gsMap/example/Mouse_Embryo'}
[2025-05-04 13:41:22,243] INFO | gsMap - Only gene window annotation will be used to calculate LD score. SNP within +-50000 bp of gene body will be used.
[2025-05-04 13:41:22,243] INFO | gsMap - ------Baseline annotation is provided. Additional baseline annotation will be used with the default baseline annotation.
[2025-05-04 13:41:22,243] INFO | gsMap - ------Baseline annotation directory: /home/hello/wkdir/gsMap/gsMap_additional_annotation
[2025-05-04 13:41:40,746] INFO | gsMap.generate_ldscore - Loading GTF data from /home/hello/wkdir/gsMap/gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf
[2025-05-04 13:42:36,687] INFO | gsMap.generate_ldscore - Found 16140 common genes between GTF and marker scores
[2025-05-04 13:42:38,701] INFO | gsMap.generate_ldscore - Processing chromosome 1
[2025-05-04 13:42:39,195] INFO | gsMap.utils.plink_ldscore_tool - Loading Plink genotype data from /home/hello/wkdir/gsMap/gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC.1.bed
[2025-05-04 13:42:39,250] INFO | gsMap.utils.plink_ldscore_tool - Calculating MAF and QC for all SNPs
[2025-05-04 13:42:42,983] INFO | gsMap.utils.plink_ldscore_tool - All SNPs passed the basic quality check
[2025-05-04 13:42:43,800] INFO | gsMap.utils.plink_ldscore_tool - Loaded genotype data with 779354 SNPs and 489 individuals
[2025-05-04 13:42:43,833] INFO | gsMap.utils.plink_ldscore_tool - 454682 SNPs with MAF > f0.05
[2025-05-04 13:42:43,834] INFO | gsMap.generate_ldscore - Creating SNP-gene mappings for chromosome 1
[2025-05-04 13:42:44,407] INFO | gsMap.generate_ldscore - Getting SNP-gene pairs from GTF. SNPs in multiple genes will be assigned to the nearest gene (by TSS)
[2025-05-04 13:42:46,709] INFO | gsMap.generate_ldscore - Found 536664 SNP-gene pairs from GTF
[2025-05-04 13:42:47,356] INFO | gsMap.generate_ldscore - Kept 98690 SNPs after filtering with /home/hello/wkdir/gsMap/gsMap_resource/LDSC_resource/hapmap3_snps/hm.1.snp
[2025-05-04 13:42:47,357] INFO | gsMap.generate_ldscore - These filtered SNPs will be used to calculate w_ld
......
[2025-05-04 17:11:21,962] INFO | gsMap.generate_ldscore - Generating w_ld for chr22
[2025-05-04 17:11:22,014] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 17261 SNPs remain
[2025-05-04 17:11:24,752] INFO | gsMap.utils.plink_ldscore_tool - After keep_snps filtering, 141123 SNPs remain
[2025-05-04 17:11:24,912] INFO | gsMap.generate_ldscore - Saved w_ld for chr22 to /home/hello/wkdir/gsMap/example/Mouse_Embryo/E16.5_E1S1.MOSTA/generate_ldscore/w_ld/weights.22.l2.ldscore.gz
[2025-05-04 17:11:24,959] INFO | gsMap.generate_ldscore - SNP-gene weight matrix shape: (17261, 365)
[2025-05-04 17:11:24,959] INFO | gsMap.generate_ldscore - Calculating baseline LD scores for chr22
[2025-05-04 17:11:25,125] INFO | gsMap.generate_ldscore - Calculating annotation LD scores for chr22
[2025-05-04 17:12:47,965] INFO | gsMap.generate_ldscore - LD score generation completed for E16.5_E1S1.MOSTA
[2025-05-04 17:12:48,022] INFO | gsMap - Finished running run_generate_ldscore at: 2025-05-04 17:12:48.
[2025-05-04 17:12:48,271] INFO | gsMap - Resource usage summary:
[2025-05-04 17:12:48,271] INFO | gsMap - • Wall clock time: 3.52 minutes
[2025-05-04 17:12:48,272] INFO | gsMap - • CPU time: 56.87 minutes
[2025-05-04 17:12:48,272] INFO | gsMap - • Average CPU utilization: 1722.7%
[2025-05-04 17:12:48,272] INFO | gsMap - • Peak memory usage: 22.96 GB

多样本同时分析

当有多个生物学重复样本可用时,可以通过计算样本间基因排名的统一切片均值 (slice mean) 来获得更一致且可比较的结果。然后基于这个切片均值排名计算基因空间评分 (GSS),这种方法确保不同样本之间的结果更加一致和可比。

计算切片均值

为了生成切片均值 (Calculate the Slice Mean),可以使用 create_slice_mean 命令。该命令会输出一个包含每个基因切片均值排名的 Parquet 文件。--sample_name_list 参数用于指定样本的名称,--h5ad_list 参数用于提供每个样本的 AnnData 对象路径。

1
2
3
4
5
6
7
# 2025-05-05 20:32 测试通过,运行了 4-5 分钟
nohup /tools/Python-3.10.8/bin/gsmap create_slice_mean \
--sample_name_list 'E16.5_E1S1.MOSTA' 'E16.5_E2S11.MOSTA' \
--h5ad_list '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad' '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E2S11.MOSTA.h5ad' \
--slice_mean_output_file '/home/hello/wkdir/gsMap/workdir/sample_slice_mean.parquet' \
--data_layer 'count' \
--homolog_file '/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt' > create_slice_mean.log 2>&1 &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# 运行日志
___ ___
| \/ |
__ _ ___ | . . | __ _ _ __
/ _` / __|| |\/| |/ _` | '_ \
| (_| \__ \| | | | (_| | |_) |
\__, |___/\_| |_/\__,_| .__/
__/ | | |
|___/ |_|
Version: 1.73.4
================================================================================
[2025-05-05 20:32:52,435] INFO | gsMap - Running create_slice_mean...
[2025-05-05 20:32:52,435] INFO | gsMap - Started at: 2025-05-05 20:32:52
Using the following arguments for CreateSliceMeanConfig:
{ 'data_layer': 'count',
'h5ad_list': [ '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad',
'/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E2S11.MOSTA.h5ad'],
'h5ad_yaml': None,
'homolog_file': '/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt',
'sample_name_list': ['E16.5_E1S1.MOSTA', 'E16.5_E2S11.MOSTA'],
'slice_mean_output_file': '/home/hello/wkdir/gsMap/workdir/sample_slice_mean.parquet'}
[2025-05-05 20:32:55,402] INFO | gsMap - Reading sample name list and h5ad list
[2025-05-05 20:32:55,402] INFO | gsMap - Input h5ad files: {'E16.5_E1S1.MOSTA': '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad', 'E16.5_E2S11.MOSTA': '/home/hello/wkdir/gsMap/gsMap_example
_data/ST/E16.5_E2S11.MOSTA.h5ad'}
[2025-05-05 20:32:55,402] INFO | gsMap - User provided homolog file to map gene names to human: /home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt
[2025-05-05 20:32:55,403] INFO | gsMap - Homolog file provided and will map gene name from column1:MOUSE_GENE_SYM to column2:HUMAN_GENE_SYM
[2025-05-05 20:33:09,393] INFO | gsMap.create_slice_mean - Found 16260 common genes across all files.
......
[2025-05-05 20:36:59,606] INFO | gsMap.create_slice_mean - Final slice mean and expression fraction saved to /home/hello/wkdir/gsMap/workdir/sample_slice_mean.parquet
[2025-05-05 20:36:59,607] INFO | gsMap - Finished running create_slice_mean at: 2025-05-05 20:36:59.
[2025-05-05 20:36:59,980] INFO | gsMap - Resource usage summary:
[2025-05-05 20:36:59,980] INFO | gsMap - • Wall clock time: 4.13 minutes
[2025-05-05 20:36:59,980] INFO | gsMap - • CPU time: 4.26 minutes
[2025-05-05 20:36:59,980] INFO | gsMap - • Average CPU utilization: 103.3%
[2025-05-05 20:36:59,980] INFO | gsMap - • Peak memory usage: 19.79 GB

使用切片均值 (快速模式)

quick_mode 命令允许您使用切片均值在单一步骤中运行整个流程,--gM_slices 参数指定了在上一步生成的切片均值文件的路径。

1
2
3
4
5
6
7
8
9
10
11
12
13
# 2025-05-05 21:52 测试通过,运行了  分钟
# For the 'E16.5_E1S1.MOSTA' sample (the same applies to 'E16.5_E2S11.MOSTA')
nohup /tools/Python-3.10.8/bin/gsmap quick_mode \
--workdir '/home/hello/wkdir/gsMap/workdir' \
--homolog_file '/home/hello/wkdir/gsMap/gsMap_resource/homologs/mouse_human_homologs.txt' \
--sample_name 'E16.5_E1S1.MOSTA' \
--gsMap_resource_dir 'gsMap_resource' \
--hdf5_path '/home/hello/wkdir/gsMap/gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad' \
--annotation 'annotation' \
--data_layer 'count' \
--sumstats_file '/home/hello/wkdir/gsMap/gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz' \
--trait_name 'IQ' \
--gM_slices '/home/hello/wkdir/gsMap/workdir/sample_slice_mean.parquet' > create_slice_mean.log 2>&1 &

示例命令:

https://yanglab.westlake.edu.cn/gsmap/document/software

  • Title: 整合全基因组关联研究和空间转录组数据定位复杂疾病相关细胞的空间分布 (运行中)
  • Author: Xing Abao
  • Created at : 2025-05-01 17:57:22
  • Updated at : 2025-05-07 08:45:12
  • Link: https://bioinformatics.vip/2025/05/01/post-GWAS/整合全基因组关联研究和空间转录组数据定位复杂疾病相关细胞的空间分布/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments