admix.Dataset¶

We introduce the central data stuctures used in this package.

dset.geno: genotype (n_snp, n_indiv, n_ploidy)
dset.lanc: local ancestry (n_snp, n_indiv, n_ploidy)
dset.snp: information about SNPs (n_snp, n_snp_feature)
dset.indiv: information about individuals (n_indiv, n_indiv_feature)

Central in python API is the admix.Dataset class, which support various convenient operations for manipulating large on-disk data sets.

[1]:

import admix

# load example data
dset = admix.io.read_dataset("example_data/CEU-YRI")

[2]:

# overview of data set
dset

[2]:

admix.Dataset object with n_snp x n_indiv = 15357 x 10000, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'

[3]:

# SNP attributes, CHROM (chromosomes), POS (positions), REF (reference allele), ALT (alternative allele), etc.
# we have also precomputed FREQ1, FREQ2 as ancestry-specific allele frequencies
dset.snp

[3]:

	CHROM	POS	REF	ALT	QUAL	FILTER	INFO
snp
22:16406147:A:G	22	16406147	A	G	.	.	.
22:16551808:T:C	22	16551808	T	C	.	.	.
22:16573830:T:C	22	16573830	T	C	.	.	.
22:16575525:T:C	22	16575525	T	C	.	.	.
22:16576248:G:T	22	16576248	G	T	.	.	.
...	...	...	...	...	...	...	...
22:50739662:G:A	22	50739662	G	A	.	.	.
22:50743331:A:G	22	50743331	A	G	.	.	.
22:50772964:T:C	22	50772964	T	C	.	.	.
22:50774447:A:C	22	50774447	A	C	.	.	.
22:50780578:G:A	22	50780578	G	A	.	.	.

15357 rows × 7 columns

[4]:

# individual attributes
dset.indiv

[4]:


indiv
Sample_1
Sample_2
Sample_3
Sample_4
Sample_5
...
Sample_9996
Sample_9997
Sample_9998
Sample_9999
Sample_10000

10000 rows × 0 columns

[5]:

# phased genotype (n_snp, n_indiv, 2)
dset.geno

[5]:

	Array	Chunk
Bytes	1.14 GiB	78.12 MiB
Shape	(15357, 10000, 2)	(1024, 10000, 2)
Dask graph	15 chunks in 31 graph layers
Data type	float32 numpy.ndarray

[6]:

# local ancestry (n_snp, n_indiv, 2)
dset.lanc

[6]:

	Array	Chunk
Bytes	292.91 MiB	19.53 MiB
Shape	(15357, 10000, 2)	(1024, 10000, 2)
Dask graph	15 chunks in 31 graph layers
Data type	int8 numpy.ndarray

[7]:

# subset the first 50 SNPs
dset[0:50, :]

[7]:

admix.Dataset object with n_snp x n_indiv = 50 x 10000, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'

[8]:

# subset the first 50 individuals
dset[:, 0:50]

[8]:

admix.Dataset object with n_snp x n_indiv = 15357 x 50, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'

[9]:

# subset the first 50 SNPs and first 50 individuals
dset[0:50:, 0:50]

[9]:

admix.Dataset object with n_snp x n_indiv = 50 x 50, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'

[10]:

# calculate allele per ancestry backgrounds
dset.allele_per_anc()

[10]:

	Array	Chunk
Bytes	2.29 GiB	156.25 MiB
Shape	(15357, 10000, 2)	(1024, 10000, 2)
Dask graph	15 chunks in 63 graph layers
Data type	float64 numpy.ndarray

[11]:

# calculate allele frequencies per ancestry backgrounds
dset.af_per_anc()

admix.data.af_per_anc: 100%|███████████████████████████████████████████████████████████████████████████████| 15/15 [00:11<00:00,  1.35it/s]

[11]:

array([[0.28433613, 0.01600149],
       [0.44607595, 0.46056075],
       [0.25962513, 0.01133815],
       ...,
       [0.1228952 , 0.01448099],
       [0.34757477, 0.55327383],
       [0.17893943, 0.36445915]])