We introduce the central data stuctures used in this package.
dset.geno: genotype (n_snp, n_indiv, n_ploidy)
dset.lanc: local ancestry (n_snp, n_indiv, n_ploidy)
dset.snp: information about SNPs (n_snp, n_snp_feature)
dset.indiv: information about individuals (n_indiv, n_indiv_feature)
Central in python API is the admix.Dataset class, which support various convenient operations for manipulating large on-disk data sets.
[1]:
importadmix# load example datadset=admix.io.read_dataset("example_data/CEU-YRI")
[2]:
# overview of data setdset
[2]:
admix.Dataset object with n_snp x n_indiv = 15357 x 10000, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[3]:
# SNP attributes, CHROM (chromosomes), POS (positions), REF (reference allele), ALT (alternative allele), etc.# we have also precomputed FREQ1, FREQ2 as ancestry-specific allele frequenciesdset.snp
[3]:
CHROM
POS
REF
ALT
QUAL
FILTER
INFO
snp
22:16406147:A:G
22
16406147
A
G
.
.
.
22:16551808:T:C
22
16551808
T
C
.
.
.
22:16573830:T:C
22
16573830
T
C
.
.
.
22:16575525:T:C
22
16575525
T
C
.
.
.
22:16576248:G:T
22
16576248
G
T
.
.
.
...
...
...
...
...
...
...
...
22:50739662:G:A
22
50739662
G
A
.
.
.
22:50743331:A:G
22
50743331
A
G
.
.
.
22:50772964:T:C
22
50772964
T
C
.
.
.
22:50774447:A:C
22
50774447
A
C
.
.
.
22:50780578:G:A
22
50780578
G
A
.
.
.
15357 rows × 7 columns
[4]:
# individual attributesdset.indiv
[4]:
indiv
Sample_1
Sample_2
Sample_3
Sample_4
Sample_5
...
Sample_9996
Sample_9997
Sample_9998
Sample_9999
Sample_10000
10000 rows × 0 columns
[5]:
# phased genotype (n_snp, n_indiv, 2)dset.geno
[5]:
Array
Chunk
Bytes
1.14 GiB
78.12 MiB
Shape
(15357, 10000, 2)
(1024, 10000, 2)
Dask graph
15 chunks in 31 graph layers
Data type
float32 numpy.ndarray
21000015357
[6]:
# local ancestry (n_snp, n_indiv, 2)dset.lanc
[6]:
Array
Chunk
Bytes
292.91 MiB
19.53 MiB
Shape
(15357, 10000, 2)
(1024, 10000, 2)
Dask graph
15 chunks in 31 graph layers
Data type
int8 numpy.ndarray
21000015357
[7]:
# subset the first 50 SNPsdset[0:50,:]
[7]:
admix.Dataset object with n_snp x n_indiv = 50 x 10000, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[8]:
# subset the first 50 individualsdset[:,0:50]
[8]:
admix.Dataset object with n_snp x n_indiv = 15357 x 50, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[9]:
# subset the first 50 SNPs and first 50 individualsdset[0:50:,0:50]
[9]:
admix.Dataset object with n_snp x n_indiv = 50 x 50, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[10]:
# calculate allele per ancestry backgroundsdset.allele_per_anc()
[10]:
Array
Chunk
Bytes
2.29 GiB
156.25 MiB
Shape
(15357, 10000, 2)
(1024, 10000, 2)
Dask graph
15 chunks in 63 graph layers
Data type
float64 numpy.ndarray
21000015357
[11]:
# calculate allele frequencies per ancestry backgroundsdset.af_per_anc()