admix.Dataset#
We introduce the central data stuctures used in this package.
dset.geno
: genotype (n_snp
,n_indiv
,n_ploidy
)dset.lanc
: local ancestry (n_snp
,n_indiv
,n_ploidy
)dset.snp
: information about SNPs (n_snp
,n_snp_feature
)dset.indiv
: information about individuals (n_indiv
,n_indiv_feature
)
Central in python API is the admix.Dataset class, which support various convenient operations for manipulating large on-disk data sets.
[1]:
import admix
# load example data
dset = admix.io.read_dataset("example_data/CEU-YRI")
[2]:
# overview of data set
dset
[2]:
admix.Dataset object with n_snp x n_indiv = 15357 x 10000, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[3]:
# SNP attributes, CHROM (chromosomes), POS (positions), REF (reference allele), ALT (alternative allele), etc.
# we have also precomputed FREQ1, FREQ2 as ancestry-specific allele frequencies
dset.snp
[3]:
CHROM | POS | REF | ALT | QUAL | FILTER | INFO | |
---|---|---|---|---|---|---|---|
snp | |||||||
22:16406147:A:G | 22 | 16406147 | A | G | . | . | . |
22:16551808:T:C | 22 | 16551808 | T | C | . | . | . |
22:16573830:T:C | 22 | 16573830 | T | C | . | . | . |
22:16575525:T:C | 22 | 16575525 | T | C | . | . | . |
22:16576248:G:T | 22 | 16576248 | G | T | . | . | . |
... | ... | ... | ... | ... | ... | ... | ... |
22:50739662:G:A | 22 | 50739662 | G | A | . | . | . |
22:50743331:A:G | 22 | 50743331 | A | G | . | . | . |
22:50772964:T:C | 22 | 50772964 | T | C | . | . | . |
22:50774447:A:C | 22 | 50774447 | A | C | . | . | . |
22:50780578:G:A | 22 | 50780578 | G | A | . | . | . |
15357 rows × 7 columns
[4]:
# individual attributes
dset.indiv
[4]:
indiv |
---|
Sample_1 |
Sample_2 |
Sample_3 |
Sample_4 |
Sample_5 |
... |
Sample_9996 |
Sample_9997 |
Sample_9998 |
Sample_9999 |
Sample_10000 |
10000 rows × 0 columns
[5]:
# phased genotype (n_snp, n_indiv, 2)
dset.geno
[5]:
|
|
[6]:
# local ancestry (n_snp, n_indiv, 2)
dset.lanc
[6]:
|
|
[7]:
# subset the first 50 SNPs
dset[0:50, :]
[7]:
admix.Dataset object with n_snp x n_indiv = 50 x 10000, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[8]:
# subset the first 50 individuals
dset[:, 0:50]
[8]:
admix.Dataset object with n_snp x n_indiv = 15357 x 50, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[9]:
# subset the first 50 SNPs and first 50 individuals
dset[0:50:, 0:50]
[9]:
admix.Dataset object with n_snp x n_indiv = 50 x 50, n_anc=2
snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[10]:
# calculate allele per ancestry backgrounds
dset.allele_per_anc()
[10]:
|
|
[11]:
# calculate allele frequencies per ancestry backgrounds
dset.af_per_anc()
admix.data.af_per_anc: 100%|███████████████████████████████████████████████████████████████████████████████| 15/15 [00:11<00:00, 1.35it/s]
[11]:
array([[0.28433613, 0.01600149],
[0.44607595, 0.46056075],
[0.25962513, 0.01133815],
...,
[0.1228952 , 0.01448099],
[0.34757477, 0.55327383],
[0.17893943, 0.36445915]])