-
Notifications
You must be signed in to change notification settings - Fork 60
/
README
339 lines (305 loc) · 17.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
See ../README for high-level documentation of the entire EIGENSOFT package.
This file contains documentation of the programs convertf and mergeit.
convertf converts between the 5 different file formats we support.
mergeit merges two data sets into a third, which has the union of
the individuals and the intersection of the SNPs in the first two.
Here "file format" simultaneously refers to the formats of three distinct files:
genotype file: contains genotype data for each individual at each SNP
snp file: contains information about each SNP
indiv file: contains information about each individual
Below, we document all 5 formats:
ANCESTRYMAP
EIGENSTRAT
PED
PACKEDPED
PACKEDANCESTRYMAP
and we explain how to use convertf to get from one format to another.
Maximum file size on 32-bit machines:
EIGENSOFT will recognize a machine as 32-bit if sizeof(long) = 4 bytes
(as opposed to 8 bytes for 64-bit machines). For 32-bit machines,
EIGENSOFT does not allow more than 8 billion genotypes, and will produce
an error message if used to produce an output file larger than 2GB.
If running convertf on 32-bit machines on data sets with 2 billion to 8 billion
genotypes, then PACKEDPED or PACKEDANCESTRYMAP output format should be used.
Maximum file size on 64-machines:
No explicit limits, but extremely large files may cause problems -- ask your
systems administrator.
------------------------------------------------------------------------------
LIST OF FORMATS
ANCESTRYMAP format:
genotype file: see example.ancestrymapgeno in this directory
snp file: see example.snp
indiv file: see example.ind
Note that
The genotype file contains 1 line per valid genotype. There are 3 columns:
1st column is SNP name
2nd column is sample ID
3rd column is number of reference alleles (0 or 1 or 2)
Missing genotypes are encoded by the absence of an entry in the genotype file.
The snp file contains 1 line per SNP. There are 6 columns (last 2 optional):
1st column is SNP name
2nd column is chromosome. X chromosome is encoded as 23.
Also, Y is encoded as 24, mtDNA is encoded as 90, and XY is encoded as 91.
Note: SNPs with illegal chromosome values, such as 0, will be removed
3rd column is genetic position (in Morgans). If unknown, ok to set to 0.0.
4th column is physical position (in bases)
Optional 5th and 6th columns are reference and variant alleles.
For monomorphic SNPs, the variant allele can be encoded as X (unknown).
The indiv file contains 1 line per individual. There are 3 columns:
1st column is sample ID. Length is limited to 39 characters, including
the family name if that will be concatenated.
2nd column is gender (M or F). If unknown, ok to set to U for Unknown.
3rd column is a label which might refer to Case or Control status, or
might be a population group label. If this entry is set to "Ignore",
then that individual and all genotype data from that individual will be
removed from the data set in all convertf output.
The name "ANCESTRYMAP format" is used for historical reasons only. This
software is completely independent of our 2004 ANCESTRYMAP software.
EIGENSTRAT format: used by eigenstrat program
genotype file: see example.eigenstratgeno
snp file: see example.snp (same as above)
indiv file: see example.ind (same as above)
Note that
The genotype file contains 1 line per SNP.
Each line contains 1 character per individual:
0 means zero copies of reference allele.
1 means one copy of reference allele.
2 means two copies of reference allele.
9 means missing data.
The program ind2pheno.perl in this directory will convert from
example.ind to the example.pheno file needed by the EIGENSTRAT software.
The syntax is "./ind2pheno.perl example.ind example.pheno".
PED format:
genotype file: see example.ped *** file name MUST end in .ped ***
snp file: see example.pedsnp *** file name MUST end in .pedsnp ***
convertf also supports .map suffix for this input file name
indiv file: see example.pedind *** file name MUST end in .pedind ***
convertf also supports the full .ped file (example.ped)
for this input file
Note that
Mandatory suffix names enable our software to recognize this file format.
The indiv file contains the first 6 or 7 columns of the genotype file.
The genotype file is 1 line per individual. Each line contains 6 or 7 columns
of information about the individual, plus two genotype columns for
each SNP in the order the SNPs are specified in the snp file.
Genotype format MUST be either 0ACGT or 01234, where 0 means missing data.
The first 6 or 7 columns of the genotype file are:
1st column is family ID.
2nd column is sample ID.
3rd and 4th column are sample IDs of parents.
5th column is gender (male is 1, female is 2)
6th column is case/control status (1 is control, 2 is case) OR
quantitative trait value OR population group label.
7th column (this column is optional) is always set to 1.
[Note: this release *changed* to output .ped files in 6-column format,
not in 7-column format. Also see sevencolumnped parameter below.]
convertf does not support pedigree information, so 1st, 3rd, 4th columns are
ignored in convertf input and set to arbitrary values in convertf output.
In the two genotype columns for each SNP, missing data is represented by 0.
The snp file contains 1 line per SNP. There are 6 columns (last 2 optional):
1st column is chromosome. Use X for X chromosome.
Note: SNPs with illegal chromosome values, such as 0, will be removed
2nd column is SNP name
3rd column is genetic position (in Morgans)
4th column is physical position (in bases)
Optional 5th and 6th columns are reference and variant alleles.
For monomorphic SNPs, the variant allele can be encoded as X.
The indiv file contains the first 6 or 7 columns of the genotype file.
The PED format is used by the PLINK package of Shaun Purcell.
See http://pngu.mgh.harvard.edu/~purcell/plink/.
PACKEDPED format:
genotype file: see example.bed *** file name MUST end in .bed ***
snp file: see example.pedsnp *** file name MUST end in .pedsnp ***
convertf also supports .map or .bim suffix for this input file
indiv file: see example.pedind *** file name MUST end in .pedind ***
convertf also supports a .ped file (example.ped)
for this input file
Note that
Mandatory suffix names enable our software to recognize this file format.
example.bed is a packed binary file (2 bits per genotype).
The PACKEDPED format is used by the PLINK package of Shaun Purcell.
See http://pngu.mgh.harvard.edu/~purcell/plink/.
For input in PACKEDPED format, snp file MUST be in genomewide order.
For input in PACKEDPED format, genotype file MUST be in SNP-major order
(the PLINK default: see PLINK documentation for details.)
PACKEDANCESTRYMAP format
genotype file: see example.packedancestrymapgeno
snp file: see example.snp (same as above)
indiv file: see example.ind (same as above)
Note that
example.packedancestrymapgeno is a packed binary file (2 bits per genotype).
----------------------------------------------------------------------------
DOCUMENTATION of convertf program:
The syntax of convertf is "../bin/convertf -p parfile". We illustrate how
parfile works via a toy example: (see example.perl in this directory)
par.ANCESTRYMAP.EIGENSTRAT converts ANCESTRYMAP to EIGENSTRAT format
par.EIGENSTRAT.PED converts EIGENSTRAT to PED format
par.PED.EIGENSTRAT converts PED to EIGENSTRAT format
par.PED.PACKEDPED converts PED to PACKEDPED format
par.PACKEDPED.PACKEDANCESTRYMAP converts PACKEDPED to PACKEDANCESTRYMAP
par.PACKEDANCESTRYMAP.ANCESTRYMAP converts PACKEDANCESTRYMAP to ANCESTRYMAP
Note that the choice of which allele is the reference allele may be arbitrary,
and thus converting to a new format and back again may change the choice of
reference allele.
DESCRIPTION OF EACH PARAMETER in parfile for convertf program:
genotypename: input genotype file
snpname: input snp file
indivname: input indiv file
outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
(Default is PACKEDANCESTRYMAP.)
genooutfilename: output genotype file
snpoutfilename: output snp file
indoutfilename: output indiv file
OPTIONAL PARAMETERS:
familynames: only relevant if input format is PED or PACKEDPED.
If set to YES, then family ID will be concatenated to sample ID.
This supports different individuals with different family ID but
same sample ID. The default for this parameter is YES.
noxdata: if set to YES, all SNPs on X chr are removed from the data set.
The default for this parameter is NO.
nomalexhet: if set to YES, any het genotypes on X chr for males are changed
to missing data. The default for this parameter is NO.
badsnpname: specifies a list of SNPs which should be removed from the data set.
Same format as example.snp.
newsnpname: additional SNP file with reordered SNPs. For runs in which the
SNPs should be in a different order in the output.
newindivname: additional individual file with reordered samples. For runs
in which the individuals should be in a different order in the output.
outputgroup: Only relevant if outputformat is PED or PACKEDPED.
This parameter specifies what the 6th column of information about each
individual should be in the output. If outputgroup is set to NO (the default),
the 6th column will be set to 1 for each Control and 2 for each Case, as
specified in the input indiv file.
[Individuals specified with some other label, such as a population group
label, will be assumed to be controls and the 6th column will be set to 1.]
If outputgroup is set to YES, the 6th column will be set to
the exact label specified in the input indiv file.
[This functionality preserves population group labels.]
chrom: Only output SNPs on this chromosome.
lopos: Only output SNPs with physical position >= this value.
hipos: Only output SNPs with physical position <= this value.
sevencolumnped: Only relevant if outputformat is PED or PACKEDPED.
If set to YES, then 7-column .ped format will be used,
instead of 6-column .ped format which is now the default.
checksizemode: If set to YES (the default), check that output file size will
be less than 2GB. If set to NO, do not perform this check.
maxmissfracsnp: Remove any SNP with a fraction of missing data greater than
this. Default is 1.0.
maxmissfracind: Remove any indifidual with a fraction of missing data greater
than this. Default is 1.0.
numchrom: The number of autosomes in the data set. The X-chromosome is
assumed to be numchrom+1 and the Y-chromosome is numchrom+2
hashcheck: If set to YES and the input genotype file is in PACKEDANCESTRYMAP
format, check the hash stored inside the file to make sure that individual
and SNP files have not changed since the file was made. If they have, then
exit in error. The default value for this parameter is YES. Note: Caution
should be exercised in turning off hashcheck, as misapplication,
e.g., reordering a SNP file, may silently produce bad data.
It is recommended that if a dataset fails the hash check (for instance
because input sample names have been changed, convertf is run with
hachcheck: NO, and the output used for further processing. The output file
will pass hashcheck.
phasedmode: YES for phased input (default NO)
Note: mixed phased and unphased data is not supported.
Note: optional command line parameter -f also turns on phased mode
xregionname: Name of file which describes regions of the genome to be
excluded from the computation. Each line of the file should be in the
format <chromosome #> <begin-physical-position> <end-physical-position>.
The excluded region is the closed interval defined by the physical positions.
(We recommend excluding the long-range LD regions listed in Table 1 of
Price et al. 2008 Am J Hum Genet.)
hwfilt: Filter parameter for Hardy-Weinberg filter. The (real-valued)
number of standard deviations beyond which the filter is applied.
(If not specified, then no Hardy-Weinberg filter is applied.)
Caution: hwfilt should not be used for admixed populations.
numchrom: The number of autosomes in the data set. The X-chromosome is
assumed to be numchrom+1 and the Y-chromosome is numchrom+2.
The default value for numchrom is 22.
deletesnpoutname: optional output file in which all deleted SNPs are listed
along with the reasons for their deletion. This file can be used
as a badsnp file in subsequent runs.
polarize: sample_id
It is expected that sample_id will be a (pseudo)-homozygous reference
sequence such as panTro2 (a chimpanzee reference). Variant and reference
alleles are "flipped" if necessary so that sample_id has a count of 2.
Hets (or missing) genotypes mean the SNP will be set to ignore.
pordercheck: NO (default YES)
If in packed format the snps are not ordered by (chromosome number, genetic
position, physical position) default is to fail hard. If prodercheck: YES,
convertf will try to fix. It is strongly recommended that packed files are
NOT made out of order. For instance make PACKEDANESTRYMAP files using
convertf and this issue will not arise.
flipstrandname: fname
fname should consist of a list of SNP IDs (1 perl line). The alleles for
this SNP will be complemented (moved to other strand). This can be useful as
preparation for mergeit.
zerodistance: YES (default NO)
if YES genetic distance will be forced to zero.
malexhet: if set to YES, any het genotypes on X chr for males are changed
to missing data. The default for this parameter is NO.
(NEW)
poplistname: yourfile
yourfile should consist of a list of populations (1/line) where populations
are the labels in the last column of the input .ind file (eigenstrat or
packedped format). Only the samples with listed labels will be output.
----------------------------------------------------------------------------
DOCUMENTATION of mergeit program:
The mergeit program merges two data sets into a third, which has the union of
the individuals and the intersection of the SNPs in the first two. If
SNP positions differ between the two data sets, then SNP positions from
the first data set will be produced in the merged data.
mergeit accounts for the possibility that the choice of reference and variant
alleles may differ between the two data sets (e.g. A/C vs. C/A), and also
accounts for the possibility that the strand may differ between the two
data sets (e.g. A/C vs. T/G), and genotype values are flipped (0 to 2, 2 to 0)
in one of the two data sets if appropriate. See documentation of docheck
and strandcheck parameters below.
The syntax of mergeit is "../bin/mergeit -p parfile".
DESCRIPTION OF EACH PARAMETER in parfile for mergeit program:
geno1: first input genotype file
snp1: first input snp file
ind1: first input indiv file
geno2: second input genotype file
snp2: second input snp file
ind2: second input indiv file
genooutfilename: output genotype file
snpoutfilename: output snp file
indoutfilename: output indiv file
OPTIONAL PARAMETERS:
outputformat: output file format (default is PACKEDANCESTRYMAP)
docheck: If set to YES, then check that reference and variable alleles
are the same in both data sets -- if they are different
(e.g. A/C vs. C/A), then flip genotype data appropriately.
The default for this parameter is YES.
strandcheck: If set to YES, then check that the allele strand is the same
in both data sets -- if they are different (e.g A/C vs. T/G), then flip
genotype data appropriately. (Note that if strandcheck is set to YES, then
all A/T and C/G SNPs will be removed because it is impossible to know whether
the allele strand is the same in both data sets. On the other hand, if
strandcheck is set to NO, then A/T and C/G SNPs will be retained since it is
assumed that both data sets are on the same strand.)
The default for this parameter is YES.
hashcheck: If set to YES and the input genotype file is in PACKEDANCESTRYMAP
format, check the hash stored inside the file to make sure that individual
and SNP files have not changed since the file was made. If they have, then
exit in error. The default value for this parameter is YES.
(NEW)
malexhet: if set to YES, any het genotypes on X chr for males are changed
to missing data. The default for this parameter is YES.
(Dangerous bend): The corresponding parameter for convertf has default NO.
(NEW)
allowdups: YES
Default is NO. By default if duplicate individuals are present mergeit will fail hard.
if allowdups: YES is set, duplicate individuals in the second data set are ignored. No
attempt is made to combine genotypes for the same sample_id in 2 daatasets.
------------------------------------------------------------------------------
Questions?
See http://www.hsph.harvard.edu/faculty/alkes-price/files/eigensoftfaq.htm
or email Samuela Pollack, spollack@hsph.harvard.edu
SOFTWARE COPYRIGHT NOTICE AGREEMENT
This software and its documentation are copyright (2010) by Harvard University
and The Broad Institute. All rights are reserved. This software is supplied
without any warranty or guaranteed support whatsoever. Neither Harvard
University nor The Broad Institute can be responsible for its use, misuse, or
functionality. The software may be freely copied for non-commercial purposes,
provided this copyright notice is retained.