2. Getting started with PyPop#

2.1. Introduction#

You may use PyPop to analyze many different kinds of data, including allele-level genotype data (as in Listing 2.1), allele-level frequency data (as in Listing 2.6), microsatellite data, SNP data, and nucleotide and amino acid sequence data.

As mentioned in the installation chapter, a minimal working example of a configuration file (.ini), and a population file (.pop), can be found by clicking the respective links.

There are three ways to run PyPop:

interactive mode (where the program will prompt you to directly type the input it needs); and
command-line (or “batch”) mode (where you supply all the command line options the program needs).
library (or “programmatic”) mode, by writing a Python program that uses the API Reference (API).

2.2. Running a population analysis#

For the most simplest application of PyPop, where you wish to analyze a single population, the interactive mode is the simplest to use. We will describe this mode, then describe command-line mode, and finally library mode.

Note

The following assumes you have already installed PyPop, done any post-install adjustments needed for your platform, and verified that you can run the main commands (see the Examples section).

Interactive mode (`pypop-interactive`)#

To run PyPop in interactive mode, with a minimal “GUI”, on Windows or MacOS, you can directly click on the pypop-interactive file in the directory where the scripts were installed (see post-install adjustments).

You can also type pypop-interactive after starting a console application on all platforms (on MacOS and GNU/Linux, this is normally the Terminal program, on Windows, it’s Command prompt).

In most cases, this will launch a console with the following:

PyPop: Python for Population Genomics (1.0.0)
[Python 3.10.9 | Linux.x86_64-x86_64 | x86_64]
Copyright (C) 2003-2006 Regents of the University of California
Copyright (C) 2007-2023 PyPop team.
This is free software.  There is NO warranty; not even for
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You may redistribute copies of PyPop under the terms of the GNU
General Public License.  For more information about these
matters, see the file named COPYING.

Select both an '.ini' configuration file and a '.pop' file via the
system file dialog.

Following this:

the system file dialog will appear prompting you to select an .ini configuration file.
a second system file dialog will prompt you for a .pop data file.

after both files are selected the console will display the processing of the file:

PyPop is processing sample.pop ...
PyPop run complete!
XML output(s) can be found in: ['sample-out.xml']
Plain text output(s) can be found in: ['sample-out.txt']
Press Enter to continue...

when the run is completed, the last line will prompt you to press Enter to leave the console window (highlighted above).

If the system file GUI dialog does not appear (e.g. if you are running on a terminal without a display), it will fall-back to text-mode entry for the files, where you need to type the full (either relative or absolute) paths to the files. The output should resemble:

PyPop: Python for Population Genomics (1.0.0)
[Python 3.10.9 | Linux.x86_64-x86_64 | x86_64]
Copyright (C) 2003-2006 Regents of the University of California
Copyright (C) 2007-2023 PyPop team.
This is free software.  There is NO warranty; not even for
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You may redistribute copies of PyPop under the terms of the GNU
General Public License.  For more information about these
matters, see the file named COPYING.

To accept the default in brackets for each filename, simply press
return for each prompt.

Please enter config filename [config.ini]: sample.ini
Please enter population filename [no default]: sample.pop
PyPop is processing sample.pop ...
PyPop run complete!
XML output(s) can be found in: ['sample-out.xml']
Plain text output(s) can be found in: ['sample-out.txt']
Press Enter to continue...

Note

Some messages with the prefix “LOG:” may appear during the console operation. They are informational only and do not indicate improper operation of the program.

In both cases you should substitute the names of your own configuration (e.g., config.ini) and population file (e.g., Guatemalan.pop) for sample.ini and sample.pop (highlighted above). The formats for these files are described in the sections on the data file and configuration file, below.

Command-line mode (`pypop`)#

To run PyPop in the more common command-line (or “batch”) mode, you can run PyPop from the console (as noted above, on Windows: open Command prompt, aka a “DOS shell”; on MacOS or GNU/Linux: open the Terminal application). Change to a directory where your .pop file is located, and type the command:

pypop Guatemalan.pop

Command-line mode assumes two things: that you have a file called config.ini in your current folder and that you also have your population file is in the current folder, otherwise you will need to supply the full path to the file. You can specify a particular configuration file for PyPop to use, by supplying the -c option as follows:

pypop -c newconfig.ini Guatemalan.pop

Output to different directory#

You may also redirect the output to a different directory (which must already exist) by using the -o option:

pypop -c newconfig.ini -o altdir Guatemalan.pop

Supplying multiple `.pop` files#

If you have multiple .pop files with the same overall format (i.e. the same, or subset of, the loci listed in the .ini file), you can process those in one pypop invocation using a single .ini file. You either supply them directly on the command-line:

pypop -c config.ini Guatemalan.pop NorthAmerican.pop

or you use the --filelist command-line option to pass in a file containing a list of files, i.e.:

pypop -c config.ini --filelist popfilelist.txt

where the text file popfilelist.txt contains a list of the .pop files to be processed on separate lines, e.g.:

Guatemalan.pop
NorthAmerican.pop

Changed in version 1.3.0: New behavior: all files within FILELIST will be resolved relative to relative to the parent directory of FILELIST, not to the current working directory (the old behavior).

This ensures that files can be more straightforwardly located independently of where pypop is run from. For example, if your current working directory looked like the following:

data/popfilelist.txt
data/file1.pop
data/file2.pop

You would run pypop like:

pypop -c config.ini --filelist data/popfilelist.txt

and the contents of popfilelist.txt should be:

file1.pop
file2.pop

Full absolute paths will be processed as-is, and will not be treated as if they are relative to the filelist.

Please see pypop usage for the full list of command-line options.

Library (programmatic) mode#

It is also possible to use PyPop as a library (i.e. programmatically) by writing a Python program that uses the Application Programming Interface documented in the API Reference (API) directly to analyze a population file. While the initial use-case for PyPop was as a standalone command-line script, it is being upgraded for better use via a programmatic interface, much functionality is exposed as Python modules and classes. Examples of programmatic use can be found in PyPop API examples.

What happens when you run PyPop?#

The most common types of analysis will involve the editing of your config.ini file to suit your data (see the configuration file) followed by the selection of either the interactive or command-line mode described above. If your input configuration file is configfilename and your population file name is popfilename.txt the initial output will be generated quickly, but your the PyPop execution will not be finished until the text output file named popfilename-out.txt has been created. A successful run will produce two output files: popfilename-out.xml, popfilename-out.txt. A third output file will be created if you are using the Anthony Nolan HLA filter option for HLA data to check your input for valid/known HLA alleles: popfilename-filter.xml).

The popfilename-out.xml file is the primary output created by PyPop and the human-readable popfilename-out.txt file is a summary of the complete XML output. The XML output can be further transformed into plain text TSV files, either directly via pypop if invoked on multiple input files (using the --enable-tsv option, see pypop usage), or via the popmeta tool that aggregates results from different pypop runs (see Aggregating results from multiple runs (popmeta)).

A typical PyPop run might take anywhere from a few of minutes to a few hours, depending on how large your data set is and who else is using the system at the same time. Note that performing the allPairwiseLDWithPermu test may take several days if you have highly polymorphic loci in your data set.

2.3. Aggregating results from multiple runs (`popmeta`)#

The popmeta script can aggregate results from a number of output XML files from individual populations into a set of tab-separated (TSV) files containing summary statistics via customized XSLT (eXtensible Stylesheet Language for Transformations) stylesheets. These TSV files can be directly imported into a spreadsheet or statistical software (e.g., R, SAS). In addition, there is some preliminary support for export into other formats, such as the population genetic software (e.g., PHYLIP).

Here is an example of a popmeta run, following on from the XML outputs generated in similar fashion in the previous pypop runs:

popmeta -o altdir Guatemalan-out.xml NorthAmerican-out.xml

This will generate a number of .tsv files, in the output directory altdir, of the form 1-locus-allele.tsv, 1-locus-summary.tsv, etc.

You can also supply a prefix to the command-line option --prefix-tsv so that all .tsv files are given a prefix, e.g.,

popmeta -o altdir --prefix-tsv myoutput Guatemalan-out.xml NorthAmerican-out.xml

Will result in files with a prefix, e.g. myoutput-1-locus-allele.tsv.

Note

It’s highly recommended to use the -o option to save the output in a separate subdirectory, as the output .tsv files have fixed names, and will overwrite any files in the local directory with the same name. See popmeta usage for the full list of options.

Note that a similar effect can be achieved directly from a pypop run (assuming that the configuration file can be used for both .pop population files), by invoking pypop with the --enable-tsv option:

pypop -c newconfig.ini -o altdir Guatemalan.pop NorthAmerican.pop --enable-tsv

2.4. The data file#

Sample files#

Data can be input either as genotypes, or in an allele count format, depending on the format of your data.

Data files are tab-delimited

These population files are plain text files, such as you might save out of the Notepad application on Windows (or Emacs). The columns are all tab-delimited, so you can include spaces in your labels. If you have your data in a spreadsheet application, such as Excel or LibreOffice, export the file as tab-delimited text, in order to use it as PyPop data file.

Depending on how you are viewing this documentation, in some of the examples below, the columns may not appear to align with their headers, but that is purely due to how tabstops are rendered. If you copy-and-paste the data into a text editor you should be able to see that the columns are tab separated.

As you will see in the following examples, population files begin with header information. In the simplest case, the first line contains the column headers for the genotype, allele count, or, sequence information from the population. If the file contains a population data-block, then the first line consists of headers identifying the data on the second line, and the third line contains the column headers for the genotype or allele count information.

Genotype sample files#

For genotype data, each locus corresponds to two columns in the population file. The locus name must repeated, with a suffix such as _1, _2 (the default) or _a, _b and must match the format defined in the config.ini (see validSampleFields). Although PyPop needs this distinction to be made, phase is NOT assumed, and if known it is ignored.

Listing 2.7 shows the relevant lines for the configuration to read in the data shown in Listing 2.1 and Listing 2.2.

Listing 2.1 Multi-locus allele-level genotype data#

a_1	a_2	c_1	c_2	b_1	b_2
****	****	01:02	02:10:06	13:01	18:01:02
01	02:01	03:07	06:05	14:01	39:02:01
10	03:01:02	07:12	01:02	15:20	13:01
01	02:18	08:04	12:02	35:09:01	40:05
01	02:01	15:07	03:07	51:01:03	14:01
10	32:04	18:01	01:02	78:02:01	13:01
01:02	32:04	15:07	06:05	51:01:03	39:02:01

This is an example of the simplest kind of data file.

Listing 2.2 Multi-locus allele-level HLA genotype data with sample information#

populat	id	a_1	a_2	c_1	c_2	b_1	b_2
UchiTelle	UT900-23	****	****	01:02	02:10:06	13:01	18:01:02
UchiTelle	UT900-24	01:01	02:01	03:07	06:05	14:01	39:02:01
UchiTelle	UT900-25	02:10	03:01:02	07:12	01:02	15:20	13:01
UchiTelle	UT900-26	01:01	02:18	08:04	12:02	35:09:01	40:05
UchiTelle	UT910-01	25:01	02:01	15:07	03:07	51:01:03	14:01
UchiTelle	UT910-02	02:10	32:04	18:01	01:02	78:02:01	13:01
UchiTelle	UT910-03	03:01:02	32:04	15:07	06:05	51:01:03	39:02:01

This example shows a data file which has non-allele data in some columns, here we have population (populat) and sample identifiers (id).

Listing 2.3 Multi-locus allele-level HLA genotype data with sample and header information#

labcode	method	ethnic	contin	collect	latit	longit
USAFEL	12th Workshop SSOP	Telle	NW Asia	Targen Village	41 deg 12 min N	94 deg 7 min E
populat	id	a_1	a_2	c_1	c_2	b_1	b_2
UchiTelle	UT900-23	****	****	01:02	02:10:06	13:01	18:01:02
UchiTelle	UT900-24	01:01	02:01	03:07	06:05	14:01	39:02:01
UchiTelle	UT900-25	02:10	03:01:02	07:12	01:02	15:20	13:01
UchiTelle	UT900-26	01:01	02:18	08:04	12:02	35:09:01	40:05
UchiTelle	UT910-01	25:01	02:01	15:07	03:07	51:01:03	14:01
UchiTelle	UT910-02	02:10	32:04	18:01	01:02	78:02:01	13:01
UchiTelle	UT910-03	03:01:02	32:04	15:07	06:05	51:01:03	39:02:01

This is an example of a data file which is identical to Listing 2.2, but which includes population level information.

Listing 2.4 Multi-locus allele-level HLA genotype and microsatellite genotype data with header information#

labcode	ethnic	complex
USAFEL	****	0
populat	id	drb1_1	drb1_2	dqb1_1	dqb1_2	d6s2222_1	d6s2222_2
UchiTelle	HJK_2	01	03:01	02:01	05:01	249	249
UchiTelle	HJK_1	03:01	03:01	02:01	02:01	249	249
UchiTelle	HJK_3	01	03:01	02:01	05:01	249	249
UchiTelle	HJK_4	01	03:01	02:01	05:01	249	249
UchiTelle	MYU_2	02	04:01	03:02	06:02	247	249
UchiTelle	MYU_1	03:01	03:01	02:01	02:01	247	249
UchiTelle	MYU_3	03:01	04:01	02:01	03:02	249	249
UchiTelle	MYU_4	03:01	04:01	02:01	03:02	247	249

This example mixes different kinds of data: HLA allele data (from DRB1 and DQB1 loci) with microsatellite data (locus D6S2222).

Listing 2.5 Sequence genotype data with header information#

labcode	file
BLOGGS	C_New
popName	ID	TGFB1cdn10(1)	TGFB1cdn10(2)	TGFBhapl(1)	TGFBhapl(2)
Urboro	XQ-1	C	T	CG	TG
Urboro	XQ-2	C	C	CG	CG
Urboro	XQ-5	C	T	CG	TG
Urboro	XQ-21	C	T	CG	TG
Urboro	XQ-7	C	T	CG	TG
Urboro	XQ-20	C	T	CG	TG
Urboro	XQ-6	T	T	TG	TG
Urboro	XQ-8	C	T	CG	TG
Urboro	XQ-9	T	T	TG	TG
Urboro	XQ-10	C	T	CG	TG

This example includes nucleotide sequence data: the TGFB1CDN10 locus consists of one nucleotide, the TGFBhapl locus is actually haplotype data, but PyPop simply treats each combination as a separate “allele” for subsequent analysis.

Allele count sample files#

PyPop can also process allele count data, of the kind shown in Listing 2.6. Like genotype sample files, allele count sample files may also include a header (see validPopFields), however, you cannot mix allele count data and genotype data together in the one sample file.

Listing 2.6 Allele count data without a header block (a corresponding .ini file is found in Listing 2.8)#

Note

Currently each .pop file can only contain allele count data for one locus. In order to process multiple loci for one population you must create a separate .pop for each locus.

Missing data#

Untyped or missing data may be represented in a variety of ways. The default value for untyped or missing data is a series of four asterisks (****) as specified by the config.ini. You may not “represent” untyped data by leaving a column blank, nor may you represent a homozygote by leaving the second column blank. All cells for which you have data must include data, and all cells for which you do not have data must also be filled in, using a missing data value.

For individuals who were not typed at all loci, the data in loci for which they are typed will be used on all single-locus analyses for that individual and locus, so that you see the value of the number of individuals (n) vary from locus to locus in the output. These individuals’ data will also be used for multi-locus analyses. Only the loci that contain no missing data will be included in any multi-locus analysis.

If an individual is only partially typed at a locus, it will be treated as if it were completely untyped, and data for that individual for that locus will be dropped from ALL analyses.

Warning

Do not leave trailing blank lines at the end of your data file, as this currently causes PyPop to terminate with an error message that takes experience to diagnose.
For haplotype estimation and linkage disequilibrium calculations (i.e., the emhaplofreq part of the program) you are currently restricted to a maximum of seven loci per haplotype request. For haplotype estimation there is a limit of 5000 for the number of individuals (n) [1]

2.5. The configuration file#

The sets of population genetic analyses that are run on your population data file and the manner in which the data file is interpreted by PyPop is controlled by a configuration file, the default name for which is config.ini. This is another plain text file consisting of comments (which are lines that start with a semi-colon), sections (which are lines with labels in square brackets), and options (which are lines specifying settings relevant to that section in the option=value format).

Note

If any option runs over one line (such as validSampleFields) then the second and subsequent lines must be indented by exactly one space.

A simple configuration file#

Here we present a simple .ini file corresponding to Listing 2.1 (Note comment lines have been omitted in the above example for clarity). After this we review the sections that are highlighted in the example below, starting with general settings, followed by how to specify data formats and then Analysis options.

Descriptions of more advanced options for the previously described sections and additional filtering sections are contained in Advanced options and Advanced filtering sections, respectively.

Listing 2.7 Minimal .ini file for genotype data#

[General]
debug=0

[ParseGenotypeFile]
untypedAllele=****
alleleDesignator=*
validSampleFields=*a_1
 *a_2
 *c_1
 *c_2
 *b_1
 *b_2

[HardyWeinberg]
lumpBelow=5

[HardyWeinbergGuoThompson]
dememorizationSteps=2000
samplingNum=1000
samplingSize=1000

[HomozygosityEWSlatkinExact]
numReplicates=10000

[Emhaplofreq]
allPairwiseLD=1
allPairwiseLDWithPermu=0
;;numPermuInitCond=5

`[General]` settings#

This section contains variables that control the overall behavior of PyPop. Additional variables are described in [General] advanced options.

debug=0

This setting enables verbose debugging messages. Setting it to 1 will generate output that can be useful in diagnosing problems. PyPop developers may ask you to enable it when reporting on problems on the issue tracker.

Specifying data formats#

There are two possible formats: [ParseGenotypeFile] and [ParseAlleleCountFile]

`[ParseGenotypeFile]`#

If your data is genotype data, you will want a section labeled: [ParseGenotypeFile] (as shown in the Minimal .ini file for genotype data).

alleleDesignator

This option is used to tell PyPop what is allele data and what isn’t. You must use this symbol in :ref:`validSampleFields option. The default is * In general, you won’t need to change it. [Default: * ]
untypedAllele

This option is used to tell PyPop what symbol you have used in your data files to represent untyped or unknown data fields. These fields MAY NOT BE LEFT BLANK. You must use something consistent that cannot be confused with real data here. [Default: **** ]

validSampleFields

This option should contain the names of the loci immediately preceding your genotype data (if it has three header lines, this information will be on the third line, otherwise it will be the first line of the file).[There is no default, this option must always be present]

The format is as follows, for each sample field (which may either be an identifying field for the sample such as populat, or contain allele data) create a new line where:
- The first line (validSampleFields=) consists of the name of your sample field (if it contains allele data, the name of the field should be preceded by the character designated in the alleleDesignator option above).
- All subsequent lines after the first must be preceded by one space (again if it contains allele data, the name of the field should be preceded by the character designated in the alleleDesignator option above).
Here is an example:
```
validSampleFields=*a_1
 *a_2
 *c_1
 *c_2
 *b_1
 *b_2    # Note initial space at start of line.
```
Here is example that includes identifying (non-allele data) information such as sample id (id) and population name (populat):
```
validSampleFields=populat
 id
 *a_1
 *a_2
 *c_1
 *c_2
 *b_1
 *b_2
```

`[ParseAlleleCountFile]`#

If your data is not genotype data, but rather, data of the allele-name count format, then you will want to use the [ParseAlleleCountFile] section INSTEAD of the [ParseGenotypeFile] section. The alleleDesignator and untypedAllele options work identically to that described for [ParseGenotypeFile].

validSampleFields

This option should contain either a single locus name or a colon-separated list of all loci that will be in the data files you intend to analyze using a specific .ini file. The colon-separated list allows you to avoid changing the .ini file when running over a collection of data files containing different loci. e.g.,
```
validSampleFields=A:B:C:DQA1:DQB1:DRB1:DPB1:DPA1
 count
```
Note that each .pop file must contain only one locus (see the note in Listing 2.6). Listing multiple loci simply permits the same .ini file to be reused for each data file.

Below is a minimal .ini file that can be used to process the sample file with a single locus in Listing 2.6.

Listing 2.8 Minimal .ini file for allele count data in Listing 2.6#

[ParseAlleleCountFile]
validSampleFields=dqa1
 count

[HomozygosityEWSlatkinExact]
numReplicates=10000

Note also that only analyses that can be performed on allele count data (i.e. that don’t require full genotype information) can be enabled. This means that other than the automatically generated single locus statistics, only [HomozygosityEWSlatkinExact] can be enabled. Hardy-Weinberg and haplotype and LD analyses are not available.

Analysis options#

These sections describe the primary analysis options that can be enabled for PyPop, as they are used in the simple example, above.

`[HardyWeinberg]`#

Hardy-Weinberg analysis is enabled by the presence of this section.

lumpBelow

This option value represents a cut-off value. Alleles with an expected value equal to or less than lumpBelow will be lumped together into a single category for the purpose of calculating the degrees of freedom and overall p-value for the chi-squared Hardy-Weinberg test.

`[HardyWeinbergGuoThompson]`#

When this section is present, an implementation of the Hardy-Weinberg exact test is run using the original Guo and Thompson (1992) code, using a Monte-Carlo Markov chain (MCMC). In addition, two measures (Chen and Diff) of the goodness of it of individual genotypes are reported under this option (Chen et al., 1999). By default this section is not enabled. This is a different implementation to the Arlequin version listed in Advanced options, below.

dememorizationSteps

Number of steps of to “burn-in” the Markov chain before statistics are collected.[Default: 2000 ]
samplingNum

Number of Markov chain samples [Default: 1000 ].
samplingSize

Markov chain sample size[Default: 1000 ].

Note that the total number of steps in the Monte-Carlo Markov chain is the product of samplingNum and samplingSize, so the default values described above would contain 1,000,000 (= 1000 x 1000) steps in the MCMC chain.

The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know.

`[HomozygosityEWSlatkinExact]`#

The presence of this section enables Slatkin’s (1994) implementation of the Ewens-Watterson exact test of neutrality.

numReplicates

The default values have proved to be optimal for us. There is no reason to change them unless you are particularly curious. If you change the default values and have problems, please let us know.

`[Emhaplofreq]`#

The presence of this section enables haplotype frequency estimation and calculation of linkage disequilibrium (LD) measures. Please note that PyPop assumes that the genotype data is unphased when estimating haplotype frequencies and LD measures.

lociToEstHaplo

In this option you can list the multi-locus haplotypes for which you wish the program to estimate and to calculate the LD. It should be a comma-separated list of colon-joined loci. e.g.,
```
lociToEstHaplo=a:b:drb1,a:b:c,drb1:dqa1:dpb1,drb1:dqb1:dpb1
```
allPairwiseLD

Set this to 1 (one) if you want the program to calculate all pairwise LD for your data, otherwise set this to 0 (zero).

allPairwiseLDWithPermu

Set this to a positive integer greater than 1 if you need to determine the significance of the pairwise LD measures in the previous section. The number you use is the number of permutations that will be run to ascertain the significance (this should be at least 1000 or greater). (Note this is done via permutation testing performed after the pairwise LD test for all pairs of loci. Note also that this test can take DAYS if your data is highly polymorphic.)
numPermuInitCond

Set this to change the number of initial conditions used per permutation. [Default: 5 ]. (Note: this parameter is only used if allPairwiseLDWithPermu is set and nonzero).

Advanced options#

The following section describes additional options to previously described sections. Most of the time these options can be omitted and PyPop will choose defaults, however these advanced options do offer greater control over the application. In particular, customization will be required for data that has sample identifiers as in Listing 2.2 or header data block as in Listing 2.3 and both validSampleFields (described above) and validPopFields (described below) will need to be modified.

Deprecated since version 1.0.0: The sections [Arlequin] and [HardyWeinbergGuoThompsonArlequin] related to the Arlequin program as they are currently unmaintained.

`[General]` advanced options#

txtOutFilename and xmlOutFilename

If you wish to specify a particular name for the output file, which you want to remain identical over several runs, you can set these two items to particular values. The default is to have the program select the output filename, which can be controlled by the next variable. [Default: not used]
outFilePrefixType

This option can either be omitted entirely (in which case the default will be filename) or be set in several ways. The default is set as filename, which will result in three output files named original-filename-minus-suffix-out.xml, original-filename-minus-suffix-out.txt, and original-filename-minus-suffix-filter.xml. [Default: filename ]

If you set the value to date instead of filename, you’ll get the date incorporated in the filename as follows: original-filename-minus-suffix-YYYY-nn-dd-HH-MM-SS-out.xml,txt. e.g., USAFEL-UchiTelle-2003-09-21-01-29-35-out.xml (where Y, n, d, H, M, S refer to year, month, day, hour, minute and second, respectively).
xslFilename

This option specifies where to find the XSLT file to use for transforming PyPop’s xml output into human-readable form. Most users will not normally need to set this option, and the default is the system-installed text.xsl file.

`[ParseGenotypeFile]` and `[ParseAlleleCountFile]` common options#

popNameDesignator

There is a special designator to mark the population name field, which is usually the first field in the data block. [Default: + ]

If you are analyzing data that contains a population name for each sample, then the first entry in your validSampleFields section should have a prefixed +, as below:
```
validSampleFields=+populat
 *a_1
 *a_2
 ...
```

validPopFields

If you are analyzing data with an initial two line population header block information as in Multi-locus allele-level HLA genotype data with sample and header information, then you will need to set this option. In this case, it should contain the field names in the first line of the header information of your file. [Default: required when a population data-block is present in data file], e.g.:
```
validPopFields=labcode
 method
 ethnic
 country
 latit
 longit
```

`[ParseGenotypeFile]` advanced option#

fieldPairDesignator

This option allows you to override the coding for the headers for each pair of alleles at each locus; it must match the entry in the config file under validSampleFields and the entries in your population data file. If you want to use something other than _1 and _2, change this option, for instance, to use letters and parentheses, change it as follows: fieldPairDesignator=(a):(b) [Default: _1:_2 ]

`[Emhaplofreq]` advanced options#

permutationPrintFlag

Warning

If permutationPrintFlag is enabled it can drastically increase the size of the output XML file on the order of the product of the number of possible pairwise comparisons and permutations. Machines with lower RAM and disk space may have difficulty coping with this.

Determines whether the likelihood ratio for each permutation will be logged to the XML output file, this is disabled by default. [Default: 0 (i.e. OFF)].

Deprecated since version 1.0.0: currently unmaintained and untested.

[Arlequin] extra section

This section sets characteristics of the Arlequin application if it has been installed (it must be installed separately from PyPop as we cannot distribute it). The options in this section are only used when a test requiring Arlequin, such as it’s implementation of Guo and Thompson’s (1992) Hardy-Weinberg exact test is invoked (see below).

arlequinExec

This option specifies where to find the Arlequin executable on your system. The default assumes it is on your system path. [Default: arlecore.exe ]

[HardyWeinbergGuoThompsonArlequin] extra section

When this section is present, Arlequin’s implementation of the Hardy-Weinberg exact test is run, using a Monte-Carlo Markov Chain implementation. By default this section is not enabled.

markovChainStepsHW

Length of steps in the Markov chain [Default: 2500000].
markovChainDememorisationStepsHW

Number of steps of to “burn-in” the Markov chain before statistics are collected.[Default: 5000 ]

The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know.

Advanced filtering sections#

This section describes additional advanced sections that can be used for applying filtering to both the input and output of the population data.

`[Filters]` extra section#

When this section is present, it allows you to specify successive filters to the data.

filtersToApply

Here you specify which filters you want applied to the data and the order in which you want them applied. The format is `FILTER[:FILTER]*, i.e. a series of colon-delimited filters. Currently there are four predefined filter: AnthonyNolan, Sequence, DigitBinning, and CustomBinning. If you specify one or more of these filters, you will get the default behavior of the filter. If you wish to modify the default behavior, you should add a section with the same name as the specified filter(s). See next section for more on this. Please note that, while you are allowed to specify any ordering for the filters, some orderings may not make sense. For example, the ordering Sequence:AnthonyNolan would not make sense (because as far as PyPop is concerned, your alleles are now amino acid residues.) However, the reverse ordering, AnthonyNolan:Sequence, would be logical and perhaps even advisable.
makeNewPopFile

This option creates intermediate population files (in the .pop format) before or after any filtering step. This allows the user to save and inspect the output, and for running the output through another PyPop process with potentially different parameters. The format of the argument is: {all-loci|separate-loci}:<NUMBER>. Where separate-loci generates separate .pop files for each of the loci, and all-loci generates a single .pop file with all loci. <NUMBER> is an integer representing the step in the filtering process at which the output files should be generated. So 0 represents the original input data, i.e. the step before the filtering runs, whereas 1 would be the data after the first filter is run, and so on.

For example, with the following stanza in an .ini file
```
[Filters]
filtersToApply=Sequence
makeNewPopFile=all-loci:1

[Sequence]
directory=tests/data/anthonynolan/msf-2.18.0/
```
applied to an input file MyPopulation.pop with a single HLA locus A :
```
A_1       A_2
0101      0201
0210      03012
0101      0218
```
This would apply the Sequence filter options , generating a new file MyPopulation-filtered.pop where the original HLA A locus is translated into columns where each new locus would consist of the individual polymorphic amino acid residue position within HLA A, and then generate output files for that data at that point. For example, the first two columns showing the first two polymorphic positions at residues 9 (new locus A_9) and 44 (new locus A_42), the NewPopulation-filtered.pop might look something like this:
```
A_9_1   A_9_2    A_44_1    A_44_2  ...
F       F        K         R       ...
Y       F        R         R       ...
F       F        K         R       ...
```

2.6. PyPop API examples#

Here is a short example of how you can use the API (documented in the PyPop API Reference) directly in your own Python program. This program reads a short .pop data file consisting of one locus with seven individuals, and rather than reading from a configuration file, it creates a configuration object with the file format details and enables a single [HardyWeinberg] analysis. It then performs the equivalent of the popmeta script and generates output TSV files.

The Main class generates analysis results as XML, and then Meta processes this XML to generate .tsv file output suitable for further analysis. Here is the process, step-by-step:

First create the ConfigParser instance from a dictionary (note that untypedAllele and alleleDesignators are specified explicitly, even though they are the same as defaults, they must always match the input file):

(You can cut and paste the following code snippets directly into an interactive Python session).

from configparser import ConfigParser
config = ConfigParser()
config.read_dict({"ParseGenotypeFile": {"untypedAllele": "****",
                          "alleleDesignator": "*",
                          "validSampleFields": "*a_1\n*a_2"},
    "HardyWeinberg": {"lumpBelow": "5"}})

Next, create a test .pop text file (note the tab-spaces inline). (You could replace this with your own input file, or generate pop_contents from an existing data structure in your program):

pop_contents = '''a_1\ta_2
****\t****
01:01\t02:01
02:10\t03:01:02
01:01\t02:18
25:01\t02:01
02:10\t32:04
03:01:02\t32:04'''
with open("my.pop", "w") as f:
    f.write(pop_contents)

Now create the Main instance, using the config object to run the analysis of data in my.pop:

from PyPop.popanalysis import Main
application = Main(config=config, fileName="my.pop", version="fake")

The analysis runs to completion and produces the following default logging output to the console:

LOG: no XSL file, skipping text output
LOG: Data file has no header data block

You can query the Main instance to get the name of the generated output XML file: my-out.xml
```
>>> application.getXmlOutPath()
'my-out.xml'
```

Lastly, pass this file to the Meta class to generate output TSV files (as described in Aggregating results from multiple runs (popmeta)):

outXML = application.getXmlOutPath()
from PyPop.popaggregate import Meta
Meta (TSV_output=True, xml_files=[outXML])

The generated TSV files are listed in the console output:

./1-locus-hardyweinberg.tsv
./1-locus-summary.tsv
./1-locus-allele.tsv
./1-locus-genotype.tsv

These listed .tsv files can then be read into another data structure (e.g. a pandas dataframe ) for further analysis.

2.7. Command-line interfaces#

Described below is the usage for both pypop and popmeta, including a full list of the current command-line options and arguments for PyPop version 1.4.1. Note that you can also view this full list of options from the program itself by supplying the --help option, i.e. pypop --help, or popmeta --help, respectively.

`pypop` usage#

Process and run population genetics statistics on one or more POPFILE s. Expects to find a configuration file called config.ini in the current directory

usage: pypop [-h]
             [--citation [{apalike,bibtex,endnote,ris,codemeta,cff,schema.org,zenodo}]]
             [-o OUTPUTDIR] [-V] [-d]
             [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
             [--log-file LOGFILE] [-c CONFIG] [-m] [-x XSLFILE] [-t]
             [--enable-ihwg] [--enable-phylip] [-p PREFIX_TSV] [-i]
             [-f FILELIST]
             [POPFILE ...]

Options for pypop#

--citation

Possible choices: apalike, bibtex, endnote, ris, codemeta, cff, schema.org, zenodo

generate citation to PyPop for this version of PyPop

Default: 'apalike'

-o, --outputdir

put output in directory OUTPUTDIR

-V, --version

show program’s version number and exit

-d, --debug

enable debugging output (sets log level to DEBUG and overrides config file setting)

--log-level

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL

set log level (overrides -d); one of: DEBUG, INFO, WARNING, ERROR, CRITICAL

Added in version 1.4.0.

--log-file

write logs to LOGFILE instead of stdout

Added in version 1.4.0.

-c, --config

select config file

Default: 'config.ini'

-m, --testmode

run PyPop in test mode for unit testing

-x, --xsl

override the default XSLT translation with XSLFILE

TSV output options#

Note that --enable-* and --prefix-tsv options are only valid if --enable-tsv/-t is also supplied

-t, --enable-tsv: generate TSV output files (aka run popmeta)
--enable-ihwg: enable 13th IWHG workshop populationdata default headers
--enable-phylip: enable generation of PHYLIP .phy files
-p, --prefix-tsv: append PREFIX_TSV to the output TSV files

Mutually exclusive input options#

-i, --interactive

run in interactive mode, prompting user for file names

-f, --filelist

file containing list of files (one per line) to process. files are resolved relative to FILELIST, unless absolute. mutually exclusive with supplying POPFILE)

POPFILE

input population (.pop) file(s)

Default: []

`popmeta` usage#

Processes XMLFILEs and generates ‘meta’-analyses. XMLFILE are expected to be the XML output files taken from runs of pypop. Will skip any XML files that are not well-formed XML.

usage: popmeta [-h]
               [--citation [{apalike,bibtex,endnote,ris,codemeta,cff,schema.org,zenodo}]]
               [-o OUTPUTDIR] [-V] [-d]
               [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
               [--log-file LOGFILE] [-p PREFIX_TSV] [--disable-tsv]
               [--output-meta] [-x XSLDIR] [--enable-ihwg]
               [--enable-phylip | -b FACTOR]
               XMLFILE [XMLFILE ...]

Positional Arguments#

XMLFILE

XML (.xml) file(s) generated by pypop runs

Default: []

Options for popmeta#

--citation

Possible choices: apalike, bibtex, endnote, ris, codemeta, cff, schema.org, zenodo

generate citation to PyPop for this version of PyPop

Default: 'apalike'

-o, --outputdir

put output in directory OUTPUTDIR

-V, --version

show program’s version number and exit

-d, --debug

enable debugging output (sets log level to DEBUG and overrides config file setting)

--log-level

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL

set log level (overrides -d); one of: DEBUG, INFO, WARNING, ERROR, CRITICAL

Added in version 1.4.0.

--log-file

write logs to LOGFILE instead of stdout

Added in version 1.4.0.

-p, --prefix-tsv

append PREFIX_TSV to the output TSV files

--disable-tsv

disable generation of .tsv TSV files

--output-meta

dump the meta output file to stdout, ignore xslt file

-x, --xsldir

use specified directory to find meta XSLT

--enable-ihwg

enable 13th IWHG workshop populationdata default headers

Mutually exclusive popmeta options#

--enable-phylip

enable generation of PHYLIP .phy files

-b, --batchsize

process in batches of size total/FACTOR rather than all at once, by default do separately (batchsize=0)

Default: 0

2. Getting started with PyPop#

2.1. Introduction#

2.2. Running a population analysis#

Interactive mode (pypop-interactive)#

Command-line mode (pypop)#

Output to different directory#

Supplying multiple .pop files#

Library (programmatic) mode#

What happens when you run PyPop?#

2.3. Aggregating results from multiple runs (popmeta)#

2.4. The data file#

Sample files#

Genotype sample files#

Allele count sample files#

Missing data#

2.5. The configuration file#

A simple configuration file#

[General] settings#

Specifying data formats#

[ParseGenotypeFile]#

[ParseAlleleCountFile]#

Analysis options#

[HardyWeinberg]#

[HardyWeinbergGuoThompson]#

[HomozygosityEWSlatkinExact]#

[Emhaplofreq]#

Advanced options#

[General] advanced options#

[ParseGenotypeFile] and [ParseAlleleCountFile] common options#

[ParseGenotypeFile] advanced option#

[Emhaplofreq] advanced options#

Advanced filtering sections#

[Filters] extra section#

[AnthonyNolan] filter section#

[Sequence] filter section#

[DigitBinning] filter section#

[CustomBinning] filter section#

2.6. PyPop API examples#

2.7. Command-line interfaces#

pypop usage#

Options for pypop#

TSV output options#

Mutually exclusive input options#

popmeta usage#

Positional Arguments#

Options for popmeta#

Mutually exclusive popmeta options#

Interactive mode (`pypop-interactive`)#

Command-line mode (`pypop`)#

Supplying multiple `.pop` files#

2.3. Aggregating results from multiple runs (`popmeta`)#

`[General]` settings#

`[ParseGenotypeFile]`#

`[ParseAlleleCountFile]`#

`[HardyWeinberg]`#

`[HardyWeinbergGuoThompson]`#

`[HomozygosityEWSlatkinExact]`#

`[Emhaplofreq]`#

`[General]` advanced options#

`[ParseGenotypeFile]` and `[ParseAlleleCountFile]` common options#

`[ParseGenotypeFile]` advanced option#

`[Emhaplofreq]` advanced options#

`[Filters]` extra section#

`[AnthonyNolan]` filter section#

`[Sequence]` filter section#

`[DigitBinning]` filter section#

`[CustomBinning]` filter section#

`pypop` usage#

`popmeta` usage#