Methods

Note to all users: the software code running this new version of ALSGene still has some issues and instabilities that we are in the process of fixing – we appreciate your patience during this process! If you encounter a problem or have suggestions on how to improve the site, please feel free to contact us and we will get back to you as soon as possible.

1. Overview

The goal of the ALSGene database is to serve as a comprehensive, regularly updated, publicly available collection of published genetic association studies assessing ALS (amyotrophic lateral sclerosis) risk. To this end, ALSGene provides a detailed qualitative overview of all identified eligible ALS genetic association studies. Furthermore, it presents quantitative assessments of the cumulative evidence for association by calculating and providing up-to-date meta-analysis results on eligible polymorphisms.

The following paragraphs provide a brief overview of the methods underlying curation and analysis of data available on ALSGene. More detailed descriptions of inclusion criteria, literature searches, data-management procedures, and statistical analyses presented on ALSGene and similar databases from our group on other diseases can be found in the following publications: Lill et al, Amyotroph Lateral Scler, 2011, Lill et al, PLoS Genet, 2012, Lill and Bertram, Hum Mutat, 2012.

2. Database organization and methods

Literature searches, inclusion criteria, and data extraction

Eligible publications are identified following systematic searches of scientific literature databases, as well as the table of contents of journals in genetics, neurology, and psychiatry. Only studies published i) in peer-reviewed journals, and ii) available in English are considered for inclusion into the database. In particular, this precludes the inclusion of data presented only in abstracted form, e.g. at scientific meetings, or non-English publications. Data extracted from eligible publications summarize key characteristics of the investigated study populations as well as genetic data on tested polymorphisms (e.g. polymorphism names, genotype distributions in cases and controls, and/or published additive odds ratios [ORs] and 95% confidence intervals [CIs]). Polymorphisms are automatically assigned to the nearest gene within an interval of 50kb (note that genes assigned using this approach are not necessarily the most compelling functional candidates). Allele names are displayed with reference to the plus (forward) strand using build 19 (GRCh37/hg19) of the human genome assembly. For polymorphisms with genotype data or association results in at least four case-control datasets, continuously updated fixed-effect meta-analyses are presented (see next section). Note that data obtained from family-based studies are not included in the meta-analyses, as ORs cannot be readily calculated from overall genotype distributions. Furthermore, owing to the specific characteristics of human mitochondrial inheritance (e.g. its multicopy nature and the high frequency of somatic mutation events) and the innate heterogeneity in the design of mitochondrial association studies, data from these studies are equally not subject to meta-analysis, either. However, family-based as well as mitochondrial association studies are still listed on the ALSGene website including qualitative study characteristics.

Meta-analyses

For all polymorphisms with minor allele frequencies in healthy controls ≥ 1% for which case-control genotype data or dataset-specific ORs and 95% CIs are available in at least four independent datasets, summary ORs and 95% CIs are calculated based on the fixed-effect model. This procedure is performed including all studies irrespective of ethnicity (denoted by "All studies" on the meta-analysis figures, a.k.a. forest plots), and repeated after exclusion of candidate-gene studies in which a violation of Hardy-Weinberg Equilibrium (HWE) was detected at P < 0.05 in controls ("All excl HWE deviations"). In addition, meta-analyses stratified by ethnic descent are performed whenever data are available from at least three independent case-control datasets within each such stratum. Results of the meta-analyses are displayed for each polymorphism as forest plots. Overlapping samples (of which usually only the largest with available data is included), studies with missing data, or datasets violating HWE in controls are typically indicated on these graphs.

Inclusion of genome-wide association studies (GWAS)

GWAS summaries: Data from GWAS and other large-scale studies are summarized in a dedicated section on ALSGene. This section distinguishes between "GWAS" (currently defined as a study investigating 10,000 or more independent polymorphisms) and "Other large-scale association studies" (investigating 500 or more polymorphisms). In addition, this latter section also includes re-analyses of previous GWAS using different analysis strategies, e.g. meta-analyses or pathway-based approaches. In addition to summarizing the main characteristics of each large-scale study, tables in this section also provide a hyperlinked list of "featured genes", i.e. loci highlighted by the primary authors as the main outcome of their study after having completed all analyses. Note that the criteria used to define "featured genes" may vary across publications.

Inclusion of partial or full GWAS datasets: The extent to which data from large-scale genotyping studies are included in ALSGene is based on the availability of the data. Whenever possible, individual-level data are obtained (e.g. via dbGaP), and study-specific effect size estimates are determined after data-cleaning and adjustment for age, sex, and population stratification. Alternatively, we use ORs and 95% CIs derived from additive transmission models supplied by the original investigators. If neither individual-level nor summary-level data are available, we include as much of the data possible reported in the primary publications as possible; typically, this information is limited to a set of "featured genes" or other loci of special interest highlighted by the authors. Generally, preference is given to include additive ORs adjusted for population stratification as reported in the primary publications. If these are not available or not eligible, crude ORs are calculated from the provided genotype/allele summary data.

Display of meta-analysis results including GWAS data

For the current version of ALSGene, GWAS summary data derived from 8 populations of European descent published in the work of Shatunov et al, 2011, have kindly been provided by the authors for inclusion in ALSGene.

Under some circumstances, large scale genetic data can be abused to re-identify individuals participating in association studies. In order for this process to be successful, a number of conditions must be met (see e.g. ref. Craig et al, 2008, and references therein for an introduction into the topic). In order to prevent users from abusing data posted on ALSGene for this purpose, we have limited the number of more detailed meta-analysis results derived from GWAS data to a maximum of 10,000 results per GWAS. This number typically includes detailed results for polymorphisms that are most strongly associated with ALS risk based on the data available on ALSGene as well as for polymorphisms that have been reported in previous GWAS and non-GWAS studies. The remainder of displayed results, i.e. those derived from GWAS-only meta-analyses not falling into the "top 10,000" category, are displayed at substantially reduced precision, e.g. by binning P-values in categories minimally encompassing two orders of magnitude (see below), and by only providing the sign of the effect direction. (i.e. “>1” for risk effect estimates and “<1“ for protective effect estimates of allele 1 [usually the minor allele in Caucasian populations] vs. reference allele 2 as indicated in the meta-analyses tables). To the best of our knowledge, this strategy effectively prevents attempts to re-identify individuals from individual GWAS combined in our meta-analyses. We will continuously review the scientific literature on the topic and will adjust our data sharing strategy accordingly should this be deemed necessary by us or the members of the scientific advisory boad.

The following P-value bins have been defined for display of GWAS-only meta-analyses results not included in the “top 10,000” category:

“≥0.05” corresponds to P-values ≤1 and ≥0.05
“<0.05” corresponds to P-values <0.05 and ≥1x10-4
“<1x10-4” corresponds to P-values <1x10-4 and ≥1x10-6
“<1x10-6” corresponds to P-values <1x10-6 and ≥5x10-8
“<5x10-8” corresponds to P-values <5x10-8

3. How to search the database

The universal search field located on ALSGene's homepage and in the main menu can be used to search for official gene names and their alias names, NCBI's official SNP identifiers ("rs-numbers"), chromosomal locations based on the current human genome build (GRCh37/hg19), PubMed IDs and (co-)author names of publications of interest. Chromosomal locations can be entered as entire chromosomes (e.g. "chr22"), chromosomal intervals (e.g. "chr1:10000-12000") or single base pair positions (e.g. "chr1:10001").

4. The "Top Results" list

In an effort to facilitate the identification of the most promising meta-analysis results available in ALSGene, a continuously updated list displaying the most strongly associated variants/loci (Top Results) is available on the website. This list includes genes/loci that contain at least one variant showing a statistically significant (P < 5x10-4) summary OR in the meta-analyses of all studies or of those stratified for ethnicity (e.g. "Caucasian"). While we believe that the top results list represents an up-to-date summary of particularly promising ALS candidate genes, we cannot exclude the possibility that some/many of these "Top Results" may still represent false-positive findings.

In the "Top Results" list, loci are ranked based on statistical significance (P-value). For loci with more than one polymorphism showing statistically significant association, ranking is based on the most significant meta-analysis result per locus. Of note, SNPs that map into intergenic regions are annotated to the nearest gene (the distance is indicated in brackets behind the gene name) within an interval of +/- 50kb; in cases where this region does not contain any known gene, only the SNP identifier ("rd-number") is being listed.

5. Mendelian ALS genes and C9ORF72

Note that the ALSGene database does not include or display reports of rare causal mutations leading to ALS of classic Mendelian inheritance (such as disease-causing mutations in SOD1). Instead, it focuses on genetic association studies assessing common polymorphisms, here defined as those with a minor-allele frequency of ≥1% in at least one of the studied control datasets and/or the 1000 Genomes database (see above). For a collection of Mendelian ALS genes, please consult the ALSoD database, curated by the Institute of Psychiatry of the King's College London, and the ALS mutation database, curated by the University of Tokyo. Note that bona fide genetic association studies on common polymorphisms in any of the established Mendelian ALS genes will still be included and meta-analyzed for ALSGene.

In this context, it should be noted that the association signal observed on chromosome 9p21.2 appears to be elicited entirely by the presence of a causal hexanucleotide extension in C9ORF72 showing Mendelian segregation with disease (e.g. see pulbications by DeJesus-Hernandez et al, 2011, and Renton et al, 2011). GWAS datasets included in ALSGene show genome-wide significant (P < 5x10-8) association between polymorphisms in this region and ALS risk owing to an effect called "synthetic association". However, since the underlying genetic cause of this association signal can be attributed entirely to the pathological hexanucleotide repeat extension in C9ORF72, association studies assessing common polymorphisms in or near this region are not included or listed in ALSGene beyond available GWAS data.

6. UCSC genome-browser

In addition to being summarized directly on ALSGene, all meta-analysis P-values are also displayed on a customized UCSC genome browser track that allows to compare the location of the region of interest with the many other features available on this genome browser. This custom-track can be accessed by searching the ALSGene database for a genetic term (e.g. a gene name, e.g. “C9ORF72”, an rs-number, e.g. “rs3849942”, or a chromosomal region such as “chr21:33031935-33041243” using hg19 coordinates) and by clicking the “Browse” button on the ALSGene homepage. Furthermore, there are additional instances on ALSGene (e.g. chromosomal annotations for rs-numbers or genetic regions) which are cross-linked to the custom-track on the genome browser.

7. The “MyMeta” function

Users also have the possibility to create customized meta-analyses using the same meta-analysis scripts implemented throughout ALSGene. For instance, this new functionality allows users to add their own genetic data (i.e. genotype summary data or additive odds ratios and standard errors, e.g. from a yet unpublished genetic association study) to the meta-analyses already existing on ALSGene thereby updating the overarching summary result. Alternatively, users can exclude specific studies already existing on ALSGene and recalculate the summary OR. Meta-analyses performed in the "MyMeta" application are calculated following the same approach as for all other meta-analyses on ALSGene, i.e. they are based on a fixed-effect model assuming additive effect estimates. Please note that these meta-analyses are generated "on the fly" and user data submitted to ALSGene will not be stored or tracked internally at any time. We explicitly encourage users to utilize the resulting meta-analysis forest plots as part of their own presentations and/or publications provided they adhere to the citing rules (see How to cite us).

8. Abbreviations/conventions used throughout ALSGene

1000G = data derived from the 1000 Genomes project

A = Asian ethnicity

A vs. B (allele contrast) = indicates direction of effect, where A, usually the minor allele, denotes the effective odds ratio (protective or risk effect) and B the reference allele with an odds ratio of 1

C = Caucasian ethnicity

CEU = Utah residents with ancestry from northern and western Europe (one HapMap population used as part of the 1000 Genomes project)

CHB = Han Chinese in Beijing, China (one HapMap population used as part of the 1000 Genomes project)

CI = confidence interval

D = African descent

hg19 = human genome build 19 (GRCh37/hg19, most current genome build)

I2 = amount of heterogeneity between study-specific results that is beyond chance

JPT = Japanese in Tokyo, Japan (one HapMap population used as part of the 1000 Genomes project)

M = Other/mixed

OR = odds ratio

SNP = single-nucleotide polymorphism

References

DeJesus-Hernandez M, Mackenzie IR, Boeve BF, Boxer AL, Baker M, et al. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS. Neuron. 2011;72(2):245-56.

Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. "Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays." PLoS Genet. 2008;4(8):e1000167.

Lill CM, Abel O, Bertram L, Al-Chalabi A. Keeping up with genetic discoveries in amyotrophic lateral sclerosis: the ALSoD and ALSGene databases. Amyotroph Lateral Scler. 2011;12(4):238-49.

Lill CM, Bertram L. Developing the "next generation" of genetic association databases for complex diseases. Hum Mutat. 2012;33(9):1366-72.

Lill CM, Roehr JT, McQueen MB, Kavvoura FK, Bagade S, et al. Comprehensive research synopsis and systematic meta-analyses in Parkinson's disease genetics: The PDGene database. PLoS Genet. 2012;8(3):e1002548.

Renton AE, Majounie E, Waite A, Simón-Sánchez J, Rollinson S, Gibbs JR, et al. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD. Neuron. 2011;72(2):257-68.

Shatunov A, Mok K, Newhouse S, Weale ME, Smith B, et al. Chromosome 9p21 in sporadic amyotrophic lateral sclerosis in the UK and seven other countries: a genome-wide association study. Lancet Neurol. 2010;9(10):986-94.