Categorizing proteins has a long history, primarily for understanding taxonomy and genetic relatedness through biosystematics. More specialized groupings help understand related structure and function and the evolution of that relatedness across organisms. With increasing exposure to novel foods and feeds through greater global distribution and through the use of biotechnology (GM crops), it became necessary to better understand similarities of allergenic proteins across organisms and how newly discovered allergens might be related to those with which we were already familiar. As a result, growing attention was given in particular to organizing allergenic proteins into structurally and functionally related protein groups, all contained within an accessible database. An effort in this area was initiated in 2016 by the ILSI Health and Environmental Sciences Institute (HESI) Protein Allergenicity Technical Committee (PATC) to develop a new allergen sequence data resource for public access: the COMprehensive Protein Allergen REsource (COMPARE) database.
A unique aspect of allergens is the phenomenon of clinically relevant cross-reactivity, in which allergic responses occur when IgE antibodies (those responsible for the allergic cascade) recognize proteins other than the original sensitizing allergen due to shared structural features. Most allergens belong to a handful of structural protein families (Ferreira et al., 2004). Structural similarities at both the primary (amino acid sequence) and tertiary (folding) levels confer cross-reactivity to proteins from diverse sources. Cross-reactive proteins share tertiary structures, but proteins with shared tertiary structures do not necessarily cross-react. Changes in primary structure by amino acid substitutions may not interfere with protein folding but may completely disrupt the antibody binding epitope. At 25% amino acid shared identity, protein folding may be similar, but clinically relevant cross-reactivity is quite rare below 50% identity and often requires greater than 70% identity (Aalberse, 2001).
A key to understanding cross-reactivity from a sequence-based perspective is identifying a minimum degree of similarity between a query protein and a known allergen that can prompt further investigation. To that end, there were attempts at conveying a minimum similarity level based on what was known of some allergens in the late 1990s and 2000s. Global guidelines on a minimum similarity have evolved in several ways: 1) the suggestion by Metcalfe et al. in 1996 that identical segments of 6 or 8 contiguous amino acids would suffice to determine a protein’s allergenicity, 2) FAO/WHO’s recommendation to use the threshold of greater than 35% identity for 80 amino acid sliding window segments aligning to an allergen (Codex Alimentarius 2003; FAO/WHO 2001), and 3) the proposed use of E-value of the FASTA or BLAST full length alignment to predict overall structural similarities to a known allergen (Silvanovich et al., 2009; Cressman and Ladics, 2009, Mirsky et al., 2013, Song et al., 2014, 2015). These three criteria are used in the current in silico process of assessing a protein’s potential allergenicity, and their validity has been discussed in various reviews (Konig et al. 2004; Goodman et al. 2005; Goodman et al. 2008; Goodman and Tetteh 2011; Vaughan et al. 2013; Song, 2015). While the scientific merits of these criteria are still evaluated and discussed, they all rely on a common source, which is a database containing protein sequences of all currently recognized allergens. The need for the database was recognized during this time of considering better hazard assessment strategies for novel proteins (Gendel, 1998) and it remains a critical piece in interpreting allergen bioinformatic data.
It is important to identify cross-reactivity potential for novel proteins as previously sensitized individuals may be at risk for allergic responses to a food or protein source not known to cause allergy, such as a food which contains a transgenic protein from another source organism. A compilation of allergens with known amino acid sequences allows for a screening process whereby proteins with unknown allergenic potential can be compared to known allergens as a group. The sequence comparison is crucial for the bioinformatic analysis of potential allergenicity, in silico, before a novel food is made commercially available. The goal is to identify proteins that may be related to a degree that enough structural similarity exists to warrant further experimental investigation into its potential to be cross-reactive.
A well-curated allergen database provides two main functions. First, it provides an up-to-date repository that reflects allergen discovery and current knowledge of organisms (animals, plants, etc.) that produce allergens. Second, the database supports allergy science by two distinct, but related processes: 1) the database allows for a comparative process to help identify potential similarity among two or more proteins and allergens using tools of bioinformatics, and 2) the database also allows identification of the source organism for any listed allergen, and the informatics sequence comparison can assess the level of taxonomic relatedness among the organisms producing the allergens/proteins.
Maintaining the up-to-date status of allergen databases has become more challenging as genomic sequencing technology has become widespread. The number of sequences to be filtered when searching for new potential allergens has grown exponentially, making it imperative that modern data handling methods be utilized to sort and identify information relevant for food safety. The COMPARE database project aims to implement a precise, accurate and consistent mechanism to identify protein sequences that are known or putative allergens via:
- The development of a cutting-edge, high-throughput, automated sequence sorting algorithm;
- The identification and collection of scientific literature and publications related to identified individual sequences;
- The coordinated review of the collected information by an external peer review panel of recognized allergy experts from the public sector, who will provide the final decision on sequence inclusion in the database;
- A public release of the database, updated annually by repeating steps 1-3 to capture newly identified allergens. The update will be under the review of the external peer review panel that will reconvene, as needed, for this purpose;
- Independent quality control and documentation practices that will ensure the integrity of the database and transparency of the process used to populate the database.
In addition to implementing a process that ensures independent external evaluations by public-sector experts, transparency is critical to the quality and scientific integrity of this tool. As such, all database design and search algorithm decisions will be publicly documented on the COMPARE database website.
There are many constituencies that may be interested in this type of database: product developers providing safety information on novel proteins; regulatory agencies responsible for food and feed safety assessments; medical personnel in the allergy field; the public, who may be interested in identifying sources of allergens. For example, prior to placing GM crops on the market an extensive safety evaluation using a weight of evidence approach is performed by the developers. Part of this safety assessment process aims to avoid the transfer of putative allergens into food. Bioinformatics tools such as the FASTA algorithm are utilized to evaluate the degree of similarity between novel proteins introduced into a crop and known allergens. These analyses are performed in early product development phases and serve as a tool to guide further evaluation if significant sequence similarity is identified. Access to a transparent and consensus-based allergen database is a key aspect for supporting public safety.
For regulatory agencies, reliable resources to assess the safety of novel proteins expressed in foods are essential for confidence in the safety determination and the ability to communicate those findings to the public. The internationally recognized standard for safety assessment of foods derived from biotechnology is the Codex Alimentarius (2009), which recommends that novel proteins expressed in foods be analyzed for any amino acid sequence similarities to all known allergens. The rapid expansion of annotated protein sequences has led to a problem in accessing and using these public database resources for producing an allergen screen. The COMPARE database is one of a handful of compiled allergen sequence databases which facilitates a focused allergy safety analysis; in contrast, comparative sequence analyses that are more broadly focused on biosystematics or general taxonomy may use the entire database of known protein sequences (e.g., GenBank, SwissProt). Using an automated approach for the initial scan of potential allergens allows for a transparent and fully documented process, and subsequent verification of the clinical relevance of these sequences by a panel of scientific experts in allergology provides validation that the entries in COMPARE are known or putative allergens, allowing the database to be a reliable resource (e.g., for regulatory purposes). Overall, the transparent and peer-reviewed method used to generate the COMPARE database makes it easier for regulatory agencies to communicate the output of any screens to the public.
Ferreira, F, Hawranek, T, Gruber, P, Wopfner, N and Mari, A. (2004). Allergic cross-reactivity: from gene to the clinic. Allergy, 59: 243–267.
Aalberse, RC. (2000). Structural Biology of Allergens. Journal of Allergy and Clinical Immunology, 106 228-38.
Codex Alimentarius Commission. (2003). Alinorm 03/34: Joint FAO/WHO Food Standard Programme, Codex Alimentarius Commission, Twenty-Fifth Session, Rome, 30 June– 5 July, 2003. Appendix III, Guideline for the conduct of food safety assessment of food derived from recombinant-DNA plants and Appendix IV, Annex on the assessment of possible allergenicity, pp. 47–60.
Metcalfe, DD, et al. (1996). Assessment of the allergenic potential of foods derived from genetically modified crop plants. Crit. Rev. Food Sci. Nutr. 36(S), 165–186.
FAO/WHO. (2001). Evaluation of allergenicity of genetically modified foods. Report of a joint FAO/WHO expert consultation on allergenicity of foods derived from biotechnology. Food and Agriculture Organization of the United Nations (FAO), Rome.
Silvanovich, A, Bannon, G, McClain, S. (2009). The use of E-scores to determine the quality of protein alignments. Regulatory Toxicology and Pharmacology 54: S26–S31.
Cressman, R, and Ladics, GS. (2009). Further evaluation of the utility of "sliding window" FASTA in predicting cross-reactivity with allergenic proteins. Reg. Toxicol. Pharmacol., 54:S20-S25.
Mirsky, HP, Cressman, RF, and Ladics, GS. (2013). Comparative assessment of multiple criteria for the in silico prediction of allergenic cross-reactivity. Reg. Toxicol. Pharmacol., 67:232-239.
Song, P, Herman, R, Kumpatla, S. (2014). Evaluation of global sequence comparison and one-to-one FASTA local alignment in regulatory allergenicity assessment of transgenic proteins in food crops. Food and Chemical Toxicology. Volume 71, September 2014, Pages 142–148.
Song, P. (2015). Bioinformatics application in regulatory assessment for potential allergenicity of transgenic proteins in food crops. In: Genetically modified organisms in food, production, safety, regulation and public health. Editors: Ronald Ross Watson and Victor R. Preedy. Academic Press, Elsevier Inc.
König, A, Cockburn, A, Crevel, RW, Debruyne, E, Grafstroem, R, Hammerling, U, Kimber, I, Knudsen, I, Kuiper, HA, Peijnenburg, AA, Penninks, AH, Poulsen, M, Schauzu, M, Wal, JM. (2004). Assessment of the safety of foods derived from genetically modified (GM) crops. Food Chem Toxicol. 42(7):1047-88.
Goodman, RE, Hefle, SL, Taylor, SL, van Ree, R. (2005). Assessing genetically modified crops to minimize the risk of increased food allergy: a review. Int Arch Allergy Immunol. 137(2):153-66.
Goodman, RE, Vieths, S, Sampson, HA, Hill, D, Ebisawa, M, Taylor, SL, van Ree, R. (2008). Allergenicity assessment of genetically modified crops--what makes sense? Nat Biotechnol. 26(1):73-81.
Goodman, RE and AO Tetteh. (2011). Suggested improvements for the allergenicity assessment of genetically modified plants used in foods. Curr Allergy Asthma Rep. 11(4):317-24.
Vaughan, K, Peters, B, Larche, M, Pomes, A, Broide, D, Sette, A. (2013). Strategies to query and display allergy-derived epitope data from the immune epitope database. Int Arch Allergy Immunol. 160(4):334-454.
Song, P, Herman, R, Kumpatla, S. (2015). 1:1 FASTA update: Using the power of E-values in FASTA to detect potential allergen cross-reactivity. Toxicology Reports, 12:1145–1148.
Gendel, SM. (1998). The use of amino acid sequence alignments to assess potential allergenicity of proteins used in genetically modified foods. Adv Food Nutr Res, 42, 45-62.
- A sustainable solution using cutting-edge, high-throughput, automated sequence sorting algorithms to screen an exponentially growing number of protein sequences for those that are potential allergens. Expertise in bioinformatics is employed.
- Oversight and coordination by the ILSI Health and Environmental Sciences Institute (HESI), an independent and objective non-profit organization dedicated to advancing the understanding of scientific issues related to human health and the environment. HESI is recognized over the world as a leader in developing and communicating science-based solutions.
- Steering by a multi-sector scientific committee with regulatory agency representation (EPA and FDA) is viewed as critical.
- Transparent criteria used by independent allergy experts to facilitate consensus on inclusion of allergens in the database.
- Inclusion of only sequences with documented allergenicity, maintaining a definitive list of sequences with clear allergenic status.
- Documentation of the search and filtering methods used to build each version of the database.
A process was developed to separate away the majority of the sequences [referred to as “entries” in the National Center for Biotechnology Information (NCBI) GenBank database(s)] that are unlikely to be allergens, and retain entries with some presumption of being potential allergens for review by the expert panel. This is because only a minute proportion of proteins have been identified as allergens, with almost 60% of these belonging to four protein families (Taylor and Hefle, 2001; Breiteneder and Radauer, 2004). As of the April 2, 2017 release, the NCBI non-redundant protein sequences (NR) database contains 118,550,801 sequences. In contrast, the 2017 version of the COMPARE allergen database contains 1970 protein sequences, or 0.0017% of the sequences in NCBI NR. The COMPARE process included two steps, described in detail in subsequent paragraphs. First, a keyword-based, automatic screening filter was executed that examined text-based descriptive features about the entries that strongly indicate that they are extremely unlikely to be valid allergen sequences. Second, manual examinations were performed to further refine the set of entries obtained through filtering.
The automatic screening filter was developed using data obtained from pilot searches of NCBI as accessed through the main query window located at: https://www.ncbi.nlm.nih.gov/ using the following Boolean search:
“allerg* AND [time period: from Jan 1, 2014 to Dec 31, 2014 or Jan 1, 2015 to Dec 31, 2015] AND [species: animals, plants, fungi and Protists]”.
The pilot searches resulted in the identification of 186,715 protein sequence entries, which were downloaded in GenBank GenPept format. The GenPept record is a complete entry of all annotated protein sequences. The initial record search term “allerg*” is very broad and can be found anywhere in the record. For example, if the sequence submitter was from an “Institute of Allergy”, this would be identified by the search term "allerg*" even if the protein sequence in the record was not itself an allergen. During development of the keywords and their order in the process, keywords for filtering were selected through an iterative observation-based process, using contextual clues such as the sequence’s source organism, whether the sequence was submitted through an automated annotation pipeline (i.e., genome sequencing projects), whether “allerg*” appeared in a protein definition description, and feature lines such as ‘/note = “allergenic/antifungal thaumatin-like proteins”’ (see Figure 1).
At each step, a moderate number of entries (about 100) were randomly selected and evaluated, and between two and five keyword-based rules were identified, each capable of distinguishing between potential allergen entries (those that have at least a minimal chance of being determined to be allergens by the expert panel) and non-allergen entries (those that are extremely unlikely to be determined to be allergens) – this was part of the manual review that made up the development of the final rules-based algorithm. These rules were evaluated and compared in their efficiency to eliminate non-allergen entries and the potential for overly aggressive exclusion that might lead to undesirable omission of potential allergen entries. Of the two to five rules evaluated, the rule with the highest efficiency and lowest potential for overly aggressive exclusion was selected; when more than one rule had comparably high efficiency, preference was given to the one with lower chance of omitting potential allergen entries. Then, the procedure continued to the next step, where another batch of ~100 entries was randomly selected and evaluated and another set of rules was identified and compared. The procedure continued until the number of retained entries was reduced to a manageable size (400-600).
The final automatic keyword-based filter includes 13 steps and 28 elements (i.e., decision steps/points) (Figure 1). Some of the rules included in the final filter are based on the source species of the entries, e.g., at elements 2 and 4, where entries with source species "Homo sapiens" or "Arabidopsis thaliana" were excluded, because human and Arabidopsis proteins are extremely unlikely to induce allergic reactions. Other rules identify features indicative of high-throughput studies where evidence of allergic reaction would not be presented. For example, at element 6, the presence of keyword “BioProject” in the DBLink field is a strong indication that the entries resulted from a high-throughput study (which could be one of several types, including genome sequencing and assembly, metagenome, transcriptome sequencing and expression, targeted locus sequencing, etc.). Interestingly, most of the rules selected at later stages of the filter (on and following element 12) were defined based on a few frequently occurring single-line annotation texts. For example, when the line “allergen of white birch (Betula verrucosa), Bet v 1, and” appears verbatim in the “Region” section under the “Features” field in a particular entry (see element 12, Figure 1), we have good confidence that this entry does not include specific evidence adequate for the expert panel to assess allergenic status; this line was added to the entry via an automatic annotation pipeline in a genome sequencing project. Bet v 1 and a number of its homologues are well-known allergens and are already represented in allergen databases (such as Allergen Online) to an extent that an inadvertent omission due to this filter would be unlikely to materially add to an already robust list of known allergens.
The final automatic keyword-based filter was tested by entering GenPept formatted sequences from another allergen database, Allergen Online (AOL) version 16 (http://www.allergenonline.com/). The results indicated that almost all the sequences containing “allerg*” in its annotation were classified as potential candidates as expected. This testing also found that some sequences in AOL version 16 (GenPept format) don’t contain the “allerg*”in their annotation. Manual examination revealed that those sequences are typically associated with allergens such as profilin, tropomyosin, etc., indicating that some allergen candidates would not be captured by searching the NCBI non-redundant protein sequences for “allerg*”. To avoid excluding an allergen candidate without “allerg*” in its annotation in the database updating process, additional searches with specific keywords are required to supplement to the “allerg*” search of the NCBI non-redundant protein sequences. These supplemental searches were performed for the 2017 COMPARE database build; the specific keywords used are listed in Table 1 of “Construction of the 2017 database”.
Taylor, SL, and Hefle, SL. (2001). Will genetically modified foods be allergenic? Journal of Allergy and Clinical Immunology, 107 (5): 765-771.
Breiteneder, H and Radauer, C. (2004). A Classification of Plant Food Allergens. Journal of Allergy and Clinical Immunology, 113 (5): 821 - 830.
Process for evaluating candidate entries
Once candidate entries are identified, they are provided to a peer review panel along with the literature associated with each candidate in NCBI in order to support the evaluation of allergenicity. The peer review panel is an international group of academic and clinical allergy experts renowned for their research and expertise in areas such as the nature of protein allergenicity and molecular characterization of allergens, allergenic cross-reactivity, immunologic mechanisms of allergy development, immunotherapy, allergen diagnostics and component-resolved diagnosis. The panel serves as an independent decision-making body, which reviews the published literature associated with the candidate allergen and determines whether the candidate has enough supporting evidence of allergenicity to be included in the database. The identified candidate and all allergy-relevant available literature associated with it are delivered to the peer review panel via organization and participation tracking software developed and hosted by JIFSAN for the purpose of facilitating and documenting the review. Each candidate with its associated literature is independently reviewed by two panelists. A disparate (or uncertain) categorization of the candidates by the two reviewers triggers a discussion by the panel on whether the evidence is sufficiently robust for categorizing the candidate as an allergen (see “Criteria for inclusion or exclusion of candidate proteins in the database”).
The peer review panel has developed a set of criteria as guidance to determine the quality of the data presented in the literature, upon which a decision to include or exclude a candidate in/from the database can be based. These criteria are consistent with those widely adopted for inclusion of sequences in other allergen databases (Breiteneder and Chapman, 2014; Goodman et al., 2014, 2016).They were developed to be straightforward, with the understanding that in some instances the discussion on data quality among panel members will be necessary to accommodate a consensus approach. The minimum criteria adopted for inclusion of a candidate are peer-reviewed evidence of IgE binding; either published literature or other peer-reviewed documentation. These criteria are considered to be highly conservative, with no requirement for demonstration of IgE functionality (e.g., cross-linking for degranulation of IgE-bearing cells).
The evaluation of evidence for IgE binding includes consideration of the experimental approach, the quality of sera, and characterization of the IgE binding itself. For the experimental approach, whether the assay is well-established or validated is a key consideration. If an assay is not well-established, an appropriate and well-described design is necessary. For all assays, the inclusion of appropriate negative controls is a prerequisite. Assessment of serum quality considers whether sera are from patients with proven sensitization and/or reported allergy to the source of the candidate protein sequence. If not, sera from patients proven allergic to a source likely to be cross-reactive with the source of the candidate entry are a viable option. For example, if the source of the candidate sequence is from a legume for which no allergic patient sera are available, sera from patients allergic to known allergenic legumes such as peanut, soy or lupine would be appropriate. Ideally, sera from patients with a convincing history of allergy, particularly challenge-proven allergy, to the source of the candidate entry will add to quality of the data, but for demonstration of IgE binding this is not a requirement.
When considering the characterization of the IgE binding itself, both the attributes of the patient sera and the quality of the protein are taken into account. If there is reason to believe that sera with high total IgE levels were used (e.g., from patients with atopic dermatitis or with parasite infections), then appropriate high total IgE control sera need to be included. If there is indication that IgE binding may exclusively be directed toward glycan moieties on the tested protein, i.e. cross-reactive carbohydrate determinants (CCD) and/or galactose-alpha-1,3-galactose (αGAL), other proof of specific IgE binding to the protein backbone needs to be provided. For example, inhibition studies would be necessary to exclude the possibility that IgE is directed only toward glycan moieties on the protein. Exclusive IgE binding to CCD is not considered to be a reason to include a candidate sequence since such binding has little to do with the protein sequence (Holzweber et al., 2013). If IgE binding is demonstrated using natural purified proteins, the purity and the impact of potential contamination with traces of known allergens needs to be addressed. Moreover, natural glycoproteins and recombinant glycoproteins expressed in eukaryotic expression systems may carry CCD or αGAL, to which IgE binding may result in false positive results as discussed above. Demonstration of IgE binding to E. coli derived recombinant proteins, free from contamination with allergens and lacking glycosylation, helps reduce the risk of unjustified designations of candidate sequences as allergens.
The criteria for exclusion of a candidate sequence were also developed. These include lack of appropriate negative control sera, serious doubts about sufficient protein purity, and probable IgE binding to carbohydrate determinants in the absence of evidence that pre-absorption or inhibition with a homologous non-glycosylated peptide or polypeptide decreases IgE binding to the protein with appropriate controls. Alternatively, the demonstration of full inhibition of IgE binding by relevant carbohydrate structures (CCD/αGAL) would indicate a lack of relevance for the protein sequence and merit exclusion of that sequence from the database, unless new evidence implicating the amino acid sequence itself were to become available. Candidate sequences rejected by the panel for lack of evidence are retained and revisited once new evidence becomes available.
In some instances, sequences are tagged as allergens because they are homologues of known allergens. Therefore they will be retrieved in the search and filtering, but will not be included in the database without evidence of allergenicity associated with the specific protein sequence. All sequences must have experimental evidence of IgE binding to be designated as allergens; homology alone does not qualify a sequence for inclusion. Sequences meeting the minimum criteria for database inclusion are treated as one group, without distinguishing any sequences as “putative” or “proven” allergens based on the level of available evidence.
A robust search for newly identified allergens needs to cast a broader net than a year’s worth of NCBI entries for two main reasons. First, an allergen may have been identified but not submitted to NCBI. This is likely to occur when the allergen discovery is extremely recent. Second, there may be older entries in NCBI for which new evidence of allergenicity may have become available. A solution to both of these situations is a thorough search of the scientific literature each year to capture new evidence. In addition, organizations with a focus on allergens such as the International Union of Immunological Societies (IUIS) can also be excellent sources for newly identified allergens. Candidates identified via these means are submitted to the peer review panel for evaluation.
Breiteneder, H, and Chapman, MD. (2014). Chapter 3: Allergen Nomenclature in Allergens and Allergen Immunotherapy Subcutaneous, Sublingual and Oral, 5th Edition. Richard F. Lockey and Dennis K. Ledford CRC Press. 37–50.
Goodman, R, van Ree, R, Vieths, S, et al. (2014). Criteria used to categorise proteins as allergens for inclusion in allergenonline.org: a curated database for risk assessment. Clinical and Translational Allergy, 4(Suppl 2):P12. doi:10.1186/2045-7022-4-S2-P12.
Goodman, RE, Ebisawa, M, Ferreira, F, Sampson, HA, van Ree, R, Vieths, S, Baumert, JL, Bohle, B, Lalithambika, S, Wise, J, and Taylor, SL. (2016), AllergenOnline: A peer-reviewed, curated allergen database to assess novel food proteins for potential cross-reactivity. Mol. Nutr. Food Res., 60: 1183–1198. doi:10.1002/mnfr.201500769.
Holzweber, F, Svehla, E, Fellner, W, Dalik, T, Stubler, S, Hemmer, W, Altmann, F. (2013), Inhibition of IgE binding to cross-reactive carbohydrate determinants enhances diagnostic selectivity. Allergy, 68: 1269–1277.
Figure 3: COMPARE 2017 process
Construction of the first version of the database (2017) occurred in two main steps. The first was identifying potential sequences for inclusion in the database by keyword filtering approaches using a transparent, validated filtering algorithm (see “Pilot searches and automatic screening filter development”). The second step was the consideration of these sequences and their associated published literature by the peer review panel, using criteria they have developed to determine whether a sequence should be included in the database (see “Peer review of candidate sequences”).
Candidate entries for the 2017 COMPARE database were obtained from the National Center for Biotechnology Information (NCBI) as accessed through the main query window located at: https://www.ncbi.nlm.nih.gov/on May 14, 2016 using the following Boolean search:
“allerg* AND [time period: from May 30, 2015 to May 14, 2016] AND [species: animals, plants, fungi and Protists]”.
The search resulted in the identification of 55,641 protein sequence entries. In addition to the search using “allerg*”, secondary searches of the NCBI non -redundant protein database using specific key words such as "profilin", “tropomyosin”, etc. (see Table 1) were performed to ensure that protein sequences without "allerg*" in the sequence annotation but which are potentially members of known allergen families would not be excluded. The search was done by replacing the "allerg*" in the above mentioned Boolean search with each of the defined specific key words, resulting in the identification of 15,704 sequences. All the candidate sequences derived from the above search were downloaded in GenBank GenPept format and subjected to a keyword-based filtering process.
Table 1. Specific keywords for supplemental searches
|Defined keywords used in searches to identify protein sequences without “allerg*”|
|2S albumin||lipid transfer protein|
|beta conglycinin||proteinase inhibitor|
|calcium binding protein||serine protease|
For the 2017 COMPARE database, greater than 99% of the entries from the “allerg*” search were removed at step 6, having been identified as products of automated annotation pipelines in genome sequencing studies (Figure 2). Entries resulting from automated annotation pipelines do not include qualified evidence of protein expression or lines of evidence related to allergy. Following the automatic keyword-based filtering, 568 entries out of the 55,641 entries from the “allerg*” search were retained. From the other specific keyword searches retrieving 15,704 sequences, filtering using the same steps through element 10 resulted in 65 sequences. Each of the entries was then manually examined. For some of the entries, confident determination could be reached that they did not contain adequate evidence for the expert panel to reach a conclusive determination: some of the entries lacked valid references, others included references describing experiments that are not germane to allergic reactions (e.g., reporting only crystal structures of the proteins and no clinical evidence). The manual removal of these entries was documented for later review, if needed. Redundant (identical) sequences were removed as well, resulting in 251 candidate entries eligible for review by the peer review panel. Of these, 43 were qualified by the panel to be included in the database (see “Peer review of candidate entries” for an overview of this evaluation process). Once redundancies were removed, the final number of allergens for inclusion in the database was 14. For the 2017 COMPARE database, these newly identified allergen sequences were integrated with the entirety of version 16 (year 2016) of the AllergenOnline.org database, a set of 1956 allergen sequences identified through the efforts of the Food Allergy Research and Resource Program (FARRP) (Goodman et al., 2016). An overview of the 2017 COMPARE database formation process is shown in Figure 3. A search of the literature to identify allergens not yet entered into NCBI (or entered earlier than the search period) is under way (see “Alternative sourcing of newly identified allergens”). A version 17.1 of the COMPARE database will be released if additional allergens are added following this process.
Goodman, RE, Ebisawa, M, Ferreira, F, Sampson, HA, van Ree, R, Vieths, S, Baumert, JL, Bohle, B, Lalithambika, S, Wise, J, and Taylor, SL. (2016). AllergenOnline: A peer-reviewed, curated allergen database to assess novel food proteins for potential cross-reactivity. Mol. Nutr. Food Res., 60: 1183–1198. doi:10.1002/mnfr.201500769
A sequence database that is used to identify potential hazards as it relates to food safety only retains its utility if it contains the most recent allergens. Keeping the COMPARE database current and up-to-date will involve an annual download from NCBI that captures any newly entered sequences (entries) added to NCBI during the time elapsed since the prior search. It is expected that the COMPARE keyword algorithm will continue to evolve each year, improving the sensitivity and accuracy of the search. Annual updates of the database will follow the same process of filtering and peer review of entries. An annual update process facilitates confidence in communicating the presence of new allergens in a timely fashion while still allowing for a thorough, high quality review process. Integration of new selected candidates with existing allergens is performed in November and December of that current year to support a January (of the next new year) release of the updated database. Searches must be conducted early enough in the year to allow sufficient time for a high quality review. Consequently, relevant entries appearing in NCBI later in the year may not be captured in that year’s process, but would enter review in the following year.
In the future, a more dynamic interface will be implemented to allow for online search functionality through a public web portal. The current database is available as a static file that can be downloaded and searched, which requires some specific expertise and implementation of appropriate tools on the part of the user. In the future, however, users will be able to compare their selected protein sequence to the database using a built-in feature on the website. The comparative software of choice has traditionally been the FASTA algorithm (Pearson and Lipman, 1988) and is recommended for assessing similarity between protein sequences. The availability of the FASTA comparison utility would allow for a dynamic interaction between the database and any users that are interested in assessing similarity in real-time without downloading the database and installing software on separate computers/servers. Additionally, a sort and search function of the database listing will be available. The purpose of this utility will support easy identification of organisms, allergen names, and other terms that describe an allergen or group of allergens listed in COMPARE.
The COMPARE database builds upon an approximately 10 year history in supporting similar efforts at reviewing and cataloging allergens at the University of Nebraska’s Food Allergy Research and Resource Program (FARRP; www.allergenonline.org). The COMPARE process supports inclusion of those allergens identified in the most recent one-year update period; those new sequences having been added to the 2016 list of sequences originally sponsored through FARRP. Due to differences in allergen screening processes and the dramatic increase in the size of the NCBI database, the COMPARE processes will be applied to not only one year’s sequences, but will be applied to a comprehensive re-screen of all possible protein allergen sequences. This will ultimately result in a re-build of the full listing of sequences in the 2017 version of COMPARE. The goal is to apply a standardized and well-documented review to all potential allergens using the COMPARE criteria for inclusion.
Standardized nomenclature for allergens in COMPARE
It should be noted that the primary source of entries in the COMPARE database and the associated annotations is the publicly available NCBI protein database (https://www.ncbi.nlm.nih.gov/protein). Part of that annotation is the text description of a protein’s name(s). With regard to allergens, the officially recognized naming convention is a key aspect of understanding related proteins across species. Therefore, standardized naming conventions, as promoted by the International Union of Immunological Societies (IUIS; http://www.allergen.org/index.php), help reduce the use of non-standard allergen names and descriptions. However, the research and associated naming of allergens has a history longer than IUIS and thus, many allergens were named according to various non-standard conventions. This has resulted in protein names that may not indicate that they are allergens or in some cases, an allergen may have two or more legacy names which complicates of identifying the source organism and the most recent, accepted name. A refinement of allergen nomenclature will be undertaken for allergen entries with multiple names, or with designations such as “un-named”, “unknown”, or “putative”. The COMPARE management team expects the editing of allergen names to current standardizations to be an ongoing quality improvement process. The goal will be to alleviate confusion over multiple names and make more clear the link between the publications that support a protein’s inclusion into the database as an allergen and sequences themselves.
A note on sequence identification. Historically, NCBI has used unique numeration to track entries and versioning of entries. Entries currently carry both accession numbers and GI sequence identification numbers (GenInfo Identifiers). Future versions of the database will use only the alpha-numeric based accession numbers due to NCBI’s phase-out of GI identifiers that began in March 2017 (NCBI-GenBank release notes, October 15, 2016).
Pearson, WR, and Lipman, DJ. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America 85, 2444-2448.