COMPARE Process
Categorizing proteins has a long history, primarily for understanding taxonomy and genetic relatedness through biosystematics. More specialized groupings help understand related structure and function and the evolution of that relatedness across organisms. With increasing exposure to novel foods and feeds through greater global distribution and through the use of biotechnology (GM crops), it became necessary to better understand similarities of allergenic proteins across organisms and how newly discovered allergens might be related to those with which we were already familiar. As a result, growing attention was given in particular to organizing allergenic proteins into structurally and functionally related protein groups, all contained within an accessible database. An effort in this area was initiated in 2016 by the ILSI Health and Environmental Sciences Institute (HESI) Protein Allergenicity Technical Committee (PATC) to develop a new allergen sequence data resource for public access: the COMprehensive Protein Allergen REsource (COMPARE) database.
A unique aspect of allergens is the phenomenon of clinically relevant cross-reactivity, in which allergic responses occur when IgE antibodies (those responsible for the allergic cascade) recognize proteins other than the original sensitizing allergen due to shared structural features. Most allergens belong to a handful of structural protein families (Ferreira et al., 2004). Structural similarities at both the primary (amino acid sequence) and tertiary (folding) levels confer cross-reactivity to proteins from diverse sources. Cross-reactive proteins share tertiary structures, but proteins with shared tertiary structures do not necessarily cross-react. Changes in primary structure by amino acid substitutions may not interfere with protein folding but may completely disrupt the antibody binding epitope. At 25% amino acid shared identity, protein folding may be similar, but clinically relevant cross-reactivity is quite rare below 50% identity and often requires greater than 70% identity (Aalberse, 2001).
A key to understanding cross-reactivity from a sequence-based perspective is identifying a minimum degree of similarity between a query protein and a known allergen that can prompt further investigation. To that end, there were attempts at conveying a minimum similarity level based on what was known of some allergens in the late 1990s and 2000s. Global guidelines on a minimum similarity have evolved in several ways: 1) the suggestion by Metcalfe et al. in 1996 that identical segments of 6 or 8 contiguous amino acids would suffice to determine a protein’s allergenicity, 2) FAO/WHO’s recommendation to use the threshold of greater than 35% identity for 80 amino acid sliding window segments aligning to an allergen (Codex Alimentarius 2003; FAO/WHO 2001), and 3) the proposed use of E-value of the FASTA or BLAST full length alignment to predict overall structural similarities to a known allergen (Silvanovich et al., 2009; Cressman and Ladics, 2009, Mirsky et al., 2013, Song et al., 2014, 2015). These three criteria are used in the current in silico process of assessing a protein’s potential allergenicity, and their validity has been discussed in various reviews (Konig et al. 2004; Goodman et al. 2005; Goodman et al. 2008; Goodman and Tetteh 2011; Vaughan et al. 2013; Song, 2015). While the scientific merits of these criteria are still evaluated and discussed, they all rely on a common source, which is a database containing protein sequences of all currently recognized allergens. The need for the database was recognized during this time of considering better hazard assessment strategies for novel proteins (Gendel, 1998) and it remains a critical piece in interpreting allergen bioinformatic data.
It is important to identify cross-reactivity potential for novel proteins as previously sensitized individuals may be at risk for allergic responses to a food or protein source not known to cause allergy, such as a food which contains a transgenic protein from another source organism. A compilation of allergens with known amino acid sequences allows for a screening process whereby proteins with unknown allergenic potential can be compared to known allergens as a group. The sequence comparison is crucial for the bioinformatic analysis of potential allergenicity, in silico, before a novel food is made commercially available. The goal is to identify proteins that may be related to a degree that enough structural similarity exists to warrant further experimental investigation into its potential to be cross-reactive.
A well-curated allergen database provides two main functions. First, it provides an up-to-date repository that reflects allergen discovery and current knowledge of organisms (animals, plants, etc.) that produce allergens. Second, the database supports allergy science by two distinct, but related processes: 1) the database allows for a comparative process to help identify potential similarity among two or more proteins and allergens using tools of bioinformatics, and 2) the database also allows identification of the source organism for any listed allergen, and the informatics sequence comparison can assess the level of taxonomic relatedness among the organisms producing the allergens/proteins.
Maintaining the up-to-date status of allergen databases has become more challenging as genomic sequencing technology has become widespread. The number of sequences to be filtered when searching for new potential allergens has grown exponentially, making it imperative that modern data handling methods be utilized to sort and identify information relevant for food safety. The COMPARE database project aims to implement a precise, accurate and consistent mechanism to identify protein sequences that are known or putative allergens via:
- The development of a cutting-edge, high-throughput, automated sequence sorting algorithm;
- The identification and collection of scientific literature and publications related to identified individual sequences;
- The coordinated review of the collected information by an external peer review panel of recognized allergy experts from the public sector, who will provide the final decision on sequence inclusion in the database;
- A public release of the database, updated annually by repeating steps 1-3 to capture newly identified allergens. The update will be under the review of the external peer review panel that will reconvene, as needed, for this purpose;
- Independent quality control and documentation practices that will ensure the integrity of the database and transparency of the process used to populate the database.
In addition to implementing a process that ensures independent external evaluations by public-sector experts, transparency is critical to the quality and scientific integrity of this tool. As such, all database design and search algorithm decisions will be publicly documented on the COMPARE database website.
There are many constituencies that may be interested in this type of database: product developers providing safety information on novel proteins; regulatory agencies responsible for food and feed safety assessments; medical personnel in the allergy field; the public, who may be interested in identifying sources of allergens. For example, prior to placing GM crops on the market an extensive safety evaluation using a weight of evidence approach is performed by the developers. Part of this safety assessment process aims to avoid the transfer of putative allergens into food. Bioinformatics tools such as the FASTA algorithm are utilized to evaluate the degree of similarity between novel proteins introduced into a crop and known allergens. These analyses are performed in early product development phases and serve as a tool to guide further evaluation if significant sequence similarity is identified. Access to a transparent and consensus-based allergen database is a key aspect for supporting public safety.
For regulatory agencies, reliable resources to assess the safety of novel proteins expressed in foods are essential for confidence in the safety determination and the ability to communicate those findings to the public. The internationally recognized standard for safety assessment of foods derived from biotechnology is the Codex Alimentarius (2009), which recommends that novel proteins expressed in foods be analyzed for any amino acid sequence similarities to all known allergens. The rapid expansion of annotated protein sequences has led to a problem in accessing and using these public database resources for producing an allergen screen. The COMPARE database is one of a handful of compiled allergen sequence databases which facilitates a focused allergy safety analysis; in contrast, comparative sequence analyses that are more broadly focused on biosystematics or general taxonomy may use the entire database of known protein sequences (e.g., NCBI Protein, UniProt). Using an automated approach for the initial scan of potential allergens allows for a transparent and fully documented process, and subsequent verification of the clinical relevance of these sequences by a panel of scientific experts in allergology provides validation that the entries in COMPARE are known or putative allergens, allowing the database to be a reliable resource (e.g., for regulatory purposes). Overall, the transparent and peer-reviewed method used to generate the COMPARE database makes it easier for regulatory agencies to communicate the output of any screens to the public.
References:
Ferreira, F, Hawranek, T, Gruber, P, Wopfner, N and Mari, A. (2004). Allergic cross-reactivity: from gene to the clinic. Allergy, 59: 243–267.
Aalberse, RC. (2000). Structural Biology of Allergens. Journal of Allergy and Clinical Immunology, 106 228-38.
Codex Alimentarius Commission. (2003). Alinorm 03/34: Joint FAO/WHO Food Standard Programme, Codex Alimentarius Commission, Twenty-Fifth Session, Rome, 30 June– 5 July, 2003. Appendix III, Guideline for the conduct of food safety assessment of food derived from recombinant-DNA plants and Appendix IV, Annex on the assessment of possible allergenicity, pp. 47–60.
Metcalfe, DD, et al. (1996). Assessment of the allergenic potential of foods derived from genetically modified crop plants. Crit. Rev. Food Sci. Nutr. 36(S), 165–186.
FAO/WHO. (2001). Evaluation of allergenicity of genetically modified foods. Report of a joint FAO/WHO expert consultation on allergenicity of foods derived from biotechnology. Food and Agriculture Organization of the United Nations (FAO), Rome.
Silvanovich, A, Bannon, G, McClain, S. (2009). The use of E-scores to determine the quality of protein alignments. Regulatory Toxicology and Pharmacology 54: S26–S31.
Cressman, R, and Ladics, GS. (2009). Further evaluation of the utility of "sliding window" FASTA in predicting cross-reactivity with allergenic proteins. Reg. Toxicol. Pharmacol., 54:S20-S25.
Mirsky, HP, Cressman, RF, and Ladics, GS. (2013). Comparative assessment of multiple criteria for the in silico prediction of allergenic cross-reactivity. Reg. Toxicol. Pharmacol., 67:232-239.
Song, P, Herman, R, Kumpatla, S. (2014). Evaluation of global sequence comparison and one-to-one FASTA local alignment in regulatory allergenicity assessment of transgenic proteins in food crops. Food and Chemical Toxicology. Volume 71, September 2014, Pages 142–148.
Song, P. (2015). Bioinformatics application in regulatory assessment for potential allergenicity of transgenic proteins in food crops. In: Genetically modified organisms in food, production, safety, regulation and public health. Editors: Ronald Ross Watson and Victor R. Preedy. Academic Press, Elsevier Inc.
König, A, Cockburn, A, Crevel, RW, Debruyne, E, Grafstroem, R, Hammerling, U, Kimber, I, Knudsen, I, Kuiper, HA, Peijnenburg, AA, Penninks, AH, Poulsen, M, Schauzu, M, Wal, JM. (2004). Assessment of the safety of foods derived from genetically modified (GM) crops. Food Chem Toxicol. 42(7):1047-88.
Goodman, RE, Hefle, SL, Taylor, SL, van Ree, R. (2005). Assessing genetically modified crops to minimize the risk of increased food allergy: a review. Int Arch Allergy Immunol. 137(2):153-66.
Goodman, RE, Vieths, S, Sampson, HA, Hill, D, Ebisawa, M, Taylor, SL, van Ree, R. (2008). Allergenicity assessment of genetically modified crops--what makes sense? Nat Biotechnol. 26(1):73-81.
Goodman, RE and AO Tetteh. (2011). Suggested improvements for the allergenicity assessment of genetically modified plants used in foods. Curr Allergy Asthma Rep. 11(4):317-24.
Vaughan, K, Peters, B, Larche, M, Pomes, A, Broide, D, Sette, A. (2013). Strategies to query and display allergy-derived epitope data from the immune epitope database. Int Arch Allergy Immunol. 160(4):334-454.
Song, P, Herman, R, Kumpatla, S. (2015). 1:1 FASTA update: Using the power of E-values in FASTA to detect potential allergen cross-reactivity. Toxicology Reports, 12:1145–1148.
Gendel, SM. (1998). The use of amino acid sequence alignments to assess potential allergenicity of proteins used in genetically modified foods. Adv Food Nutr Res, 42, 45-62.
- A sustainable solution using cutting-edge, high-throughput, automated sequence sorting algorithms to screen an exponentially growing number of protein sequences for those that are potential allergens. Expertise in bioinformatics is employed.
- Oversight and coordination by the Health and Environmental Sciences Institute (HESI), an independent and objective non-profit organization dedicated to advancing the understanding of scientific issues related to human health and the environment. HESI is recognized over the world as a leader in developing and communicating science-based solutions.
- Steering by a multi-sector scientific committee with regulatory agency representation (EPA and FDA) is viewed as critical.
- Transparent criteria used by independent allergy experts to facilitate consensus on inclusion of allergens in the database.
- Inclusion of only sequences with documented allergenicity, maintaining a definitive list of sequences with clear allergenic status.
- Documentation of the search and filtering methods used to build each version of the database.
The construction of the COMPARE database and the process of identifying allergenic protein sequences has been published in the paper below:
The COMPARE Database: A Public Resource for Allergen Identification, Adapted for Continuous Improvement. van Ree et al. Frontiers in Allergy. (August 2021)