COMPARE Process
Categorizing proteins has a long history, primarily for understanding taxonomy and genetic relatedness through biosystematics. More specialized groupings have helped elucidate related structure and function and the evolution of those relationships across organisms. With increasing exposure to novel foods and feeds due to greater global distribution and advancements in biotechnology (e.g., genetically modified crops), the need to better understand the similarities of allergenic proteins across organisms became evident. Additionally, there was a growing need to identify how newly discovered allergens relate to those already known. As a result, attention increasingly focused on organizing allergenic proteins into structurally and functionally related groups within an accessible database. In 2016, the ILSI Health and Environmental Sciences Institute (HESI) Protein Allergenicity Technical Committee (PATC) initiated an effort to develop a new allergen sequence data resource for public access: the COMprehensive Protein Allergen REsource (COMPARE) database.
A unique aspect of allergens is the phenomenon of clinically relevant cross-reactivity, in which allergic responses occur when IgE antibodies (responsible for triggering the allergic cascade) recognize proteins other than the original sensitizing allergen due to shared structural features. Most allergens belong to a limited number of structural protein families (Ferreira et al., 2004). Structural similarities at both the primary (amino acid sequence) and tertiary (folding) levels confer cross-reactivity among proteins from diverse sources. While cross-reactive proteins share tertiary structures, proteins with shared tertiary structures do not necessarily cross-react. Changes in primary structure through amino acid substitutions may not affect protein folding but can completely disrupt the antibody binding epitope. At 25% amino acid sequence identity, protein folding may be similar, but clinically relevant cross-reactivity is rare below 50% identity and often requires greater than 70% identity (Aalberse, 2001).
A key to understanding cross-reactivity from a sequence-based perspective is identifying the minimum degree of similarity between a query protein and a known allergen that warrants further investigation. Efforts to establish a minimum similarity threshold began in the late 1990s and 2000s. Global guidelines for these thresholds have evolved in several ways:
- Metcalfe et al.,1996: Suggested that identical segments of 6 or 8 contiguous amino acids could suffice to determine a protein’s allergenicity.
- FAO/WHO 2001; Codex Alimentarius 2003: Recommended using a threshold of greater than 35% identity over an 80-amino-acid sliding window segment aligned to a known allergen.
- Silvanovich et al., 2009; Cressman and Ladics, 2009; Mirsky et al., 2013; Song et al., 2014, 2015: Proposed the use of FASTA/BLAST E-value scores from full-length sequence alignments to predict structural similarities to known allergens.
These three criteria are employed in the current in silico processes for assessing a protein’s potential allergenicity, and their validity has been discussed extensively in various reviews (König et al., 2004; Goodman et al., 2005; Goodman et al., 2008; Goodman and Tetteh, 2011; Vaughan et al., 2013; Song, 2015). While the scientific merits of these criteria continue to be evaluated, they all rely on a common foundation: a comprehensive database containing protein sequences of all currently recognized allergens. The need for such a database was first recognized during efforts to improve hazard assessment strategies for novel proteins (Gendel, 1998), and it remains a critical tool for interpreting allergen bioinformatic data.
It is important to identify cross-reactivity potential for novel proteins as previously sensitized individuals may be at risk for allergic responses to a food or protein source not known to cause allergy, such as a food which contains a transgenic protein from another source organism. A compilation of allergens with known amino acid sequences allows for a screening process whereby proteins with unknown allergenic potential can be compared to known allergens as a group. The sequence comparison is crucial for the bioinformatic analysis of potential allergenicity, in silico, before a novel food is made commercially available. The goal is to identify proteins that may be related to a degree that enough structural similarity exists to warrant further experimental investigation into its potential to be cross-reactive.
A well-curated allergen database provides two main functions. First, it provides an up-to-date repository that reflects allergen discovery and current knowledge of organisms (animals, plants, etc.) that produce allergens. Second, the database supports allergy science by two distinct, but related processes: 1) the database allows for a comparative process to help identify potential similarity among two or more proteins and allergens using tools of bioinformatics, and 2) the database also allows identification of the source organism for any listed allergen, and the informatics sequence comparison can assess the level of taxonomic relatedness among the organisms producing the allergens/proteins.
Maintaining the up-to-date status of allergen databases has become more challenging as genomic sequencing technology has become widespread. The number of sequences to be filtered when searching for new potential allergens has grown exponentially, making it imperative that modern data handling methods be utilized to sort and identify information relevant for food safety. The COMPARE database project aims to implement a precise, accurate and consistent mechanism to identify protein sequences that are known or putative allergens via:
- The development of a cutting-edge, high-throughput, automated sequence sorting algorithm;
- The identification and collection of scientific literature and publications related to identified individual sequences;
- The coordinated review of the collected information by an external peer review panel of recognized allergy experts from the public sector, who will provide the final decision on sequence inclusion in the database;
- A public release of the database, updated annually by repeating steps 1-3 to capture newly identified allergens. The update will be under the review of the external peer review panel that will reconvene, as needed, for this purpose;
- Independent quality control and documentation practices that will ensure the integrity of the database and transparency of the process used to populate the database.
In addition to implementing a process that ensures independent external evaluations by public-sector experts, transparency is critical to the quality and scientific integrity of this tool. As such, all database design and search algorithm decisions will be publicly documented on the COMPARE database website.
There are many constituencies that may be interested in this type of database: product developers providing safety information on novel proteins; regulatory agencies responsible for food and feed safety assessments; medical personnel in the allergy field; the public, who may be interested in identifying sources of allergens. For example, prior to placing GM crops on the market an extensive safety evaluation using a weight of evidence approach is performed by the developers. Part of this safety assessment process aims to avoid the transfer of putative allergens into food. Bioinformatics tools such as the FASTA algorithm are utilized to evaluate the degree of similarity between novel proteins introduced into a crop and known allergens. These analyses are performed in early product development phases and serve as a tool to guide further evaluation if significant sequence similarity is identified. Access to a transparent and consensus-based allergen database is a key aspect for supporting public safety.
For regulatory agencies, reliable resources to assess the safety of novel proteins expressed in foods are essential for confidence in the safety determination and the ability to communicate those findings to the public. The internationally recognized standard for safety assessment of foods derived from biotechnology is the Codex Alimentarius (2009), which recommends that novel proteins expressed in foods be analyzed for any amino acid sequence similarities to all known allergens. The rapid expansion of annotated protein sequences has led to a problem in accessing and using these public database resources for producing an allergen screen. The COMPARE database is one of a handful of compiled allergen sequence databases which facilitates a focused allergy safety analysis; in contrast, comparative sequence analyses that are more broadly focused on biosystematics or general taxonomy may use the entire database of known protein sequences (e.g., NCBI Protein, UniProt). Using an automated approach for the initial scan of potential allergens allows for a transparent and fully documented process, and subsequent verification of the clinical relevance of these sequences by a panel of scientific experts in allergology provides validation that the entries in COMPARE are known or putative allergens, allowing the database to be a reliable resource (e.g., for regulatory purposes). Overall, the transparent and peer-reviewed method used to generate the COMPARE database makes it easier for regulatory agencies to communicate the output of any screens to the public.
References:
Ferreira, F, Hawranek, T, Gruber, P, Wopfner, N and Mari, A. (2004). Allergic cross-reactivity: from gene to the clinic. Allergy, 59: 243–267.
Aalberse, RC. (2000). Structural Biology of Allergens. Journal of Allergy and Clinical Immunology, 106 228-38.
Codex Alimentarius Commission. (2003). Alinorm 03/34: Joint FAO/WHO Food Standard Programme, Codex Alimentarius Commission, Twenty-Fifth Session, Rome, 30 June– 5 July, 2003. Appendix III, Guideline for the conduct of food safety assessment of food derived from recombinant-DNA plants and Appendix IV, Annex on the assessment of possible allergenicity, pp. 47–60.
Metcalfe, DD, et al. (1996). Assessment of the allergenic potential of foods derived from genetically modified crop plants. Crit. Rev. Food Sci. Nutr. 36(S), 165–186.
FAO/WHO. (2001). Evaluation of allergenicity of genetically modified foods. Report of a joint FAO/WHO expert consultation on allergenicity of foods derived from biotechnology. Food and Agriculture Organization of the United Nations (FAO), Rome.
Silvanovich, A, Bannon, G, McClain, S. (2009). The use of E-scores to determine the quality of protein alignments. Regulatory Toxicology and Pharmacology 54: S26–S31.
Cressman, R, and Ladics, GS. (2009). Further evaluation of the utility of "sliding window" FASTA in predicting cross-reactivity with allergenic proteins. Reg. Toxicol. Pharmacol., 54:S20-S25.
Mirsky, HP, Cressman, RF, and Ladics, GS. (2013). Comparative assessment of multiple criteria for the in silico prediction of allergenic cross-reactivity. Reg. Toxicol. Pharmacol., 67:232-239.
Song, P, Herman, R, Kumpatla, S. (2014). Evaluation of global sequence comparison and one-to-one FASTA local alignment in regulatory allergenicity assessment of transgenic proteins in food crops. Food and Chemical Toxicology. Volume 71, September 2014, Pages 142–148.
Song, P. (2015). Bioinformatics application in regulatory assessment for potential allergenicity of transgenic proteins in food crops. In: Genetically modified organisms in food, production, safety, regulation and public health. Editors: Ronald Ross Watson and Victor R. Preedy. Academic Press, Elsevier Inc.
König, A, Cockburn, A, Crevel, RW, Debruyne, E, Grafstroem, R, Hammerling, U, Kimber, I, Knudsen, I, Kuiper, HA, Peijnenburg, AA, Penninks, AH, Poulsen, M, Schauzu, M, Wal, JM. (2004). Assessment of the safety of foods derived from genetically modified (GM) crops. Food Chem Toxicol. 42(7):1047-88.
Goodman, RE, Hefle, SL, Taylor, SL, van Ree, R. (2005). Assessing genetically modified crops to minimize the risk of increased food allergy: a review. Int Arch Allergy Immunol. 137(2):153-66.
Goodman, RE, Vieths, S, Sampson, HA, Hill, D, Ebisawa, M, Taylor, SL, van Ree, R. (2008). Allergenicity assessment of genetically modified crops--what makes sense? Nat Biotechnol. 26(1):73-81.
Goodman, RE and AO Tetteh. (2011). Suggested improvements for the allergenicity assessment of genetically modified plants used in foods. Curr Allergy Asthma Rep. 11(4):317-24.
Vaughan, K, Peters, B, Larche, M, Pomes, A, Broide, D, Sette, A. (2013). Strategies to query and display allergy-derived epitope data from the immune epitope database. Int Arch Allergy Immunol. 160(4):334-454.
Song, P, Herman, R, Kumpatla, S. (2015). 1:1 FASTA update: Using the power of E-values in FASTA to detect potential allergen cross-reactivity. Toxicology Reports, 12:1145–1148.
Gendel, SM. (1998). The use of amino acid sequence alignments to assess potential allergenicity of proteins used in genetically modified foods. Adv Food Nutr Res, 42, 45-62.
- A sustainable solution using cutting-edge, high-throughput, automated sequence sorting algorithms to screen an exponentially growing number of protein sequences for those that are potential allergens. Expertise in bioinformatics is employed.
- Oversight and coordination by the Health and Environmental Sciences Institute (HESI), an independent and objective non-profit organization dedicated to advancing the understanding of scientific issues related to human health and the environment. HESI is recognized over the world as a leader in developing and communicating science-based solutions.
- Steering by a multi-sector scientific committee with regulatory agency representation (EPA and FDA) is viewed as critical.
- Transparent criteria used by independent allergy experts to facilitate consensus on inclusion of allergens in the database.
- Inclusion of only sequences with documented allergenicity, maintaining a definitive list of sequences with clear allergenic status.
- Documentation of the search and filtering methods used to build each version of the database.
The construction of the COMPARE database and the process of identifying allergenic protein sequences has been published in the paper below:
The COMPARE Database: A Public Resource for Allergen Identification, Adapted for Continuous Improvement. van Ree et al. Frontiers in Allergy. (August 2021)