Introduction - What is GWAS?

GWAS (Genome-wide association studies) is a common study when you want to check genetic variability in a genomic scale. GWAS will mainly focus on the association between single-nucleotide polymorthisms (SNPs) and traits such as herbicide resistance or some sort of disease. GWAS can be applied to any organisms and species where you want to study variation between different phenotype. GWAS studies investigate the entire genome, in contrast to methods that specifically test a small number of pre-specified genetic regions.

In other words, GWAS is not made to identify gene-specific candidate genes. GWAS studies can be a valuable approach in weed science to help us to identify traits and weed characteristics that could help us better understand how to control it and even how to use some other traits for breeding purposes. GWAS is also a hypothesis-free approach since you are looking for clues in the entire genome to support your potential hypothesis about your trait of interest.

GWAS data is commonly analyzed using a Manhattan plot which is a type of scatter-plot used to display significant SNPs. In a Manhattan plot the x-axis will display the genomic coordinates per chromosome and the y-axis will display the association p-value for each single-nucleotide. This post will focus in the design of a Manhattan plot using ggplot.

Here are the packages that will use for creating the Manhattan plot:

library(tidyverse) # tidyverse packages
library(RColorBrewer) # complement to ggplot
library(ggrepel) # complement to ggplot
library(kableExtra) # table layout

How can we conduct a GWAS?

A GWAS experiment have many steps. Since this post will focused in how create a nice visualization plot for your GWAS data, I will cover only the basics. For this example I will use a adapted available data from a Alzheimer’s disease study. Why not use a plant study? I will discuss this later, for a GWAS study a reference genome is required so we will use a human data due to the easy access to the genome (there are ways to go around this, but for the sake of this post lets just assume the genome of reference issue).

To conduct a GWAS here are some basic steps and key points to consider before visualization of the data:

  1. Case/control set up

As most of our weed science studies, the first step is to select individuals (or plants) that does not have the trait of interest. In this data we have 188 controls and 176 cases. There are other approaches that you can use for your GWAS study.

  1. Reference genome

As mentioned previously, for this analysis we used the reference human genome GRCh38.p13 that can be found at NCBI. We will use this genome to calculate the odds-ratio and p-values for each SNP from both cases.

  1. Linkage-disequilibrium and Population structure

Before calculation of association is important to check for linkage-disequilibrium, population structure or any other confounding variable that can affect your results. Population structure is defined by the organization of genetic variation and is driven by the combined effects of evolutionary processes that include recombination, mutation, genetic drift, demographic history, and natural selection. Linkage-disequilibrium refers to the non-random association of alleles at different loci in a given population. It is important to account for this factors when designing your GWAS studies

Analysis

Statistical analysis

The initial GWAS statistical analysis needs to be done prior plotting the data. This initial steps will not be done here due to the time to run the programs and also because this analysis was not conducted in R. For this analysis, here are some options you can use:

  • Plink: I decided to use this software based on Linux system. This software is well known for GWAS studies and I recommend it due to the speed of processing data on it.

  • Tassel: Tassel is commonly used in plant breeding studies, specially in Maize. You can also used it in R!

After running the data in Plink here are the data-set results obtained from plink:

Click here to download the data: https://drive.google.com/file/d/1OPbKktDUN4izxhcrcA1EtgAdhvWThDF8/view?usp=sharing

# load data
plink.result <- read_table2("/Users/maxwelco/Documents/OpenWeedSci/static/analysis1.assoc.logistic") # change to your root directory

head(plink.result) # get first 5 lines
## # A tibble: 6 x 10
##     CHR SNP           BP A1    TEST  NMISS      OR   STAT      P X10  
##   <dbl> <chr>      <dbl> <chr> <chr> <dbl>   <dbl>  <dbl>  <dbl> <chr>
## 1     1 rs3094315 792429 C     ADD     339 1.47     1.76  0.0791 <NA> 
## 2     1 rs3094315 792429 C     COV1    339 0.417   -0.420 0.674  <NA> 
## 3     1 rs3094315 792429 C     COV2    339 0.00531 -2.46  0.0137 <NA> 
## 4     1 rs3094315 792429 C     COV3    339 1.82     0.293 0.770  <NA> 
## 5     1 rs3094315 792429 C     COV4    339 0.446   -0.396 0.692  <NA> 
## 6     1 rs3094315 792429 C     COV5    339 2.71     0.490 0.624  <NA>
glimpse(plink.result) # check data stucture
## Rows: 1,485,344
## Columns: 10
## $ CHR   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ SNP   <chr> "rs3094315", "rs3094315", "rs3094315", "rs3094315", "rs3094315"…
## $ BP    <dbl> 792429, 792429, 792429, 792429, 792429, 792429, 819185, 819185,…
## $ A1    <chr> "C", "C", "C", "C", "C", "C", "G", "G", "G", "G", "G", "G", "G"…
## $ TEST  <chr> "ADD", "COV1", "COV2", "COV3", "COV4", "COV5", "ADD", "COV1", "…
## $ NMISS <dbl> 339, 339, 339, 339, 339, 339, 341, 341, 341, 341, 341, 341, 341…
## $ OR    <dbl> 1.474000, 0.417400, 0.005308, 1.820000, 0.446000, 2.712000, 1.4…
## $ STAT  <dbl> 1.75600, -0.42040, -2.46500, 0.29290, -0.39620, 0.49010, 1.4240…
## $ P     <dbl> 0.07913, 0.67420, 0.01369, 0.76960, 0.69200, 0.62410, 0.15440, …
## $ X10   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

As you can see, this is a very big data-set where we have almost 1.5 million observations and 10 variables:

  • CHR = Chromosome
  • SNP = Single-nucleotide polymorphism
  • BP = Base-pair coordinate
  • A1 = Allele 1
  • TEST = Test identifier
  • NMISS = Number of observations (non-missing genotype, phenotype, and covariates)
  • OD = odds(allele 1 | case) / odds(allele 1 | control)
  • STAT = T-statistic
  • P = Asymptotic p-value for t-statistic

Now lets do some data cleaning to transform some variables to factor and remove the last column that does not mean anything. We are also filtering the data only for TEST = ADD to only plot the data points corresponding to SNP effect:

df_clean <- plink.result %>% 
  filter(TEST == "ADD") %>% 
  select(-X10) #  Remove all rows that don’t correspond to testing the SNP effect 

fac <- c("CHR","SNP", "A1", "TEST") # select columns to be factor

df_clean[fac] <- lapply(df_clean[fac], factor) # transform columns to factor

glimpse(df_clean) # check new data stucture
## Rows: 246,663
## Columns: 9
## $ CHR   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ SNP   <fct> rs3094315, rs4040617, rs4075116, rs6603781, rs3766180, rs660379…
## $ BP    <dbl> 792429, 819185, 1043552, 1198554, 1563420, 1586208, 1596068, 16…
## $ A1    <fct> C, G, G, T, G, G, A, T, C, G, C, A, A, A, A, A, A, C, C, A, A, …
## $ TEST  <fct> ADD, ADD, ADD, ADD, ADD, ADD, ADD, ADD, ADD, ADD, ADD, ADD, ADD…
## $ NMISS <dbl> 339, 341, 341, 336, 342, 341, 341, 338, 342, 342, 328, 342, 342…
## $ OR    <dbl> 1.4740, 1.4160, 0.9771, 1.4910, 0.8282, 0.8004, 0.8130, 0.7624,…
## $ STAT  <dbl> 1.756000, 1.424000, -0.139800, 1.755000, -1.098000, -1.321000, …
## $ P     <dbl> 0.07913, 0.15440, 0.88880, 0.07933, 0.27230, 0.18660, 0.22470, …

Now we have a smaller (still big) data-set with almost 250,000 observations and we can go forward and create our Manhattan plot.

Manhattan plot

To create the Manhattan plot, I will create a function that takes two arguments (df = dataframe and threshold of SNP significance) was created following three steps:

  1. Compute the cumulative position of SNP (BPcum) and add it to the data set. This was done by grouping the filtered data by chromosome and summarizing the Max BP for each chromosome. After that, using mutate I calculated total length by subtracting the cumulative chromosome length by the chromosome length and them use left join to add this into the data. A final step was arranging by chromosome and base pairs and create another column named BPcum to get the cumulative position of SNPs.

  2. The second step was to prepare the x-axis to be used in the plot. By centering each chromosome to the middle point of the BPcum.

  3. The final step is to create the plot. To do this I used ggplot where I specify the x as BPcum and y as the -log10 of the P-values. After that, I created a geom_point layer by mapping color to Chromosome; color was defined by using the scale_color_manual function where each chromosome was map in two colors. To use the previous created x-axis we defined it using the function scale_x_continuous. Other aesthetic values were defined using the theme function and geom_rapel from the package ggrapel was used to annotate the significant SNPs with another geom_point function to highlight those points as well. Dashed line represent the defined threshold.

Lets implement our plot function:

manh_plot <- function(df, threshold) {
  
  ### 1. Compute the cumulative position of SNP ### 
  plot_data <- df %>%   
  # Compute chromosome size
  group_by(CHR) %>% 
  summarise(chr_len=as.numeric(max(BP))) %>% 
  # Calculate cumulative position of each chromosome
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  select(-chr_len) %>%
  # Add this info to the initial dataset
  left_join(df_clean, ., by=c("CHR"="CHR")) %>%
  # Add a cumulative position of each SNP
  arrange(CHR, BP) %>%
  mutate( BPcum=as.numeric(BP+tot))
  
  ### 2. Generate x-axis ###
  axisdf <- plot_data %>% 
  group_by(CHR) %>% 
  summarize(center=(max(BPcum) + min(BPcum)) / 2 )
  
  ### 3. create plot ###
  plot <- ggplot(plot_data, aes(x=BPcum, y=-log10(P))) + 
  #specify the y and x values
  geom_point( aes(color=as.factor(CHR)), alpha=0.8, size=1.3) + 
  # create scatterplot colored by chromosome
  scale_color_manual(values = rep(c("#E2709A", "#CB4577", 
                                    "#BD215B", "#970F42", 
                                    "#75002B"), 22)) + 
  # set a colour pattern 
  scale_x_continuous(label = axisdf$CHR, breaks= axisdf$center) + 
  # scale the x-axis
  scale_y_continuous(expand = c(0, 0)) + 
  # remove space between plot area and x axis
  ylim(0,20) +
  theme_light() +
  theme(legend.position="none",
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.line = element_line(color = "black")) +
  xlab("Chromosome") + 
  # add x label
  geom_label_repel( data=plot_data %>% filter(P < threshold), # add annotation value
                    aes(label=SNP), size=3) + # add annotation
    geom_point(data= plot_data %>% filter(P < threshold), # add annotation value
               color="orange", size=2) + # Add highlighted points 
  geom_hline(yintercept = -log10(threshold), linetype="dashed") # threshold line

  return(plot) # return the final plot
}

Now lets run our function! For our function we will set our threshold for SNP negative log p-value significance of \(10^{-10}\). This value is defined by you based on your data.

plot <- manh_plot(df_clean, threshold = 10^-10) # run function
plot

Great! It works looks like we have two major SNPs in our dataset. It seems that those SNPs are located in the chromossome 19. Lets take a look by zooming on it using the ggplot function coord_cartesian().

However, before doing it we need to identify the coordinates limits of each chromosome in the our genomic data. To do that lets use some tidyverse functions such as group by and summarize:

# Identify max and min values for BPcum ##
kable(df_clean %>%
  # Compute chromosome size
  group_by(CHR) %>% 
  summarise(chr_len=as.numeric(max(BP))) %>% 
  # Calculate cumulative position of each chromosome
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  select(-chr_len) %>%
  # Add this info to the initial dataset
  left_join(df_clean, ., by=c("CHR"="CHR")) %>%
  # Add a cumulative position of each SNP
  arrange(CHR, BP) %>%
  mutate( BPcum=as.numeric(BP+tot)) %>% 
  group_by(CHR) %>% 
  summarize(`Upper limit` = max(BPcum),
            `Lower limit` = min(BPcum))) %>%
  kable_styling()
CHR Upper limit Lower limit
1 245330435 792429
2 488060817 245421949
3 687382885 488109420
4 878688928 687452420
5 1059314367 878770877
6 1230137976 1059424999
7 1388743029 1230279298
8 1535007247 1388932576
9 1673311023 1535189376
10 1808634455 1673459969
11 1943078135 1808831222
12 2075451837 1943140015
13 2189515198 2093562099
14 2295868223 2208917893
15 2396055457 2314314645
16 2484724313 2396138960
17 2563329787 2484731201
18 2639445341 2563480674
19 2703230392 2639657374
20 2765607350 2703316517
21 2812509590 2779179019
22 2861956667 2827773044
23 3016436346 2864653419
25 3019123947 3016531021

It seems that our chromosome is on the 2639657374 and 2703230392 limit lets plot it!

## plot only chromossome 19 according to the above values
plot +
  coord_cartesian(xlim = c(2639657374, 2703230392))

Conclusion

With this codes you can create your own (and customizable) Manhattan plot! That are many other analysis that you can do in a GWAS studies; it is a vast and amazing research approach that weed scientist can (and should) use!