r – Page 2 – R-posts.com

New Package to Process TVDI index and Filter Golay Savitzky Raster

Description

Use MODIS image to calculate TVDI index
Make multiple Raster images at the same time
Can be used to calculate large image files
UI interface calculates TVDI index
UI interface exports Golay Savitzky filter images
The functions in the TVDI package
- Golay_Raster
- Golay_GUI (may be failed if you don’t have GTK+)
- Mean_Raster
- Mask_Multi_Raster
- IQR_Raster
- TVDI_process
- TVDI_Largefiles_process
- TVDI_GUI (may be failed if you don’t have GTK+)

How to Download and Install

Download and Install from Github

install.packages("devtools")
library("devtools")
install_github("nguyenduclam/TVDIpk")
library("TVDIpk")

Install from Cran (waiting for update in Cran)

install.packages("TVDIpk")

Note that GTK+ library is not already installed on your system, installation may fail. In that case, please install and load the gWidgetsRGtk2 library beforehand:

install.packages("gWidgetsRGtk2")
library("gWidgetsRGtk2")

If none of the above works, download package at this link to download this package type tar.gz
Github link: https://github.com/nguyenduclam/TVDIpk

How to use Pakages

Golay UI
- ```
Golay_GUI()
```
TVDI UI
- ```
TVDI_GUI()
```

References

A New Release of rIP (v1.2.0) for Detecting Fraud in Online Surveys

We are excited to announce the latest major release of rIP (v1.2.0), which is an R package that detects fraud in online surveys by tracing, scoring, and visualizing IP addresses. Essentially, rIP takes an array of IP addresses, which are always captured in online surveys (e.g., MTurk), and the keys for the services the user wishes to use (IP Hub, IP Intel, and Proxycheck), and passes these to all respective APIs. The output is a dataframe with the IP addresses, country, internet service provider (ISP), labels for non-US IP Addresses, whether a virtual private server (VPS) was used, and then recommendations for blocking the IP address. Users also have the option to visualize the distributions, as discussed below in the updates to v1.2.0.

Especially important in this is the variable “block”, which gives a score indicating whether the IP address is likely from a “server farm” and should be excluded from the data. It is coded 0 if the IP is residential/unclassified (i.e. safe IP), 1 if the IP is non-residential IP (hostping provider, proxy, etc. – should likely be excluded, though the decision to do so is left to the researcher), and 2 for non-residential and residential IPs (more stringent, may flag innocent respondents).

Including some great contributions from Bob Rudis, some of the key feature updates included in v1.2.0 of rIP are:

Added discrete API endpoints for the three IP services so users can use this as a general purpose utility package as well as for the task-specific functionality currently provided. Each endpoint is tied to an environment variable for the secret info (API key or contact info). This is documented in each function.
On-load computed package global .RIP_UA which is an httr user_agent object, given the best practice to use an identifiable user agent when making API calls so the service provider can track usage and also follow up with any issues they see.
A new plotting option that, when set to “TRUE”, produces a barplot of the IP addresses checked with color generated via the amerika package.
Users can now supply any number of IP service keys they wish to use (1, 2, or all 3), and the function will ping only the preferred IP check services (formerly, the package required all three keys or none to be entered).
For those interested in reading more and citing the package in published work, check out our recently published software paper in the Journal of Open Source Software.

Here is a quick demo of the package with some fake (auto-generated) IP addresses:

# Install and load rIP, v1.2.0
install.packages("rIP")
library(rIP)

# Store personal keys (only "IP Hub" used here)
ip_hub_key = "MzI2MTpkOVpld3pZTVg1VmdTV3ZPenpzMmhodkJmdEpIMkRMZQ=="

ipsample = data.frame(rbind(c(1, "15.79.157.16"), c(2, "1.223.176.227"), c(3, "72.167.36.25"), c(4, "5.195.165.176"),
                             c(5, "27.28.25.206"), c(6, "106.118.241.121"), c(7, "231.86.14.33"), c(8, "42.56.9.80"), c(9, "41.42.62.229"),
                             c(10, "183.124.243.176")))
names(ipsample) = c("number", "IPAddress")

# Call the getIPinfo function to check the IPs
getIPinfo(ipsample, "IPAddress", iphub_key = ip_hub_key, plots = TRUE)

Running the code above will generate the following plot, as well as the dataframe mentioned above.

Note that to use the package, users must have valid personal keys for any IP service they wish to call via the function. These can be obtained for free at the corresponding IP check services.

Finally, we welcome contributions and reports of bugs either by opening an issue ticket or a pull request at the corresponding Github repository. Thanks and enjoy the package!

How to get the homology of a antibody using R

What is homology?

Proteins are conserved bio molecules present in all organisms. Some proteins are conserved in many similar species, making them homologues of each other, meaning that the sequence is not the same but there is a degree of similarity.

This is measured as homology in percentage.

Histone H1 homology in mammals (https://en.wikipedia.org/wiki/Sequence_homology)

What is R

R is a high level programming language with a focus on statistical analysis and machine learning. R Project

R is widely used in bioinformatics and has a huge resources library of packages, most of them developed by the top Universities to help with hard to understand data, like genomic and proteomic data.

Why this matters?

This matters because antibodies will bind to specific proteins, but many proteins have similar structure and sequence. So when performing an assay you want to know what specific protein the antibody binds to.

This is also relevant when choosing an antibody that works in many species, specially when doing research in less used species, like veterinary studies.

Getting Started

You need the sequence of the epitope used to produce the antibody, I need to know where this product will bind in the target protein. If the sequence is very conserved in mammalians for example, this will be homologue in other species; eg. actins, membrane proteins, etc.

Obtaining the data

Data for sequence can be provided by the manufacturer, in some cases it will be in the terms of region of the protein, eg. 162–199aa in C-terminal of RAB22A human.

In these cases I will need to retrieve the protein sequence from UNIPROT and then cut it according to the information I have on the immunogen, this can be done using a program sequentially, like R stringi package or using an excel sheet with MID function.

Prepare environment in R (libraries)

I will use Bioconductor packages:

#Install packages: rBLAST, Biostrings, annotate 
install.packages("devtools")
library(devtools)
install_github("mhahsler/rBLAST")
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
biocLite("annotate")

These libraries will help us retrieve the homology data, and then I will use simple commands to clean and present the data in a nice way.

Cleaning data

Cleaning the data is usually 80% of the work of any data scientist.

I want the homology in simple terms of aa, or in bioinformatic terms FASTA; eg.

ATTAAGCTTATTGC

You can have terminal places, that need to be removed, usually in terms of ‘-C’, I can remove this using a simple gsub function.

dataframe$sequenceImunogen <- gsub("\\-C", "", dataframe$sequenceImunogen)

For this case the sequence is in a column called sequenceImunogen that is why I have to make a loop so the protein blast goes through all the sequences in the data frame.

The function

I will use the function blastSequences to blast the protein and obtain the homology of this specific sequence to other species. The parameters of the function can be found in the literature. I used program = “blastp” so I perform a Protein Blast, I use a hitListSize of 50 and a timeout of 200 seconds, this is from previous experience, as I will do this for multiple sequences I prefer the data to be as = data.frame.

x <- blastSequences(x = Sequence, timeout = 200, as = "data.frame", program = "blastp", hitListSize = 50)

Making a loop in R

When creating a loop in R I always put a Start variable with the Sys.time()constant, so I know how long it takes to run this loop.

Start <- Sys.time()

I have use this in a couple of hundred sequences and the result toke a few days, so keep that in mind when you run.

I first create a function that formats my results in terms of percentage of homology and give it the term percentage.

#Create a percent function
percent <- function(x, digits = 1, format = "f", ...) {
paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
           }

Cleaning the result data

Once the blastP results come, the 1st step is to clean unwanted information so it is not confusing, I used the base functions subset, grepl, gsub to get only information with results verified, and remove annoying characters. The objective is to have a nice table result with the best homologues.

#Remove what I do not want
x <- subset(x, grepl("_A", x$Hit_accession) == FALSE)
x <- subset(x, grepl("XP_", x$Hit_accession) == FALSE)
x <- subset(x, grepl("unnamed", x$Hit_def) == FALSE)

#Removing characters that do not 
x$Species <- sub(".*\\[(.*)\\].*", "\\1", x$Hit_def, perl=TRUE)
x$Species <- gsub("/].*","", x$Species)
x$Hit_def <- gsub("[[:punct:]].*", "", x$Hit_def)

Retrieving the results above 80% identical

After I get all the results and clean a bit of the results, I want to subset the homologues to above 80%. To do this I count the total characters of the sequence using nchar function and divide the total number of hit results by this number of character, ie. for a 100 aa sequence if I hit 80 I get a 80% homology.

x$Hsp_identity <- as.numeric(x$Hsp_identity)
n <- nchar(dataframe$sequenceImunogen[i])
x$percentage <- as.numeric(x$Hsp_identity/n)
x<- x[order(-x$percentage), ]
x<- x[x$percentage > 0.8, ]
x$percentage <- percent(x$percentage)

I then clean the result to show as percentage.

Format nice output

I want this to be able to be presented in HTML, so I use HTML tags to create a visually pleasing table.

Because this is to be online I want to add hyperlinks to the results so my customers can see the result and analyse the data.

x$link <- paste0("https://www.ncbi.nlm.nih.gov/protein/",x$Hit_accession)
    x$final <- paste(x$Hit_def, x$Species, x$percentage, sep = " - ")
    x <- subset(x,!duplicated(x$final))
    x$TOTAL <- paste0("<a href='",x$link,"'target='_blank'>",x$final,"</a>")

I Remove also any duplicated information from our table, and paste all the result into a new column, with all results separated by
(line break in HTML), to do this I use the function paste, with the option collapse = “<br>”

This will create hopefully a table like following:

All the code

All the code can be seen bellow: Use TryCatch for error management, sometimes the functions times out and you need to rerun, you get a prompt asking to skip or run again, a simple Y is enough.

Start <- Sys.time()
    for (i in 1:nrow(dataframe)) {
          tryCatch({
          percent <- function(x, digits = 1, format = "f", ...) { paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
                    }
            x <- blastSequences(x = dataframe$sequenceImunogen[i], timeout = 200, as = "data.frame", program = "blastp", hitListSize = 50)

#Remove what I do not want
      x <- subset(x, grepl("_A", x$Hit_accession) == FALSE)
      x <- subset(x, grepl("XP_", x$Hit_accession) == FALSE)
      x <- subset(x, grepl("unnamed", x$Hit_def) == FALSE)

#Removing characters that do not 
      x$Species <- sub(".\\[(.)\\].*", "\\1", x$Hit_def, perl=TRUE)
      x$Species <- gsub("/].*","", x$Species)
      x$Hit_def <- gsub("[[:punct:]].*", "", x$Hit_def)
      x$Hsp_identity <- as.numeric(x$Hsp_identity)
      n <- nchar(masterir2$sequenceImunogen[i])
      x$percentage <- as.numeric(x$Hsp_identity/n)
      x<- x[order(-x$percentage), ]
      x<- x[x$percentage > 0.8, ]
      x$percentage <- percent(x$percentage)

      x$link <- paste0("https://www.ncbi.nlm.nih.gov/protein/",x$Hit_accession)
      x$final <- paste(x$Hit_def, x$Species, x$percentage, sep = " - ")
      x <- subset(x,!duplicated(x$final))
      x$TOTAL <- paste0("<a href='",x$link,"'target='_blank'>",x$final,"</a>")

  dataframe$Homology[i] <- paste(unlist(x$TOTAL), collapse ="</br>")                
}, error=function(e{cat("ERROR :",conditionMessage(e), "\n")})
    }
end <- Sys.time()
print(end - Start)

Conclusion

I can use R and the Bioinformatics packages in Bioconductor to retrieve the homology of known epitopes, this help us provide more valuable information to researchers.

Homology is important in the niche world of antibodies, there is a limited amount of money for validation and a limited amount of species that the antibodies can be tested on. With this simple program a homology can be check in a click of a button, instead of using

What next?

This can be incorporated into a small web application that can provide the homology for a specific selected product with a click off a button, saving time to search and blast on the NCBI website.

I am very new to the world of R and coding, so any comment is appreciated.