Web data acquisition: parsing json objects with tidyjson (Part 3)

The collection of example flight data in json format available in part 2, described the libraries and the structure of the POST request necessary to collect data in a json object. Despite the process generated and transferred locally a proper response, the data collected were neither in a suitable structure for data analysis nor immediately readable. They appears as just a long string of information nested and separated according to the JavaScript object notation syntax. Thus, to visualize the deeply nested json object and make it human readable and understandable for further processing, the json content could be copied and pasted in a common online parser. The tool allows to select each node of the tree and observe the data structure up to the variables and data of interest for the statistical analysis. The bulk of the relevant information for the purpose of the analysis on flight prices are hidden in the tripOption node as shown in the following figure (only 50 flight solutions were requested). However, looking deeply into the object, several other elements are provided as the distance in mile, the segment, the duration, the carrier, etc. The R parser to transform the json structure in a usable dataframe requires the dplyr library for using the pipe operator (%>%) to streamline the code and make the parser more readable. Nevertheless, the library actually wrangling through the lines is tidyjson and its powerful functions:
  • enter_object: enters and dives into a data object;
  • gather_array: stacks a JSON array;
  • spread_values: creates new columns from values assigning specific type (e.g. jstring, jnumber).
library(dplyr) # for pipe operator %>% and other dplyr functions library(tidyjson) # https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html data_items <- datajson %>% spread_values(kind = jstring("kind")) %>% spread_values(trips.kind = jstring("trips","kind")) %>% spread_values(trips.rid = jstring("trips","requestId")) %>% enter_object("trips","tripOption") %>% gather_array %>% spread_values( id = jstring("id"), saleTotal = jstring("saleTotal")) %>% enter_object("slice") %>% gather_array %>% spread_values(slice.kind = jstring("kind")) %>% spread_values(slice.duration = jstring("duration")) %>% enter_object("segment") %>% gather_array %>% spread_values( segment.kind = jstring("kind"), segment.duration = jnumber("duration"), segment.id = jstring("id"), segment.cabin = jstring("cabin")) %>% enter_object("leg") %>% gather_array %>% spread_values( segment.leg.aircraft = jstring("aircraft"), segment.leg.origin = jstring("origin"), segment.leg.destination = jstring("destination"), segment.leg.mileage = jnumber("mileage")) %>% select(kind, trips.kind, trips.rid, saleTotal,id, slice.kind, slice.duration, segment.kind, segment.duration, segment.id, segment.cabin, segment.leg.aircraft, segment.leg.origin, segment.leg.destination, segment.leg.mileage) head(data_items) kind trips.kind trips.rid saleTotal 1 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 2 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 3 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 4 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 5 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 6 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 id slice.kind slice.duration 1 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 2 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 3 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 4 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 5 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 6 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 segment.kind segment.duration segment.id segment.cabin 1 qpxexpress#segmentInfo 160 GixYrGFgbbe34NsI COACH 2 qpxexpress#segmentInfo 235 Gj1XVe-oYbTCLT5V COACH 3 qpxexpress#segmentInfo 190 Grt369Z0shJhZOUX COACH 4 qpxexpress#segmentInfo 155 GRvrptyoeTfrSqg8 COACH 5 qpxexpress#segmentInfo 100 GXzd3e5z7g-5CCjJ COACH 6 qpxexpress#segmentInfo 105 G8axcks1R8zJWKrN COACH segment.leg.aircraft segment.leg.origin segment.leg.destination segment.leg.mileage 1 320 FCO IST 859 2 77W IST LHR 1561 3 73H FCO ARN 1256 4 73G ARN LHR 908 5 319 FCO STR 497 6 319 STR LHR 469 Data are now in an R-friendly structure despite not yet ready for analysis. As can be observed from the first rows, each record has information on a single segment of the flight selected. A further step of aggregation using some SQL is needed in order to end up with a dataframe of flights data suitable for statistical analysis. Next up, the aggregation, some data analysis and data visualization to complete the journey through the web data acquisition using R. #R #rstats #maRche #json #curl #tidyjson #Rbloggers This post is also shared in www.r-bloggers.com and LinkedIn

Web data acquisition: the structure of RCurl request (Part 2)

The acquisition of data in json structure presented in part 1 clearly showed the functioning of the client-server connection and the possibility to collect the data of interest. However, the json output appeares as a set of raw data in a json string that needs to be structured and stored in a suitable form for data processing and statistical analysis.

For this reason, it makes sense to develop the entire process using #R in order to have the data directly queried, collected, parsed, structured and made usable in a unique environment. Of course, this will be the one used in the process “last mile”, i.e. data analysis. The curl library adopted in the command line process described in the previous post has its alter ego in the RCurl library. Together with jsonlite for ‘R-JSON translation’ these are the necessay packages for the development of the request as presented in the following code. [splus] # before loading the libraries rememeber to install them – install.packages(‘library here’) library(RCurl) library(jsonlite) # save the url of the request in an object (same as -X POST in the curl request) url <- ‘https://www.googleapis.com/qpxExpress/v1/trips/search?key={SERVER_KEY}&alt=json’ # headers (same as -H) headers <- list(‘Accept’ = ‘application/json’, ‘Content-Type’ = ‘application/json’, ‘charset’ = ‘UTF-8’) # R structure of the input for the request (same as -d + JSON) x = list( request = list( slice = list( list(origin = ‘FCO’, destination = ‘LHR’, date = ‘2017-06-30’)), passengers = list(adultCount = 1, infantInLapCount = 0, infantInSeatCount = 0, childCount = 0, seniorCount = 0), solutions = 500, refundable = F)) # url, headers and x are the parameters to be used in R functions to send the request # and save the output data in the datajson object # postForm is the RCurl function to send the request using the POST method # toJSON is the jsonlite function to convert the R structure of the request in JSON input datajson <- postForm(url, .opts=list(postfields=toJSON(x), httpheader=headers)) datajson [/splus] After few seconds from the POST request necessary to send the request and collect the response, all the information related to the flights with origin FCO (Fiumicino – Rome) and destination LHR (London Heathrow) will be hosted in the datajson object, similarly to the command line procedure. The json string holds and hides all the observations and variables of interest for the statistical analysis inlcuding the most important, i.e. the flight prices. The next post will explain how to parse the json object and structure the information in a suitable dataframe for analysis using the powerful library #tidyjson. #R #rstats #maRche #json #curl #qpxexpress #Rbloggers This post is also shared in www.r-bloggers.com and LinkedIn

Web data acquisition: understanding RCurl from the command line (Part 1)

After the short presentation here, let’s start using R seriously, i.e. with every day data.

Being a frequent flyer, I often search the web to book flights and organise my trip. Being a data analyst, it’s natural to look at price data with interest, especially the most convenient, most expensive and “average” flights. Being a PhD student in software engineering the (study) programme includes some on the structure of applications, client-server data flows and the web languages running the processes. Nothing like Google flight API QPX express would have been better to combine all these three. I discovered it by chance delving into the first results of my flight searches on Google and it is perfect for some web scraping and analysis. As described in the manual, QPX Express is part of the Google suite of publicly-available REST APIs, designed for easy integration, fast response and high reliability. QPX Express searches combinations of airline schedules, fares and seat availability and includes search parameters such as origin, destination, dates and travel features such as maximum price and earliest departure. The API returns up to 500 itineraries for each request, including all data elements needed for customer display and a booking process. Moreover, it also allows to clearly see the structure of the data in #json both in input (JSON request) and output (Unformatted Response). For a simple test, use the Demo with a Google account. Before start using the QPX Express API, you need to get a Google API key. To get a key, follow these steps: 1. Go to the Google Developers Console / 2. Create or select a project / 3. Click Continue to enable the API and any related services / 4. On the Credentials page, get an API key. With a single API key, it is possible to make 50 free requests per day (then the cost per single query is $ 0.035.) The first thing I tried before writing the code in #R for data acquisition, manipulation and analysis was to test the client-server connection through a command line instruction using the cURL library (you can download here – add the certificate in the same directory otherwise the server will deny you access). Let’s play a bit with the command line (Windows 10) before running the request:
# open the command prompt and the default folder should be system32 of the OS
C:\WINDOWS\system32>

# set a convenient current directory (cd) for storing data - \ (enter) and / (exit) the path
cd C:\Users\Roberto\Desktop

# your new position should be in 
C:\Users\Roberto\Desktop>


# I prefer using a partition of the HD dedicated to data instead of a messy desktop
cd /D D:\

# create a new working directory (use "" for separated names otherwise two folders will be created)
mkdir "flight data"
# set the current directory to the newly created folder
cd "flight data"
# make sub-folders of the cd
mkdir "curl test 1"
mkdir "curl test 2"
# check the cd, its features and its new sub-folders
cd

dir

# finally, check the proper installation of curl using help
# the list of the function parameters should appear with some description

curl -help
After the creation of a repository, cURL can be used for testing the connection to QPX express and the collection of data in json using cURL parameters for HTTP requests and the structure of the JSON request the API proposes:
# Parameters
# -X POST: parameter for POST request -> see: https://en.wikipedia.org/wiki/POST_(HTTP)
# -H: Pass custom header LINE to server (H) -> see: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
# -d: HTTP POST data, i.e. the input data in JSON structure 
# >: "send to" destination file

curl -X POST "https://www.googleapis.com/qpxExpress/v1/trips/search?key={SERVER_KEY}&alt=json" -H "Content-Type: application/json; charset=UTF-8" -d "{ \"request\": { \"slice\": [ { \"origin\": \"{ORIGIN_CODE}\", \"destination\": \"{DESTINATION_CODE}\", \"date\": \"{DATE_YYYY-MM-DD}\" } ], \"passengers\": { \"adultCount\": 1, \"infantInLapCount\": 0, \"infantInSeatCount\": 0, \"childCount\": 0, \"seniorCount\": 0 }, \"solutions\": 500, \"refundable\": false }}" > "{RESULT_FILE}.json"
# Mac syntax
curl -X POST 'https://www.googleapis.com/qpxExpress/v1/trips/search?key={SERVER_KEY}&alt=json' -H 'Content-Type: application/json; charset=UTF-8' -d '{ "request": { "slice": [ { "origin": "{ORIGIN_CODE}", "destination": "{DESTINATION_CODE}", "date": "{DATE_YYYY-MM-DD}" } ], "passengers": { "adultCount": 1, "infantInLapCount": 0, "infantInSeatCount": 0, "childCount": 0, "seniorCount": 0 }, "solutions": 500, "refundable": false }}'> '{RESULT_FILE}.json'
After a second, the prompt should show the downloading process stats and a new json file should appear in the cd “flight data” close to the previously created subfolders. The connection works! Next up – a post with code to translate the same commands in R – to manipulate the data comfortably after acquisition from the API. #R #rstats #maRche #json #curl #qpxexpress #Rbloggers This post is also shared in www.r-bloggers.com and LinkedIn

ggedit 0.1.1: Shiny module to interactvely edit ggplots within Shiny applications

ggedit is a package that lets users interactively edit ggplot layer and theme aesthetics. In a previous post we showed you how to use it in a collaborative workflow using standard R scripts. More importantly, we highlighted that ggedit outputs to the user, after editing, updated: gg plots, layers, scales and themes as both self-contained objects and script that you can paste directly in your code.

Installation

devtools::install_github('metrumresearchgroup/ggedit',subdir='ggedit')

version 0.1.1 Updates

  • ggEdit Shiny module: use ggedit as part of any Shiny application.
  • gggsave: generalized ggsave to write multiple outputs of ggplot to a single file and/or multiple files from a single call. Plots can be saved to various graphic devices. 

ggEdit Shiny module

This post will demonstrate a new method to use ggedit, Shiny modules. A Shiny module is a chunk of Shiny code that can be reused many times in the same application, but generic enough so it can be applied in any Shiny app (in simplest terms think of it as a Shiny function). By making ggedit a Shiny module we can now replace any renderPlot() call that inputs a ggplot and outputs in the UI plotOutput(), with an interactive ggedit layout. The analogy between how to use the ggEdit module in comparison to a standard renderPlot call can be seen in the table below.
  Standard Shiny Shiny Module
Server output$id=renderPlot(p) reactiveOutput=callModule(ggEdit,id,reactive(p))
UI plotOutput(id) ggEditUI(id)
We can see that there are a few differences in the calls. To call a module you need to run a Shiny function callModule, in this case ggEdit. Next, a character id for the elements the module will create in the Shiny environment and finally the arguments that are expected by the module, in this case a reactive object that outputs a ggplot or list of ggplots. This is coupled with ggEditUI, which together create a ggedit environment to edit the plots during a regular Shiny app. In addition to the output UI the user also gets a reactive output that has all the objects that are in the regular ggedit package (plots, layers, scales, themes) both in object and script forms. This has great advantages if you want to let users edit plots while keeping track of what they are changing. A realistic example of this would be clients (be it industry or academia) that are shown a set of default plots, with the appropriate data, and then they are given the opportunity to customize according to their specifications. Once they finish editing, the script is automatically saved to the server, updating the clients portfolio with their preferred aesthetics. No more email chains on changing a blue point to an aqua star! Below is a small example of a static ggplot using renderPlot/plotOutput and how to call the same plot and a list of plots using ggEdit/ggeditUI. We added a small reactive text output so you can see the real-time changes of the aesthetic editing being returned to the server [youtube https://www.youtube.com/watch?v=pJ1kbd_OVwg]

Source Code for example

library(ggedit)
server = function(input, output,session) {
p1=ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species))+geom_point()
p2=ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species))+geom_line()+geom_point()
p3=list(p1=p1,p2=p2)
output$p=renderPlot({p1})

outp1=callModule(ggEdit,'pOut1',obj=reactive(list(p1=p1)))
outp2=callModule(ggEdit,'pOut2',obj=reactive(p3))

output$x1=renderUI({
layerTxt=outp1()$UpdatedLayerCalls$p1[[1]]
aceEditor(outputId = 'layerAce',value=layerTxt,
mode = 'r', theme = 'chrome',
height = '100px', fontSize = 12,wordWrap = T)
})

output$x2=renderUI({
themeTxt=outp1()$UpdatedThemeCalls$p1
aceEditor(outputId = 'themeAce',value=themeTxt,
mode = 'r', theme = 'chrome',
height = '100px', fontSize = 12,wordWrap = T)
})
}

ui=fluidPage(
conditionalPanel("input.tbPanel=='tab2'",
sidebarPanel(uiOutput('x1'),uiOutput('x2'))),
mainPanel(
tabsetPanel(id = 'tbPanel',
tabPanel('renderPlot/plotOutput',value = 'tab1',plotOutput('p')),
tabPanel('ggEdit/ggEditUI',value = 'tab2',ggEditUI('pOut1')),
tabPanel('ggEdit/ggEditUI with lists of plots',value = 'tab3',ggEditUI('pOut2'))
)))
shinyApp(ui, server)

gggedit

ggsave is the device writing function written for the ggplot2 package. A limitation of it is that only one figure can be written at a time. gggsave is a wrapper of ggsave that allows for list of ggplots to be called and then passes arguments to base graphics devices to create multiple outputs automatically, without the need of loops.
library(ggedit) 
#single file output to pdf 
gggsave('Rplots.pdf',plot=pList) 

#multiple file output to pdf 
gggsave('Rplots.pdf',plot=pList,onefil = FALSE) 

#multiple file output to png 
gggsave('Rplots.png',plot=pList)


Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings.


To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website.


Contact: For questions and comments, feel free to email me at: [email protected] or open an issue in github.