The primary programming language used in the Data Science course at Erasmus University Rotterdam was ‘R’. So when the topic ‘XML, XQuery and XSL’ was tackled, there was a need for a tool that could work with R. BaseX, an Open Source XML database, offered the functionalities that were needed for the excercises but there was no R support. However, a server protocol was available in which all communication was fully described. This protocol has now been used for the development of RBaseX. Version 1.1.1 of this client is available at (https://cran.r-project.org/web/packages/RBaseX/index.html). All following examples are based on the minutes of the Dutch Parliament which are freely available in XML-format. BaseX is written in Java. It is shipped as .jar, .zip, .exe or .war. On Windows and Linux, it works out of the box. All clients are based on the client/server architecture. Hence, a BaseX database server must be started first. RbaseX can be used in two modes:
- Standard mode is used for connecting to a server and sending commands
- Query Mode is used for defining queries, binding variables, iterative evaluation
Standard mode
RBaseX is developed using the R6 package for object-oriented programming. All interactions between the client and BaseX begin with the creation of a new instance of BaseXClient.The following command creates a new
BasexClient
instance and stores it in the variable Session
. This Session
is used in all the examples.
library(RBaseX)
Session <- BasexClient$new("localhost", 1984L, username = "admin", password = "admin")
In the Command
-command, the Create
statement can be used to create new databases. If a database with that name already exists, it is overwritten.Existing databases can be opened with the statement
Open
. BaseX has a convenience check
statement which combines open
and create
. If a database with that name already exists, it is opened. Otherwise it is created (and opened).
Session$Command("Check Parliament")
After the client has send a command to the server, the server responds by sending the outcome of the command as a raw vector or byte-array. This outcome is converted to a list. In most cases this list has attributes $result and if appropriate $info. To indicate if the process was successful a single byte is added to the raw vetor. A \0x00 byte means success, \0x01 indicates an error. The value of this status-byte is stored as a private variable in the Session instance-variable and can be accessed as the $success attribute.Public methods
get_success()
and set_success()
are methods which are not described in the server protocol. Neither are the methods set_intercept
, get_intercept
and restore_intercept
. I added them because I needed to catch errors. When the Intercept
variable has the value FALSE and an error in your code is detected, execution stops. When the Intercept
variable has the value TRUE, you can catch the error.
Scraping the site of the Parliament is surprisingly easy. A search for all the minutes of a given date, returns a page that contains an ID and the number of documents. The two can be combined to construct a link to all the minutes in XML-format. The files can be stored directly into the database. The following store
-function uses the BaseX Add
-command.
Session$set_intercept(TRUE)
store <- function(Period) {
for (i in 1:length(Period)) {
Minutes <- minutes(Period[i])
if (!identical(Minutes, list())) {
Source <- paste("https://zoek.officielebekendmakingen.nl/", Minutes, sep="")
mapply(function(source) Session$Add("Debates", source), Source)
}
}
}
Thanks to the set_intercept
, errors, due to missing files, are handled neatly. Downloading and storing the 740 files in the database took 3.5 minutes.
Querying
‘Simple’ XQuery-statements can be handled as normal commands. So you can use the following statement to return the number of records that were inserted:(Count <- Session$Command("xquery count(collection('Parliament'))")$result)
## [[1]]
## [1] "740"
I use the BaseX-GUI to develop queries. Queries that have been tested in the GUI are copied into the R-environment.The following XQuery code returns for every meeting the Debate-ID, the names of the speakers, the order in which they spoke and the party they represent.
result2frame
Is a utility function which converts the results to a dataframe.
library(glue)
library(knitr)
Query_Expr <-paste(
'import module namespace functx = "http://www.functx.com";',
'for $Debat in collection("Parliament")',
' let $debate-id := fn:analyze-string(',
' $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',
' for $Speaker at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
' let $Spreker := $Speaker/spreker/naam/achternaam/text()',
' let $Pol := $Speaker/spreker/politiek/text()',
' order by $debate-id, $CountInner',
'return($debate-id, $CountInner, $Spreker, ($Pol, "n.v.t")[1])')
result <- Session$Command(as.character(glue("xquery {Query_Expr}")))
result_frame <- result2frame(result$result, 4)
names(result_frame) <- c('Debate ID','Speach', 'Speaker', 'Party')
kable(head(result_frame,3))
Debate ID | Speach | Speaker | Party |
---|---|---|---|
20202021-102-10 | 1 | voorzitter | n.v.t |
20202021-102-10 | 2 | voorzitter | n.v.t |
20202021-102-2 | 1 | voorzitter | n.v.t |
import module namespace functx = “http://www.functx.com”; declare function local:order_id ( $Meeting as xs:string) { let $debate-id := fn:analyze-string( $Meeting, “(\d{8}-\d)-(\d)”) let $date := $debate-id//fn:match/:group[@nr=“1”]/text() let $item-nr := functx:pad-integer-to-length($debate-id//fn:match/:group[@nr=“2”]/number(),3) return fn:string-join(($date, $item-nr), “-”) };After incorporating this function in the query, ordering is correct.
Debate ID | Speach | Speaker | Party |
---|---|---|---|
20202021-102-2 | 1 | voorzitter | n.v.t |
20202021-102-2 | 2 | Bromet | GroenLinks |
20202021-102-2 | 3 | voorzitter | n.v.t |
Query mode
ExecuteQuery
The Session$Query()-command does not execute a query expression. It creates a new instance of the QueryClass and adds this instance to theSession
variable. It is this QueryClass-instance that has an ExecuteQuery()
method. This is demonstrated in the following example to extract the text from the minutes.
Query_Expr <-paste(
'import module namespace functx = "http://www.functx.com";',
'for $Debat at $CountOuter in collection("Parliament")',
' where $CountOuter <=2',
' let $debate-id := fn:analyze-string(',
' $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',
' for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
' let $Spreker := $Speach/spreker/naam/achternaam/text()',
' let $Pol := $Speach/spreker/politiek/text()',
' order by $debate-id, $CountInner',
' for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
' let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
' return($debate-id, $Spreker, ($Pol, "n.v.t")[1], $CountPar, $tekst)'
)
QueryExe <- Session$Query(Query_Expr)
result <- QueryExe$queryObject$ExecuteQuery()
result_frame <- result2frame(result, 5)
names(result_frame) <- c('Debate ID','Speaker', 'Party', 'Order', 'Text')
kable(head(result_frame,1))
Debate ID | Speaker | Party | Order | Text |
---|---|---|---|---|
20202021-102-2 | voorzitter | n.v.t | 1 | Allereerst hebben we het traditionele mondelinge vragenuur. … |
Binding
An XQuery expression may use external variables. Before using them they have to be bound to the queryObject(). In the following example, bind() is used to bind $Regex to a regular expression.library(dplyr)
library(knitr)
Query_Bind <- Session$Query(paste(
'declare variable $Regex external;',
'for $Debat at $CountOuter in collection("Parliament")',
' where $CountOuter <= 2',
' let $debate-id := fn:analyze-string(',
' $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',
' for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
' let $Spreker := $Speach/spreker/naam/achternaam/text()',
' for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
' let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
' return($debate-id, $Spreker, $tekst)'))
Query_Bind$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")
## $Binding
## character(0)
##
## $success
## [1] TRUE
BindResult <- Query_Bind$queryObject$ExecuteQuery()
ResultFrame <- result2frame(BindResult, 3)
names(ResultFrame) <- c('Debate ID','Speaker', 'Text')
kable(head(ResultFrame,1))
Debate ID | Speaker | Text |
---|---|---|
20202021-102 | voorzitter | Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven weer geschiedenis. … |
Iteration
Sometimes there is a need to iterate over the results of a query. When using the more()-next()
construct, BaseX prefixes every item with a byte that represents the Type ID or XDM meta data. For every item, RBaseX creates a list with as elements the byte and the value of the item.In the following example I’m only interested in every third item. This example shows how to iterate over all the items and also shows how to use the wrapper functions which are available for all the commands.
Query_Iter <- Session$Query(paste(
'declare variable $Regex external;',
'for $Debat at $CountOuter in collection("Parliament")',
' where $CountOuter <= 2',
' let $debate-id := fn:analyze-string(',
' $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',
' for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
' let $Spreker := $Speach/spreker/naam/achternaam/text()',
' for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
' let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
' return($debate-id, $Spreker, $tekst)'))
Query_Iter$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")
## $Binding
## character(0)
##
## $success
## [1] TRUE
IterCnt <- 0
IterResult <- c()
while (IterCnt <= 10 && More(Query_Iter)) {
NextItem <- Next(Query_Iter)
IterCnt <- IterCnt + 1
if (IterCnt %% 3 == 0) {
IterResult <- c(IterResult, NextItem)
(str(NextItem[[1]][[2]]))
}
}
## chr "Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven we"| __truncated__
## chr "Voorzitter. Het was altijd al een eer om hier te staan. Vandaag is dat het in het bijzonder. We voelen ons alle"| __truncated__
## chr "Dank u wel. Het woord is nu aan de minister. Gaat uw gang."