RbaseX, a BaseX-client written in R

Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.
The primary programming language used in the Data Science course at Erasmus University Rotterdam was ‘R’. So when the topic ‘XML, XQuery and XSL’ was tackled, there was a need for a tool that could work with R. BaseX, an Open Source XML database, offered the functionalities that were needed for the excercises but there was no R support. However, a server protocol was available in which all communication was fully described. This protocol has now been used for the development of RBaseX. Version 1.1.1 of this client is available at (https://cran.r-project.org/web/packages/RBaseX/index.html). All following examples are based on the minutes of the Dutch Parliament which are freely available in XML-format. BaseX is written in Java. It is shipped as .jar, .zip, .exe or .war. On Windows and Linux, it works out of the box. All clients are based on the client/server architecture. Hence, a BaseX database server must be started first. RbaseX can be used in two modes:
  • Standard mode is used for connecting to a server and sending commands
  • Query Mode is used for defining queries, binding variables, iterative evaluation

Standard mode

RBaseX is developed using the R6 package for object-oriented programming. All interactions between the client and BaseX begin with the creation of a new instance of BaseXClient.
The following command creates a new BasexClient instance and stores it in the variable Session. This Session is used in all the examples.
library(RBaseX)
Session <- BasexClient$new("localhost", 1984L, username = "admin", password = "admin")
In the Command-command, the Create statement can be used to create new databases. If a database with that name already exists, it is overwritten.
Existing databases can be opened with the statement Open. BaseX has a convenience check statement which combines open and create. If a database with that name already exists, it is opened. Otherwise it is created (and opened).
Session$Command("Check Parliament")
After the client has send a command to the server, the server responds by sending the outcome of the command as a raw vector or byte-array. This outcome is converted to a list. In most cases this list has attributes $result and if appropriate $info. To indicate if the process was successful a single byte is added to the raw vetor. A \0x00 byte means success, \0x01 indicates an error. The value of this status-byte is stored as a private variable in the Session instance-variable and can be accessed as the $success attribute.
Public methods get_success() and set_success() are methods which are not described in the server protocol. Neither are the methods set_intercept, get_intercept and restore_intercept. I added them because I needed to catch errors. When the Intercept variable has the value FALSE and an error in your code is detected, execution stops. When the Intercept variable has the value TRUE, you can catch the error. Scraping the site of the Parliament is surprisingly easy. A search for all the minutes of a given date, returns a page that contains an ID and the number of documents. The two can be combined to construct a link to all the minutes in XML-format. The files can be stored directly into the database. The following store-function uses the BaseX Add-command.
Session$set_intercept(TRUE)
store <- function(Period) {
  for (i in 1:length(Period)) {
    Minutes <- minutes(Period[i])
    if (!identical(Minutes, list())) {
      Source <- paste("https://zoek.officielebekendmakingen.nl/", Minutes, sep="")
      mapply(function(source) Session$Add("Debates", source), Source)
    }
  }
}
Thanks to the set_intercept, errors, due to missing files, are handled neatly. Downloading and storing the 740 files in the database took 3.5 minutes.

Querying

‘Simple’ XQuery-statements can be handled as normal commands. So you can use the following statement to return the number of records that were inserted:
(Count <- Session$Command("xquery count(collection('Parliament'))")$result)
## [[1]]
## [1] "740"
I use the BaseX-GUI to develop queries. Queries that have been tested in the GUI are copied into the R-environment.
The following XQuery code returns for every meeting the Debate-ID, the names of the speakers, the order in which they spoke and the party they represent. result2frame Is a utility function which converts the results to a dataframe.
library(glue)
library(knitr)
  
Query_Expr <-paste(
'import module namespace  functx = "http://www.functx.com";', 
'for $Debat in collection("Parliament")',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',
'  for $Speaker at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speaker/spreker/naam/achternaam/text()',
'    let $Pol := $Speaker/spreker/politiek/text()',
'    order by $debate-id, $CountInner',
'return($debate-id, $CountInner, $Spreker, ($Pol, "n.v.t")[1])')

result <- Session$Command(as.character(glue("xquery {Query_Expr}")))
result_frame <- result2frame(result$result, 4)
names(result_frame) <- c('Debate ID','Speach', 'Speaker', 'Party')

kable(head(result_frame,3))
Debate ID Speach Speaker Party
20202021-102-10 1 voorzitter n.v.t
20202021-102-10 2 voorzitter n.v.t
20202021-102-2 1 voorzitter n.v.t
As you can see the ordering by debate-id does not give a correct order. This would be corrected if the digit(s) following the second hyphen were padded with 0’s. Since some meetings have more than 100 items, I pad up to length 3. It is easy to write a function for this transformation and incorporate it in the query expression.
import module namespace functx = “http://www.functx.com”;
declare function local:order_id
( $Meeting as xs:string)
{ let $debate-id := fn:analyze-string( $Meeting, “(\d{8}-\d)-(\d)”)
let $date := $debate-id//fn:match/:group[@nr=“1”]/text()
let $item-nr := functx:pad-integer-to-length($debate-id//fn:match/:group[@nr=“2”]/number(),3)
return fn:string-join(($date, $item-nr), “-”)
};
After incorporating this function in the query, ordering is correct.
Debate ID Speach Speaker Party
20202021-102-2 1 voorzitter n.v.t
20202021-102-2 2 Bromet GroenLinks
20202021-102-2 3 voorzitter n.v.t

Query mode

ExecuteQuery

The Session$Query()-command does not execute a query expression. It creates a new instance of the QueryClass and adds this instance to the Session variable. It is this QueryClass-instance that has an ExecuteQuery() method. This is demonstrated in the following example to extract the text from the minutes.
Query_Expr <-paste(
'import module namespace  functx = "http://www.functx.com";', 
'for $Debat at $CountOuter in collection("Parliament")',
'    where $CountOuter <=2',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, "(\\d{8}-\\d*-\\d*)")//fn:match/*:group[@nr="1"]/text()',  
'  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speach/spreker/naam/achternaam/text()',
'    let $Pol := $Speach/spreker/politiek/text()',
'    order by $debate-id, $CountInner',
'    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
'       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
'    return($debate-id, $Spreker, ($Pol, "n.v.t")[1], $CountPar, $tekst)'
)

QueryExe <- Session$Query(Query_Expr)
result <- QueryExe$queryObject$ExecuteQuery()
result_frame <- result2frame(result, 5)
names(result_frame) <- c('Debate ID','Speaker', 'Party', 'Order', 'Text')

kable(head(result_frame,1))
Debate ID Speaker Party Order Text
20202021-102-2 voorzitter n.v.t 1 Allereerst hebben we het traditionele mondelinge vragenuur. …

Binding

An XQuery expression may use external variables. Before using them they have to be bound to the queryObject(). In the following example, bind() is used to bind $Regex to a regular expression.
library(dplyr)
library(knitr)
Query_Bind <- Session$Query(paste(
'declare variable $Regex external;',
'for $Debat at $CountOuter in collection("Parliament")',
'    where $CountOuter <= 2',
'  let $debate-id := fn:analyze-string(',
'    $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',  
'  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
'    let $Spreker := $Speach/spreker/naam/achternaam/text()',
'    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
'       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
'    return($debate-id, $Spreker, $tekst)'))

Query_Bind$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")
## $Binding
## character(0)
## 
## $success
## [1] TRUE
BindResult <- Query_Bind$queryObject$ExecuteQuery()
ResultFrame <- result2frame(BindResult, 3)
names(ResultFrame) <- c('Debate ID','Speaker', 'Text')

kable(head(ResultFrame,1))
Debate ID Speaker Text
20202021-102 voorzitter Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven weer geschiedenis. …


Iteration

Sometimes there is a need to iterate over the results of a query. When using the more()-next() construct, BaseX prefixes every item with a byte that represents the Type ID or XDM meta data. For every item, RBaseX creates a list with as elements the byte and the value of the item.
In the following example I’m only interested in every third item. This example shows how to iterate over all the items and also shows how to use the wrapper functions which are available for all the commands.
Query_Iter <- Session$Query(paste(
  'declare variable $Regex external;',
  'for $Debat at $CountOuter in collection("Parliament")',
  '    where $CountOuter <= 2',
  '  let $debate-id := fn:analyze-string(',
  '    $Debat/officiele-publicatie/metadata/meta/@content, $Regex)//fn:match/*:group[@nr="1"]/text()',  
  '  for $Speach at $CountInner in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt',
  '    let $Spreker := $Speach/spreker/naam/achternaam/text()',
  '    for $par at $CountPar in $Debat/officiele-publicatie/handelingen/agendapunt/spreekbeurt/tekst',
  '       let $tekst := fn:string-join(fn:data($par//al/text()), "
")',
  '    return($debate-id, $Spreker, $tekst)'))

Query_Iter$queryObject$Bind("$Regex", "(\\d{8}-\\d*)-\\d*")
## $Binding
## character(0)
## 
## $success
## [1] TRUE
IterCnt <- 0
IterResult <- c()
while (IterCnt <= 10 && More(Query_Iter)) {
  NextItem <- Next(Query_Iter)
  IterCnt <- IterCnt + 1
  if (IterCnt %% 3 == 0) {
    IterResult <- c(IterResult, NextItem)
    (str(NextItem[[1]][[2]]))
  }
}
##  chr "Allereerst hebben we het traditionele mondelinge vragenuur. Dat is ook weer voor het eerst, dus we schrijven we"| __truncated__
##  chr "Voorzitter. Het was altijd al een eer om hier te staan. Vandaag is dat het in het bijzonder. We voelen ons alle"| __truncated__
##  chr "Dank u wel. Het woord is nu aan de minister. Gaat uw gang."

BaseX modules

BaseX provides many more functionalities. A complete overview can be found at https://docs.basex.org/wiki/Main_Page.

Published by

Ben Engbers

I was trained as a biologist but have never worked as one. Since 1984 until my retirement in 2021, I have always worked in the IT sector. The last years I worked on a daily basis for a government organisation as a data scientist with R. Now I work on projects that I like myself and that I did not have time for before.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.