HebRew (using Hebrew in R)

Adi Sarid

8 years ago

Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.

Adi Sarid (Tel Aviv university and Sarid Research Institute LTD.)

July-2017

Background

A while back I participated in an R workshop, in the annual convention of the Israeli Association for Statistics. I had the pleasure of talking with Tal Galili and Jonathan Rosenblatt which indicated that a lot of Israeli R users run into difficulties with Hebrew with R. My firm opinion is that its best to keep everything in English, but sometimes you simply don’t have a choice. For example, I had to prepare a number of R shiny dashboards to Hebrew speaking clients, hence Hebrew was the way to go, in a kind of English-Hebrew “Mishmash” (mix). I happened to run into a lot of such difficulties in the past, so here are a few pointers to get you started when working in R with Hebrew. This post deals with Reading and writing files which contain Hebrew characters. Note, there is also a bit to talk about in the context of Shiny apps which contain Hebrew and using right-to-left in shiny apps, and using Hebrew variable names. Both work with some care, but I won’t cover them here. If you have any other questions you’d like to see answered, feel free to contact me adisarid@gmail.com.

Reading and writing files with Hebrew characters

R can read and write files in many formats. The common formats for small to medium data sets include the comma separated values (*.csv), and excel files (*.xlsx, *.xls). Each such read/write action is facilitated using some kind of “encoding”. Encoding, in simple terms, is a definition of a character set which help you operating system to interpret and represent the character as it should (לדוגמה, תווים בעברית). There are a number of relevant character sets (encodings) when Hebrew is concerned:

UTF-8
ISO 8859-8
Windows-1255

When you try to read files in which there are Hebrew characters, I usually recommend trying to read them in that order- UTF-8 is commonly used by a lot of applications, since it covers a lot of languages.

Using csv files with Hebrew characters

Here’s an example for something that can go wrong, and a possible solution. In this case I’ve prepared a csv file which encoded with UTF-8. When using R’s standard read.csv function, this is what happens:

sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv")
sample.data

##    ן...... X......        X............
## 1 ׳¨׳•׳ ׳™      25             ׳—׳™׳₪׳”
## 2 ׳׳•׳˜׳™      77         ׳”׳¨׳¦׳׳™׳”
## 3   ׳“׳ ׳™      13 ׳×׳-׳׳‘׳™׳‘ ׳™׳₪׳•
## 4 ׳¨׳¢׳•׳×      30  ׳§׳¨׳™׳× ׳©׳׳•׳ ׳”
## 5   ׳“׳ ׳”      44        ׳‘׳™׳× ׳©׳׳

Oh boy, that’s probably not what the file’s author had in mind. Let’s try to instruct read.csv to use a different encoding.

sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv",
                        encoding = "UTF-8")
sample.data

##   X.U.FEFF.שם גיל      מגורים
## 1        רוני  25        חיפה
## 2        מוטי  77      הרצליה
## 3         דני  13 תל-אביב יפו
## 4        רעות  30  קרית שמונה
## 5         דנה  44     בית שאן

A bit better isn’t it? However, not perfect. We can read the Hebrew, but there is a weird thing in the header “X.U.FEFF”. A better way to read and write files (much more than just encoding aspects – it’s quicker reading large files) is using the readr package which is part of the tidyverse. On a side note, if you haven’t already, install.packages(tidyverse), it’s a must. It includes readr but a lot more goodies (read on). Now, for some tools you get with readr:

library(readr)
locale("he")

## <locale>
## Numbers:  123,456.78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   יום ראשון (יום א׳), יום שני (יום ב׳), יום שלישי (יום ג׳), יום
##         רביעי (יום ד׳), יום חמישי (יום ה׳), יום שישי (יום ו׳), יום
##         שבת (שבת)
## Months: ינואר (ינו׳), פברואר (פבר׳), מרץ (מרץ), אפריל (אפר׳), מאי (מאי),
##         יוני (יוני), יולי (יולי), אוגוסט (אוג׳), ספטמבר (ספט׳),
##         אוקטובר (אוק׳), נובמבר (נוב׳), דצמבר (דצמ׳)
## AM/PM:  לפנה״צ/אחה״צ

guess_encoding("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv")

## # A tibble: 2 × 2
##   encoding confidence
##      <chr>      <dbl>
## 1    UTF-8       1.00
## 2   KOI8-R       0.98

First we used locale() which knows the date format and default encoding for the language (UTF-8 in this case). On it’s own locale() does nothing than output the specs of the locale, but when used in conjuction with read_csv it tells read_csv everything it needs to know. Also note the use of guess_encoding which reads the first “few” lines of a file (10,000 is the default) which helps us, well… guess the encoding of a file. You can see that readr is pretty confident we need the UTF-8 here (and 98% confident we need a Korean encoding, but first option wins here…)

sample.data <- read_csv(file = "http://www.sarid-ins.co.il/files/utf_encoded_sample.csv",
                        locale = locale(date_names = "he", encoding = "UTF-8"))

## Parsed with column specification:
## cols(
##   שם = col_character(),
##   גיל = col_integer(),
##   מגורים = col_character()
## )

sample.data

## # A tibble: 5 × 3
##      שם   גיל      מגורים
##   <chr> <int>       <chr>
## 1  רוני    25        חיפה
## 2  מוטי    77      הרצליה
## 3   דני    13 תל-אביב יפו
## 4  רעות    30  קרית שמונה
## 5   דנה    44     בית שאן

Awesome isn’t it? Note that the resulting sample.data is a tibble and not a data.frame (read about tibbles). The package readr has tons of functions features to help us with reading (writing) and controlling the encoding, so I definitely recommend it. By the way, try using read_csv without setting the locale parameter and see what happens.

What about files saved by Excel?

Excel files are not the best choice for storing datasets, but the format is extremely common for obvious reasons.

CSV files which were saved by excel

In the past, I had run to a lot of difficulties trying to load CSV files which were saved by excel into R. Excel seems to save them in either “Windows-1255” or “ISO-8859-8”, instead of “UTF-8”. The default read by read_csv might yield something like “” instead of “שלום”. In other cases you might get a “multibyte error”. Just make sure you check the “Windows-1255” or “ISO-8859-8” encodings if the standard UTF-8 doesn’t work well (i.e., use read_csv(file, locale = locale(encoding = "ISO-8859-8"))).

Reading directly from excel

Also, if the original is in Excel, you might want to consider reading it directly from the excel file (skipping CSVs entirely). There are a number of packages for reading excel files and I recommend using readxl, specifically read_xlsx or read_xls will do the trick (depending on file format). You don’t even have to specify the encoding, if there are Hebrew characters they will be read as they should be.

Summary

For reading csv files with Hebrew characters, it’s very convenient to use readr. The package has a lot of utilities for language encoding and localization like guess_encoding and locale. If the original data is in excel, you might want to try skipping the csv and read the data directly from the excel format using the readxl package. Somtimes reading files envolves a lot of trial and error – but eventually it will work.

Don’t give up!