Adi Sarid (Tel Aviv university and Sarid Research Institute LTD.)
July-2017
Background
A while back I participated in an R workshop, in the annual convention of the Israeli Association for Statistics. I had the pleasure of talking with Tal Galili and Jonathan Rosenblatt which indicated that a lot of Israeli R users run into difficulties with Hebrew with R. My firm opinion is that its best to keep everything in English, but sometimes you simply don’t have a choice. For example, I had to prepare a number of R shiny dashboards to Hebrew speaking clients, hence Hebrew was the way to go, in a kind of English-Hebrew “Mishmash” (mix). I happened to run into a lot of such difficulties in the past, so here are a few pointers to get you started when working in R with Hebrew. This post deals with Reading and writing files which contain Hebrew characters. Note, there is also a bit to talk about in the context of Shiny apps which contain Hebrew and using right-to-left in shiny apps, and using Hebrew variable names. Both work with some care, but I won’t cover them here. If you have any other questions you’d like to see answered, feel free to contact me adisarid@gmail.com.Reading and writing files with Hebrew characters
R can read and write files in many formats. The common formats for small to medium data sets include the comma separated values (*.csv), and excel files (*.xlsx, *.xls). Each such read/write action is facilitated using some kind of “encoding”. Encoding, in simple terms, is a definition of a character set which help you operating system to interpret and represent the character as it should (לדוגמה, תווים בעברית). There are a number of relevant character sets (encodings) when Hebrew is concerned:- UTF-8
- ISO 8859-8
- Windows-1255
Using csv files with Hebrew characters
Here’s an example for something that can go wrong, and a possible solution. In this case I’ve prepared a csv file which encoded with UTF-8. When using R’s standardread.csv
function, this is what happens:
sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv")
sample.data
## ן...... X...... X............ ## 1 ׳¨׳•׳ ׳™ 25 ׳—׳™׳₪׳” ## 2 ׳׳•׳˜׳™ 77 ׳”׳¨׳¦׳׳™׳” ## 3 ׳“׳ ׳™ 13 ׳×׳-׳׳‘׳™׳‘ ׳™׳₪׳• ## 4 ׳¨׳¢׳•׳× 30 ׳§׳¨׳™׳× ׳©׳׳•׳ ׳” ## 5 ׳“׳ ׳” 44 ׳‘׳™׳× ׳©׳׳Oh boy, that’s probably not what the file’s author had in mind. Let’s try to instruct
read.csv
to use a different encoding.
sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv",
encoding = "UTF-8")
sample.data
## X.U.FEFF.שם גיל מגורים ## 1 רוני 25 חיפה ## 2 מוטי 77 הרצליה ## 3 דני 13 תל-אביב יפו ## 4 רעות 30 קרית שמונה ## 5 דנה 44 בית שאןA bit better isn’t it? However, not perfect. We can read the Hebrew, but there is a weird thing in the header “X.U.FEFF”. A better way to read and write files (much more than just encoding aspects – it’s quicker reading large files) is using the
readr
package which is part of the tidyverse
. On a side note, if you haven’t already, install.packages(tidyverse)
, it’s a must. It includes readr
but a lot more goodies (read on).
Now, for some tools you get with readr
:
library(readr)
locale("he")
## <locale> ## Numbers: 123,456.78 ## Formats: %AD / %AT ## Timezone: UTC ## Encoding: UTF-8 ## <date_names> ## Days: יום ראשון (יום א׳), יום שני (יום ב׳), יום שלישי (יום ג׳), יום ## רביעי (יום ד׳), יום חמישי (יום ה׳), יום שישי (יום ו׳), יום ## שבת (שבת) ## Months: ינואר (ינו׳), פברואר (פבר׳), מרץ (מרץ), אפריל (אפר׳), מאי (מאי), ## יוני (יוני), יולי (יולי), אוגוסט (אוג׳), ספטמבר (ספט׳), ## אוקטובר (אוק׳), נובמבר (נוב׳), דצמבר (דצמ׳) ## AM/PM: לפנה״צ/אחה״צ
guess_encoding("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv")
## # A tibble: 2 × 2 ## encoding confidence ## <chr> <dbl> ## 1 UTF-8 1.00 ## 2 KOI8-R 0.98First we used
locale()
which knows the date format and default encoding for the language (UTF-8 in this case). On it’s own locale()
does nothing than output the specs of the locale, but when used in conjuction with read_csv
it tells read_csv
everything it needs to know. Also note the use of guess_encoding
which reads the first “few” lines of a file (10,000 is the default) which helps us, well… guess the encoding of a file. You can see that readr
is pretty confident we need the UTF-8 here (and 98% confident we need a Korean encoding, but first option wins here…)
sample.data <- read_csv(file = "http://www.sarid-ins.co.il/files/utf_encoded_sample.csv",
locale = locale(date_names = "he", encoding = "UTF-8"))
## Parsed with column specification: ## cols( ## שם = col_character(), ## גיל = col_integer(), ## מגורים = col_character() ## )
sample.data
## # A tibble: 5 × 3 ## שם גיל מגורים ## <chr> <int> <chr> ## 1 רוני 25 חיפה ## 2 מוטי 77 הרצליה ## 3 דני 13 תל-אביב יפו ## 4 רעות 30 קרית שמונה ## 5 דנה 44 בית שאןAwesome isn’t it? Note that the resulting
sample.data
is a tibble and not a data.frame (read about tibbles).
The package readr
has tons of functions features to help us with reading (writing) and controlling the encoding, so I definitely recommend it. By the way, try using read_csv
without setting the locale parameter and see what happens.
What about files saved by Excel?
Excel files are not the best choice for storing datasets, but the format is extremely common for obvious reasons.CSV files which were saved by excel
In the past, I had run to a lot of difficulties trying to load CSV files which were saved by excel into R. Excel seems to save them in either “Windows-1255” or “ISO-8859-8”, instead of “UTF-8”. The default read byread_csv
might yield something like “” instead of “שלום”. In other cases you might get a “multibyte error”. Just make sure you check the “Windows-1255” or “ISO-8859-8” encodings if the standard UTF-8 doesn’t work well (i.e., use read_csv(file, locale = locale(encoding = "ISO-8859-8"))
).
Reading directly from excel
Also, if the original is in Excel, you might want to consider reading it directly from the excel file (skipping CSVs entirely). There are a number of packages for reading excel files and I recommend usingreadxl
, specifically read_xlsx
or read_xls
will do the trick (depending on file format). You don’t even have to specify the encoding, if there are Hebrew characters they will be read as they should be.
Summary
For reading csv files with Hebrew characters, it’s very convenient to usereadr
. The package has a lot of utilities for language encoding and localization like guess_encoding
and locale
.
If the original data is in excel, you might want to try skipping the csv and read the data directly from the excel format using the readxl
package.
Somtimes reading files envolves a lot of trial and error – but eventually it will work. Don’t give up!