Chapter 1 Preparing Data
Before we create create graphs or look at summaries, we often need to tidy up the data a bit. ArabBarometR
provides a suite of functions to ease this process. In particular, it is useful to create a character columns: with the names of countries rather than their numeric codes; for commonly used demographics (gender, age, income, education, and settlement); and dummied versions of various variables.
1.1 TL;DR
Just use the add_demographics()
function.
The function add_demographics()
preforms all the functions below: recode_country()
, gender_demographic()
, age_demographic()
, income_demographic()
, education_demographic()
, and settlement_demographic()
. You do not need to use the functions from the following sections if you use add_demographics()
.
The input for add_demographics()
is an Arab Barometer data frame. The output is an Arab Barometer data frame with the additional columns Country
, gender
, age
, income
, education
, and settlement
.
1.2 recode_country()
The input of the recode_country()
function is an Arab Barometer data frame. The function takes the COUNTRY
variable from an Arab Barometer data frame and creates a Country
variable. The original variable COUNTRY
should be a labeled numeric value where the numeric value is the country code for the country in which the interview was conducted, and the label is the name of the country. The output variable Country
will be a character vector of country names. This function was originally created by Patrick, and updated by MaryClare.
1.3 Demographic functions
1.3.1 Basics
There are five functions that can be used to create the standard demographic breakdowns Arab Barometer produces graphs for:
gender_demographic()
: Based onQ1002
, outputs variablegender
age_demographic()
: Based onQ1001
2, outputs variableage
income_demographic()
: Based onQ1016
, outputs variableincome
education_demographic()
: Based onQ1003
, outputs variableeducation
settlement_demographic()
: Based onQ13
, outputs variablesettlement
Each of the aforementioned functions generates a dichotomous character variable. The input for each function is an Arab Barometer data frame.
1.4 Advanced
Of the five basic demographic creation functions, the user has the ability to manipulate the output of two of them: age_demographic()
and education_demographic()
. This is because, unlike the other demographics, age is a continuous variable and education has multiple levels. Therefore, the point at which the variables bifurcate can be changed.
1.5 age_demographic()
First let’s look at the age_demographic()
function. As seen above, the default output is either "18-29"
or "30+"
. By setting the parameter .range
to N, the user can change the returned values to "18-(N-1)"
and "N+"
. For example, if the user wanted to compare opinions of those over 60 with the opinions of those under 60, the user could set the parameter .range
to 60, so the function age_demographic()
returns "18-59"
and "60+"
.
##
## 18-59 60+
## 2702 296
1.6 education_demographic()
The education_demographic()
function creates a dichotomous character column in the data frame titled education
. You can enter a cutoff by entering a numeric value between 1 and 7 or a character string from value labels of the column Q1003 for the .edu_lvl
parameter. The default for this parameter is "Secondary"
. The new column education
has the values "Max Secondary"
or "Higher"
. If you change the .edu_lvl
parameter, the “education” column will have the values "Max .edu_lvl"
or "Higher"
. If you enter a numeric value, the function will use the value label associated with that number.
1.6.1 Examples:
# English:
## Default parameters:
survey1 %>%
education_demographic() %>% # Creates an "education" column
select(education) %>% # Select just the "education" column
table() # Create table of the "education" column values
## education
## Above Secondary Max Secondary
## 1666 1328
## Changing the cutoff with a character string:
survey1 %>%
education_demographic("BA") %>%
select(education) %>%
table()
## education
## Above BA Max BA
## 324 2670
## Changing the cutoff with a numeric:
survey1 %>%
education_demographic(3) %>%
select(education) %>%
table()
## education
## Above Preparatory/Basic Max Preparatory/Basic
## 2423 571
1.7 dummy_all()
The dummy_all()
function takes one or more factor variables and creates a dummy column for each factor level. The new columns are given the original variable name then appended with an underscore and the factor level being dummied.
In the example below, you can see when dummy_all()
is used on the variable b
from the data frame df
, two new columns are created: b_1
and b_2
.
By default, dummy_all()
does not create a dummy column for NA’s. The can be changed, however, by specifying no_NA = FALSE
.
The function dummy_all()
takes as input the data frame variables are being selected from and the variable you want to create dummy columns for. More details can be found by typing help("dummy_all")
into the R Console.
1.7.1 Examples:
# Sample Data:
a <- c(1,3,2,4,2,3,4,2,2)
b <- c(1,2,1,1,2,1,2,2,2)
c <- c(2,5,1,1,5,NA,2,NA,5)
df <- data.frame(a,b,c)
# Original data frame:
names(df)
## [1] "a" "b" "c"
# Creating dummy columns for one variable:
df_1 <- dummy_all(df, # Data frame
b # Variable to dummy
)
names(df_1)
## [1] "a" "b" "c" "b_1" "b_2"
# Creating dummy columns for multiple variables:
df_2 <- df %>% # Data frame
dummy_all(a, # Variable 1 to dummy
b # Variable 2 to dummy
)
names(df_2)
## [1] "a" "b" "c" "a_1" "a_2" "a_3" "a_4" "b_1" "b_2"
# Creating NA dummy column:
df_3 <- df %>% # Data frame
dummy_all(c, # Variable to dummy
no_NA = FALSE # Include dummy for NA occurrences
)
1.8 dummy_select()
For some variables, we want to create dummy columns for all factor levels, but for other we only want to dummy out a specific level. The dummy_select()
variable allows the user to select the factor level to create a dummy column for. The function creates a new variable named in the same style as dummy_all()
; the variable name, underscore, the factor level.
The dummy_select()
column also allows the user to replace the current column with the new dummy column by including the option inplace = TRUE
.
You can use the function for more than one variable at a time, but not more than one factor level at a time. For example, if two columns both have 1 as a factor level and you want to create a dummy for when the factor is 1, you can include both columns in dummy_select()
. However, you cannot create a dummy for when the factor(s) are 1 and 2 in the same command. You would have to run the command once for level 1 and once for level 2.
The following example uses the same sample data as before. More details can be found by typing help("dummy_select")
into your R Console.
1.8.1 Examples:
# Sample Data:
a <- c(1,3,2,4,2,3,4,2,2)
b <- c(1,2,1,1,2,1,2,2,2)
c <- c(2,5,1,1,5,NA,2,NA,5)
df <- data.frame(a,b,c)
names(df)
## [1] "a" "b" "c"
# Adding new column:
df_1 <- dummy_select(df, # Data frame
b, # Variable to dummy
dummy_lvl = 1 # Level to dummy
)
names(df_1)
## [1] "a" "b" "c" "b_1"
# Adding multiple new column:
df_2 <- df %>% # Data frame
dummy_select(b, # Variable 1 to dummy
c, # Variable 2 to dummy
dummy_lvl = 1 # Level to dummy
)
names(df_2)
## [1] "a" "b" "c" "b_1" "c_1"
# Replacing column:
df_3 <- dummy_select(df, # Data frame
a, # Variable to dummy
dummy_lvl = 4, # Level to dummy
inplace = TRUE # Replace old variable
)
table(df$a)
##
## 1 2 3 4
## 1 4 2 2
##
## 0 1
## 7 2
An update to the function
age_demographic()
has been requested and is on the docket. The update requests that the function takes account ages missing fromQ1001
if they are supplied byQ1001YEAR
orQ1001APPROX
.↩︎