Chapter 1 Preparing Data

Before we create create graphs or look at summaries, we often need to tidy up the data a bit. ArabBarometR provides a suite of functions to ease this process. In particular, it is useful to create a character columns: with the names of countries rather than their numeric codes; for commonly used demographics (gender, age, income, education, and settlement); and dummied versions of various variables.

1.1 TL;DR

Just use the add_demographics() function.

The function add_demographics() preforms all the functions below: recode_country(), gender_demographic(), age_demographic(), income_demographic(), education_demographic(), and settlement_demographic(). You do not need to use the functions from the following sections if you use add_demographics().

The input for add_demographics() is an Arab Barometer data frame. The output is an Arab Barometer data frame with the additional columns Country, gender, age, income, education, and settlement.

1.1.0.1 Example:

survey1 <- add_demographics(survey1)
table(survey1$Country)

## 
## Algeria Lebanon Morocco 
##     998    1000    1002

table(survey1$gender)

## 
## Female   Male 
##   1495   1505

table(survey1$age)

## 
## 18-29   30+ 
##  1161  1837

table(survey1$income)

## 
##    Can cover expenses Cannot cover expenses 
##                  1565                  1406

table(survey1$education)

## 
## Above Secondary   Max Secondary 
##            1666            1328

table(survey1$settlement)

## 
## Rural Urban 
##  1492  1508

1.2 `recode_country()`

The input of the recode_country() function is an Arab Barometer data frame. The function takes the COUNTRY variable from an Arab Barometer data frame and creates a Country variable. The original variable COUNTRY should be a labeled numeric value where the numeric value is the country code for the country in which the interview was conducted, and the label is the name of the country. The output variable Country will be a character vector of country names. This function was originally created by Patrick, and updated by MaryClare.

1.2.1 Example:

table(survey1$COUNTRY)

## 
##    1   10   13 
##  998 1000 1002

survey1 <- recode_country(survey1)
table(survey1$Country)

## 
## Algeria Lebanon Morocco 
##     998    1000    1002

1.3 Demographic functions

1.3.1 Basics

There are five functions that can be used to create the standard demographic breakdowns Arab Barometer produces graphs for:

gender_demographic() : Based on Q1002, outputs variable gender
age_demographic() : Based on Q1001², outputs variable age
income_demographic() : Based on Q1016, outputs variable income
education_demographic() : Based on Q1003, outputs variable education
settlement_demographic() : Based on Q13, outputs variable settlement

Each of the aforementioned functions generates a dichotomous character variable. The input for each function is an Arab Barometer data frame.

1.3.1.1 Example:

# Gender
survey1 <- gender_demographic(survey1)
table(survey1$gender)

## 
## Female   Male 
##   1495   1505

# Age
survey1 <- age_demographic(survey1)
table(survey1$age)

## 
## 18-29   30+ 
##  1161  1837

# Income
survey1 <- income_demographic(survey1)
table(survey1$income)

## 
##    Can cover expenses Cannot cover expenses 
##                  1565                  1406

# Education
survey1 <- education_demographic(survey1)
table(survey1$education)

## 
## Above Secondary   Max Secondary 
##            1666            1328

# Settlement
survey1 <- settlement_demographic(survey1)
table(survey1$settlement)

## 
## Rural Urban 
##  1492  1508

1.4 Advanced

Of the five basic demographic creation functions, the user has the ability to manipulate the output of two of them: age_demographic() and education_demographic(). This is because, unlike the other demographics, age is a continuous variable and education has multiple levels. Therefore, the point at which the variables bifurcate can be changed.

1.5 `age_demographic()`

First let’s look at the age_demographic() function. As seen above, the default output is either "18-29" or "30+". By setting the parameter .range to N, the user can change the returned values to "18-(N-1)" and "N+". For example, if the user wanted to compare opinions of those over 60 with the opinions of those under 60, the user could set the parameter .range to 60, so the function age_demographic() returns "18-59" and "60+".

survey1 <- age_demographic(survey1,
                           .range = 60)
table(survey1$age)

## 
## 18-59   60+ 
##  2702   296

1.6 `education_demographic()`

The education_demographic() function creates a dichotomous character column in the data frame titled education. You can enter a cutoff by entering a numeric value between 1 and 7 or a character string from value labels of the column Q1003 for the .edu_lvl parameter. The default for this parameter is "Secondary". The new column education has the values "Max Secondary" or "Higher". If you change the .edu_lvl parameter, the “education” column will have the values "Max .edu_lvl" or "Higher". If you enter a numeric value, the function will use the value label associated with that number.

1.6.1 Examples:

# English:
## Default parameters:
survey1 %>%
  education_demographic() %>% # Creates an "education" column
  select(education) %>% # Select just the "education" column
  table() # Create table of the "education" column values

## education
## Above Secondary   Max Secondary 
##            1666            1328

## Changing the cutoff with a character string:
survey1 %>%
  education_demographic("BA") %>%
  select(education) %>%
  table()

## education
## Above BA   Max BA 
##      324     2670

## Changing the cutoff with a numeric:
survey1 %>%
  education_demographic(3) %>%
  select(education) %>%
  table()

## education
## Above Preparatory/Basic   Max Preparatory/Basic 
##                    2423                     571

1.7 `dummy_all()`

The dummy_all() function takes one or more factor variables and creates a dummy column for each factor level. The new columns are given the original variable name then appended with an underscore and the factor level being dummied.

In the example below, you can see when dummy_all() is used on the variable b from the data frame df, two new columns are created: b_1 and b_2.

By default, dummy_all() does not create a dummy column for NA’s. The can be changed, however, by specifying no_NA = FALSE.

The function dummy_all() takes as input the data frame variables are being selected from and the variable you want to create dummy columns for. More details can be found by typing help("dummy_all") into the R Console.

1.7.1 Examples:

# Sample Data:
a <- c(1,3,2,4,2,3,4,2,2)
b <- c(1,2,1,1,2,1,2,2,2)
c <- c(2,5,1,1,5,NA,2,NA,5)
df <- data.frame(a,b,c)

# Original data frame:
names(df)

## [1] "a" "b" "c"

# Creating dummy columns for one variable:
df_1 <- dummy_all(df, # Data frame
                  b   # Variable to dummy
                  )
names(df_1)

## [1] "a"   "b"   "c"   "b_1" "b_2"

# Creating dummy columns for multiple variables:
df_2 <- df %>% # Data frame
  dummy_all(a, # Variable 1 to dummy
            b  # Variable 2 to dummy
            )
names(df_2)

## [1] "a"   "b"   "c"   "a_1" "a_2" "a_3" "a_4" "b_1" "b_2"

# Creating NA dummy column:
df_3 <- df %>%            # Data frame
  dummy_all(c,            # Variable to dummy
            no_NA = FALSE # Include dummy for NA occurrences
            )

1.8 `dummy_select()`

For some variables, we want to create dummy columns for all factor levels, but for other we only want to dummy out a specific level. The dummy_select() variable allows the user to select the factor level to create a dummy column for. The function creates a new variable named in the same style as dummy_all(); the variable name, underscore, the factor level.

The dummy_select() column also allows the user to replace the current column with the new dummy column by including the option inplace = TRUE.

You can use the function for more than one variable at a time, but not more than one factor level at a time. For example, if two columns both have 1 as a factor level and you want to create a dummy for when the factor is 1, you can include both columns in dummy_select(). However, you cannot create a dummy for when the factor(s) are 1 and 2 in the same command. You would have to run the command once for level 1 and once for level 2.

The following example uses the same sample data as before. More details can be found by typing help("dummy_select") into your R Console.

1.8.1 Examples:

# Sample Data:
a <- c(1,3,2,4,2,3,4,2,2)
b <- c(1,2,1,1,2,1,2,2,2)
c <- c(2,5,1,1,5,NA,2,NA,5)
df <- data.frame(a,b,c)

names(df)

## [1] "a" "b" "c"

# Adding new column:
df_1 <- dummy_select(df,           # Data frame
                     b,            # Variable to dummy
                     dummy_lvl = 1 # Level to dummy
                     )
names(df_1)

## [1] "a"   "b"   "c"   "b_1"

# Adding multiple new column:
df_2 <- df %>%               # Data frame
  dummy_select(b,            # Variable 1 to dummy
               c,            # Variable 2 to dummy
               dummy_lvl = 1 # Level to dummy
               )
names(df_2)

## [1] "a"   "b"   "c"   "b_1" "c_1"

# Replacing column:
df_3 <- dummy_select(df,           # Data frame
                     a,            # Variable to dummy
                    dummy_lvl = 4, # Level to dummy
                    inplace = TRUE # Replace old variable
                    )
table(df$a)

## 
## 1 2 3 4 
## 1 4 2 2

table(df_3$a)

## 
## 0 1 
## 7 2

1.9 In Practice

The functions presented so far can be used in combination to recreate the standard cleaning we use to prep the Arab Barometer data frames for graph creation. Remember: You can copy and paste this code exactly.

1.9.1 Example:

# English:
survey1 = survey1 |> 
  
  # Adding Demographic Categories:
  add_demographics() |> 
  # Creating dummy variables for all levels of Q2061A:
  dummy_all(Q2061A)

An update to the function age_demographic() has been requested and is on the docket. The update requests that the function takes account ages missing from Q1001 if they are supplied by Q1001YEAR or Q1001APPROX.↩︎