Data structures are fundamental for organizing and managing data efficiently in any programming language. Essentially, a data structure is a method to arrange data within a system to facilitate effective access and modification. The main goal is to minimize complexities related to both space and time during various computational tasks.
When working with programming languages like R, variables serve as the basic containers for storing different types of data. Each variable reserves a specific memory location where its values are stored. Once a variable is created, memory is allocated for it to hold the data.
In R, data structures are the primary objects that users interact with and manipulate. They provide an organized way to store data, making data manipulation, analysis, and other operations more efficient. R supports several types of data structures, each designed to handle data in specific formats or forms. This part introduces the most basic and widely used data structures in R, starting with vectors.
Recap: What Is a Vector?
A vector is a one-dimensional array that holds elements of the same data type. This homogeneity is a key property: no mixing of types is allowed without automatic coercion.
Types of Vectors
R has several atomic vector types:
- Logical (TRUE, FALSE)
- Integer (1L, 2L) — note the L suffix forces integer type
- Double (numeric) (e.g., 3.14, 5.0) — default for numbers with decimals
- Character (strings)
- Complex (complex numbers like 1 + 2i)
You can check the type of any vector with:
r
CopyEdit
typeof(vec)
and its class with
r
CopyEdit
class(vec)
Typically, vectors are also considered objects of class numeric, integer, etc., for method dispatch.
Creating Vectors
Using c()
The most common way:
r
CopyEdit
v <- c(1, 2, 3, 4)
Using vector()
Creates an empty vector of specified type and length:
r
CopyEdit
v <- vector(“numeric”, length = 5)
print(v) # prints 0 0 0 0 0
Using seq()
Generates regular sequences:
r
CopyEdit
seq(1, 10, by = 2) # 1 3 5 7 9
Using rep()
Repeats values:
r
CopyEdit
rep(1:3, times = 3) # 1 2 3 1 2 3 1 2 3
rep(1:3, each = 3) # 1 1 1 2 2 2 3 3 3
Coercion Rules and Mixing Types
When combining elements of different types in a vector, R coerces them to the most general type according to the hierarchy:
sql
CopyEdit
logical < integer < double < complex < character
For example:
r
CopyEdit
mixed <- c(TRUE, 1L, 3.14, “hello”)
typeof(mixed) # “character”
print(mixed) # “TRUE” “1” “3.14” “hello”
This implicit coercion can sometimes cause bugs, so be mindful.
Vector Indexing — More Details
Positive indexing
Access elements by position (starting from 1):
r
CopyEdit
v <- c(10, 20, 30, 40)
v[2] # 20
v[c(1, 4)] # 10 40
Negative indexing
Excludes elements:
r
CopyEdit
v[-3] # all except the 3rd element: 10 20 40
Logical indexing
Use a logical vector to pick elements:
r
CopyEdit
v[c(TRUE, FALSE, TRUE, FALSE)] # 10 30
This must be the same length or recycled accordingly.
Named vectors
You can assign names to elements and index by name:
r
CopyEdit
v <- c(a = 10, b = 20, c = 30)
v[“b”] # 20
Using which()
Find indices that satisfy a condition:
r
CopyEdit
v <- c(5, 10, 15, 20)
which(v > 10) # 3 4
v[which(v > 10)] # 15 20
Vector Arithmetic in Depth
Element-wise operations
Standard arithmetic operations (+, -, *, /, ^) are element-wise:
r
CopyEdit
a <- c(1, 2, 3)
b <- c(4, 5, 6)
a + b # 5 7 9
a * b # 4 10 18
Recycling rule
If vectors differ in length, the shorter one is recycled:
r
CopyEdit
a <- c(1, 2, 3, 4)
b <- c(10, 20)
a + b # 11 22 13 24 (b recycled as 10,20,10,20)
Be cautious: if the longer vector length is not a multiple of the shorter vector length, R throws a warning.
Vectorized Functions and Operations
One of R’s strengths is vectorization — functions that operate on vectors element-wise efficiently.
Mathematical functions:
- sqrt(v) — square root
- log(v) — natural logarithm
- exp(v) — exponentiation
- abs(v) — absolute value
Example:
r
CopyEdit
v <- c(-1, 0, 1, 4)
abs(v) # 1 0 1 4
sqrt(abs(v)) # 1 0 1 2
Summary functions:
- Sum (v) — sum of all elements
- prod(v) — product of all elements
- mean(v) — average
- median(v) — median
- min(v), max(v) — minimum and maximum
Logical Vectors and Boolean Operations
Logical vectors store TRUE or FALSE.
You can create them by comparisons:
r
CopyEdit
v <- c(2, 5, 8, 1)
v > 4 # FALSE TRUE TRUE FALSE
Logical operations work element-wise:
- & (and)
- | (or)
- ! (not)
Example:
r
CopyEdit
a <- c(TRUE, FALSE, TRUE)
b <- c(FALSE, FALSE, TRUE)
a & b # FALSE FALSE TRUE
a | b # TRUE FALSE TRUE
!a # FALSE TRUE FALSE
Naming Vector Elements
Naming vector elements can make your code clearer:
r
CopyEdit
temps <- c(30, 35, 28, 25)
names(temps) <- c(“Mon”, “Tue”, “Wed”, “Thu”)
temps[“Tue”] # 35
You can also name elements on creation:
r
CopyEdit
temps <- c(Mon = 30, Tue = 35, Wed = 28, Thu = 25)
Modifying Vector Elements
Assign new values using indices:
r
CopyEdit
v <- c(1, 2, 3)
v[2] <- 10
print(v) # 1 10 3
Assigning with logical indices:
r
CopyEdit
v[v > 2] <- 0
print(v) # 1 10 0
Useful Vector Functions
- length(v) — returns the number of elements
- unique(v) — returns unique elements
- duplicated(v) — returns a logical vector indicating duplicates
- Rev (v) — reverses the vector.
Any y(v) — returns TRUE if any element is TRUE (useful with logical vectors)
All l(v) — returns TRUE if all elements are TRUE.E
Example:
r
CopyEdit
v <- c(1, 2, 2, 3, 4)
unique(v) # 1 2 3 4
duplicated(v) # FALSE FALSE TRUE FALSE FALSE
rev(v) # 4 3 2 2 1
any(v > 3) # TRUE
all(v > 0) # TRUE
Coercion and Type CheckinFunctionses. numeric(v)
- is.integer(v)
- is.character(v)
- is.logical(v)
Example:
r
CopyEdit
v <- c(1, 2, 3)
is.numeric(v) # TRUE
is.integer(v) # FALSE, because the default numeric type is double
v2 <- c(1L, 2L)
is.integer(v2) # TRUE
Missing Values (NA) in Vectors
NA represents missing or undefined values in R.
Example:
r
CopyEdit
v <- c(1, NA, 3)
sum(v) # returns NA by default
sum(v, na.rm = TRUE) # 4, removes NA
You can test for NA values with is.na():
r
CopyEdit
is.na(v) # FALSE TRUE FALSE
Missing values propagate through most operations unless handled explicitly.
Factors vs. Character Vectors
Factors are special vectors that represent categorical data with fixed levels.
r
CopyEdit
v <- factor(c(“low”, “medium”, “high”, “medium”))
levels(v) # “high” “low” “medium”
typeof(v) # “integer”
class(v) # “factor”
They store the underlying data as integers but print as categories. Factors are essential for statistical modeling.
Subsetting with which() and Logical Conditions
To extract elements meeting conditions, use:
r
CopyEdit
v <- c(3, 6, 9, 12)
v[v > 5] # 6 9 12
v[which(v > 5)] # same as above
Which gives indices; direct logical indexing extracts elements.
Combining Vectors
You can concatenate vectors with c():
r
CopyEdit
a <- c(1, 2, 3)
b <- c(4, 5)
c <- c(a, b) # 1 2 3 4 5
Vector Recycling — Detailed Example
Recycling is a powerful feature, but it must be used with care.
r
CopyEdit
v1 <- c(1, 2, 3, 4, 5)
v2 <- c(10, 20)
v1 + v2 # 11 22 13 24 15 (v2 recycled as 10 20 10 20 10)
If lengths don’t align evenly, R warns:
r
CopyEdit
v1 <- c(1, 2, 3, 4, 5)
v2 <- c(10, 20, 30)
v1 + v2
# Warning: longer object length is not a multiple of shorter object length
Sorting and Ordering Vectors
- sort() returns a sorted vector.
- Order () returns indices to sort the vector.
Example:
r
CopyEdit
v <- c(7, 2, 9, 4)
sort(v) # 2 4 7 9
order(v) # 2 4 1 3 (positions of sorted elements)
v[order(v)] # 2 4 7 9
You can sort descending with:
r
CopyEdit
sort(v, decreasing = TRUE)
Additional Useful Vector Functions
- sample(v, size) — randomly sample elements from a vector
- match(x, table) — find positions of elements of x in table
- setdiff(x, y) — elements in x but not in y
- intersect(x, y) — common elements of x and y
- union(x, y) — all unique elements from x and y combined
Example:
r
CopyEdit
x <- c(1, 2, 3)
y <- c(3, 4, 5)
setdiff(x, y) # 1 2
intersect(x, y) # 3
union(x, y) # 1 2 3 4 5
Working with Large Vectors
For big data, vectors can become huge. Use functions like:
- Length () to check size
- head() and tail() to view subsets
Summary y() for quick stats
Example:
r
CopyEdit
v <- rnorm(1000000)
length(v) # 1,000,000
head(v) # first 6 elements
summary(v) # min, max, median, quartiles
Lists
Unlike vectors, lists are non-homogeneous data structures, meaning they can contain elements of different types. Lists can hold numbers, characters, vectors, other lists, matrices, and even functions.
Lists are created using the list() function.
Example:
list1 <- list(“Sam”, “Green”, c(8, 2, 67), TRUE, 51.99, 11.78, FALSE)
print(list1)
Output:
[[1]]
[1] “Sam”
[[2]]
[1] “Green”
[[3]]
[1] 8 2 67
[[4]]
[1] TRUE
[[5]]
[1] 51.99
[[6]]
[1] 11.78
[[7]]
[1] FALSE
Accessing Elements in a List
Elements of a list can be accessed by using their indices.
Example:
list2 <- list(matrix(c(3, 9, 5, 1, -2, 8), nrow = 2), c(“Jan”, “Feb”, “Mar”), list(3, 4, 5))
print(list2[1])
print(list2[2])
print(list2[3])
Output:
[[1]]
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
[[2]]
[1] “Jan” “Feb” “Mar”
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 4
[[3]][[3]]
[1] 5
Adding and Deleting List Elements
You can add elements to the end of a list by assigning a new value to the next index, and remove elements by assigning NULL.
Example:
list2[4] <- “HELLO”
print(list2[4])
list2[4] <- NULL
print(list2[4])
Output:
[[1]]
[1] “HELLO”
[[1]]
NULL
Updating Elements of a List
To update an element, assign a new value to the specific index.
Example:
list2[3] <- “Element Updated”
print(list2[3])
Output:
[[1]]
[1] “Element Updated”
Lists in R Programming
Lists are an important and versatile data structure in R. Unlike vectors, which are homogeneous and contain elements of the same data type, lists are non-homogeneous and can store elements of different types together. These elements can include numbers, characters, vectors, other lists, matrices, functions, or any other objects. This flexibility makes lists very useful when you want to store complex or mixed data types in a single structure.
Creating Lists
Lists are created using the list() function. You can pass any number of elements to this function, each of any type. For example:
r
CopyEdit
list1 <- list(“Sam”, “Green”, c(8, 2, 67), TRUE, 51.99, 11.78, FALSE)
print(list1)
The output will show each element of the list in its position:
lua
CopyEdit
[[1]]
[1] “Sam”
[[2]]
[1] “Green”
[[3]]
[1] 8 2 67
[[4]]
[1] TRUE
[[5]]
[1] 51.99
[[6]]
[1] 11.78
[[7]]
[1] FALSE
Accessing List Elements
You can access elements in a list by using their indices, enclosed in square brackets. For example, if you have a list:
r
CopyEdit
list2 <- list(matrix(c(3, 9, 5, 1, -2, 8), nrow = 2), c(“Jan”, “Feb”, “Mar”), list(3, 4, 5))
Accessing the first, second, and third elements can be done as follows:
r
CopyEdit
print(list2[1]) # Returns the first element (a matrix)
print(list2[2]) # Returns the second element (a vector of month names)
print(list2[3]) # Returns the third element (a nested list)
If you want to access elements inside the nested list within the list, use double square brackets:
r
CopyEdit
print(list2[[3]][[1]]) # Outputs 3
print(list2[[3]][[2]]) # Outputs 4
print(list2[[3]][[3]]) # Outputs 5
Adding and Deleting List Elements
You can add elements to a list by assigning a new value at the next index:
r
CopyEdit
list2[4] <- “HELLO”
print(list2[4])
This will add a new element at position 4. To delete an element, assign NULL to that index:
r
CopyEdit
list2[4] <- NULL
print(list2[4]) # This will return NULL since the element is removed
Updating List Elements
To update an existing element in the list, simply assign a new value to the desired index:
r
CopyEdit
list2[3] <- “Element Updated”
print(list2[3])
This replaces the third element in the list with the string “Element Updated”.
Matrices in R Programming
Matrices are two-dimensional data structures in R that hold elements of the same data type. They can be considered as vectors with a dimension attribute. The elements in a matrix are arranged in rows and columns.
Creating Matrices
You can create a matrix using the matrix() function. The primary arguments include the data elements as a vector, the number of rows, the number of columns, whether to fill the matrix by row or by column, and optional dimension names.
The syntax is:
r
CopyEdit
matrix(data, nrow, ncol, byrow, dimnames)
- Data is a vector containing the elements.
- nrow specifies the number of rows.
- ncol specifies the number of columns.
- byrow is a logical value indicating whether to fill the matrix by rows (TRUE) or by columns (FALSE, default).
- dimnames is a list containing optional row and column names.
Example of creating a 3×3 matrix filled by rows:
r
CopyEdit
M1 <- matrix(c(1:9), nrow = 3, ncol = 3, byrow = TRUE)
print(M1)
Output:
css
CopyEdit
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
Similarly, creating a matrix filled with columns:
r
CopyEdit
M2 <- matrix(c(1:9), nrow = 3, ncol = 3, byrow = FALSE)
print(M2)
Output:
css
CopyEdit
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Adding Row and Column Names
You can name the rows and columns by passing a list to the dimnames argument:
r
CopyEdit
rownames <- c(“row1”, “row2”, “row3”)
colnames <- c(“col1”, “col2”, “col3”)
M3 <- matrix(c(1:9), nrow = 3, byrow = TRUE, dimnames = list(rownames, colnames))
print(M3)
Output:
markdown
CopyEdit
col1 col2 col3
row1 1 2 3
row2 4 5 6
row3 7 8 9
Accessing Matrix Elements
To access elements of a matrix, you specify the row and column indices within square brackets. The syntax is:
r
CopyEdit
matrixName[row, column]
For example, using matrix M3:
r
CopyEdit
print(M3[1, 1]) # First row, first column
print(M3[3, 3]) # Third row, third column
print(M3[2, 3]) # Second row, third column
The output will be:
csharp
CopyEdit
[1] 1
[1] 9
[1] 6
You can also extract entire rows or columns by leaving the other index blank:
r
CopyEdit
print(M3[1, ]) # All elements in first row
print(M3[, 2]) # All elements in second column
Factors in R
Factors are a special data structure used for fields that take a limited number of unique values, often called categorical data. Factors are useful in statistical modeling and data analysis because they represent categories and their levels efficiently.
Creating Factors
You create factors from vectors using the factor() function. For example:
r
CopyEdit
data <- c(“Male”, “Female”, “Male”, “Child”, “Child”, “Male”, “Female”, “Female”)
print(data)
factor.data <- factor(data)
print(factor.data)
Output:
csharp
CopyEdit
[1] “Male” “Female” “Male” “Child” “Child” “Male” “Female” “Female”
Levels: Child, Female, Male
The unique values in the vector become the factor levels.
Using Factors in Data Frames
When a data frame contains text columns, R often treats them as factors by default. For instance, consider a data frame, emp. finalda, ta with a column,umn empdept:
r
CopyEdit
print(is.factor(emp.finaldata$empdept))
print(emp.finaldata$empdept)
Output might be:
csharp
CopyEdit
[1] TRUE
[1] Sales Marketing HR R&D IT Operations Finance
Levels: HR, Marketing, R&D, Sales, Finance, IT Operations
This shows that the empdept column is treated as a factor, which can be beneficial for grouping and analysis.
Data Frames in R Programming
Data frames are one of the most widely used data structures in R. They are essentially tables or 2D data structures where each column can be of a different data type (numeric, character, factor, etc.). Data frames are used to store tabular data similar to spreadsheets or SQL tables.
Creating a Data Frame
You can create a data frame using the data.frame() function:
r
CopyEdit
empdata <- data.frame(
empid = c(1001, 1002, 1003, 1004),
empname = c(“Alice”, “Bob”, “Charlie”, “David”),
empdept = c(“Sales”, “Marketing”, “HR”, “IT”),
empsalary = c(50000, 55000, 45000, 60000)
)
print(empdata)
Output:
yaml
CopyEdit
empid empname empdept empsalary
1 1001 Alice Sales 50000
2 1002 Bob Marketing 55000
3 1003 Charlie HR 45000
4 1004 David IT 60000
Accessing Data Frame Elements
You can access elements of a data frame in several ways:
- By column name:
r
CopyEdit
print(empdata$empname) # Prints the empname column
- By row and column indices:
r
CopyEdit
print(empdata[2, 3]) # Row 2, column 3: “Marketing”
print(empdata[ , 2]) # All rows, column 2 (empname)
print(empdata[1, ]) # Row 1, all columns
Adding Rows and Columns
- To add a new column:
r
CopyEdit
empdata$empbonus <- c(5000, 4000, 3000, 6000)
print(empdata)
- To add a new row (using rbind()):
r
CopyEdit
newrow <- data.frame(empid=1005, empname=”Eve”, empdept=”Finance”, empsalary=52000, empbonus=4500)
empdata <- rbind(empdata, newrow)
print(empdata)
Arrays in R Programming
Arrays are similar to matrices but can have more than two dimensions. They hold elements of the same data type.
Creating Arrays
You use the array() function to create an array by specifying the data and dimensions:
r
CopyEdit
# Create a 3D array with dimensions 2 x 3 x 4
arr <- array(1:24, dim = c(2, 3, 4))
print(arr)
Accessing Array Elements
You specify indices for each dimension in square brackets:
r
CopyEdit
print(arr[1, 2, 3]) # Element at 1st row, 2nd column, 3rd matrix
print(arr[, , 2]) # The entire 2nd matrix
Arrays are useful for working with multi-dimensional data, like image processing, scientific datasets, etc.
Handling Missing Data in R
Missing data is common in real-world datasets, and R provides ways to detect and manage missing values.
Representing Missing Data
In R, missing data is represented by the special value NA.
Detecting Missing Values
Use the is.na() function to check for missing values:
r
CopyEdit
data <- c(10, 20, NA, 40, NA)
print(is.na(data))
Output:
graphql
CopyEdit
[1] FALSE FALSE TRUE FALSE TRUE
Removing Missing Values
You can remove missing values using the na.omit() function or by using logical indexing:
r
CopyEdit
clean_data <- na.omit(data)
print(clean_data)
Output:
csharp
CopyEdit
[1] 10 20 40
Alternatively:
r
CopyEdit
clean_data <- data[!is.na(data)]
print(clean_data)
Replacing Missing Values
You can replace missing values with a specific value:
r
CopyEdit
data[is.na(data)] <- 0
print(data)
Output:
csharp
CopyEdit
[1] 10 20 0 40 0
Lists in R Programming
Lists are very flexible data structures in R. Unlike vectors or arrays, lists can hold elements of different types and structures — including vectors, matrices, other lists, data frames, and even functions.
Creating Lists
r
CopyEdit
mylist <- list(
name = “John”,
age = 28,
scores = c(85, 90, 88),
passed = TRUE
)
print(mylist)
Accessing List Elements
You can access list elements by name or position:
r
CopyEdit
print(mylist$name) # “John”
print(mylist[[2]]) # 28 (age)
print(mylist$scores) # c(85, 90, 88)
To access an element inside a list element:
r
CopyEdit
print(mylist$scores[2]) # 90
Factors in R Programming
Factors are used to handle categorical data — data that has a fixed number of possible values (levels). They are stored as integers with labels.
Creating Factors
r
CopyEdit
colors <- c(“red”, “blue”, “red”, “green”, “blue”, “blue”)
color_factor <- factor(colors)
print(color_factor)
Output:
csharp
CopyEdit
[1] red blue red green blue blue
Levels: blue, green, red
Why Factors?
- Factors help with statistical modeling.
- They save memory compared to storing character vectors.
- Levels can be ordered or unordered.
Accessing Levels
r
CopyEdit
levels(color_factor) # “blue” “green” “red”
You can specify the order of levels:
r
CopyEdit
ordered_factor <- factor(colors, levels = c(“red”, “green”, “blue”), ordered = TRUE)
print(ordered_factor)
Basic Data Manipulation with dplyr
dplyr is a package that makes data manipulation easy and intuitive.
Installing and Loading dplyr
r
CopyEdit
install.packages(“dplyr”) # Run once
library(dplyr)
Sample Data Frame
r
CopyEdit
df <- data.frame(
name = c(“Alice”, “Bob”, “Charlie”, “David”, “Eve”),
age = c(25, 30, 35, 40, 28),
salary = c(50000, 60000, 55000, 65000, 48000)
)
Common dplyr Functions
- filter() — Filter rows based on condition:
r
CopyEdit
young_employees <- filter(df, age < 30)
print(young_employees)
- Select () — Select specific columns:
r
CopyEdit
selected_data <- select(df, name, salary)
print(selected_data)
- Mutate () — Add new columns or modify existing ones:
r
CopyEdit
df <- mutate(df, bonus = salary * 0.1)
print(df)
- Arrange e() — Sort rows:
r
CopyEdit
df_sorted <- arrange(df, desc(salary))
print(df_sorted)
- Summarize () — Summarize data (usually with grouping):
r
CopyEdit
avg_salary <- summarize(df, avg_salary = mean(salary))
print(avg_salary)
- group_by() — Group data before summarizing:
r
CopyEdit
df2 <- data.frame(
dept = c(“Sales”, “Sales”, “HR”, “HR”, “IT”),
salary = c(50000, 55000, 45000, 47000, 60000)
)
dept_avg <- df2 %>%
group_by(dept) %>%
summarize(avg_salary = mean(salary))
print(dept_avg)
Final Thoughts
1. Practice is Key
- R is best learned by doing. Try solving real problems or working on small projects.
- Use datasets from sources like Kaggle or R’s built-in datasets (mtcars, iris).
2. Understand the Ecosystem
- R is much more than base functions. Explore popular packages like:
- tidyverse (includes dplyr, ggplot2, tidyr, etc.) for data manipulation and visualization.
- Data. Table for fast data handling.
- Shiny is for interactive web apps.
- tidyverse (includes dplyr, ggplot2, tidyr, etc.) for data manipulation and visualization.
3. Learn to Debug and Read Error Messages
- Errors are part of coding. Carefully read error messages—they often tell you exactly what’s wrong.
- Use traceback(), debug(), and browser() functions to troubleshoot.
4. Write Clean and Reproducible Code
- Comment your code.
- Use meaningful variable names.
- Organize your scripts logically.
- Consider using R Markdown for combining code, output, and explanations in one document.
5. Keep Exploring Statistical and Visualization Tools
- R’s strength is in statistics and graphics.
- Learn to use ggplot2 for advanced plotting.
- Explore modeling functions like lm(), glm(), or machine learning packages.
6. Join the Community
- R has a welcoming and active community.
- Participate in forums like Stack Overflow, RStudio Community, or local user groups.
- Follow blogs, Twitter accounts, or YouTube channels about R.