An Introduction to Data Structures in R

Posts

Data structures are fundamental for organizing and managing data efficiently in any programming language. Essentially, a data structure is a method to arrange data within a system to facilitate effective access and modification. The main goal is to minimize complexities related to both space and time during various computational tasks.

When working with programming languages like R, variables serve as the basic containers for storing different types of data. Each variable reserves a specific memory location where its values are stored. Once a variable is created, memory is allocated for it to hold the data.

In R, data structures are the primary objects that users interact with and manipulate. They provide an organized way to store data, making data manipulation, analysis, and other operations more efficient. R supports several types of data structures, each designed to handle data in specific formats or forms. This part introduces the most basic and widely used data structures in R, starting with vectors.

Recap: What Is a Vector?

A vector is a one-dimensional array that holds elements of the same data type. This homogeneity is a key property: no mixing of types is allowed without automatic coercion.

Types of Vectors

R has several atomic vector types:

  • Logical (TRUE, FALSE)
  • Integer (1L, 2L) — note the L suffix forces integer type
  • Double (numeric) (e.g., 3.14, 5.0) — default for numbers with decimals
  • Character (strings)
  • Complex (complex numbers like 1 + 2i)

You can check the type of any vector with:

r

CopyEdit

typeof(vec)

and its class with

r

CopyEdit

class(vec)

Typically, vectors are also considered objects of class numeric, integer, etc., for method dispatch.

Creating Vectors

Using c()

The most common way:

r

CopyEdit

v <- c(1, 2, 3, 4)

Using vector()

Creates an empty vector of specified type and length:

r

CopyEdit

v <- vector(“numeric”, length = 5)

print(v)  # prints 0 0 0 0 0

Using seq()

Generates regular sequences:

r

CopyEdit

seq(1, 10, by = 2)  # 1 3 5 7 9

Using rep()

Repeats values:

r

CopyEdit

rep(1:3, times = 3)  # 1 2 3 1 2 3 1 2 3

rep(1:3, each = 3)   # 1 1 1 2 2 2 3 3 3

Coercion Rules and Mixing Types

When combining elements of different types in a vector, R coerces them to the most general type according to the hierarchy:

sql

CopyEdit

logical < integer < double < complex < character

For example:

r

CopyEdit

mixed <- c(TRUE, 1L, 3.14, “hello”)

typeof(mixed)  # “character”

print(mixed)   # “TRUE” “1” “3.14” “hello”

This implicit coercion can sometimes cause bugs, so be mindful.

Vector Indexing — More Details

Positive indexing

Access elements by position (starting from 1):

r

CopyEdit

v <- c(10, 20, 30, 40)

v[2]  # 20

v[c(1, 4)]  # 10 40

Negative indexing

Excludes elements:

r

CopyEdit

v[-3]  # all except the 3rd element: 10 20 40

Logical indexing

Use a logical vector to pick elements:

r

CopyEdit

v[c(TRUE, FALSE, TRUE, FALSE)]  # 10 30

This must be the same length or recycled accordingly.

Named vectors

You can assign names to elements and index by name:

r

CopyEdit

v <- c(a = 10, b = 20, c = 30)

v[“b”]  # 20

Using which()

Find indices that satisfy a condition:

r

CopyEdit

v <- c(5, 10, 15, 20)

which(v > 10)  # 3 4

v[which(v > 10)]  # 15 20

Vector Arithmetic in Depth

Element-wise operations

Standard arithmetic operations (+, -, *, /, ^) are element-wise:

r

CopyEdit

a <- c(1, 2, 3)

b <- c(4, 5, 6)

a + b  # 5 7 9

a * b  # 4 10 18

Recycling rule

If vectors differ in length, the shorter one is recycled:

r

CopyEdit

a <- c(1, 2, 3, 4)

b <- c(10, 20)

a + b  # 11 22 13 24  (b recycled as 10,20,10,20)

Be cautious: if the longer vector length is not a multiple of the shorter vector length, R throws a warning.

Vectorized Functions and Operations

One of R’s strengths is vectorization — functions that operate on vectors element-wise efficiently.

Mathematical functions:

  • sqrt(v) — square root
  • log(v) — natural logarithm
  • exp(v) — exponentiation
  • abs(v) — absolute value

Example:

r

CopyEdit

v <- c(-1, 0, 1, 4)

abs(v)  # 1 0 1 4

sqrt(abs(v))  # 1 0 1 2

Summary functions:

  • Sum (v) — sum of all elements
  • prod(v) — product of all elements
  • mean(v) — average
  • median(v) — median
  • min(v), max(v) — minimum and maximum

Logical Vectors and Boolean Operations

Logical vectors store TRUE or FALSE.

You can create them by comparisons:

r

CopyEdit

v <- c(2, 5, 8, 1)

v > 4  # FALSE TRUE TRUE FALSE

Logical operations work element-wise:

  • & (and)
  • | (or)
  • ! (not)

Example:

r

CopyEdit

a <- c(TRUE, FALSE, TRUE)

b <- c(FALSE, FALSE, TRUE)

a & b  # FALSE FALSE TRUE

a | b  # TRUE FALSE TRUE

!a     # FALSE TRUE FALSE

Naming Vector Elements

Naming vector elements can make your code clearer:

r

CopyEdit

temps <- c(30, 35, 28, 25)

names(temps) <- c(“Mon”, “Tue”, “Wed”, “Thu”)

temps[“Tue”]  # 35

You can also name elements on creation:

r

CopyEdit

temps <- c(Mon = 30, Tue = 35, Wed = 28, Thu = 25)

Modifying Vector Elements

Assign new values using indices:

r

CopyEdit

v <- c(1, 2, 3)

v[2] <- 10

print(v)  # 1 10 3

Assigning with logical indices:

r

CopyEdit

v[v > 2] <- 0

print(v)  # 1 10 0

Useful Vector Functions

  • length(v) — returns the number of elements
  • unique(v) — returns unique elements
  • duplicated(v) — returns a logical vector indicating duplicates
  • Rev (v) — reverses the vector.
    Any y(v) — returns TRUE if any element is TRUE (useful with logical vectors)
    All l(v) — returns TRUE if all elements are TRUE.E

Example:

r

CopyEdit

v <- c(1, 2, 2, 3, 4)

unique(v)       # 1 2 3 4

duplicated(v)   # FALSE FALSE TRUE FALSE FALSE

rev(v)          # 4 3 2 2 1

any(v > 3)      # TRUE

all(v > 0)      # TRUE

Coercion and Type CheckinFunctionses. numeric(v)

  • is.integer(v)
  • is.character(v)
  • is.logical(v)

Example:

r

CopyEdit

v <- c(1, 2, 3)

is.numeric(v)  # TRUE

is.integer(v)  # FALSE, because the default numeric type is double

v2 <- c(1L, 2L)

is.integer(v2) # TRUE

Missing Values (NA) in Vectors

NA represents missing or undefined values in R.

Example:

r

CopyEdit

v <- c(1, NA, 3)

sum(v)  # returns NA by default

sum(v, na.rm = TRUE)  # 4, removes NA

You can test for NA values with is.na():

r

CopyEdit

is.na(v)  # FALSE TRUE FALSE

Missing values propagate through most operations unless handled explicitly.

Factors vs. Character Vectors

Factors are special vectors that represent categorical data with fixed levels.

r

CopyEdit

v <- factor(c(“low”, “medium”, “high”, “medium”))

levels(v)  # “high” “low” “medium”

typeof(v)  # “integer”

class(v)   # “factor”

They store the underlying data as integers but print as categories. Factors are essential for statistical modeling.

Subsetting with which() and Logical Conditions

To extract elements meeting conditions, use:

r

CopyEdit

v <- c(3, 6, 9, 12)

v[v > 5]        # 6 9 12

v[which(v > 5)] # same as above

Which gives indices; direct logical indexing extracts elements.

Combining Vectors

You can concatenate vectors with c():

r

CopyEdit

a <- c(1, 2, 3)

b <- c(4, 5)

c <- c(a, b)  # 1 2 3 4 5

Vector Recycling — Detailed Example

Recycling is a powerful feature, but it must be used with care.

r

CopyEdit

v1 <- c(1, 2, 3, 4, 5)

v2 <- c(10, 20)

v1 + v2  # 11 22 13 24 15  (v2 recycled as 10 20 10 20 10)

If lengths don’t align evenly, R warns:

r

CopyEdit

v1 <- c(1, 2, 3, 4, 5)

v2 <- c(10, 20, 30)

v1 + v2

# Warning: longer object length is not a multiple of shorter object length

Sorting and Ordering Vectors

  • sort() returns a sorted vector.
  • Order () returns indices to sort the vector.

Example:

r

CopyEdit

v <- c(7, 2, 9, 4)

sort(v)          # 2 4 7 9

order(v)         # 2 4 1 3 (positions of sorted elements)

v[order(v)]      # 2 4 7 9

You can sort descending with:

r

CopyEdit

sort(v, decreasing = TRUE)

Additional Useful Vector Functions

  • sample(v, size) — randomly sample elements from a vector
  • match(x, table) — find positions of elements of x in table
  • setdiff(x, y) — elements in x but not in y
  • intersect(x, y) — common elements of x and y
  • union(x, y) — all unique elements from x and y combined

Example:

r

CopyEdit

x <- c(1, 2, 3)

y <- c(3, 4, 5)

setdiff(x, y)     # 1 2

intersect(x, y)   # 3

union(x, y)       # 1 2 3 4 5

Working with Large Vectors

For big data, vectors can become huge. Use functions like:

  • Length () to check size
  • head() and tail() to view subsets
    Summary y() for quick stats

Example:

r

CopyEdit

v <- rnorm(1000000)

length(v)  # 1,000,000

head(v)    # first 6 elements

summary(v) # min, max, median, quartiles

Lists

Unlike vectors, lists are non-homogeneous data structures, meaning they can contain elements of different types. Lists can hold numbers, characters, vectors, other lists, matrices, and even functions.

Lists are created using the list() function.

Example:

list1 <- list(“Sam”, “Green”, c(8, 2, 67), TRUE, 51.99, 11.78, FALSE)

print(list1)

Output:

[[1]]

[1] “Sam”

[[2]]

[1] “Green”

[[3]]

[1] 8 2 67

[[4]]

[1] TRUE

[[5]]

[1] 51.99

[[6]]

[1] 11.78

[[7]]

[1] FALSE

Accessing Elements in a List

Elements of a list can be accessed by using their indices.

Example:

list2 <- list(matrix(c(3, 9, 5, 1, -2, 8), nrow = 2), c(“Jan”, “Feb”, “Mar”), list(3, 4, 5))

print(list2[1])

print(list2[2])

print(list2[3])

Output:

[[1]]

     [,1] [,2] [,3]

[1,]    3    5   -2

[2,]    9    1    8

[[2]]

[1] “Jan” “Feb” “Mar”

[[3]]

[[3]][[1]]

[1] 3

[[3]][[2]]

[1] 4

[[3]][[3]]

[1] 5

Adding and Deleting List Elements

You can add elements to the end of a list by assigning a new value to the next index, and remove elements by assigning NULL.

Example:

list2[4] <- “HELLO”

print(list2[4])

list2[4] <- NULL

print(list2[4])

Output:

[[1]]

[1] “HELLO”

[[1]]

NULL

Updating Elements of a List

To update an element, assign a new value to the specific index.

Example:

list2[3] <- “Element Updated”

print(list2[3])

Output:

[[1]]

[1] “Element Updated”

Lists in R Programming

Lists are an important and versatile data structure in R. Unlike vectors, which are homogeneous and contain elements of the same data type, lists are non-homogeneous and can store elements of different types together. These elements can include numbers, characters, vectors, other lists, matrices, functions, or any other objects. This flexibility makes lists very useful when you want to store complex or mixed data types in a single structure.

Creating Lists

Lists are created using the list() function. You can pass any number of elements to this function, each of any type. For example:

r

CopyEdit

list1 <- list(“Sam”, “Green”, c(8, 2, 67), TRUE, 51.99, 11.78, FALSE)

print(list1)

The output will show each element of the list in its position:

lua

CopyEdit

[[1]]

[1] “Sam”

[[2]]

[1] “Green”

[[3]]

[1] 8 2 67

[[4]]

[1] TRUE

[[5]]

[1] 51.99

[[6]]

[1] 11.78

[[7]]

[1] FALSE

Accessing List Elements

You can access elements in a list by using their indices, enclosed in square brackets. For example, if you have a list:

r

CopyEdit

list2 <- list(matrix(c(3, 9, 5, 1, -2, 8), nrow = 2), c(“Jan”, “Feb”, “Mar”), list(3, 4, 5))

Accessing the first, second, and third elements can be done as follows:

r

CopyEdit

print(list2[1])  # Returns the first element (a matrix)

print(list2[2])  # Returns the second element (a vector of month names)

print(list2[3])  # Returns the third element (a nested list)

If you want to access elements inside the nested list within the list, use double square brackets:

r

CopyEdit

print(list2[[3]][[1]])  # Outputs 3

print(list2[[3]][[2]])  # Outputs 4

print(list2[[3]][[3]])  # Outputs 5

Adding and Deleting List Elements

You can add elements to a list by assigning a new value at the next index:

r

CopyEdit

list2[4] <- “HELLO”

print(list2[4])

This will add a new element at position 4. To delete an element, assign NULL to that index:

r

CopyEdit

list2[4] <- NULL

print(list2[4])  # This will return NULL since the element is removed

Updating List Elements

To update an existing element in the list, simply assign a new value to the desired index:

r

CopyEdit

list2[3] <- “Element Updated”

print(list2[3])

This replaces the third element in the list with the string “Element Updated”.

Matrices in R Programming

Matrices are two-dimensional data structures in R that hold elements of the same data type. They can be considered as vectors with a dimension attribute. The elements in a matrix are arranged in rows and columns.

Creating Matrices

You can create a matrix using the matrix() function. The primary arguments include the data elements as a vector, the number of rows, the number of columns, whether to fill the matrix by row or by column, and optional dimension names.

The syntax is:

r

CopyEdit

matrix(data, nrow, ncol, byrow, dimnames)

  • Data is a vector containing the elements.
  • nrow specifies the number of rows.
  • ncol specifies the number of columns.
  • byrow is a logical value indicating whether to fill the matrix by rows (TRUE) or by columns (FALSE, default).
  • dimnames is a list containing optional row and column names.

Example of creating a 3×3 matrix filled by rows:

r

CopyEdit

M1 <- matrix(c(1:9), nrow = 3, ncol = 3, byrow = TRUE)

print(M1)

Output:

css

CopyEdit

    [,1] [,2] [,3]

[1,]    1    2    3

[2,]    4    5    6

[3,]    7    8    9

Similarly, creating a matrix filled with columns:

r

CopyEdit

M2 <- matrix(c(1:9), nrow = 3, ncol = 3, byrow = FALSE)

print(M2)

Output:

css

CopyEdit

    [,1] [,2] [,3]

[1,]    1    4    7

[2,]    2    5    8

[3,]    3    6    9

Adding Row and Column Names

You can name the rows and columns by passing a list to the dimnames argument:

r

CopyEdit

rownames <- c(“row1”, “row2”, “row3”)

colnames <- c(“col1”, “col2”, “col3”)

M3 <- matrix(c(1:9), nrow = 3, byrow = TRUE, dimnames = list(rownames, colnames))

print(M3)

Output:

markdown

CopyEdit

    col1 col2 col3

row1    1    2    3

row2    4    5    6

row3    7    8    9

Accessing Matrix Elements

To access elements of a matrix, you specify the row and column indices within square brackets. The syntax is:

r

CopyEdit

matrixName[row, column]

For example, using matrix M3:

r

CopyEdit

print(M3[1, 1])  # First row, first column

print(M3[3, 3])  # Third row, third column

print(M3[2, 3])  # Second row, third column

The output will be:

csharp

CopyEdit

[1] 1

[1] 9

[1] 6

You can also extract entire rows or columns by leaving the other index blank:

r

CopyEdit

print(M3[1, ])  # All elements in first row

print(M3[, 2])  # All elements in second column

Factors in R

Factors are a special data structure used for fields that take a limited number of unique values, often called categorical data. Factors are useful in statistical modeling and data analysis because they represent categories and their levels efficiently.

Creating Factors

You create factors from vectors using the factor() function. For example:

r

CopyEdit

data <- c(“Male”, “Female”, “Male”, “Child”, “Child”, “Male”, “Female”, “Female”)

print(data)

factor.data <- factor(data)

print(factor.data)

Output:

csharp

CopyEdit

[1] “Male”   “Female” “Male”   “Child”  “Child”  “Male”   “Female” “Female”

Levels: Child, Female, Male

The unique values in the vector become the factor levels.

Using Factors in Data Frames

When a data frame contains text columns, R often treats them as factors by default. For instance, consider a data frame, emp. finalda, ta with a column,umn empdept:

r

CopyEdit

print(is.factor(emp.finaldata$empdept))

print(emp.finaldata$empdept)

Output might be:

csharp

CopyEdit

[1] TRUE

[1] Sales     Marketing HR R&D IT        Operations Finance  

Levels: HR, Marketing, R&D, Sales, Finance, IT Operations

This shows that the empdept column is treated as a factor, which can be beneficial for grouping and analysis.

Data Frames in R Programming

Data frames are one of the most widely used data structures in R. They are essentially tables or 2D data structures where each column can be of a different data type (numeric, character, factor, etc.). Data frames are used to store tabular data similar to spreadsheets or SQL tables.

Creating a Data Frame

You can create a data frame using the data.frame() function:

r

CopyEdit

empdata <- data.frame(

  empid = c(1001, 1002, 1003, 1004),

  empname = c(“Alice”, “Bob”, “Charlie”, “David”),

  empdept = c(“Sales”, “Marketing”, “HR”, “IT”),

  empsalary = c(50000, 55000, 45000, 60000)

)

print(empdata)

Output:

yaml

CopyEdit

 empid  empname  empdept    empsalary

1  1001    Alice    Sales       50000

2  1002      Bob Marketing       55000

3  1003  Charlie       HR       45000

4  1004    David       IT       60000

Accessing Data Frame Elements

You can access elements of a data frame in several ways:

  • By column name:

r

CopyEdit

print(empdata$empname)  # Prints the empname column

  • By row and column indices:

r

CopyEdit

print(empdata[2, 3])  # Row 2, column 3: “Marketing”

print(empdata[ , 2])  # All rows, column 2 (empname)

print(empdata[1, ])   # Row 1, all columns

Adding Rows and Columns

  • To add a new column:

r

CopyEdit

empdata$empbonus <- c(5000, 4000, 3000, 6000)

print(empdata)

  • To add a new row (using rbind()):

r

CopyEdit

newrow <- data.frame(empid=1005, empname=”Eve”, empdept=”Finance”, empsalary=52000, empbonus=4500)

empdata <- rbind(empdata, newrow)

print(empdata)

Arrays in R Programming

Arrays are similar to matrices but can have more than two dimensions. They hold elements of the same data type.

Creating Arrays

You use the array() function to create an array by specifying the data and dimensions:

r

CopyEdit

# Create a 3D array with dimensions 2 x 3 x 4

arr <- array(1:24, dim = c(2, 3, 4))

print(arr)

Accessing Array Elements

You specify indices for each dimension in square brackets:

r

CopyEdit

print(arr[1, 2, 3])  # Element at 1st row, 2nd column, 3rd matrix

print(arr[, , 2])    # The entire 2nd matrix

Arrays are useful for working with multi-dimensional data, like image processing, scientific datasets, etc.

Handling Missing Data in R

Missing data is common in real-world datasets, and R provides ways to detect and manage missing values.

Representing Missing Data

In R, missing data is represented by the special value NA.

Detecting Missing Values

Use the is.na() function to check for missing values:

r

CopyEdit

data <- c(10, 20, NA, 40, NA)

print(is.na(data))

Output:

graphql

CopyEdit

[1] FALSE FALSE  TRUE FALSE  TRUE

Removing Missing Values

You can remove missing values using the na.omit() function or by using logical indexing:

r

CopyEdit

clean_data <- na.omit(data)

print(clean_data)

Output:

csharp

CopyEdit

[1] 10 20 40

Alternatively:

r

CopyEdit

clean_data <- data[!is.na(data)]

print(clean_data)

Replacing Missing Values

You can replace missing values with a specific value:

r

CopyEdit

data[is.na(data)] <- 0

print(data)

Output:

csharp

CopyEdit

[1] 10 20  0 40  0

Lists in R Programming

Lists are very flexible data structures in R. Unlike vectors or arrays, lists can hold elements of different types and structures — including vectors, matrices, other lists, data frames, and even functions.

Creating Lists

r

CopyEdit

mylist <- list(

  name = “John”,

  age = 28,

  scores = c(85, 90, 88),

  passed = TRUE

)

print(mylist)

Accessing List Elements

You can access list elements by name or position:

r

CopyEdit

print(mylist$name)    # “John”

print(mylist[[2]])    # 28 (age)

print(mylist$scores)  # c(85, 90, 88)

To access an element inside a list element:

r

CopyEdit

print(mylist$scores[2])  # 90

Factors in R Programming

Factors are used to handle categorical data — data that has a fixed number of possible values (levels). They are stored as integers with labels.

Creating Factors

r

CopyEdit

colors <- c(“red”, “blue”, “red”, “green”, “blue”, “blue”)

color_factor <- factor(colors)

print(color_factor)

Output:

csharp

CopyEdit

[1] red   blue  red   green blue  blue 

Levels: blue, green, red

Why Factors?

  • Factors help with statistical modeling.
  • They save memory compared to storing character vectors.
  • Levels can be ordered or unordered.

Accessing Levels

r

CopyEdit

levels(color_factor)  # “blue” “green” “red”

You can specify the order of levels:

r

CopyEdit

ordered_factor <- factor(colors, levels = c(“red”, “green”, “blue”), ordered = TRUE)

print(ordered_factor)

Basic Data Manipulation with dplyr

dplyr is a package that makes data manipulation easy and intuitive.

Installing and Loading dplyr

r

CopyEdit

install.packages(“dplyr”)  # Run once

library(dplyr)

Sample Data Frame

r

CopyEdit

df <- data.frame(

  name = c(“Alice”, “Bob”, “Charlie”, “David”, “Eve”),

  age = c(25, 30, 35, 40, 28),

  salary = c(50000, 60000, 55000, 65000, 48000)

)

Common dplyr Functions

  • filter() — Filter rows based on condition:

r

CopyEdit

young_employees <- filter(df, age < 30)

print(young_employees)

  • Select () — Select specific columns:

r

CopyEdit

selected_data <- select(df, name, salary)

print(selected_data)

  • Mutate () — Add new columns or modify existing ones:

r

CopyEdit

df <- mutate(df, bonus = salary * 0.1)

print(df)

  • Arrange e() — Sort rows:

r

CopyEdit

df_sorted <- arrange(df, desc(salary))

print(df_sorted)

  • Summarize () — Summarize data (usually with grouping):

r

CopyEdit

avg_salary <- summarize(df, avg_salary = mean(salary))

print(avg_salary)

  • group_by() — Group data before summarizing:

r

CopyEdit

df2 <- data.frame(

  dept = c(“Sales”, “Sales”, “HR”, “HR”, “IT”),

  salary = c(50000, 55000, 45000, 47000, 60000)

)

dept_avg <- df2 %>%

  group_by(dept) %>%

  summarize(avg_salary = mean(salary))

print(dept_avg)

Final Thoughts 

1. Practice is Key

  • R is best learned by doing. Try solving real problems or working on small projects.
  • Use datasets from sources like Kaggle or R’s built-in datasets (mtcars, iris).

2. Understand the Ecosystem

  • R is much more than base functions. Explore popular packages like:
    • tidyverse (includes dplyr, ggplot2, tidyr, etc.) for data manipulation and visualization.
    • Data. Table for fast data handling.
    • Shiny is for interactive web apps.

3. Learn to Debug and Read Error Messages

  • Errors are part of coding. Carefully read error messages—they often tell you exactly what’s wrong.
  • Use traceback(), debug(), and browser() functions to troubleshoot.

4. Write Clean and Reproducible Code

  • Comment your code.
  • Use meaningful variable names.
  • Organize your scripts logically.
  • Consider using R Markdown for combining code, output, and explanations in one document.

5. Keep Exploring Statistical and Visualization Tools

  • R’s strength is in statistics and graphics.
  • Learn to use ggplot2 for advanced plotting.
  • Explore modeling functions like lm(), glm(), or machine learning packages.

6. Join the Community

  • R has a welcoming and active community.
  • Participate in forums like Stack Overflow, RStudio Community, or local user groups.
  • Follow blogs, Twitter accounts, or YouTube channels about R.