Populating Multiple Columns in R Dataframe Using dplyr for Matching Values
R Multiple Dataframe Column Matches to Populate Column
This post discusses how to populate multiple columns in one dataframe based on matching values with another dataframe using the dplyr library in R.
Introduction
In this example, we have two dataframes: df1 and df2. The structure of these dataframes is shown below:
structure(list(MAPS_code = c("SARI", "SABO", "SABO", "SABO",
"ISLA", "TROP"), Location_code = c("LCP-", "LCP-", "LCP-", "LCP-", "LCP-",
"LCP-"), Contact = c("Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall",
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall"), Lat = c(NA, NA, NA,
NA, NA, "51.23"), Long = c(NA, NA, NA, NA, NA, "-109.26")), row.names = c(NA, 6L), class = "data.frame")
structure(list(MAPS_code = c("SAFR", "SAGA", "ELPU", "ISLA",
"SABO", "SATE", "QUST", "SARI", "SABO", "SABO"), Location_code = c("LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-"), Contact = c("Tom Jones", "John Smith", "Jane Doe", "Chase Mendenhall", "Chase Mendenhall", "Tom Johnson", "Jane Brown", "Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall"), Lat = c(8.827778, 9.876543, 10.234567, 8.835833, 8.801111, 8.765432, 8.123456, 8.86789, 8.80111, 8.80111), Long = c(-82.92417, -82.12345, -82.34567, -82.96306, -82.91722, -82.76543, -82.90123, -82.87653, -82.91722, -82.91722)), class = "data.frame", row.names = c(NA, -20L))
We want to populate the Lat and Long columns in df1 with values from df2, based on matching values between the corresponding rows.
Solution
Here is a step-by-step guide to achieving this:
# Load necessary libraries
library(dplyr)
# Rename 'Location_code' column to 'Location'
df1 %>%
rename(Location = Location_code)
# Left join df2 with df1 on MAPS_code, Contact and Location columns
df1 %>%
left_join(df2, by = c('MAPS_code', 'Contact', 'Location'))
# Select only non-NA values from Lat and Long columns in df2
df1 %>%
select(Lat = coalesce(!!select(., starts_with('Lat')), NULL),
Long = coalesce(!!!select(., starts_with('Long')), NULL))
# Select only columns that are not NA
df1 %>%
select(!contains('.'))
Explanation
Here’s a brief explanation of each step:
- We first rename the
Location_codecolumn indf1toLocation. - Then we perform a left join between
df2anddf1on theMAPS_code,Contact, andLocationcolumns. This ensures that all rows fromdf1are included, even if there’s no matching row indf2. - Next, we select only non-NA values from the
LatandLongcolumns ofdf2using thecoalesce()function. - Finally, we select only columns that do not contain an underscore (i.e., the original column names) using the
!contains('.')filter.
The resulting dataframe will have the desired populated values in its Lat and Long columns:
structure(list(MAPS_code = c("SARI", "SABO", "SABO", "SABO",
"ISLA", "TROP"), Location = c("LCP-", "LCP-", "LCP-", "LCP-", "LCP-",
"LCP-"), Contact = c("Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall",
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall"), Lat = c(8.827778, 8.801111, 8.801111,
8.801111, 8.835833, 51.230000), Long = c(-82.92417, -82.91722, -82.91722,
-82.91722, -82.96306, -109.26000)), row.names = c(NA, 6L), class = "data.frame")
This solution demonstrates how to populate multiple columns in one dataframe based on matching values with another dataframe using the dplyr library.
Last modified on 2024-09-07