Intra-import deduplication: preventing duplicate records during CSV import
Intra-import deduplication: preventing duplicate records during CSV import
You have a contacts CSV with 500 rows. Every row has a "Company" column. Three hundred of those rows say "Acme Corp". You run the import. Your companies table now has 300 rows for Acme Corp. Three hundred identical records, each created because the import processed each row in isolation without remembering what it had already done.
This is intra-import duplication -- the most common data quality problem in CSV imports that involve relationships. It is not about duplicates that already exist in the database before the import starts. Intra-import duplication happens within a single import run, when two or more rows reference the same related entity and the system creates it fresh each time.
If you have not already read about why duplicate handling ends up in every real-world import, see The hidden cost of building your own CSV importer for the full picture of what this piece fits into.
The fix is a normalized-key cache that sits in memory for the duration of the import. The first row that references "Acme Corp" creates the company record and stores its ID. The remaining 299 rows pull that ID from the cache. One company record, not three hundred.
#Why firstOrCreate alone does not solve this
The obvious response is to use Eloquent's firstOrCreate. If the record exists, return it. If not, create it. Problem solved.
foreach ($rows as $row) {
$company = Company::firstOrCreate(
['name' => $row['company']],
);
Contact::create([
'first_name' => $row['first_name'],
'email' => $row['email'],
'company_id' => $company->id,
]);
}
This prevents true duplicates at the database level, but it comes with two costs that get worse as your file grows.
First, every row hits the database with a SELECT query. For 500 rows referencing "Acme Corp", you run 500 SELECT queries against the companies table. The first one finds nothing and inserts. The next 499 find the existing row and return it. That is 499 queries that return the exact same result, each adding latency and load for no reason.
Second, firstOrCreate relies on exact string matching in the WHERE clause. If row 1 has "Acme Corp" and row 47 has " Acme Corp" (leading space) or row 200 has "acme corp" (lowercase), firstOrCreate treats them as different values. You get three company records instead of one. The database does not know these are the same entity because nobody told it how to normalize the comparison.
An in-memory cache with normalized keys solves both problems at once.
#The normalized-key cache pattern
The pattern has three parts: a key format, a normalization function, and a lookup-or-create flow.
#Key format
Cache keys follow the structure {linkKey}:{normalized_value}. The linkKey is a prefix that identifies which relationship type the entry belongs to. The normalized value is the cleaned version of the raw CSV string.
For a company relationship, the key might be company:acme corp. For a tag relationship, tag:laravel. The prefix prevents collisions when a single import resolves multiple relationship types -- you do not want a company named "Laravel" to collide with a tag named "Laravel".
#Normalization
Normalization strips the cosmetic differences that make string matching fail. The minimum set of transformations is:
- Trim leading and trailing whitespace
- Collapse internal runs of whitespace to a single space
- Convert to lowercase
private function normalizeValue(string $value): string
{
$value = trim($value);
$value = preg_replace('/\s+/', ' ', $value);
return strtolower($value);
}
This ensures that "Acme Corp", " Acme Corp ", "acme corp", and "ACME CORP" all produce the same normalized key: acme corp. The raw value from the first occurrence is what gets stored in the database -- normalization only affects cache key lookups, not the persisted data.
Depending on your domain, you might extend normalization further -- stripping punctuation to catch "Acme Corp." matching "Acme Corp", or transliterating accented characters. But trim + collapse + lowercase covers the vast majority of real-world CSV inconsistencies.
#The cache itself
The cache is a plain PHP array, scoped to the import job instance. It does not use Redis or any external store. It exists for the lifetime of the job and is garbage collected when the job finishes.
class EntityLinkResolver
{
private array $resolved = [];
public function resolve(
string $linkKey,
string $rawValue,
string $model,
string $matchBy,
string $behavior,
): ?int {
$normalized = $this->normalizeValue($rawValue);
$cacheKey = "{$linkKey}:{$normalized}";
if (isset($this->resolved[$cacheKey])) {
return $this->resolved[$cacheKey];
}
$record = $model::where($matchBy, $rawValue)->first();
if ($record) {
$this->resolved[$cacheKey] = $record->id;
return $record->id;
}
if ($behavior !== 'match_or_create') {
return null;
}
$record = $model::create([$matchBy => $rawValue]);
$this->resolved[$cacheKey] = $record->id;
return $record->id;
}
private function normalizeValue(string $value): string
{
$value = trim($value);
$value = preg_replace('/\s+/', ' ', $value);
return strtolower($value);
}
}
Walk through what happens with 500 rows referencing "Acme Corp":
- Row 1: Cache miss. SELECT query finds no match. INSERT creates company with ID 42. Cache stores
company:acme corp => 42. - Row 2: Cache hit. Returns 42 immediately. No database query.
- Row 3 through 500: Cache hit. Returns 42 immediately. No database query.
Total queries for the company relationship: 1 SELECT + 1 INSERT. Without the cache: 500 SELECTs + 1 INSERT. For a file with 50 unique companies referenced across 10,000 rows, the savings are even starker: 50 SELECTs + 50 INSERTs versus 10,000 SELECTs + 50 INSERTs.
#Transaction safety
The cache must stay consistent with the database. If a company record is created and cached, but the subsequent contact insert fails and the transaction rolls back, the cache still contains the now-nonexistent company ID. The next row that references that company will use the stale ID, producing a foreign key constraint violation or an orphaned reference.
The solution is straightforward: wrap the create-and-cache operation in a transaction, and clear the cache entry on failure.
use Illuminate\Support\Facades\DB;
public function resolveWithTransaction(
string $linkKey,
string $rawValue,
string $model,
string $matchBy,
): ?int {
$normalized = $this->normalizeValue($rawValue);
$cacheKey = "{$linkKey}:{$normalized}";
if (isset($this->resolved[$cacheKey])) {
return $this->resolved[$cacheKey];
}
try {
return DB::transaction(function () use ($model, $matchBy, $rawValue, $cacheKey) {
$record = $model::lockForUpdate()->where($matchBy, $rawValue)->first();
if (! $record) {
$record = $model::create([$matchBy => $rawValue]);
}
$this->resolved[$cacheKey] = $record->id;
return $record->id;
});
} catch (\Throwable $e) {
unset($this->resolved[$cacheKey]);
throw $e;
}
}
The lockForUpdate() prevents a race condition where two queue workers both attempt to create the same company. The first acquires the lock and creates. The second waits, then finds the existing record. Without the lock, both could pass the first() check simultaneously and both attempt the insert.
The catch block ensures that if anything inside the transaction fails, the cache entry is removed. The next row retries from scratch rather than using a poisoned cache entry.
#Cross-column deduplication
The examples so far assume a single column references each entity. But real-world CSVs often reference the same entity through different columns.
Consider a contacts CSV where "Acme Corp" appears in the "Company" column on some rows, and "acme.com" appears in the "Company Domain" column on others. Both point to the same company record, but through different matching fields: one by name, one by domain.
If the import resolves each column independently, it creates two entries in the cache -- company:acme corp and company_domain:acme.com -- and potentially two company records. The name-based lookup finds nothing (Acme Corp does not exist yet), creates the record, and caches it under the name key. Later, the domain-based lookup finds nothing under its key, creates a second record, and caches it under the domain key. Same company, two records.
Unified resolution solves this by treating both columns as inputs to the same entity resolution pipeline. When the importer defines multiple matchable fields for a relationship, the resolver checks all of them against the same pool of resolved entities:
- Row 1 has
company: "Acme Corp". The resolver searches by name, finds nothing, creates company ID 42, and caches it undercompany:acme corp. - Row 2 has
company_domain: "acme.com". The resolver searches by domain, finds company ID 42 (which was just created with "Acme Corp" as its name). It caches the result undercompany_domain:acme.com, pointing to the same ID 42.
Now both cache keys resolve to ID 42. Subsequent rows that reference "Acme Corp" by name or "acme.com" by domain all hit the cache and link to the same record.
This is why match priority matters. When a single row provides both a company name and a company domain, the resolver checks the higher-priority field first. A domain match (priority 80) is more reliable than a name match (priority 10). If the domain matches an existing record, the name column is ignored for matching purposes even if the name differs slightly. This prevents the resolver from creating a new record based on a name variation when the domain already identified the entity unambiguously.
The interplay between multiple matching fields and the deduplication cache is covered in detail in Importing relational data from CSV files in Laravel.
#When the cache is not enough
The in-memory cache works for a single import job processing rows sequentially. It does not cover two scenarios.
Pre-existing duplicates. If your companies table already has three rows for "Acme Corp" before the import starts, the cache cannot fix that. The import system's match resolution step -- where it queries for existing records and presents results to the user -- is where pre-existing duplicates surface. The user sees "Acme Corp matched 3 records" and decides which one to link to.
Parallel job processing. If your import runs across multiple queue workers simultaneously, each worker has its own in-memory cache. Worker A creates "Acme Corp" and caches ID 42. Worker B does not see that cache entry and creates a second one. The lockForUpdate pattern from the transaction safety section mitigates this at the database level, but for strict deduplication, sequential chunk processing within a single job is the safer architecture. In multi-tenant setups, per-worker cache isolation interacts with tenant scoping in ways worth understanding -- Multi-tenant CSV imports in Laravel covers the job isolation patterns in depth.
#Measuring the impact
The performance gain scales with how relational your data is. An import where every row references 2-3 related entities with high repetition -- 10,000 order rows referencing 50 products and 200 customers -- goes from 20,000+ redundant queries to a few hundred.
The correctness impact is binary. Without the cache and normalization, "Acme Corp" and "acme corp" are different companies. With it, they are the same. You either deduplicate within the import or you do not.
The normalized-key cache is one piece of a broader relationship resolution pipeline. Further reading:
- Importing relational data from CSV files in Laravel -- the full picture of how CSV columns map to database relationships, including BelongsTo lookups and multi-value fields
- For the decision framework behind when to create versus link related records, a dedicated post covers this topic in depth
- The complete guide to CSV imports in Laravel -- the complete landscape of import approaches in Laravel
If you are building an import system that needs deduplication, relationship resolution, and inline validation handled for you, take a look at tapix.dev to see how it works out of the box.