Intra-import deduplication: preventing duplicate records during CSV import | Tapix                  [ ![Tapix](/img/tapix-logo-light.svg) ![Tapix](/img/tapix-logo-dark.svg) ](https://tapix.dev) [Features](https://tapix.dev#features) [Pricing](https://tapix.dev#pricing) [Docs](https://docs.tapix.dev) [Blog](https://tapix.dev/blog)

   Try Demo  [ Get Tapix from $99](https://tapix.dev#pricing)

  [Features](https://tapix.dev#features) [Pricing](https://tapix.dev#pricing) [Docs](https://docs.tapix.dev) [Blog](https://tapix.dev/blog)   Try Demo  [ Get Tapix from $99](https://tapix.dev#pricing)

   ![Tapix](https://tapix.dev/img/tapix-logo-light.svg)

 Best PracticesIntra-import deduplication: preventing duplicate records during CSV import
==========================================================================

 tapix.dev/blog

  [    Back to blog ](https://tapix.dev/blog) [ Best Practices ](https://tapix.dev/blog/category/best-practices)

Intra-import deduplication: preventing duplicate records during CSV import
==========================================================================

 Manch Minasyan ·  May 26, 2026  · 10 min read

 You have a contacts CSV with 500 rows. Every row has a "Company" column. Three hundred of those rows say "Acme Corp". You run the import. Your `companies` table now has 300 rows for Acme Corp. Three hundred identical records, each created because the import processed each row in isolation without remembering what it had already done.

This is intra-import duplication -- the most common data quality problem in CSV imports that involve relationships. It is not about duplicates that already exist in the database before the import starts. Intra-import duplication happens within a single import run, when two or more rows reference the same related entity and the system creates it fresh each time.

If you have not already read about why duplicate handling ends up in every real-world import, see [The hidden cost of building your own CSV importer](/blog/hidden-cost-building-csv-importer) for the full picture of what this piece fits into.

The fix is a normalized-key cache that sits in memory for the duration of the import. The first row that references "Acme Corp" creates the company record and stores its ID. The remaining 299 rows pull that ID from the cache. One company record, not three hundred.

[\#](#why-firstorcreate-alone-does-not-solve-this "Permalink")Why `firstOrCreate` alone does not solve this
-----------------------------------------------------------------------------------------------------------

The obvious response is to use Eloquent's `firstOrCreate`. If the record exists, return it. If not, create it. Problem solved.

```
foreach ($rows as $row) {
    $company = Company::firstOrCreate(
        ['name' => $row['company']],
    );

    Contact::create([
        'first_name'  => $row['first_name'],
        'email'       => $row['email'],
        'company_id'  => $company->id,
    ]);
}

```

This prevents true duplicates at the database level, but it comes with two costs that get worse as your file grows.

First, every row hits the database with a SELECT query. For 500 rows referencing "Acme Corp", you run 500 SELECT queries against the `companies` table. The first one finds nothing and inserts. The next 499 find the existing row and return it. That is 499 queries that return the exact same result, each adding latency and load for no reason.

Second, `firstOrCreate` relies on exact string matching in the WHERE clause. If row 1 has "Acme Corp" and row 47 has " Acme Corp" (leading space) or row 200 has "acme corp" (lowercase), `firstOrCreate` treats them as different values. You get three company records instead of one. The database does not know these are the same entity because nobody told it how to normalize the comparison.

An in-memory cache with normalized keys solves both problems at once.

[\#](#the-normalized-key-cache-pattern "Permalink")The normalized-key cache pattern
-----------------------------------------------------------------------------------

The pattern has three parts: a key format, a normalization function, and a lookup-or-create flow.

### [\#](#key-format "Permalink")Key format

Cache keys follow the structure `{linkKey}:{normalized_value}`. The `linkKey` is a prefix that identifies which relationship type the entry belongs to. The normalized value is the cleaned version of the raw CSV string.

For a company relationship, the key might be `company:acme corp`. For a tag relationship, `tag:laravel`. The prefix prevents collisions when a single import resolves multiple relationship types -- you do not want a company named "Laravel" to collide with a tag named "Laravel".

### [\#](#normalization "Permalink")Normalization

Normalization strips the cosmetic differences that make string matching fail. The minimum set of transformations is:

1. Trim leading and trailing whitespace
2. Collapse internal runs of whitespace to a single space
3. Convert to lowercase

```
private function normalizeValue(string $value): string
{
    $value = trim($value);
    $value = preg_replace('/\s+/', ' ', $value);

    return strtolower($value);
}

```

This ensures that "Acme Corp", " Acme Corp ", "acme corp", and "ACME CORP" all produce the same normalized key: `acme corp`. The raw value from the first occurrence is what gets stored in the database -- normalization only affects cache key lookups, not the persisted data.

Depending on your domain, you might extend normalization further -- stripping punctuation to catch "Acme Corp." matching "Acme Corp", or transliterating accented characters. But trim + collapse + lowercase covers the vast majority of real-world CSV inconsistencies.

### [\#](#the-cache-itself "Permalink")The cache itself

The cache is a plain PHP array, scoped to the import job instance. It does not use Redis or any external store. It exists for the lifetime of the job and is garbage collected when the job finishes.

```
class EntityLinkResolver
{
    private array $resolved = [];

    public function resolve(
        string $linkKey,
        string $rawValue,
        string $model,
        string $matchBy,
        string $behavior,
    ): ?int {
        $normalized = $this->normalizeValue($rawValue);
        $cacheKey = "{$linkKey}:{$normalized}";

        if (isset($this->resolved[$cacheKey])) {
            return $this->resolved[$cacheKey];
        }

        $record = $model::where($matchBy, $rawValue)->first();

        if ($record) {
            $this->resolved[$cacheKey] = $record->id;

            return $record->id;
        }

        if ($behavior !== 'match_or_create') {
            return null;
        }

        $record = $model::create([$matchBy => $rawValue]);
        $this->resolved[$cacheKey] = $record->id;

        return $record->id;
    }

    private function normalizeValue(string $value): string
    {
        $value = trim($value);
        $value = preg_replace('/\s+/', ' ', $value);

        return strtolower($value);
    }
}

```

Walk through what happens with 500 rows referencing "Acme Corp":

- **Row 1**: Cache miss. SELECT query finds no match. INSERT creates company with ID 42. Cache stores `company:acme corp => 42`.
- **Row 2**: Cache hit. Returns 42 immediately. No database query.
- **Row 3 through 500**: Cache hit. Returns 42 immediately. No database query.

Total queries for the company relationship: 1 SELECT + 1 INSERT. Without the cache: 500 SELECTs + 1 INSERT. For a file with 50 unique companies referenced across 10,000 rows, the savings are even starker: 50 SELECTs + 50 INSERTs versus 10,000 SELECTs + 50 INSERTs.

[\#](#transaction-safety "Permalink")Transaction safety
-------------------------------------------------------

The cache must stay consistent with the database. If a company record is created and cached, but the subsequent contact insert fails and the transaction rolls back, the cache still contains the now-nonexistent company ID. The next row that references that company will use the stale ID, producing a foreign key constraint violation or an orphaned reference.

The solution is straightforward: wrap the create-and-cache operation in a transaction, and clear the cache entry on failure.

```
use Illuminate\Support\Facades\DB;

public function resolveWithTransaction(
    string $linkKey,
    string $rawValue,
    string $model,
    string $matchBy,
): ?int {
    $normalized = $this->normalizeValue($rawValue);
    $cacheKey = "{$linkKey}:{$normalized}";

    if (isset($this->resolved[$cacheKey])) {
        return $this->resolved[$cacheKey];
    }

    try {
        return DB::transaction(function () use ($model, $matchBy, $rawValue, $cacheKey) {
            $record = $model::lockForUpdate()->where($matchBy, $rawValue)->first();

            if (! $record) {
                $record = $model::create([$matchBy => $rawValue]);
            }

            $this->resolved[$cacheKey] = $record->id;

            return $record->id;
        });
    } catch (\Throwable $e) {
        unset($this->resolved[$cacheKey]);

        throw $e;
    }
}

```

The `lockForUpdate()` prevents a race condition where two queue workers both attempt to create the same company. The first acquires the lock and creates. The second waits, then finds the existing record. Without the lock, both could pass the `first()` check simultaneously and both attempt the insert.

The catch block ensures that if anything inside the transaction fails, the cache entry is removed. The next row retries from scratch rather than using a poisoned cache entry.

[\#](#cross-column-deduplication "Permalink")Cross-column deduplication
-----------------------------------------------------------------------

The examples so far assume a single column references each entity. But real-world CSVs often reference the same entity through different columns.

Consider a contacts CSV where "Acme Corp" appears in the "Company" column on some rows, and "acme.com" appears in the "Company Domain" column on others. Both point to the same company record, but through different matching fields: one by name, one by domain.

If the import resolves each column independently, it creates two entries in the cache -- `company:acme corp` and `company_domain:acme.com` -- and potentially two company records. The name-based lookup finds nothing (Acme Corp does not exist yet), creates the record, and caches it under the name key. Later, the domain-based lookup finds nothing under its key, creates a second record, and caches it under the domain key. Same company, two records.

Unified resolution solves this by treating both columns as inputs to the same entity resolution pipeline. When the importer defines multiple matchable fields for a relationship, the resolver checks all of them against the same pool of resolved entities:

1. Row 1 has `company: "Acme Corp"`. The resolver searches by name, finds nothing, creates company ID 42, and caches it under `company:acme corp`.
2. Row 2 has `company_domain: "acme.com"`. The resolver searches by domain, finds company ID 42 (which was just created with "Acme Corp" as its name). It caches the result under `company_domain:acme.com`, pointing to the same ID 42.

Now both cache keys resolve to ID 42. Subsequent rows that reference "Acme Corp" by name or "acme.com" by domain all hit the cache and link to the same record.

This is why match priority matters. When a single row provides both a company name and a company domain, the resolver checks the higher-priority field first. A domain match (priority 80) is more reliable than a name match (priority 10). If the domain matches an existing record, the name column is ignored for matching purposes even if the name differs slightly. This prevents the resolver from creating a new record based on a name variation when the domain already identified the entity unambiguously.

The interplay between multiple matching fields and the deduplication cache is covered in detail in [Importing relational data from CSV files in Laravel](/blog/importing-relational-data-csv-laravel).

[\#](#when-the-cache-is-not-enough "Permalink")When the cache is not enough
---------------------------------------------------------------------------

The in-memory cache works for a single import job processing rows sequentially. It does not cover two scenarios.

**Pre-existing duplicates.** If your `companies` table already has three rows for "Acme Corp" before the import starts, the cache cannot fix that. The import system's match resolution step -- where it queries for existing records and presents results to the user -- is where pre-existing duplicates surface. The user sees "Acme Corp matched 3 records" and decides which one to link to.

**Parallel job processing.** If your import runs across multiple queue workers simultaneously, each worker has its own in-memory cache. Worker A creates "Acme Corp" and caches ID 42. Worker B does not see that cache entry and creates a second one. The `lockForUpdate` pattern from the transaction safety section mitigates this at the database level, but for strict deduplication, sequential chunk processing within a single job is the safer architecture. In multi-tenant setups, per-worker cache isolation interacts with tenant scoping in ways worth understanding -- [Multi-tenant CSV imports in Laravel](/blog/multi-tenant-csv-imports-laravel) covers the job isolation patterns in depth.

[\#](#measuring-the-impact "Permalink")Measuring the impact
-----------------------------------------------------------

The performance gain scales with how relational your data is. An import where every row references 2-3 related entities with high repetition -- 10,000 order rows referencing 50 products and 200 customers -- goes from 20,000+ redundant queries to a few hundred.

The correctness impact is binary. Without the cache and normalization, "Acme Corp" and "acme corp" are different companies. With it, they are the same. You either deduplicate within the import or you do not.

---

The normalized-key cache is one piece of a broader relationship resolution pipeline. Further reading:

- [Importing relational data from CSV files in Laravel](/blog/importing-relational-data-csv-laravel) -- the full picture of how CSV columns map to database relationships, including BelongsTo lookups and multi-value fields
- For the decision framework behind when to create versus link related records, a dedicated post covers this topic in depth
- [The complete guide to CSV imports in Laravel](/blog/complete-guide-csv-imports-laravel) -- the complete landscape of import approaches in Laravel

If you are building an import system that needs deduplication, relationship resolution, and inline validation handled for you, take a look at [tapix.dev](/) to see how it works out of the box.

 ### Enjoyed this post?

Get notified when we publish new articles about Laravel imports and data handling.

  Email address   Subscribe

Almost there — confirm your subscription via email.

 Related posts
-------------

 [  Best Practices   Jul 7, 2026

 Designing AI-friendly import flows in Laravel
-----------------------------------------------

Developers are asking how to trigger imports from scripts and feed CSV data through APIs. Here's how to design import flows that work for both humans and machines.

 ](https://tapix.dev/blog/designing-ai-friendly-import-flows) [  Best Practices   Jun 23, 2026

 CSV import lifecycle hooks: running custom logic before, during, and after import
-----------------------------------------------------------------------------------

Six lifecycle hooks let you inject custom logic at every stage of an import -- from preparing caches to setting owners to sending notifications.

 ](https://tapix.dev/blog/csv-import-lifecycle-hooks) [  Best Practices   Jun 5, 2026

 CSV import error recovery: from silent failures to user-friendly correction
-----------------------------------------------------------------------------

The spectrum from silent failure to inline correction. Here's how to store, display, and resolve import errors without losing data.

 ](https://tapix.dev/blog/csv-import-error-recovery)

   [ ![Tapix](/img/tapix-logo-light.svg) ![Tapix](/img/tapix-logo-dark.svg) ](https://tapix.dev)CSV and Excel import wizard for Laravel.

  Product [Pricing](https://tapix.dev#pricing) [Docs](https://docs.tapix.dev) [Blog](https://tapix.dev/blog) [Contact](mailto:hello@tapix.dev)

 Compare [vs Laravel Excel](https://tapix.dev/vs/laravel-excel) [vs Filament Import](https://tapix.dev/vs/filament-import)

 Legal [Privacy](https://tapix.dev/privacy-policy) [Terms](https://tapix.dev/terms-of-service)

© 2026 Tapix. All rights reserved.