Auto-detecting CSV column types in Laravel
Auto-detecting CSV column types in Laravel
Every value in a CSV file arrives as a string. The number 1234.56, the date 2026-04-14, the boolean true, and the literal word banana all look the same to fgetcsv: a sequence of characters delimited by commas. Your import system has to figure out what each column actually represents, because the database on the other end cares very much about the difference between an integer, a timestamp, and a varchar.
Most import implementations skip this problem entirely. They hardcode the expected type for each column in the importer definition and blow up at runtime when the data does not match. That works when the developer controls the file format. It fails the moment end users upload their own spreadsheets, where a "Price" column contains dollar signs, a "Date" column uses European day-first ordering, and a "Status" column contains yes, no, 1, 0, true, and active all in the same file.
This post covers how to detect column types from sample values, handle the ambiguity inherent in numeric and date formats, and build a type inference engine that gives your import system a reasonable starting point before the user ever touches the mapping screen.
#The FieldType spectrum
A CSV import system needs a type vocabulary that sits between "everything is a string" and a full database schema. The type must be specific enough to drive validation and parsing, but general enough to handle the messy reality of human-authored data.
Tapix uses 11 field types, defined as a PHP enum:
enum FieldType: string
{
case Text = 'text';
case Number = 'number';
case Date = 'date';
case DateTime = 'datetime';
case Boolean = 'boolean';
case Choice = 'choice';
case MultiChoice = 'multi_choice';
case Email = 'email';
case Phone = 'phone';
case Url = 'url';
case Currency = 'currency';
}
Each type carries three responsibilities. First, validation rules: an Email type applies the email rule, a Phone type applies phone format matching. Second, parsing behavior: Number and Currency strip thousands separators and normalize decimal points; Date and DateTime parse through format-specific patterns. Third, UI representation: each type maps to an icon (heroicon-o-envelope for Email, heroicon-o-banknotes for Currency, heroicon-o-calendar for Date) that appears in the mapping interface so users can visually confirm the detected type.
The enum also provides grouping methods that simplify branching logic downstream. isNumeric() returns true for both Number and Currency. isDate() covers both Date and DateTime. isMultiValue() identifies types that accept comma-separated lists. These groups matter because validation and parsing are shared within a group -- a Currency value parses through the same NumberFormat::parse() pipeline as a plain Number, just with an extra step to strip the currency symbol first.
#Sample value analysis
Type detection starts with pattern matching against sample values from the column. The order of checks matters: more specific types must be tested before general ones. A valid email address also passes a text check, and a currency value like $1,234.56 would match a number pattern if you stripped the dollar sign first.
The detection priority, from most specific to least:
- Email -- validated with Laravel's
emailrule. If"john@example.com"passes, it is an email. - URL -- validated with Laravel's
urlrule. Catcheshttps://example.combut not bare domain names. - Phone -- regex pattern: starts with optional
+, contains at least 7 digits, allows spaces/dashes/parentheses."+1 (555) 123-4567"matches. - Date -- validated with Laravel's
daterule. Catches ISO dates (2026-04-14), natural language dates, and most delimiter-separated formats. - Currency -- regex for values with a leading or trailing currency symbol (
$, EUR, GBP, JPY)."$1,234.56"matches;"1234.56"does not. - Number -- after stripping commas and spaces, the value passes
is_numeric(). Catches integers, floats, and formatted numbers like"1,234". - Text -- the fallback. If nothing else matched, it is text.
Boolean and Choice types are not inferred from values alone -- they require context from the field definition (a predefined list of options, or explicit boolean typing). The inferencer does not guess that a column containing "yes" and "no" is boolean, because those same values could be legitimate text data. Type inference provides a suggestion; the importer definition provides the authority.
private function detectType(string $value): string
{
if ($this->isEmail($value)) {
return FieldType::Email->value;
}
if ($this->isUrl($value)) {
return FieldType::Url->value;
}
if ($this->isPhone($value)) {
return FieldType::Phone->value;
}
if ($this->isDate($value)) {
return FieldType::Date->value;
}
if ($this->isCurrency($value)) {
return FieldType::Currency->value;
}
if ($this->isNumber($value)) {
return FieldType::Number->value;
}
return FieldType::Text->value;
}
Notice Currency is checked before Number. The string "$42.99" would pass the number check after stripping the dollar sign, but the currency symbol is meaningful -- it tells the parser to expect a specific format and potentially preserve the currency context for display.
#Number format detection
Numbers look simple until you leave the US locale. The value "1,234.56" is unambiguous in English-speaking countries: comma is the thousands separator, period is the decimal point. But "1.234,56" means the same thing in Germany, France, and most of continental Europe: period separates thousands, comma is the decimal.
Tapix handles this with a NumberFormat enum that represents the two conventions:
- POINT format: decimal separator is
., thousands separator is,(example:1,000.00) - COMMA format: decimal separator is
,, thousands separator is.(example:1.000,00)
The parsing logic strips the "other" separator, normalizes the decimal to a period (which PHP and every database engine expect), removes currency symbols, and returns a float:
public function parse(string $value): ?float
{
$value = trim($value);
if ($value === '') {
return null;
}
$decimalSeparator = match ($this) {
self::POINT => '.',
self::COMMA => ',',
};
$otherSeparator = match ($this) {
self::POINT => ',',
self::COMMA => '.',
};
$value = str_replace(' ', '', $value);
$value = preg_replace('/^[$\x{20AC}\x{00A3}\x{00A5}]\s*|\s*[$\x{20AC}\x{00A3}\x{00A5}]$/u', '', $value);
$value = str_replace($otherSeparator, '', $value);
if ($decimalSeparator === ',') {
$value = str_replace(',', '.', $value);
}
if (! is_numeric($value)) {
return null;
}
return (float) $value;
}
The ambiguity problem surfaces with values like "1,234". Is this the integer one thousand two hundred thirty-four (POINT format, comma as thousands separator), or the decimal 1.234 (COMMA format, comma as decimal)? There is no way to resolve this from a single value. You need either user input (asking which format applies during the mapping step) or enough sample data to make a statistical guess.
In practice, the safest approach is to default to the user's locale and let them override. Tapix stores the selected NumberFormat on each ColumnData mapping, so the same import can have one column using POINT format and another using COMMA format -- which is exactly what happens when a European company exports a file that mixes US dollar amounts with local pricing.
#Date format detection
Dates are worse than numbers. The string "04/05/2026" has three valid interpretations depending on locale:
- ISO (YYYY-MM-DD): not applicable here, but
2026-04-05is always unambiguous - American (MM/DD/YYYY): April 5th, 2026
- European (DD/MM/YYYY): May 4th, 2026
ISO format (2026-04-05) is the only format that can be inferred from the value itself. The year comes first, followed by month, followed by day. There is exactly one parse format and no ambiguity.
For European and American formats, each DateFormat enum case carries multiple parse patterns to handle real-world variations. European dates might appear as 15/05/2024, 15-05-2024, or 15.05.2024. American dates use 05/15/2024 or 05-15-2024. The parser tries each format in sequence, returning the first successful Carbon parse:
public function parse(string $value, bool $withTime = false): ?Carbon
{
$value = trim($value);
if ($value === '') {
return null;
}
foreach ($this->getParseFormats($withTime) as $format) {
try {
$date = Date::createFromFormat($format, $value);
if ($date instanceof Carbon) {
return $date;
}
} catch (\Exception) {
continue;
}
}
return null;
}
European date-only parsing covers six format patterns (d/m/Y, d-m-Y, d.m.Y, j/n/Y, j-n-Y, j.n.Y). American covers four (m/d/Y, m-d-Y, n/j/Y, n-j/Y). DateTime formats double that count, adding time-first and time-last variants with both H:i and H:i:s precision. The European DateTime list alone has 12 patterns.
This is why date format must be selected by the user, not guessed by the system. You can auto-detect that a column contains dates (the date validation rule catches most formats), but you cannot auto-detect whether 01/02/2026 means January 2nd or February 1st. The mapping step presents the three format options with examples, and the user picks the one that matches their data.
One heuristic that does work: if any value in the column has a day value above 12, the ambiguity resolves itself. A value of 15/04/2026 can only be European (day-first), because there is no 15th month. Scanning sample values for this signal can auto-select the format when the data cooperates, falling back to user selection when it does not.
#Building a type inference engine
Individual value detection is one piece. A full inference engine scans multiple sample values from a column and produces a type recommendation with a confidence score.
The algorithm:
- Sample N values from the column. Sampling matters for large files -- you do not need to scan all 50,000 rows. The first 100-200 non-empty values are sufficient for statistical confidence.
- Vote per value: Run each sample through the detection chain. Every value casts a vote for the type it matched.
- Count votes: The type with the most votes is the candidate. Calculate confidence as the ratio of winning votes to total votes.
- Apply a threshold: If confidence is below 50%, the column is too mixed to infer a type. Fall back to Text.
- Exclude Text: If the winning type is Text, that is not useful information -- it just means nothing specific was detected. Return no inference rather than suggesting Text.
The confidence score drives the UX. A column where 95 of 100 samples are valid email addresses gets a high-confidence Email inference, and the mapping screen can auto-select the type. A column where 60 of 100 samples look like numbers but the rest are text (maybe a "Notes" column with occasional numeric values) gets a low-confidence inference, and the system leaves the type decision to the user.
The 50% threshold is deliberately conservative. A column with "$42.99", "N/A", "$18.00", "pending" is not a Currency column -- it is a mixed column that happens to have some currency values. The inference engine should report uncertainty, not force a bad guess that causes validation failures downstream.
In practice, most real-world CSV columns are homogeneous. An "Email" column is 98% valid emails with a few blanks and maybe one typo. A "Hire Date" column is 100% dates in the same format. The inference engine handles these easy cases automatically, which covers 80% of columns, and flags the ambiguous 20% for human review.
#Where type inference fits in the import pipeline
Type inference is not a standalone feature. It feeds into the mapping step, where the user sees each CSV column paired with a suggested field and type. The inference result becomes the default selection, not the final answer.
The user uploads a CSV, the system parses headers and samples the first N rows. For each column, the ColumnMapper tries to match the header text against field definitions using guess() aliases. If a match is found, the field's declared type takes precedence -- the developer who wrote the importer knows that "salary" is Currency, regardless of what the sample values look like. For unmapped columns, the inference result provides the suggested type in the mapping UI.
Format selection (POINT vs COMMA for numbers, ISO vs European vs American for dates) happens in the same step. Numeric columns expand to show a format picker defaulted to the user's locale. Date columns show the three format options with example values from the actual data.
By the time the system reaches validation, every mapped column has an explicit type and format stored on its ColumnData object. Validation and parsing use those explicit settings, not inference guesses. The per-column validation jobs that use this type information run in parallel -- the queue architecture behind that is covered in Queue-powered imports: processing 100K rows in Laravel. For what that validation step surfaces to the user and how inline correction works, see Handling CSV validation errors before they hit your database. The inference engine reduces the number of decisions the user has to make. It does not make decisions for them.
#Further reading
For the complete import pipeline that type inference feeds into, start with The complete guide to CSV imports in Laravel. The mapping step where inferred types become user-confirmed settings is covered in CSV column mapping UX patterns.
Tapix handles type inference, format detection, and the mapping UI out of the box. The 4-step import wizard auto-detects column types from sample values, lets users confirm or override the suggestion, and carries the format configuration through validation and execution. If you are building CSV imports that need to handle real-world data from non-technical users, take a look.