Hello @Dani_S ,
You could use this file ConverterFactoryInputExtensions_refactored.txt as a reference. Please read and make the changes as per your requirements.
Your overall approach is solid. However, I would like to point out a few issues that could cause problems:
- Your current
ClassifyJsonHeaderuses simple substring searches on truncated headers. This might break when JSON is minified or has unusual whitespace. Also, the logic also doesn’t check whether enough bytes have been read to make a reliable determination. - When voting on JSON entries in an archive, tied results simply return false. Since you’re already doing the work to read those entries, implementing a simple tiebreaker (such as alphabetical ordering) would make detection more robust.
- You have format-to-extension mappings in both
_s_extensionToConverterand_s_archiveRequirements. Maintaining two separate dictionaries increases the risk of inconsistencies over time. - Some paths log and return false, others catch and swallow exceptions, and some let errors bubble up. Choose one pattern and apply it consistently throughout.
- The method doesn’t check whether the file actually exists before trying to process it. Also consider what happens with zero‑byte files, files without extensions, or symlinks.
Suggested refactoring
I recommend consolidating your format definitions into a single descriptor class that holds file extensions, archive requirements, and other metadata:
private static readonly Dictionary<string, FormatDescriptor> _s_formats = new Dictionary<string, FormatDescriptor>(StringComparer.OrdinalIgnoreCase)
{
{ "GeoJson", new FormatDescriptor("GeoJson", new[] { ".geojson" }, new[] { ".geojson" }) },
{ "EsriJson", new FormatDescriptor("EsriJson", new[] { ".esrijson" }, new[] { ".esrijson" }) },
{ "GeoJsonSeq", new FormatDescriptor("GeoJsonSeq", new[] { ".jsonl", ".ndjson" }, Array.Empty<string>()) },
{ "TopoJson", new FormatDescriptor("TopoJson", new[] { ".topojson" }, Array.Empty<string>()) },
{ "Kml", new FormatDescriptor("Kml", new[] { ".kml" }, new[] { ".kml" }) },
{ "Kmz", new FormatDescriptor("Kmz", new[] { ".kmz" }, new[] { ".kml" }) }, // archive requirement is inner .kml
{ "Shapefile", new FormatDescriptor("Shapefile", new[] { ".shp" }, new[] { ".shp", ".shx", ".dbf" }) },
{ "Osm", new FormatDescriptor("Osm", new[] { ".osm" }, new[] { ".osm" }) },
{ "Gpx", new FormatDescriptor("Gpx", new[] { ".gpx" }, new[] { ".gpx" }) },
{ "Gml", new FormatDescriptor("Gml", new[] { ".gml" }, new[] { ".gml" }) },
{ "Gdb", new FormatDescriptor("Gdb", new[] { ".gdb" }, new[] { ".gdb" }) },
{ "MapInfoInterchange", new FormatDescriptor("MapInfoInterchange", new[] { ".mif" }, new[] { ".mif" }) },
{ "MapInfoTab", new FormatDescriptor("MapInfoTab", new[] { ".tab", ".map", ".dat", ".id" }, new[] { ".tab", ".dat", ".map", ".id" }) },
{ "Csv", new FormatDescriptor("Csv", new[] { ".csv" }, new[] { ".csv" }) },
{ "GeoPackage", new FormatDescriptor("GeoPackage", new[] { ".gpkg" }, new[] { ".gpkg" }) },
};
private class FormatDescriptor
{
public string Name { get; }
public string[] FileExtensions { get; } // extensions that identify this format for single files
public string[] ArchiveRequirements { get; } // extensions that MUST be present in an archive
public FormatDescriptor(string name, string[] fileExts, string[] archiveReqs)
{
Name = name;
FileExtensions = fileExts ?? Array.Empty<string>();
ArchiveRequirements = archiveReqs ?? Array.Empty<string>();
}
}
Reduce your header read limit from 64 KB to something smaller, such as 8 KB. Your classification logic doesn’t need that much data, and you’ll save memory when processing multiple archive entries.
Instead of returning false on tied votes, you could do something like this:
var maxVotes = votes.Values.Max();
var winners = votes.Where(kv => kv.Value == maxVotes)
.Select(kv => kv.Key)
.OrderBy(k => k) // tiebreaker
.ToArray();
var winner = winners.First();
detectReason = winners.Length > 1
? $"{winner} selected from tie ({string.Join(", ", winners)})"
: $"{winner} won with {maxVotes} votes";
Split the main method into TryDetectArchiveFormat and TryDetectSingleFileFormat to separate concerns and make unit testing easier.
Testing recommendations
You should add tests for:
- Empty or zero‑byte files
- Files without extensions
- Archives with all entries of unknown JSON type
- Archives with tied JSON votes (to verify the tiebreaker)
- Corrupted or truncated JSON files
- Minified JSON (no whitespace)