GIS - TryCreateForInput

Question

GIS - TryCreateForInput

Dani_S 4,836

Hi,

1.I used GIS API .

2.There are 15 formats:

EsriJson GeoJson GeoJsonSeq Kml/Kmz Shapefile OsmXml Gpx Gml FileGdb TopoJson MapInfoInterchange MapInfoTab Csv GeoPackage.

3.This is my main function: TryCreateForInput that gisInputFilePath-Path to a single GIS file or archive to inspect and returns IConverterFactory + detectReason-uman friendly reason describing how the converter was selected (useful for logging).

Since it critical in my app and difficult to hanle all cases , please can you look on the code

if it a good practice. If not can you make the fixes, the class not depend on other classes.

4.The summary of main function:

    /// <summary>

    /// Inspect the input path (file or archive) and attempt to resolve a converter from the factory.

    /// </summary>

    /// <param name="factory">Factory used to resolve a converter key to an <see cref="IConverter"/> instance.</param>

    /// <param name="gisInputFilePath">Path to a single GIS file or archive to inspect.</param>

    /// <param name="converter">Out parameter populated with the resolved converter when the method returns true.</param>

    /// <param name="detectReason">Human friendly reason describing how the converter was selected (useful for logging).</param>

    /// <returns>True when a converter was resolved; false when detection failed or result was ambiguous.</returns>

    /// <remarks>

    /// Behaviour details

    /// - Returns false and sets <paramref name="detectReason"/> when:

    ///   - The input path is invalid or cannot be inspected.

    ///   - Archive inspection is ambiguous (for example tied JSON votes).

    ///   - No matching converter mapping or requirement rules are found.

    /// - Detection steps (high-level):

    ///   1. If the input looks like an archive (see <see cref="ConverterUtils.IsArchiveFile"/>):

    ///      a. Use <see cref="ConverterUtils.TryListArchiveEntries"/> to obtain entry names (no extraction).

    ///      b. Build a set of discovered extensions / markers and apply fast "wins" (explicit .geojson/.esrijson).

    ///      c. Apply the KMZ guard (outer .kmz or top-level doc.kml => Kmz).

    ///      d. If archive contains only generic .json entries, open each .json entry and perform bounded header reads

    ///         (see <c>ReadEntryHeadUtf8</c>) and classify via <c>ClassifyJsonHeader</c>; then apply majority voting.

    ///      e. Otherwise apply strict requirement matching against <see cref="_s_archiveRequirements"/>.

    ///   2. For single-file inputs:

    ///      a. Use explicit extension mapping for .geojson and .esrijson.

    ///      b. For generic .json files invoke <c>JsonFormatDetector.DetectFromFile</c> (if available) then fall back

    ///         to a bounded header read + <c>ClassifyJsonHeader</c>.

    ///      c. Map the detected JSON format to a converter key (GeoJson / EsriJson / GeoJsonSeq / TopoJson).

    ///

    /// Safety and IO

    /// - This method avoids extracting archive contents. When entry-stream reads are required they are bounded

    ///   to <c>HeaderReadLimit</c> bytes and performed via streaming to minimize memory usage.

    /// - Unexpected exceptions are caught; the method logs details and returns false with a detect reason describing the problem.

    /// </remarks>

4.Attached files:

ConverterFactoryInputExtensions.txt

JsonFormatDetector.txt

ConverterFactoryInputExtensionsTests.txt

Thanks in advance,

1 answer

Your answer

Answer 1

Hello @Dani_S ,

You could use this file ConverterFactoryInputExtensions_refactored.txt as a reference. Please read and make the changes as per your requirements.

Your overall approach is solid. However, I would like to point out a few issues that could cause problems:

Your current ClassifyJsonHeader uses simple substring searches on truncated headers. This might break when JSON is minified or has unusual whitespace. Also, the logic also doesn’t check whether enough bytes have been read to make a reliable determination.
When voting on JSON entries in an archive, tied results simply return false. Since you’re already doing the work to read those entries, implementing a simple tiebreaker (such as alphabetical ordering) would make detection more robust.
You have format-to-extension mappings in both _s_extensionToConverter and _s_archiveRequirements. Maintaining two separate dictionaries increases the risk of inconsistencies over time.
Some paths log and return false, others catch and swallow exceptions, and some let errors bubble up. Choose one pattern and apply it consistently throughout.
The method doesn’t check whether the file actually exists before trying to process it. Also consider what happens with zero‑byte files, files without extensions, or symlinks.

Suggested refactoring

I recommend consolidating your format definitions into a single descriptor class that holds file extensions, archive requirements, and other metadata:


private static readonly Dictionary<string, FormatDescriptor> _s_formats = new Dictionary<string, FormatDescriptor>(StringComparer.OrdinalIgnoreCase)
{
    { "GeoJson", new FormatDescriptor("GeoJson", new[] { ".geojson" }, new[] { ".geojson" }) },
    { "EsriJson", new FormatDescriptor("EsriJson", new[] { ".esrijson" }, new[] { ".esrijson" }) },
    { "GeoJsonSeq", new FormatDescriptor("GeoJsonSeq", new[] { ".jsonl", ".ndjson" }, Array.Empty<string>()) },
    { "TopoJson", new FormatDescriptor("TopoJson", new[] { ".topojson" }, Array.Empty<string>()) },
    { "Kml", new FormatDescriptor("Kml", new[] { ".kml" }, new[] { ".kml" }) },
    { "Kmz", new FormatDescriptor("Kmz", new[] { ".kmz" }, new[] { ".kml" }) }, // archive requirement is inner .kml
    { "Shapefile", new FormatDescriptor("Shapefile", new[] { ".shp" }, new[] { ".shp", ".shx", ".dbf" }) },
    { "Osm", new FormatDescriptor("Osm", new[] { ".osm" }, new[] { ".osm" }) },
    { "Gpx", new FormatDescriptor("Gpx", new[] { ".gpx" }, new[] { ".gpx" }) },
    { "Gml", new FormatDescriptor("Gml", new[] { ".gml" }, new[] { ".gml" }) },
    { "Gdb", new FormatDescriptor("Gdb", new[] { ".gdb" }, new[] { ".gdb" }) },
    { "MapInfoInterchange", new FormatDescriptor("MapInfoInterchange", new[] { ".mif" }, new[] { ".mif" }) },
    { "MapInfoTab", new FormatDescriptor("MapInfoTab", new[] { ".tab", ".map", ".dat", ".id" }, new[] { ".tab", ".dat", ".map", ".id" }) },
    { "Csv", new FormatDescriptor("Csv", new[] { ".csv" }, new[] { ".csv" }) },
    { "GeoPackage", new FormatDescriptor("GeoPackage", new[] { ".gpkg" }, new[] { ".gpkg" }) },
};

private class FormatDescriptor
{
    public string Name { get; }
    public string[] FileExtensions { get; } // extensions that identify this format for single files
    public string[] ArchiveRequirements { get; } // extensions that MUST be present in an archive
    public FormatDescriptor(string name, string[] fileExts, string[] archiveReqs)
    {
        Name = name;
        FileExtensions = fileExts ?? Array.Empty<string>();
        ArchiveRequirements = archiveReqs ?? Array.Empty<string>();
    }
}

Reduce your header read limit from 64 KB to something smaller, such as 8 KB. Your classification logic doesn’t need that much data, and you’ll save memory when processing multiple archive entries.

Instead of returning false on tied votes, you could do something like this:

var maxVotes = votes.Values.Max();
var winners = votes.Where(kv => kv.Value == maxVotes)
                   .Select(kv => kv.Key)
                   .OrderBy(k => k)  // tiebreaker
                   .ToArray();

var winner = winners.First();
detectReason = winners.Length > 1 
    ? $"{winner} selected from tie ({string.Join(", ", winners)})"
    : $"{winner} won with {maxVotes} votes";

Split the main method into TryDetectArchiveFormat and TryDetectSingleFileFormat to separate concerns and make unit testing easier.

Testing recommendations

You should add tests for:

Empty or zero‑byte files
Files without extensions
Archives with all entries of unknown JSON type
Archives with tied JSON votes (to verify the tiebreaker)
Corrupted or truncated JSON files
Minified JSON (no whitespace)

Dani_S 4,836 Reputation points

2025-12-12T09:40:32.28+00:00

Dear Michael,

Thank you very much for your help and the code.

I implemented only these converters: Shapefile, Csv, Gml, so all checks not ready.

Before I will your tests:

1.GeoJson -this what you wrote:

FormatDescriptor("GeoJson", new[] { ".geojson" }, new[] { ".geojson" }) },

GeoJson has .json also is handled separately? how to fix it?

2.EsriJson -this what you wrote:

{ "EsriJson", new FormatDescriptor("EsriJson", new[] { ".esrijson" }, new[] { ".esrijson" }) },

GeoJson has .json also is handled separately? how to fix it?

3.GeoJsonSeq -this what you wrote:

{ "GeoJsonSeq", new FormatDescriptor("GeoJsonSeq", new[] { ".jsonl", ".ndjson" }, Array.Empty<string>()) },

GeoJsonSeq has .json format only , how to fix it?

Does it support also :".jsonl", ".ndjson" ?

4.TopoJson - this what you wrote:

{ "TopoJson", new FormatDescriptor("TopoJson", new[] { ".topojson" }, Array.Empty<string>()) },

TopoJson has .json format only , how to fix it?

Does it support also :".topojson" ?

4.Shapefile- this what you wrote:

{ "Shapefile", new FormatDescriptor("Shapefile", new[] { ".shp" }, new[] { ".shp", ".shx", ".dbf" }) },

In case of single file .shp, is the converter responsibility to say there are missing mandatories files:".shp", ".shx", ".dbf" , id no how to fix it?

Gdb- Gdb file can be represent as a dataset or just as one table. In case of dataset - the entire .gdb folder as a single file In case of one table - .gdbtable and .gdbtablx as an archive file. Does in case of one table the extension also .gdb if not your code not handle it? Can you please how to fix it?

If your fixes need a lot of changes, please provide a new modified file.

Thanks in advance,

Your quick answer will be appreciated.
Dani_S 4,836 Reputation points

2025-12-12T11:47:51.9533333+00:00

Dear Michael,

Thank you very much for your help and the code.

1.For archive input: the archive requirements check will ensure both .gdbtable and .gdbtablx are present before selecting GDB format -

-Do you mean in converter level?

-The archive file of .gdbtable and .gdbtablx must be with extension .gdb or can be any extension?

2.I added your tests: 2 not passed.

Can you look on them:

1.

Explicit .esrijson extension maps to EsriJson converter

Source: ConverterFactoryInputExtensionsTests.cs line 142

Duration: 12 ms

Message:

Assert.True() Failure

Expected: True

Actual: False

Stack Trace:

ConverterFactoryInputExtensionsTests.EsriJson_Extension_Mapped() line 149

RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)

MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

2. Explicit .geojson extension maps to GeoJson converter

Source: ConverterFactoryInputExtensionsTests.cs line 121

Duration: 3 ms

Message:

Assert.Contains() Failure: Sub-string not found

String: "detected json format: GeoJson (JsonFormat"···

Not found: "Mapped extension"

Stack Trace:

ConverterFactoryInputExtensionsTests.GeoJson_Extension_Mapped() line 131

RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)

MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

3.The tests:

ConverterFactoryInputExtensionsTests.txt

Share via

GIS - TryCreateForInput

1 answer

Your answer