Nicola Candoni

How StreetGap Processes 1.5 Million Italian Street Names

2026-02-02 data-engineeringopen-datallmaiosm

A technical walkthrough of the data ingestion, classification, and AI-assisted pipeline that powers StreetGap — from raw OpenStreetMap data to gender attribution at scale.

StreetGap analyzes who Italian streets are named after, with a focus on gender representation. The output looks like a map and a few charts. The engineering underneath is a multi-stage classification pipeline that takes raw OpenStreetMap data and transforms it into a structured, confidence-scored dataset of over a million street attributions.

This post describes the technical process: how data comes in, how streets get classified, and where AI fits in (and where it doesn’t).


The data source: OpenStreetMap

All street geometries and names come from OpenStreetMap (OSM). OSM is a global, community-edited geographic database released under the Open Database License (ODbL). For Italy, it’s remarkably complete — every comune, every vicolo, most interiors.

Streets in OSM are represented as way elements with a name tag. Some streets also carry additional tags like name:etymology or name:etymology:wikidata — a structured pointer to the Wikidata entity the street is named after. These are rare but extremely useful: when present, they bypass the entire classification process.

The raw Italian dataset contains roughly 1.5 million named ways. After deduplication (same name, same municipality), the working dataset is smaller but still large.

Extracting the data is straightforward with osmium and a country extract from Geofabrik:

# Download Italy extract
wget https://download.geofabrik.de/europe/italy-latest.osm.pbf

# Export ways with name tag to GeoJSON
osmium export italy-latest.osm.pbf \
  --geometry-types=linestring \
  --output=streets.geojson \
  --output-format=geojson

From there, the data loads into a PostgreSQL database with PostGIS for spatial queries. Street lengths, administrative boundaries (comune, provincia, regione), and OSM metadata all live in the same schema.


Stage 1: Pre-classification — filtering non-person names

Most street names in Italy are not dedications to people. “Via Roma”, “Corso Italia”, “Via della Pace”, “Via XX Settembre” — these are geographic references, dates, concepts, or generic descriptors. Trying to classify these as person-dedications would generate noise.

The first stage is a rule-based filter that identifies and removes these cases before any expensive processing happens.

The patterns are:

Geographic prefixes: Via del Monte, Via della Valle, Lago di Como, Via Po — these reference places or natural features. A list of known geographic qualifiers (del, della, dei, delle, followed by a capitalized noun) catches most cases.

Date patterns: Via XX Settembre, Piazza IV Novembre, Via 25 Aprile — Italian streets often commemorate historical dates using Roman numerals + month names. A regex handles this reliably:

import re

DATE_PATTERN = re.compile(
    r"\b(I{1,3}|IV|V|VI{0,3}|IX|X{1,2}|XI{0,3}|XIV|XV|XVI{0,3}|XIX|XX|XXI)\b"
    r"\s+(gennaio|febbraio|marzo|aprile|maggio|giugno|luglio|agosto|"
    r"settembre|ottobre|novembre|dicembre)",
    re.IGNORECASE
)

Concept vocabulary: “Libertà”, “Pace”, “Unità”, “Costituzione”, “Repubblica” — a curated list of ~200 common concept words used in Italian street names.

Institutional names: “Carabinieri”, “Polizia”, “Vigili del Fuoco”, “Partigiani” — proper nouns referring to groups rather than individuals.

After this stage, roughly 40% of streets are marked as non-person dedications and exit the pipeline. The remaining 60% proceed to person-attribution.


Stage 2: Name resolution

Many Italian streets use only a surname or an abbreviated form: “Via Verdi”, “Via G. Garibaldi”, “Via Dott. Rossi”. Before attributing gender, you need to know who “Verdi” refers to.

This is a lookup problem with ambiguity. “Via Verdi” could be Giuseppe Verdi (composer) or a different Verdi. The resolution strategy is:

  1. Exact Wikidata lookup by full name — if the street already has name:etymology:wikidata, use it directly.
  2. Surname + municipality context — search Wikidata for people with that surname, filtered by Italian relevance. Rank by sitelinks (a proxy for prominence). For “Via Garibaldi”, Giuseppe Garibaldi dominates.
  3. First initial expansion — “G. Garibaldi” narrows the candidate set immediately; names not starting with G are excluded.
  4. Fallback: accept ambiguity — if resolution is uncertain, the street remains unresolved and gets a lower confidence score.

The Wikidata SPARQL endpoint handles lookups:

import requests

def lookup_wikidata_person(surname: str, first_initial: str | None = None) -> list[dict]:
    query = """
    SELECT ?item ?itemLabel ?genderLabel ?sitelinks WHERE {
      ?item wdt:P31 wd:Q5 .          # instance of: human
      ?item wdt:P734 ?familyName .   # family name
      ?familyName rdfs:label ?fnLabel .
      FILTER(LCASE(?fnLabel) = LCASE("%s"@it))
      OPTIONAL { ?item wdt:P21 ?gender . }
      OPTIONAL { ?item wikibase:sitelinks ?sitelinks . }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "it,en" . }
    }
    ORDER BY DESC(?sitelinks)
    LIMIT 10
    """ % surname

    response = requests.get(
        "https://query.wikidata.org/sparql",
        params={"query": query, "format": "json"},
        headers={"User-Agent": "StreetGap/1.0"}
    )
    return response.json()["results"]["bindings"]

The sitelinks ordering is critical — it surfaces the most encyclopedically prominent person with that name, which is almost always who the street refers to.


Stage 3: Gender attribution

With a resolved person (or at least a full name), gender attribution follows a priority chain:

Rule layer (fast, deterministic)

Wikidata P21: If the resolved entity has a sex/gender property (P21), use it. This covers the majority of resolved names.

Linguistic markers: Italian grammatical gender is often embedded in the name itself.

  • Titles: San → male, Santa → female, Re → male, Regina → female
  • Name endings: -a suffix in given names is a strong female signal; -o, -e are weaker male signals
  • Known gendered name lists: curated lists of Italian given names with high-confidence gender assignments
MALE_TITLES = {"san", "beato", "re", "principe", "duca", "frate", "padre", "fra"}
FEMALE_TITLES = {"santa", "beata", "regina", "principessa", "duchessa", "suor", "madre"}

def gender_from_title(name: str) -> str | None:
    first_word = name.lower().split()[0]
    if first_word in MALE_TITLES:
        return "male"
    if first_word in FEMALE_TITLES:
        return "female"
    return None

Given name lookup: A lookup table of ~15,000 Italian given names with gender labels. Built from ISTAT name frequency data and Wikidata extracts. If the first name of the resolved person is in the table, this provides a high-confidence signal.

AI layer (slow, for ambiguous cases)

When the rule layer produces no signal or conflicting signals, two LLMs are queried independently and their outputs compared. If they agree, the attribution is accepted. If they disagree, the case is flagged for manual review.

The prompt is minimal — any unnecessary context inflates cost (see: Italian tokenization overhead):

ATTRIBUTION_PROMPT = """Given an Italian street name, identify the person it commemorates and their sex assigned at birth.

Street name: {street_name}
Municipality: {municipality}

Respond with JSON only:
{{"person": "Full Name", "sex": "male|female|unknown", "confidence": "high|medium|low"}}"""

Two models run in parallel — currently a combination of Claude and GPT-4o. Disagreements surface cases that are genuinely ambiguous (historical figures with unusual names, foreign names, names that changed meaning over time).

The AI layer handles roughly 15% of streets. It’s not the dominant path — the rule layer covers most cases — but it closes a gap that regex cannot.


Stage 4: Confidence scoring

Each attribution gets a confidence score based on how it was produced:

MethodConfidence
Wikidata name:etymology:wikidata tagVery high
Wikidata P21 via resolved nameHigh
Title rule (San/Santa, etc.)High
Given name lookup (unambiguous)High
Given name lookup (ambiguous)Medium
LLM consensus (both models agree)Medium
LLM with disagreement + manual overrideHigh
LLM single model, no cross-checkLow
Heuristics onlyLow

An attribution is considered approved when:

  • It was manually confirmed, or
  • It passes automated checks with high confidence

The automated checks include:

  • Street name contains both a recognized given name and a surname
  • Resolved person has a Wikidata entry marked as instance of: human
  • Resolved person has an Italian Wikipedia article consistent with the attribution

Streets that don’t pass these checks remain classified but unapproved, and are excluded from aggregate statistics displayed on the site.


Stage 5: Length computation

Gender representation measured by street count alone is misleading — a short vicolo and a 10 km corso count the same. The pipeline also computes the total length of each classified street from its PostGIS geometry:

SELECT
    name,
    gender,
    ST_Length(ST_Transform(geometry, 32632)) / 1000.0 AS length_km
FROM streets
WHERE gender IS NOT NULL
  AND geometry IS NOT NULL
  AND ST_GeometryType(geometry) = 'ST_LineString';

EPSG:32632 is UTM zone 32N, appropriate for Italy — it gives length in meters with low distortion. Areas (piazze, large squares mapped as polygons) are excluded from length statistics because their perimeter-to-area relationship makes length comparisons unreliable.


What the pipeline doesn’t solve

Variant forms of the same name: “Via Maria SS. Assunta”, “Via Madonna delle Grazie”, and “Via Beata Vergine” all arguably refer to the same person. Deduplication across these variants is partially addressed by normalization rules, but it’s incomplete. The feminine total is likely undercounted because of this.

Typos and OCR errors: OSM data for smaller municipalities sometimes contains corrupted street names — “Piazza Giuspepe Garbibaldi” is a real example found during development. The LLM layer catches some of these (and the corrected version gets contributed back to OSM), but systematic typo correction is not yet in place.

Groups and institutions: “Via Martiri della Resistenza” doesn’t point to a single person. “Piazza Caduti di tutte le Guerre” refers to unnamed individuals. These are filtered early, but edge cases slip through.

Non-binary and historical gender complexity: The pipeline uses a binary male/female classification because that’s what the available structured data (Wikidata P21, ISTAT) supports at scale. This is a methodological limitation, not a design choice.


Numbers

Running this pipeline on the full Italian OSM extract:

  • ~1.5M named ways ingested
  • ~60% filtered as non-person dedications
  • ~85% of remaining streets classified with at least medium confidence
  • ~70% with approved (high-confidence) attributions
  • LLM API calls: ~200k total across the initial classification run
  • Total LLM cost for initial processing: under €200 (batching + caching help significantly)

The pipeline runs incrementally — OSM diffs are applied weekly and only changed or new streets are re-processed.


The boring parts matter most

The rule layer — the deterministic filters, the name lookup tables, the title detection — handles 85% of the work and costs almost nothing to run. Investing in that layer before touching AI was the right call.

The AI layer is useful for closing the long tail. But it’s most effective when the inputs are already well-structured: clean names, resolved surnames, known municipalities. Feeding raw ambiguous strings to an LLM and hoping for the best produces unreliable results. The rule layer earns its keep by making the AI layer’s job tractable.

The full methodology, including all confidence thresholds and approval criteria, is published on streetgap.it.

Back to posts