Skip to content

Conversation

Copy link

Copilot AI commented Dec 19, 2025

O que esse PR faz?

Implementa base canônica para normalização de dados de localização (Country, State, City) com rastreamento explícito de status de processamento e limpeza de dados brutos.

Status field com 5 valores:

  • RAW - dado bruto, sem processamento (default)
  • CLEANED - pré-limpo
  • MATCHED - associado a registro canônico
  • VERIFIED - validado oficialmente
  • REJECTED - inválido ou irresolvível

Data cleaning via clean_data() methods:

  • Remove HTML tags (Django strip_tags)
  • Normaliza whitespace
  • Retorna None para valores None
# Uso básico
cleaned = City.clean_data("<p>São Paulo</p>")  # Returns: "São Paulo"

city = City.create(user, name="São Paulo", status="CLEANED")
state = State.create_or_update(user, name="SP", acronym="SP", status="VERIFIED")

API consistente em todos os modelos: create(), get_or_create(), create_or_update() aceitam parâmetro status.

Onde a revisão poderia começar?

  1. location/models.py - função utility clean_text_data() (linhas ~20-38)
  2. location/models.py - status field nos modelos City, State, Country
  3. location/tests.py - test classes CityStatusTest, StateStatusTest, CountryStatusTest
  4. location/migrations/0004_add_status_field.py - migração

Como este poderia ser testado manualmente?

from django.contrib.auth import get_user_model
from location.models import City, State, Country

user = get_user_model().objects.first()

# Teste 1: HTML cleaning
dirty = "<p>Test City</p>"
clean = City.clean_data(dirty)  # Retorna "Test City"

# Teste 2: Status field
city = City.create(user=user, name="Test", status="CLEANED")
assert city.status == "CLEANED"

# Teste 3: Status update
state = State.create_or_update(user=user, name="Test", status="RAW")
state = State.create_or_update(user=user, name="Test", status="VERIFIED")
assert state.status == "VERIFIED"

Algum cenário de contexto que queira dar?

Backward compatible: campo nullable com default "RAW" - registros existentes recebem o default na migração.

Shared utility: clean_text_data() elimina duplicação de código entre City, State e Country.

Migration safety: hardcoded choices na migração previnem quebra se LOCATION_STATUS mudar no futuro.

21 novos testes adicionados. CodeQL scan: 0 alertas.

Quais são tickets relevantes?

Issue sobre normalização de dados de localização (Country, State, City).

Referências

Original prompt

This section details on the original issue you should resolve

<issue_title>Normalização dos dados de Localização</issue_title>
<issue_description>## 🎯 Nova funcionalidade: Base Canônica de Localidades (Country, State, City)

Objetivo

Criar uma base canônica, verificada e referenciável de localidades (Country, State, City), garantindo consistência, deduplicação e confiabilidade dos dados usados pelos modelos de negócio.


🏗️ Arquitetura de Dados (Ajuste fino)

1. Dados Canônicos (Reference Data)

Dados oficiais e confiáveis.

  • Country
  • State
  • City

Fontes:


2. Dados Sujos (Raw)

Exemplos:

  • nomes incorretos
  • HTML embutido
  • dados incompletos

3. Dados de Negócio

Entidades como:

  • Instituições
  • Organizações
  • Autores
  • Outros modelos correlacionados

📌 Regra:
Esses dados só podem apontar para localidades verificadas (canônicas).


🔖 Usar Status explícito de processamento nos modelos de localidade

RAW       -> dado bruto, sem tratamento
CLEANED   -> dado pré-limpo
MATCHED   -> associado a um registro canônico
VERIFIED  -> validado oficialmente
REJECTED  -> inválido ou impossível de resolver

🌍 Country

Etapa 1 - Pré-limpeza

Antes de qualquer matching:

  • remover HTML (<i>, <b>, etc)
  • remover e-mails
  • remover strings numéricas

normalizar: lower trim unaccent

Etapa 2 – Deduplicação interna

Objetivo: eliminar variações semânticas e erros comuns antes do matching externo.

Exemplo de entradas:

Brazil
Brasil
Brazi1
Brazil.

Resultado normalizado:
brazil

Etapa 3 - matching com GeoNames

  • match por nome
  • match por iso
  • fuzzy como fallback
    ```cleaned_country -> country_iso2``

Etapa 4 - Tradução

  1. Armazene apenas inglês
  2. Traduza apenas na camada de apresentação
  3. Autocomplete: traduz input -> inglês -> busca -> exibe -> idioma do usuário.

🗺️ State

❗ Um estado nunca pode ser resolvido sem país

Etapa 1- Pré-limpeza

Antes de qualquer matching:

  • remover HTML (<'i>, <'b>, etc)
  • remover emails
  • remover strings númericas
  • normalizar: lower + trim + unaccent?

Etapa 2 - Agrupamento por nome limpo

<i>Córdoba<
Cordoba
CÓRDOBA

vira:

cordoba

Etapa 3 - Seleção do registro "vencedor"

  • Durante a limpeza do nome, optar pelo dado mais completo (País resolvido + acronimo válido)
    4.1 NÃO confie só no nome do estado
    Ex:
    Córdoba (Argentina)
    Córdoba (Espanha)
    San José (vários países)
  • Nunca criar estado sem país

City

Etapa 1- Pré-limpeza

Antes de qualquer matching:

  • remover HTML (<'i>, <'b>, etc)
  • remover emails
  • remover strings númericas
  • normalizar: lower + trim + unaccent?

Criação de novos dados (prevenção definitiva)

Regra na criação de organização ou outro modelo que correlaciona locations:

  • Atribuir apenas localização completa.
  • Autocomplete só com verified
  • Não permitir criação de localidades manualmente. Solicitar?</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 6 commits December 19, 2025 13:53
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Copilot AI changed the title [WIP] Add canonical base for location data consistency Add status field and data cleaning for location normalization Dec 19, 2025
Comment on lines +15 to +16
| **MATCHED** | Matched to a canonical record from reference databases. |
| **VERIFIED** | Officially validated against authoritative sources. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelveigarangel a diferença parece sutil entre MATCHED e VERIFIED. Na prática qual é a diferença, o ganho desta distinção? Para que precisa do status 'CLEANED'?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ideida do MATCHED era ter o pais, estado ou cidade matched com o official. O cleaned era pra setar a localidade que foi realizado tratamento (remoção de caracteres especiais). Mas o correto, é ter apenas se é official ou não. Fazer o tratamento pode dar conflito com outras localidades. A ideia é ter os officiais e tentar criar a localidade a partir dos dados que ja existe no sistema.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a canonical data architecture for location models (City, State, Country) by adding a status field to track data processing stages and introducing data cleaning utilities.

Changes:

  • Added status field with 5 values (RAW, CLEANED, MATCHED, VERIFIED, REJECTED) to City, State, and Country models
  • Implemented clean_data() class methods for each model to remove HTML tags and normalize whitespace
  • Updated all create/update methods to accept optional status parameter
  • Added comprehensive test suite with 21 new tests covering status field functionality and data cleaning
  • Created migration 0004_add_status_field.py with hardcoded choices for stability
  • Added detailed documentation in STATUS_FIELD.md

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
location/choices.py Defines LOCATION_STATUS tuple with 5 status values for tracking data processing stages
location/models.py Adds status field to City/State/Country models, implements clean_text_data utility and clean_data methods, updates create/get_or_create/create_or_update methods to accept status parameter
location/migrations/0004_add_status_field.py Database migration adding nullable status field with default "RAW" to all three location models
location/tests.py Adds 21 new tests covering default status behavior, status setting, data cleaning (HTML removal, space normalization, None handling), and status updates
location/STATUS_FIELD.md Comprehensive documentation of status field purpose, values, cleaning methods, usage examples, workflow, and future enhancements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

city.save()
return city
except IntegrityError:
return cls.get(name)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an IntegrityError occurs, the existing city is returned via cls.get(name). However, this means that if a status other than "RAW" was requested during creation, it will be silently ignored and the existing city's status will be returned instead. This could lead to unexpected behavior where the caller expects a city with a specific status but gets one with a different status. Consider either updating the status of the existing city or documenting this behavior clearly.

Suggested change
return cls.get(name)
# A city with this name already exists (likely due to a race condition).
# Retrieve the existing city and, if appropriate, update its status to
# reflect the requested status instead of silently ignoring it.
existing_city = cls.get(name)
if status is not None and status != "" and existing_city.status != status:
# Only "upgrade" status from an initial/blank state to the requested one.
if existing_city.status in (None, "", "RAW"):
existing_city.status = status
existing_city.save(update_fields=["status"])
return existing_city

Copilot uses AI. Check for mistakes.
Comment on lines 237 to 243
obj = cls()
obj.name = name
obj.acronym = acronym
obj.status = status
obj.creator = user
obj.save()
return obj
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an IntegrityError occurs in State.create, the existing state is returned via cls.get(name, acronym). Similar to City.create, if a status other than "RAW" was requested during creation, it will be silently ignored and the existing state's status will be returned instead. This could lead to unexpected behavior where the caller expects a state with a specific status but gets one with a different status.

Copilot uses AI. Check for mistakes.
choices=LOCATION_STATUS,
default="RAW",
blank=True,
null=True,
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status field is defined with both blank=True and null=True, but also has a default value of "RAW". This is redundant - when a field has a default value, it doesn't need to be nullable since the default will be used. The nullable setting (null=True) allows storing NULL in the database, which conflicts with the intent of always having a status. Consider removing null=True to enforce that all records always have a status value.

Suggested change
null=True,

Copilot uses AI. Check for mistakes.
Comment on lines +413 to 415
null=True,
)

Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status field is defined with both blank=True and null=True, but also has a default value of "RAW". This is redundant - when a field has a default value, it doesn't need to be nullable since the default will be used. The nullable setting (null=True) allows storing NULL in the database, which conflicts with the intent of always having a status. Consider removing null=True to enforce that all records always have a status value.

Suggested change
null=True,
)
)

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +45
migrations.AddField(
model_name='state',
name='status',
field=models.CharField(
blank=True,
choices=[
('RAW', 'RAW'),
('CLEANED', 'CLEANED'),
('MATCHED', 'MATCHED'),
('VERIFIED', 'VERIFIED'),
('REJECTED', 'REJECTED')
],
default='RAW',
max_length=10,
null=True,
verbose_name='Status'
),
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.

Copilot uses AI. Check for mistakes.
choices=LOCATION_STATUS,
default="RAW",
blank=True,
null=True,
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status field is defined with both blank=True and null=True, but also has a default value of "RAW". This is redundant - when a field has a default value, it doesn't need to be nullable since the default will be used. The nullable setting (null=True) allows storing NULL in the database, which conflicts with the intent of always having a status. Consider removing null=True to enforce that all records always have a status value.

Suggested change
null=True,

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +27
migrations.AddField(
model_name='city',
name='status',
field=models.CharField(
blank=True,
choices=[
('RAW', 'RAW'),
('CLEANED', 'CLEANED'),
('MATCHED', 'MATCHED'),
('VERIFIED', 'VERIFIED'),
('REJECTED', 'REJECTED')
],
default='RAW',
max_length=10,
null=True,
verbose_name='Status'
),
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +63
migrations.AddField(
model_name='country',
name='status',
field=models.CharField(
blank=True,
choices=[
('RAW', 'RAW'),
('CLEANED', 'CLEANED'),
('MATCHED', 'MATCHED'),
('VERIFIED', 'VERIFIED'),
('REJECTED', 'REJECTED')
],
default='RAW',
max_length=10,
null=True,
verbose_name='Status'
),
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +246
def test_city_get_or_create_with_status(self):
"""Test that get_or_create accepts status parameter"""
user, created = User.objects.get_or_create(username="adm")
city = models.City.get_or_create(user=user, name="Test City", status="CLEANED")
self.assertEqual(city.status, "CLEANED")
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test coverage for the scenario where City.create() encounters an IntegrityError (when a city with the same name already exists). In this case, the existing city is returned via cls.get(name), which means the status parameter passed to create() is silently ignored. This behavior should be tested to ensure it's documented and expected, or the code should be modified to handle this case differently.

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +303
def test_state_get_or_create_with_status(self):
"""Test that get_or_create accepts status parameter"""
user, created = User.objects.get_or_create(username="adm")
state = models.State.get_or_create(user=user, name="Test State", acronym="TS", status="MATCHED")
self.assertEqual(state.status, "MATCHED")
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test coverage for the scenario where State.create() encounters an IntegrityError (when a state with the same name and acronym already exists). In this case, the existing state is returned via cls.get(name, acronym), which means the status parameter passed to create() is silently ignored. This behavior should be tested to ensure it's documented and expected.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Normalização dos dados de Localização

3 participants