-
Notifications
You must be signed in to change notification settings - Fork 10
Add status field and data cleaning for location normalization #1251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
| | **MATCHED** | Matched to a canonical record from reference databases. | | ||
| | **VERIFIED** | Officially validated against authoritative sources. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelveigarangel a diferença parece sutil entre MATCHED e VERIFIED. Na prática qual é a diferença, o ganho desta distinção? Para que precisa do status 'CLEANED'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A ideida do MATCHED era ter o pais, estado ou cidade matched com o official. O cleaned era pra setar a localidade que foi realizado tratamento (remoção de caracteres especiais). Mas o correto, é ter apenas se é official ou não. Fazer o tratamento pode dar conflito com outras localidades. A ideia é ter os officiais e tentar criar a localidade a partir dos dados que ja existe no sistema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a canonical data architecture for location models (City, State, Country) by adding a status field to track data processing stages and introducing data cleaning utilities.
Changes:
- Added
statusfield with 5 values (RAW, CLEANED, MATCHED, VERIFIED, REJECTED) to City, State, and Country models - Implemented
clean_data()class methods for each model to remove HTML tags and normalize whitespace - Updated all create/update methods to accept optional
statusparameter - Added comprehensive test suite with 21 new tests covering status field functionality and data cleaning
- Created migration
0004_add_status_field.pywith hardcoded choices for stability - Added detailed documentation in
STATUS_FIELD.md
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| location/choices.py | Defines LOCATION_STATUS tuple with 5 status values for tracking data processing stages |
| location/models.py | Adds status field to City/State/Country models, implements clean_text_data utility and clean_data methods, updates create/get_or_create/create_or_update methods to accept status parameter |
| location/migrations/0004_add_status_field.py | Database migration adding nullable status field with default "RAW" to all three location models |
| location/tests.py | Adds 21 new tests covering default status behavior, status setting, data cleaning (HTML removal, space normalization, None handling), and status updates |
| location/STATUS_FIELD.md | Comprehensive documentation of status field purpose, values, cleaning methods, usage examples, workflow, and future enhancements |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| city.save() | ||
| return city | ||
| except IntegrityError: | ||
| return cls.get(name) |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an IntegrityError occurs, the existing city is returned via cls.get(name). However, this means that if a status other than "RAW" was requested during creation, it will be silently ignored and the existing city's status will be returned instead. This could lead to unexpected behavior where the caller expects a city with a specific status but gets one with a different status. Consider either updating the status of the existing city or documenting this behavior clearly.
| return cls.get(name) | |
| # A city with this name already exists (likely due to a race condition). | |
| # Retrieve the existing city and, if appropriate, update its status to | |
| # reflect the requested status instead of silently ignoring it. | |
| existing_city = cls.get(name) | |
| if status is not None and status != "" and existing_city.status != status: | |
| # Only "upgrade" status from an initial/blank state to the requested one. | |
| if existing_city.status in (None, "", "RAW"): | |
| existing_city.status = status | |
| existing_city.save(update_fields=["status"]) | |
| return existing_city |
| obj = cls() | ||
| obj.name = name | ||
| obj.acronym = acronym | ||
| obj.status = status | ||
| obj.creator = user | ||
| obj.save() | ||
| return obj |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an IntegrityError occurs in State.create, the existing state is returned via cls.get(name, acronym). Similar to City.create, if a status other than "RAW" was requested during creation, it will be silently ignored and the existing state's status will be returned instead. This could lead to unexpected behavior where the caller expects a state with a specific status but gets one with a different status.
| choices=LOCATION_STATUS, | ||
| default="RAW", | ||
| blank=True, | ||
| null=True, |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status field is defined with both blank=True and null=True, but also has a default value of "RAW". This is redundant - when a field has a default value, it doesn't need to be nullable since the default will be used. The nullable setting (null=True) allows storing NULL in the database, which conflicts with the intent of always having a status. Consider removing null=True to enforce that all records always have a status value.
| null=True, |
| null=True, | ||
| ) | ||
|
|
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status field is defined with both blank=True and null=True, but also has a default value of "RAW". This is redundant - when a field has a default value, it doesn't need to be nullable since the default will be used. The nullable setting (null=True) allows storing NULL in the database, which conflicts with the intent of always having a status. Consider removing null=True to enforce that all records always have a status value.
| null=True, | |
| ) | |
| ) |
| migrations.AddField( | ||
| model_name='state', | ||
| name='status', | ||
| field=models.CharField( | ||
| blank=True, | ||
| choices=[ | ||
| ('RAW', 'RAW'), | ||
| ('CLEANED', 'CLEANED'), | ||
| ('MATCHED', 'MATCHED'), | ||
| ('VERIFIED', 'VERIFIED'), | ||
| ('REJECTED', 'REJECTED') | ||
| ], | ||
| default='RAW', | ||
| max_length=10, | ||
| null=True, | ||
| verbose_name='Status' | ||
| ), |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.
| choices=LOCATION_STATUS, | ||
| default="RAW", | ||
| blank=True, | ||
| null=True, |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status field is defined with both blank=True and null=True, but also has a default value of "RAW". This is redundant - when a field has a default value, it doesn't need to be nullable since the default will be used. The nullable setting (null=True) allows storing NULL in the database, which conflicts with the intent of always having a status. Consider removing null=True to enforce that all records always have a status value.
| null=True, |
| migrations.AddField( | ||
| model_name='city', | ||
| name='status', | ||
| field=models.CharField( | ||
| blank=True, | ||
| choices=[ | ||
| ('RAW', 'RAW'), | ||
| ('CLEANED', 'CLEANED'), | ||
| ('MATCHED', 'MATCHED'), | ||
| ('VERIFIED', 'VERIFIED'), | ||
| ('REJECTED', 'REJECTED') | ||
| ], | ||
| default='RAW', | ||
| max_length=10, | ||
| null=True, | ||
| verbose_name='Status' | ||
| ), |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.
| migrations.AddField( | ||
| model_name='country', | ||
| name='status', | ||
| field=models.CharField( | ||
| blank=True, | ||
| choices=[ | ||
| ('RAW', 'RAW'), | ||
| ('CLEANED', 'CLEANED'), | ||
| ('MATCHED', 'MATCHED'), | ||
| ('VERIFIED', 'VERIFIED'), | ||
| ('REJECTED', 'REJECTED') | ||
| ], | ||
| default='RAW', | ||
| max_length=10, | ||
| null=True, | ||
| verbose_name='Status' | ||
| ), |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.
| def test_city_get_or_create_with_status(self): | ||
| """Test that get_or_create accepts status parameter""" | ||
| user, created = User.objects.get_or_create(username="adm") | ||
| city = models.City.get_or_create(user=user, name="Test City", status="CLEANED") | ||
| self.assertEqual(city.status, "CLEANED") |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no test coverage for the scenario where City.create() encounters an IntegrityError (when a city with the same name already exists). In this case, the existing city is returned via cls.get(name), which means the status parameter passed to create() is silently ignored. This behavior should be tested to ensure it's documented and expected, or the code should be modified to handle this case differently.
| def test_state_get_or_create_with_status(self): | ||
| """Test that get_or_create accepts status parameter""" | ||
| user, created = User.objects.get_or_create(username="adm") | ||
| state = models.State.get_or_create(user=user, name="Test State", acronym="TS", status="MATCHED") | ||
| self.assertEqual(state.status, "MATCHED") |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no test coverage for the scenario where State.create() encounters an IntegrityError (when a state with the same name and acronym already exists). In this case, the existing state is returned via cls.get(name, acronym), which means the status parameter passed to create() is silently ignored. This behavior should be tested to ensure it's documented and expected.
O que esse PR faz?
Implementa base canônica para normalização de dados de localização (Country, State, City) com rastreamento explícito de status de processamento e limpeza de dados brutos.
Status field com 5 valores:
RAW- dado bruto, sem processamento (default)CLEANED- pré-limpoMATCHED- associado a registro canônicoVERIFIED- validado oficialmenteREJECTED- inválido ou irresolvívelData cleaning via
clean_data()methods:strip_tags)API consistente em todos os modelos:
create(),get_or_create(),create_or_update()aceitam parâmetrostatus.Onde a revisão poderia começar?
location/models.py- função utilityclean_text_data()(linhas ~20-38)location/models.py- status field nos modelos City, State, Countrylocation/tests.py- test classesCityStatusTest,StateStatusTest,CountryStatusTestlocation/migrations/0004_add_status_field.py- migraçãoComo este poderia ser testado manualmente?
Algum cenário de contexto que queira dar?
Backward compatible: campo nullable com default "RAW" - registros existentes recebem o default na migração.
Shared utility:
clean_text_data()elimina duplicação de código entre City, State e Country.Migration safety: hardcoded choices na migração previnem quebra se LOCATION_STATUS mudar no futuro.
21 novos testes adicionados. CodeQL scan: 0 alertas.
Quais são tickets relevantes?
Issue sobre normalização de dados de localização (Country, State, City).
Referências
strip_tags: https://docs.djangoproject.com/en/stable/ref/utils/#django.utils.html.strip_tagsOriginal prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.