Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions location/STATUS_FIELD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Location Data Normalization - Status Field

## Overview

The location models (Country, State, City) now include a `status` field to track the processing state of location data. This implements a canonical data architecture that ensures data quality and traceability.

## Status Values

The `status` field can have one of the following values:

| Status | Description |
|--------|-------------|
| **RAW** | Raw data, no processing. Default value for new records. |
| **CLEANED** | Pre-cleaned data. HTML removed, spaces normalized. |
| **MATCHED** | Matched to a canonical record from reference databases. |
| **VERIFIED** | Officially validated against authoritative sources. |
Comment on lines +15 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelveigarangel a diferença parece sutil entre MATCHED e VERIFIED. Na prática qual é a diferença, o ganho desta distinção? Para que precisa do status 'CLEANED'?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ideida do MATCHED era ter o pais, estado ou cidade matched com o official. O cleaned era pra setar a localidade que foi realizado tratamento (remoção de caracteres especiais). Mas o correto, é ter apenas se é official ou não. Fazer o tratamento pode dar conflito com outras localidades. A ideia é ter os officiais e tentar criar a localidade a partir dos dados que ja existe no sistema.

| **REJECTED** | Invalid or unresolvable data that cannot be matched. |

## Data Cleaning

Each model now includes a `clean_data()` class method for pre-cleaning operations:

### City.clean_data(name)
Removes HTML tags and normalizes spaces in city names.

```python
cleaned_name = City.clean_data("<p>São Paulo</p>")
# Returns: "São Paulo"
```

### State.clean_data(name, acronym)
Removes HTML tags and normalizes spaces in state names and acronyms.

```python
cleaned_name, cleaned_acronym = State.clean_data("<b>São Paulo</b>", "<i>SP</i>")
# Returns: ("São Paulo", "SP")
```

### Country.clean_data(name, acronym, acron3)
Removes HTML tags and normalizes spaces in country names and acronyms.

```python
cleaned_name, cleaned_acronym, cleaned_acron3 = Country.clean_data(
"<strong>Brazil</strong>",
"<em>BR</em>",
"<span>BRA</span>"
)
# Returns: ("Brazil", "BR", "BRA")
```

## Usage Example

### Creating records with status

```python
from django.contrib.auth import get_user_model

User = get_user_model()
user = User.objects.first()

# Create a city with RAW status (default)
city = City.create(user=user, name="São Paulo")
# city.status == "RAW"

# Create a city with VERIFIED status
verified_city = City.create(user=user, name="Rio de Janeiro", status="VERIFIED")
# verified_city.status == "VERIFIED"
```

### Cleaning data before creation

```python
# Dirty data from external source
dirty_name = "<p>São Paulo City</p>"

# Clean the data
cleaned_name = City.clean_data(dirty_name)
# cleaned_name == "São Paulo City"

# Create with CLEANED status
city = City.create(user=user, name=cleaned_name, status="CLEANED")
```

## Workflow

The typical workflow for location data is:

1. **RAW** → Data is initially created/imported in raw form
2. **CLEANED** → HTML is removed, spaces normalized
3. **MATCHED** → Data is matched to canonical reference (e.g., GeoNames)
4. **VERIFIED** → Data is validated against authoritative source
5. **REJECTED** → Data that cannot be verified or matched

## Database Migration

The status field was added via migration `0004_add_status_field.py`:
- Adds nullable `status` field to City, State, and Country models
- Default value is "RAW"
- Max length: 10 characters
- Choices: RAW, CLEANED, MATCHED, VERIFIED, REJECTED

## Testing

Comprehensive tests have been added to verify:
- Default status is RAW
- Status can be set to any valid value
- `clean_data()` methods remove HTML
- `clean_data()` methods normalize spaces
- `clean_data()` methods handle None values

Run tests with:
```bash
python manage.py test location
```

## Reference Data Sources

The canonical location data should be sourced from:

1. **Countries States Cities Database**
- GitHub: https://github.com/dr5hn/countries-states-cities-database
- Comprehensive database of countries, states, and cities

2. **GeoNames**
- Website: https://www.geonames.org/
- Official geographical database

## Future Enhancements

Potential improvements for the location normalization system:

1. Add methods to transition between states
2. Implement automatic matching against reference databases
3. Add validation rules for each status transition
4. Create admin views to bulk-update status
5. Add logging/audit trail for status changes
9 changes: 9 additions & 0 deletions location/choices.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,12 @@
("Sudeste", "Sudeste"),
("Sul", "Sul"),
)

# Processing status for canonical location data
LOCATION_STATUS = (
("RAW", "RAW"), # Raw data, no processing
("CLEANED", "CLEANED"), # Pre-cleaned data
("MATCHED", "MATCHED"), # Matched to canonical record
("VERIFIED", "VERIFIED"), # Officially validated
("REJECTED", "REJECTED"), # Invalid or unresolvable
)
65 changes: 65 additions & 0 deletions location/migrations/0004_add_status_field.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('location', '0003_alter_city_options_alter_country_options_and_more'),
]

operations = [
migrations.AddField(
model_name='city',
name='status',
field=models.CharField(
blank=True,
choices=[
('RAW', 'RAW'),
('CLEANED', 'CLEANED'),
('MATCHED', 'MATCHED'),
('VERIFIED', 'VERIFIED'),
('REJECTED', 'REJECTED')
],
default='RAW',
max_length=10,
null=True,
verbose_name='Status'
),
Comment on lines +11 to +27
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.

Copilot uses AI. Check for mistakes.
),
migrations.AddField(
model_name='state',
name='status',
field=models.CharField(
blank=True,
choices=[
('RAW', 'RAW'),
('CLEANED', 'CLEANED'),
('MATCHED', 'MATCHED'),
('VERIFIED', 'VERIFIED'),
('REJECTED', 'REJECTED')
],
default='RAW',
max_length=10,
null=True,
verbose_name='Status'
),
Comment on lines +29 to +45
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.

Copilot uses AI. Check for mistakes.
),
migrations.AddField(
model_name='country',
name='status',
field=models.CharField(
blank=True,
choices=[
('RAW', 'RAW'),
('CLEANED', 'CLEANED'),
('MATCHED', 'MATCHED'),
('VERIFIED', 'VERIFIED'),
('REJECTED', 'REJECTED')
],
default='RAW',
max_length=10,
null=True,
verbose_name='Status'
),
Comment on lines +47 to +63
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration defines the status field as both nullable (null=True) and having a default value ('RAW'). This is redundant - when a field has a default, it doesn't need to be nullable. During migration, existing records will get the default 'RAW', but the nullable setting allows future records to have NULL status, which conflicts with the intent of always tracking status. Consider making the field non-nullable (null=False) to enforce data integrity.

Copilot uses AI. Check for mistakes.
),
]
Loading
Loading