Data Quality Assessment of an Airport Database for Argentina¶
1. Introduction¶
In the aviation industry, data quality plays a crucial role where accuracy and reliability directly impact safety, operational efficiency, and strategic planning. This report aims to assess the data quality of an airports and heliports database in Argentina, focusing on key factors such as geographic accuracy, completeness, and data consistency.
2. Dataset Description¶
The dataset under evaluation is open-source and contains information on airports and heliports in Argentina. The fields included are:
- Airport name
- IATA/ICAO code
- Geographic latitude and longitude
- Elevation
- Operational status
3. Methodology¶
To evaluate data quality, the following steps will be taken:
Data Loading and Exploration:
- Load the dataset into the analysis environment.
- Perform an initial exploration to understand the structure and content of the data.
Data Quality Assessment:
- Completeness: Identify and address missing or null values.
- Uniqueness: Check for the presence of duplicate records in the database.
- Accuracy: Verify the correctness of the data through cross-validation with external sources, if possible.
- Consistency: Ensure that the data is coherent throughout the dataset. This includes checking for values that do not follow the expected format or are contradictory.
- Validity: We validated the values within the dataset to ensure they fall within expected ranges and adhere to predefined formats. This helps in maintaining the accuracy and reliability of the data.
Results Analysis:
- Analyze the data quality metrics obtained and create visualizations to clearly represent the findings.
- Identify patterns and areas that need improvement.
Recommendations:
- Propose solutions to enhance data quality based on identified issues. This may include suggestions for data cleaning, validation processes, and better data management practices.
1. Data Loading and Exploration¶
From here, I will provide a view of the most basic information about the dataset. This will include an overview of the structure and content of the data, such as column names, data types, and basic summary statistics.
To gain a clearer perspective, we will now take a look at the first five records in the dataset. This will help us observe how the data is structured and provide a quick reference to the kind of information contained in each field.
id | type | name | iso_country | latitude_deg | longitude_deg | elevation_ft | gps_code | |
---|---|---|---|---|---|---|---|---|
11695 | 35333 | small_airport | Cullen Airport | AR | -52.885740 | -68.414956 | 132.0 | NaN |
11696 | 35334 | small_airport | Estancia Los Cerros Airport | AR | -54.343000 | -67.837532 | 1914.0 | NaN |
11697 | 35335 | small_airport | Rio Bellavista Airport | AR | -53.982700 | -68.523598 | 201.0 | NaN |
11698 | 35398 | small_airport | Merlo Airport | AR | -32.358200 | -65.017403 | 796.0 | NaN |
11699 | 35399 | small_airport | Bragado Airport | AR | -35.145811 | -60.480294 | 196.0 | SA2X |
In the following output, we can observe the basic structure of the DataFrame. It consists of 942 entries and 8 columns. Here's a breakdown of the columns:
- id: A unique identifier for each entry, with no missing values.
- type: The type of airport, represented as a string, with no missing values.
- name: The name of the airport, represented as a string, with no missing values.
- iso_country: The country code in ISO format, with no missing values.
- latitude_deg: The latitude coordinate of the airport, represented as a float, with no missing values.
- longitude_deg: The longitude coordinate of the airport, represented as a float, with no missing values.
- elevation_ft: The elevation of the airport in feet, with some missing values (34 entries).
- gps_code: The GPS code of the airport, which has a significant number of missing values (only 225 non-null entries).
The column data types range from integers and floats to strings (object), and we can immediately see that some columns, such as elevation_ft and gps_code, contain missing data, which will be important to address in further data quality checks.
<class 'pandas.core.frame.DataFrame'> Index: 942 entries, 11695 to 56103 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 942 non-null int64 1 type 942 non-null object 2 name 942 non-null object 3 iso_country 942 non-null object 4 latitude_deg 942 non-null float64 5 longitude_deg 942 non-null float64 6 elevation_ft 908 non-null float64 7 gps_code 225 non-null object dtypes: float64(3), int64(1), object(4) memory usage: 66.2+ KB
In the following summary statistics for the latitude_deg, longitude_deg, and elevation_ft columns, we can assess whether the values are sensible within the context of airports and helipads:
Latitude and Longitude: The latitude values range from approximately -54.84 to -22.12, and the longitude values range from -72.89 to -53.67. These coordinates fall within the geographical boundaries of Argentina, indicating that the recorded locations are plausible.
Elevation: The elevation values show a minimum of 6 feet and a maximum of 13,000 feet. While the minimum value is acceptable for small airports and helipads, the maximum value of 13,000 feet raises a flag. It is unusually high for typical airport elevations, suggesting a potential data entry error or the inclusion of non-standard data.
Overall, the statistics indicate that the latitude and longitude values are reasonable, while the elevation values warrant further investigation to confirm their accuracy.
latitude_deg | longitude_deg | elevation_ft | |
---|---|---|---|
count | 942.000000 | 942.000000 | 908.000000 |
mean | -35.390905 | -62.908536 | 801.696035 |
std | 6.671947 | 4.095665 | 1296.457558 |
min | -54.843300 | -72.885820 | 6.000000 |
25% | -38.005901 | -65.493250 | 131.750000 |
50% | -34.444700 | -62.180665 | 316.000000 |
75% | -31.508499 | -59.450899 | 903.000000 |
max | -22.123510 | -53.673332 | 13000.000000 |
2. Data Quality Assessment¶
Data Completeness¶
In this section, we will explore the various dimensions of the data quality assessment, starting with data completeness. As previously noted in the overview of the dataset, we have identified missing values in the elevation_ft and gps_code columns. This analysis highlights the importance of ensuring that all required information is present for accurate and reliable data usage.
id 0 type 0 name 0 iso_country 0 latitude_deg 0 longitude_deg 0 elevation_ft 34 gps_code 717 dtype: int64
Data Accuracy¶
To ensure the accuracy of the data, I undertook the following steps:
Visualization and Preliminary Check: I plotted the data on various maps to visually inspect its consistency and ensure it aligns with known geographical and contextual information. This visual inspection helps identify any obvious discrepancies or anomalies in the dataset.
Cross-Verification: The next step involves verifying the data with additional sources. This includes comparing the dataset against other reliable datasets or authoritative sources to confirm its accuracy and consistency.
By using these methods, I aim to enhance the reliability of the data and ensure that any potential issues are identified and addressed.
Visualization and Preliminary Check
During the preliminary visual inspection, it appears that some points are located in the sea. We will further investigate these anomalies to confirm their accuracy
Next, we will examine a list of airports and helipads that appear to be located not on land but over bodies of water. This discrepancy raises questions about the accuracy of their geographical data.
index | id | type | name | iso_country | latitude_deg | longitude_deg | elevation_ft | gps_code | geometry | index_right | featurecla | scalerank | min_zoom | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11740 | 38721 | heliport | Aries Heliport | AR | -52.683100 | -68.041900 | NaN | NaN | POINT (-68.0419 -52.6831) | 0.0 | Ocean | 0.0 | 0.0 |
1 | 11800 | 38781 | heliport | Club Nautico San Isidro Heliport | AR | -34.461400 | -58.500300 | 6.0 | NaN | POINT (-58.5003 -34.4614) | 0.0 | Ocean | 0.0 | 0.0 |
2 | 11880 | 38861 | heliport | Heliplataforma Am I Heliport | AR | -52.519200 | -68.385800 | 98.0 | NaN | POINT (-68.3858 -52.5192) | 0.0 | Ocean | 0.0 | 0.0 |
3 | 11881 | 38862 | heliport | Heliplataforma Carina/Total Fina ELF | AR | -52.757200 | -67.219400 | 30.0 | NaN | POINT (-67.2194 -52.7572) | 0.0 | Ocean | 0.0 | 0.0 |
4 | 11882 | 38863 | heliport | Heliplataforma/Am-2 Heliport | AR | -52.548900 | -68.312500 | 134.0 | NaN | POINT (-68.3125 -52.5489) | 0.0 | Ocean | 0.0 | 0.0 |
5 | 11883 | 38864 | heliport | Heliplataforma/Am3 Heliport | AR | -52.522800 | -68.280800 | 134.0 | NaN | POINT (-68.2808 -52.5228) | 0.0 | Ocean | 0.0 | 0.0 |
6 | 11884 | 38865 | heliport | Heliplataforma/Am5 Heliport | AR | -52.570600 | -68.253300 | 98.0 | NaN | POINT (-68.2533 -52.5706) | 0.0 | Ocean | 0.0 | 0.0 |
7 | 11885 | 38866 | heliport | Heliplataforma/Rio Cullen-Hidra Norte Heliport | AR | -52.820600 | -68.219200 | 65.0 | NaN | POINT (-68.2192 -52.8206) | 0.0 | Ocean | 0.0 | 0.0 |
8 | 12168 | 42896 | heliport | Heliplataforma Barcaza Yagana Heliport | AR | -52.522500 | -68.280600 | 49.0 | NaN | POINT (-68.2806 -52.5225) | 0.0 | Ocean | 0.0 | 0.0 |
9 | 12169 | 42897 | heliport | Heliplataforma Buque Skandi Patagonia Heliport | AR | -52.000000 | -67.000000 | NaN | NaN | POINT (-67 -52) | 0.0 | Ocean | 0.0 | 0.0 |
10 | 12170 | 42898 | heliport | Heliplataforma Equipo Modular M-10 Heliport | AR | -52.548900 | -68.311900 | 108.0 | NaN | POINT (-68.3119 -52.5489) | 0.0 | Ocean | 0.0 | 0.0 |
11 | 12185 | 45157 | heliport | Rio Cullen II Heliplatform | AR | -52.836400 | -68.178900 | 127.0 | NaN | POINT (-68.1789 -52.8364) | 0.0 | Ocean | 0.0 | 0.0 |
12 | 12358 | 340027 | heliport | Heliplataforma Oceanic Champion | AR | -52.833333 | -66.416667 | 66.0 | NaN | POINT (-66.41667 -52.83333) | 0.0 | Ocean | 0.0 | 0.0 |
13 | 12359 | 340028 | heliport | Heliplataforma Móvil Ocean Scepter | AR | -46.466667 | -67.433333 | 505.0 | NaN | POINT (-67.43333 -46.46667) | 0.0 | Ocean | 0.0 | 0.0 |
14 | 12361 | 340030 | heliport | Heliplataforma Swiber PJW3000 | AR | -53.076495 | -67.964004 | 118.0 | NaN | POINT (-67.964 -53.0765) | 0.0 | Ocean | 0.0 | 0.0 |
15 | 12403 | 348475 | small_airport | Haras Wassermann Airport | AR | -40.596220 | -62.192030 | 10.0 | NaN | POINT (-62.19203 -40.59622) | 0.0 | Ocean | 0.0 | 0.0 |
16 | 12405 | 348477 | heliport | Puerto Belgrano Naval Base Heliport | AR | -38.888750 | -62.108640 | 7.0 | NaN | POINT (-62.10864 -38.88875) | 0.0 | Ocean | 0.0 | 0.0 |
17 | 12430 | 5837 | small_airport | Ushuaia Aeroclub Airport | AR | -54.822700 | -68.304300 | 19.0 | SAWO | POINT (-68.3043 -54.8227) | 0.0 | Ocean | 0.0 | 0.0 |
18 | 55500 | 16 | small_airport | Isla Martin Garcia Airport | AR | -34.182100 | -58.246900 | 6.0 | SAAK | POINT (-58.2469 -34.1821) | 0.0 | Ocean | 0.0 | 0.0 |
19 | 55525 | 40 | small_airport | Comandante Luis Piedrabuena Airport | AR | -49.995100 | -68.953100 | 78.0 | SA33 | POINT (-68.9531 -49.9951) | 0.0 | Ocean | 0.0 | 0.0 |
20 | 55671 | 5835 | medium_airport | Malvinas Argentinas Airport | AR | -54.843300 | -68.295800 | 102.0 | SAWH | POINT (-68.2958 -54.8433) | 0.0 | Ocean | 0.0 | 0.0 |
4. Conclusions¶
Summarize the key findings and highlight critical areas needing improvement. Also, discuss the impact of data quality on the use of the database for real-world applications.
5. Future Improvements¶
Propose ideas for future analyses or improvements in the data quality process, including potential approaches for additional data or advanced techniques that could be applied.
EPSG:4326
GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433],AXIS["Longitude",EAST],AXIS["Latitude",NORTH]]