-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JOSS REVIEW] Cannot run P.cleanit() #8
Comments
Thx @chrisleaman for reporting this. But, I need to figur out this error and why I get duplicates but I think it is related to the fact that some labels_k are present in the polygons used to correct the data but not in the actual profile data. |
Ok I found out why some points got duplicated when the number of k was 10 for all the surveys to be clustered with KMeans, it was a tricky one but luckily you spot it. Label correction polygons can overlap because they are only operating on a survey basis (raw date and location) and only target those points within their boundaries that have the What happened was that I digitised label correction polygons by looking at their When using the correct number of k in each survey, the point above is still there in the overlap area but its label_k is not 6, therefore these polygons do not operate on it and no error arise. Here is what I did to correct this error. Here is the function that takes care of that and checks the aforementioned rule: def check_overlaps_poly_label(label_corrections, profiles,crs):
"""
Function to check wether overlapping areas of label correction polygons targeting the same label_k in the same surveys but assigning different new classes do not contain points that would be affected by those polygons.
Args:
label_corrections (gpd.GeoDataFrame): GeodataFrame of the label correction polygons.
profiles (gpd.GeoDataFrame): Geodataframe of the extracted elevation profiles.
crs (dict, int): Either an EPSG code (int) or a dictionary. If dictionary, it must store location codes as keys and crs information as values, in dictionary form (example: {'init' :'epsg:4326'}).
"""
for loc in label_corrections.location.unique():
for raw_date in label_corrections.query(f"location=='{loc}'").raw_date.unique():
for target_label_k in label_corrections.query(f"location=='{loc}' and raw_date=={raw_date}").target_label_k.unique():
date_labelk_subset=label_corrections.query(f"location=='{loc}' and raw_date=={raw_date} and target_label_k=={int(target_label_k)}")
# if more than one polygons target the same label k, check if they overlap
if len(date_labelk_subset)>1:
# check if there are any one of them that overlaps
for i,z in comb(range(len(date_labelk_subset)),2):
intersection_gdf = overlay(date_labelk_subset.iloc[[i]], date_labelk_subset.iloc[[z]], how='intersection')
if not intersection_gdf.empty:
# check if the overlapping polygons have assigns different new_classes
if any(intersection_gdf.new_class_1 != intersection_gdf.new_class_2):
# if overlap areas assign different classes, check if this area contains points.
# if contains points, raise an error as it does not make sense and the polygons must be corrected
# by the user
pts=profiles.query(f"location=='{loc}' and raw_date=={raw_date}")
if isinstance(pts,pd.DataFrame):
pts['coordinates']=pts.coordinates.apply(coords_to_points)
if isinstance(crs, dict):
pts_gdf=gpd.GeoDataFrame(pts, geometry='coordinates', crs=crs[loc])
elif isinstance(crs, int):
crs_adhoc={'init': f'epsg:{crs}'}
pts_gdf=gpd.GeoDataFrame(pts, geometry='coordinates', crs=crs_adhoc)
elif isinstance(pts,gpd.GeoDataFrame):
pts_gdf=pts
else:
raise ValueError(f"profiles must be either a Pandas DataFrame or Geopdandas GeoDataFrame. Found {type(profiles)} type instead.")
fully_contains = [intersection_gdf.geometry.contains(mask_geom)[0] for mask_geom in pts_gdf.geometry]
if True in fully_contains:
idx_true=[i for i, x in enumerate(fully_contains) if x]
raise ValueError(f"There are {len(intersection_gdf)} points in the overlap area of two label correction polygons (location: {loc}, raw_date: {raw_date}, target_label_k = {target_label_k}) which assign two different classes: {intersection_gdf.loc[:,'new_class_1'][0], intersection_gdf.loc[:,'new_class_2'][0]}. This doesn't make sense, please correct your label correction polygons. You can have overlapping polygons which act on the same target label k, but if they overlap points with such target_label_k, then they MUST assign the same new class.")
print("Check label correction polygons overlap inconsistencies terminated successfully") Now users can create label correction polygons which target the same label k without stressing too much on being super precise and not overlapping. If an issue of overlapping area of 2 polygons which target the same label k but assign different class should contain a point with such target k, this function will return a ValueError: There are 1 points in the overlap area of two label correction polygons (location: leo, raw_date: 20190731, target_label_k = 6) which assign two different classes: ('sand', 'veg'). This doesn't make sense, please correct your label correction polygons. You can have overlapping polygons which act on the same target label k, but if they overlap points with such target_label_k, then they MUST assign the same new class. I am now updating the doc with this warning, then I will place this check directly in the cleanit method. |
Looks like a tricky error to find! Glad you managed to sort it out - a warning to the user, like you suggested, should work well. |
Triki but important. Now I am struggling to run the function in the github environment while testing, which works perfectly local. As soon as I find what is causing this error I update you here. Cheers |
If you agree a warning is enough (for now), I added a big note in the DOC, in the label correction file section of the data cleaning chapter. However, the function I created works perfectly in my local Jupyter notebook setup, but for some reasons it doesn't work when running in the GitHub VM during testing the package. So I will take the time to make this check work in the near future. |
Hi @npucino, I agree with your approach - a warning is fine for now if you're struggling to get it to work on Github 👍 |
Hi thanks @chrisleaman . I see that you fetched the commits related to the issues, I am sorry I didn't do it myself. |
Here is the commit when I added the Note message: 12e5965 Cheers |
Comments are for openjournals/joss-reviews#3666 (comment).
Following
2 - Profiles extraction, unsupervised sand labelling and cleaning.ipynb
, I get aMergeError: Merge keys are not unique in right dataset; not a one-to-one merge
error when runningP.cleanit
.sandpyper/examples/2 - Profiles extraction, unsupervised sand labelling and cleaning.ipynb
Lines 3396 to 3399 in ce542c6
Looks like the validation when merging the dataframe is failing, potentially because of multiple matches in the dataframe?
sandpyper/sandpyper/common.py
Lines 2268 to 2269 in ce542c6
I saved the
to_clean_classified
andto_update_finetune
dataframes before the error is thrown:The text was updated successfully, but these errors were encountered: