Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace string by pattern ? #14

Open
moodymudskipper opened this issue Oct 3, 2019 · 4 comments
Open

replace string by pattern ? #14

moodymudskipper opened this issue Oct 3, 2019 · 4 comments

Comments

@moodymudskipper
Copy link
Owner

moodymudskipper commented Oct 3, 2019

It's not an unglueing feature but more about aggregating by pattern.

Say I have some file names , like those but in big numbers and with more patterns:

c(
  "John report January.doc",
  "Brian report March.doc",
  "Summary 2018.xls",
  "Summary 2017.xls",
  "unstructured isolated file name.doc")

in order to count or to aggregate, it would be nice to be able to give as input the patterns
"{name} doc {month}.doc" and "Summary {year}.doc", and get as an output :

c(
  "{name} doc {month}.doc",
  "{name} doc {month}.doc",
  "Summary {year}.doc",
  "Summary {year}.doc",
  "unstructured isolated file name.doc")

Maybe the default should be to output :

c(
  "{name} doc {month}.doc",
  "{name} doc {month}.doc",
  "Summary {year}.doc",
  "Summary {year}.doc",
  NA)

And then it's an option to keep original string if unmatched ?

No real good name idea...

Maybe something like unglue_simplify(), unglue_generalize(), unglue_to_pattern() ?

@moodymudskipper
Copy link
Owner Author

This would allow a nice itterative workflow to create the patterns from a list of messy names, using count, or wrapping our own function around it.

@moodymudskipper
Copy link
Owner Author

It won't be for this version but I think what I want is the perfect function to help one build patterns when they don't know the data.

It could be named unglue_summary.

I imagine a table like this that would show how many messages were matched by several patterns.

Here for example the pattern B has been matched 6 times but in one of these instances it was matched by A too, which has priority, so the total of matched is 5 for pattern B.

             n  A  B  C
A pattern1   4  4
B pattern2   5  1  6
C pattern3   3  2  0  5

Then we'd display the number of unmatched messages and the first 10 of them, making it easy for the user to iterate on their pattern vector.

@moodymudskipper
Copy link
Owner Author

this could actually be a parameter, so we can use it in unglue_unnest, unglue_vec...

we could have a default return_pattern = FALSE

Better maybe, this could be a parameter of unglue_detect(), logical = TRUE, if FALSE output as above.

This wouldn't be type stable though.

Another parameter of unglue detect could have us get a logical column per pattern, making the type of table above easier to get.

@moodymudskipper
Copy link
Owner Author

These are cool features but the names are not good, since there's no demand, we'll leave it as "nice to have".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant