-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add domain counting script #91
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are linting error that need to be fixed too
return suburls | ||
|
||
|
||
def merge_counters(counter_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced with
import operator as op
from functools import reduce
def merge_counters(counter_list):
return reduce(op.add, counter_list)
from datasets import Dataset, load_dataset | ||
|
||
|
||
def process_file(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand not using the dolma processors here as you are creating these counter objects, but can we add a comment about that being why we are using them?
except json.JSONDecodeError: | ||
continue | ||
|
||
gc.collect() # Explicitly trigger garbage collection to free memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this had to get added, I feel like there is something going wrong in your parallelism implementation. Can you add some comments about why this was needed?
suburls = [domain, ] | ||
current_path = domain | ||
|
||
for part in path_parts[:-1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you indexing with [:-1]
here? In the path_parts
line you have .strip('/')
so it seems like this function is ok with the final part of the path being a directory (and therefore should be part of this suburl list) but it gets filtered out?
signal.signal(signal.SIGINT, signal_handler) | ||
|
||
with tqdm(total=len(all_files), desc="Total files", position=0) as file_progress: | ||
with ProcessPoolExecutor(max_workers=40) as executor: # Adjust max_workers based on your system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This max worker should be a cli argument if you want people to adjust it themselves
for suburl, count in sorted_suburls[:10]: | ||
print(f"{suburl}: {count}") | ||
|
||
sorted_text_lengths = sorted(final_text_length_counter.items(), key=lambda item: item[1], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
for suburl, count in sorted_suburls[:10]: | ||
print(f"{suburl}: {count}") | ||
|
||
sorted_text_lengths = sorted(final_text_length_counter.items(), key=lambda item: item[1], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
import operator as op
... = sorted(..., key=op.itemgetter(1), ...)
is generally preferred over a lambda to get a item like this
parent_data = url_dict[parent_url] | ||
if (data['total_sample_count'] == parent_data['total_sample_count'] and | ||
data['total_text_length'] == parent_data['total_text_length']): | ||
# If counts match, mark the higher-level URL for removal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term higher-level
is confusing, it makes it seem like you would be removing the parent url.
# # Run the processing | ||
# root_folder = './data/dolma-cccc/data/CC-MAIN-2024-18' # Update with your root folder path | ||
|
||
token = "YOUR_HF_TOKEN" # Replace with you own hf token if you want to submit the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a cli argument.
return filtered_data | ||
|
||
# download the dolma-ccc dataset for analysis | ||
snapshot_download(repo_id="allenai/dolma-cccc", local_dir='data/dolma-cccc', repo_type="dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should all be wrapped into a main
function
Script for extracting the sub-urls and counting their frequency and total-text-length in CCCC dataset.