Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix socket error: other side closed #20

Merged
merged 14 commits into from
Mar 1, 2024
Merged

Fix socket error: other side closed #20

merged 14 commits into from
Mar 1, 2024

Conversation

rael346
Copy link
Contributor

@rael346 rael346 commented Feb 25, 2024

Closes #17

This PR essentially fix the socket closed error by

  • Refactor the scraper pipeline to do a phase for every major in one go (instead of doing every phase for individual major)
    • This means every majors' raw html will be fetched in Classify stage before Tokenize
    • Similarly every major will be Tokenized before Parse phase
  • Add a retry logic for fetching html
    • The refactor above already fix 99% of the errors. This is more to ensure no error of this type can happen

Besides that there are a few improvements to the pipeline

  • Improve CLI with stats for individual phase
  • For business majors' concentration specifically, change the logic for fetching concentration html in Tokenize to getting the local html instead
  • A bunch of refactor to remove unused code and make the files format more consistent (moving graduate types to just the general types folder, etc)
    • @AlpacaFur FYI, some of this might break the imports for tooling branch

@rael346 rael346 added the Bug Something isn't working label Feb 25, 2024
@rael346 rael346 requested a review from AlpacaFur February 25, 2024 23:13
@rael346 rael346 self-assigned this Feb 25, 2024
Copy link
Member

@AlpacaFur AlpacaFur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, thank you for all your hard work!!

src/main.ts Show resolved Hide resolved
src/utils.ts Show resolved Hide resolved
@rael346 rael346 merged commit e53d7d5 into main Mar 1, 2024
@rael346 rael346 deleted the fix-socket-drop branch March 1, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix socketerror: other side closed error when fetching major's html
2 participants