From 32107acdaf97396eea17300df4b818a773942d7e Mon Sep 17 00:00:00 2001 From: mytruong Date: Tue, 11 Feb 2025 17:37:49 +0700 Subject: [PATCH] fix: usecases formatting --- .../ai-powered-monthly-project-reports.md | 50 +++++++------- Use Cases/ai-ruby-travel-assistant-chatbot.md | 3 +- Use Cases/binance-transfer-matching.md | 24 ++++--- Use Cases/bitcoin-alt-performance-tracking.md | 2 + ...atbot-agent-for-project-management-tool.md | 8 +-- ...building-data-pipeline-ogif-transcriber.md | 65 +++++++++++++------ ...d-monitoring-setup-for-trading-platform.md | 15 +++-- ...rypto-market-outperform-chart-rendering.md | 5 +- Use Cases/data-archive-and-recovery.md | 50 ++++++-------- ...database-hardening-for-trading-platform.md | 27 +++----- ...nhancing-cryptocurrency-transfer-logger.md | 50 +++++--------- ...lement-binance-future-pnl-analysis-page.md | 14 ++-- ...migrate-normal-table-to-timescale-table.md | 6 +- ...ist-history-using-data-snapshot-pattern.md | 21 ++---- ...ting_trading_pnl_data_pipeline_approach.md | 27 +------- 15 files changed, 162 insertions(+), 205 deletions(-) diff --git a/Use Cases/ai-powered-monthly-project-reports.md b/Use Cases/ai-powered-monthly-project-reports.md index aac6ed01..bc194d1a 100644 --- a/Use Cases/ai-powered-monthly-project-reports.md +++ b/Use Cases/ai-powered-monthly-project-reports.md @@ -5,28 +5,31 @@ date: "2024-11-14" description: "An in-depth look at Dwarves' monthly Project Reports system - a lean, efficient system that transforms communication data into actionable intelligence for Operations teams. This case study explores how we orchestrate multiple data streams into comprehensive project insights while maintaining enterprise-grade security and cost efficiency." tags: - "data-engineering" -- "project-management" +- "ai-agents" +- "llmops" - "case-study" title: "Project reports system: a case study" --- -At Dwarves, we've developed a monthly Project Reports system - a lean, efficient system that transforms our communication data into actionable intelligence for our Operations team. This system orchestrates multiple data streams into comprehensive project insights while maintaining enterprise-grade security and cost efficiency. +At Dwarves, we've developed a Monthly Project Reports system that transforms communication data into actionable intelligence. This lean system orchestrates multiple data streams into comprehensive project insights while maintaining enterprise-grade security and cost efficiency. ## The need for orchestrated intelligence +Our engineering teams exchange thousands of Discord messages daily across projects, capturing critical technical discussions, architectural decisions, and implementation details. However, while Discord excels at real-time communication, valuable insights often remain buried in chat histories, making it difficult to: -Our engineering teams generate thousands of Discord messages daily across multiple projects. These messages contain critical technical discussions, architectural decisions, and implementation details that traditionally remained trapped in chat histories. While Discord excels as a communication platform, its real-time nature makes it challenging to track project progress against client requirements or ensure alignment between ongoing discussions and formal documentation. +1. Track project progress against client requirements. +2. Align ongoing discussions with formal documentation. +3. Extract actionable insights from technical conversations. -This challenge sparked the development of our Project Reports system. Like a skilled conductor bringing order to complex musical pieces, our system coordinates multiple data streams into clear, actionable project intelligence +This challenge led us to develop the Project Reports system - an intelligent orchestration layer that transforms scattered communication data into structured project intelligence. Our system processes multiple data streams, extracting key insights and patterns to generate comprehensive project visibility. ## The foundation: Data architecture - Our architecture follows a simple yet powerful approach to data management, emphasizing efficiency and practicality over complexity. We've built our system on three core principles: -1. **Lean Storage**: S3 serves as our primary data lake and warehouse, using Parquet and CSV files to optimize for both cost and performance -2. **Efficient Processing**: DuckDB and Polars provide high-performance querying without the overhead of traditional data warehouses -3. **Secure Access**: Modal orchestrates our serverless functions, ensuring secure and efficient data processing +1. **Lean storage**: S3 serves as our primary data lake and warehouse, using Parquet and CSV files to optimize for both cost and performance +2. **Efficient processing**: DuckDB and Polars provide high-performance querying without the overhead of traditional data warehouses +3. **Secure access**: Modal orchestrates our serverless functions, ensuring secure and efficient data processing -### Data Flow Overview +### Data flow overview ```mermaid graph TB @@ -93,7 +96,7 @@ graph TB The system begins with raw data collection from various sources, primarily Discord at present, with planned expansion to Git, JIRA, Google Docs, and Notion. This data moves through our S3-based landing and gold zones, where it undergoes quality checks and transformations before feeding into our platform and AI engineering layers. -### Detailed Processing Pipeline +### Detailed processing pipeline ```mermaid graph LR @@ -157,12 +160,12 @@ graph LR Our processing pipeline emphasizes efficiency and security: -1. **Collection Layer**: Weekly scheduled collectors gather data from various sources -2. **Processing Pipeline**: Data undergoes PII scrubbing, validation, and schema enforcement -3. **Storage Layer**: Processed data is stored in S3 using Parquet and CSV formats -4. **Query Layer**: DuckDB and Polars engines provide fast, efficient data analysis +1. **Collection layer**: Weekly scheduled collectors gather data from various sources +2. **Processing pipeline**: Data undergoes PII scrubbing, validation, and schema enforcement +3. **Storage layer**: Processed data is stored in S3 using Parquet and CSV formats +4. **Query layer**: DuckDB and Polars engines provide fast, efficient data analysis -## Dify - Operational Intelligence through Low-code Workflows +## Dify - Operational intelligence through low-code workflows We use Dify to transform our raw data streams into intelligent insights through low-code workflows. This process bridges the gap between our data collection pipeline and the operational insights needed by our team. @@ -213,7 +216,7 @@ The workflow system easily integrates with our existing data pipeline, pulling f - **Maintainable Intelligence** Templates and workflows are version-controlled and documented, making it easy for team members to understand and modify the intelligence generation process. This ensures our reporting system can evolve with our organizational needs. -## Operational Impact +## Operational impact The Project Reports system serves as the foundation for our Operations team's project oversight. It provides: @@ -221,9 +224,8 @@ The Project Reports system serves as the foundation for our Operations team's pr - **Data-Driven Decision Making**: By analyzing communication patterns and project discussions, we can make informed decisions about resource allocation and project timelines. - **Automated Reporting**: The system generates comprehensive monthly reports, reducing manual effort and ensuring consistent project tracking across the organization. -## Technical Implementation - -### Secure Data Collection +## Technical implementation +### Secure data collection The cornerstone of our system is a robust collection pipeline built on Modal. Our collection process runs weekly, automatically processing Discord messages through a sophisticated filtering system that preserves critical technical discussions while ensuring security and privacy. @@ -242,8 +244,7 @@ def weekly_discord_collection(): Through Modal's serverless architecture, we've implemented separate landing zones for different project data, ensuring granular access control and comprehensive audit trails. Each message undergoes content filtering and PII scrubbing before being transformed into optimized Parquet format, providing both storage efficiency and query performance. -### Query Interface - +### Query interface The system provides a flexible API for accessing processed data: ```python @@ -262,12 +263,10 @@ def query_messages(item: QueryRequest, token: str = Depends(verify_token)) -> Di ``` -## Measured Impact - +## Measured impact The implementation of Project Reports has fundamentally transformed our project management approach. Our operations team now have greater visibility into project progress, with tracking and early issue identification becoming the norm rather than the exception. The automated documentation of key decisions has significantly reduced meeting overhead, while the correlation between discussions and deliverables ensures nothing falls through the cracks. -## Future Development - +## Future development We're expanding the system's capabilities in several key areas: - **Additional Data Sources**: Integration with Git metrics, JIRA tickets, and documentation platforms will provide a more comprehensive view of project health. @@ -277,7 +276,6 @@ We're expanding the system's capabilities in several key areas: We also don’t plan to be vendor-locked using entirely Modal. The foundations we’ve layed out to create our landing zones and data lake makes it very easy to swap in-and-out query and API architectures. ## Conclusion - At Dwarves, our Project Reports system demonstrates the power of thoughtful data engineering in transforming raw communication into strategic project intelligence. By combining secure data collection, efficient processing, and AI-powered analysis, we've created a system that doesn't just track progress – it actively contributes to project success. The system continues to coordinate our project data streams with precision and purpose, ensuring that every piece of information contributes to a clear picture of project health. Through this systematic approach, we're setting new standards for data-driven project management in software development, one report at a time. diff --git a/Use Cases/ai-ruby-travel-assistant-chatbot.md b/Use Cases/ai-ruby-travel-assistant-chatbot.md index 2fe7b7f2..0a9493e2 100644 --- a/Use Cases/ai-ruby-travel-assistant-chatbot.md +++ b/Use Cases/ai-ruby-travel-assistant-chatbot.md @@ -4,9 +4,8 @@ authors: date: "2024-11-21" description: "A case study exploring how we built an AI-powered travel assistant using Ruby and AWS Bedrock, demonstrating how choosing the right tools over popular choices led to a more robust and maintainable solution. This study examines our approach to integrating AI capabilities within existing Ruby infrastructure while maintaining enterprise security standards." tags: -- "ruby" +- "ai-agents" - "ai-engineering" -- "ai" - "case-study" title: "AI-powered Ruby travel assistant" --- diff --git a/Use Cases/binance-transfer-matching.md b/Use Cases/binance-transfer-matching.md index ef3e763b..79ad9f98 100644 --- a/Use Cases/binance-transfer-matching.md +++ b/Use Cases/binance-transfer-matching.md @@ -2,10 +2,10 @@ title: "Building better Binance transfer tracking" date: 2024-11-18 tags: - - data - - sql - - binance -description: A deep dive into building a robust transfer tracking system for Binance accounts, transforming disconnected transaction logs into meaningful fund flow narratives through SQL and data analysis + - "data-engineering" + - fintech + - defi +description: A deep dive into building a robust transfer tracking syste m for Binance accounts, transforming disconnected transaction logs into meaningful fund flow narratives through SQL and data analysis authors: - bievh --- @@ -16,8 +16,8 @@ Everything worked well at the beginning, motivating the clients to increase the This emergency lets us begin record every transfers between accounts in the system, then notify to the clients continuously. -### Limitations of Binance income history ---- +## Limitations of Binance income history + To record every transfers, we need the help of Binance APIs, specifically is [Get Income History (USER_DATA)](https://developers.binance.com/docs/derivatives/usds-margined-futures/account/rest-api/Get-Income-History). Once calling to this endpoint with proper parameters, we can retrieve the following `JSON` response. ```JSON @@ -58,9 +58,8 @@ To me, it looks bad. Ignore the wrong destination balance because of another iss If you pay attention to the `JSON` response of Binance API, an idea can be raised in your mind that "*Hmm, it looks easy to get the better version of logging by just only matching the transaction ID aka tranId field value*". Yes, it is the first thing that popped into my mind. Unfortunately, once the transfer happens between two accounts, different transaction IDs are produced on each account side. -### Our approach to transfer history mapping ---- -#### Current implementation +## Our approach to transfer history mapping +### Current implementation It can make you a bit of your time at the beginning when looking at the response of Binance API and ask yourself "Why does Binance give us a bad API response?". Bit it is not a dilemma. And Binance API is not as bad as when I mentioned it. This API serves things enough for its demand in the Biance. And more general means can serve more use cases at all. Enough to explain, now, we get to the important part: matching transfers to make the transfer history logging becomes more robust. I think we have more than two ways to do it. But because this issue comes from a data aspect, we will use a database solution to make it better. @@ -105,8 +104,7 @@ The flow chart above shows how the current system produced transfer tracking log - From `Future Incomes`, we simply query transfer information such as amount, time, and its sign. - Using the time of transfer, query `Balance snapshots` to detect balance before and after it is changed by the transfer. -#### How to make it better? - +### How to make it better? To do it better, we need to match the transfers together to know the source and destination of the fund. To match the transfers together, we need to specify what is the transfer before and after it (**with the assumption that transfers of the same fund on the send and receive side happen in a small gap of time, and two transfers can't happen in the same time**). We are lucky that Postgresql provides us with two convenient window functions, LEAD and LAG. LEAD is used to access a row following the current row at a specific physical offset. On the other hand, LAG helps with previous row access. With simple syntax and better performance, it is our choice to do transfer paring. ```sql @@ -219,8 +217,8 @@ flowchart TD ``` *Figure 4: Upgraded process to build transfer history* -### Conclusions +## Conclusions From the problem to the idea and finally is the implementation, nothing is too difficult. Every normal software developer can do it even better. But to do the huge thing, we first should begin from the smaller and make it done subtly and carefully. From this small problem, I learned some things: - **The answer may lie in the question itself.** Instead of blaming Binance API for being so bad, we can take a sympathetic look at it, and see if there is anything we can get out of it. - **One small change can make everything better.** When comparing the original transfer tracking log, and the version after upgrading with some small changes in the DB query, there is a huge difference when seeing the new one. This reminds uss that impactful solutions don't always require complex architectures – sometimes they just need careful refinement of existing approaches. -- **Data challenges are often best addressed through data-driven solutions**. Rather than seeking fixes elsewhere, the key is to leverage the inherent patterns and structure within the data itself. +- **Data challenges are often best addressed through data-driven solutions**. Rather than seeking fixes elsewhere, the key is to leverage the inherent patterns and structure within the data itself. \ No newline at end of file diff --git a/Use Cases/bitcoin-alt-performance-tracking.md b/Use Cases/bitcoin-alt-performance-tracking.md index e0ee586d..57464e13 100644 --- a/Use Cases/bitcoin-alt-performance-tracking.md +++ b/Use Cases/bitcoin-alt-performance-tracking.md @@ -3,6 +3,8 @@ title: "Tracking Bitcoin-Altcoin Performance Indicators in BTC Hedging Strategy" date: 2025-01-02 tags: - data + - fintech + - blockchain - crypto description: "This article provides an overview of the importance of tracking Bitcoin-Altcoin performance indicators in a trading strategy known as Hedge, and explains how to visualize this data effectively. It also demonstrates how to render a chart for this strategy using Matplotlib and Seaborn" authors: diff --git a/Use Cases/building-chatbot-agent-for-project-management-tool.md b/Use Cases/building-chatbot-agent-for-project-management-tool.md index 4af83d29..cb7ee2be 100644 --- a/Use Cases/building-chatbot-agent-for-project-management-tool.md +++ b/Use Cases/building-chatbot-agent-for-project-management-tool.md @@ -5,8 +5,9 @@ authors: date: '2024-11-21' description: 'A technical case study detailing the implementation of an AI chatbot agent in a project management platform. Learn how the team leveraged LangChain, LangGraph, and GPT-4 to build a multi-agent system using the supervisor-worker pattern. ' tags: - - 'ai' - - 'project-management' + - 'ai-agents' + - 'aiops' + - 'langchain' - 'case-study' title: 'Building chatbot agent to streamline project management' --- @@ -18,7 +19,6 @@ The challenge was to natively integrate a generative AI chatbot that could assis Implementing the chatbot agent involved key technical domains such as developing an interface to communicate with external AI platforms like OpenAI, creating an agentic system to interpret and execute user requests, and setting up usage monitoring to control AI token consumption and track chatbot performance. ## System requirements - ### Business requirements - Chatbot should be able to answer general questions about project management, such as writing project proposals or epic planning. @@ -87,7 +87,6 @@ Implementing the chatbot agent involved key technical domains such as developing The data flows from the user to the Supervisor, which routes the request to the appropriate worker agent. The worker agent processes the request, interacting with the necessary tools and the database, and generates a response. The response is then returned to the Supervisor and finally to the user. ## Technical implementation - ### Core workflows ```mermaid @@ -147,7 +146,6 @@ To address the need for displaying custom UI elements instead of text-only respo - **MongoDB**: NoSQL database for storing chat history, token usage, and other relevant data, offering flexibility and scalability. ## Lessons learned - ### What worked well 1. Implementing the supervisor-worker pattern using LangGraph allowed us to build a scalable and extensible multi-agent AI system that could handle increasing functionalities without compromising performance. diff --git a/Use Cases/building-data-pipeline-ogif-transcriber.md b/Use Cases/building-data-pipeline-ogif-transcriber.md index c5244c7e..d37ca466 100644 --- a/Use Cases/building-data-pipeline-ogif-transcriber.md +++ b/Use Cases/building-data-pipeline-ogif-transcriber.md @@ -1,19 +1,20 @@ --- authors: - 'thanh' - - 'quang' + le new content discovery and analytics capabilities.- 'quang' date: '2024-11-21' description: 'A technical case study of creating an automated system that downloads videos, processes audio, and generates transcripts using AI services like Groq and OpenAI.' tags: - - 'data-engineering' - - 'project-management' + - 'aiops' + - 'ai-agents' + - llm - 'case-study' title: 'Building data pipeline for OGIF transcriber' --- -At Dwarves, we needed an automated way to transcribe and summarize the recordings of our weekly OGIF events for our Brainery knowledge hub. The key challenge was to build a scalable data pipeline that could efficiently process YouTube videos, extract audio, transcribe the content using AI models, and store the results for downstream analysis and search. +At Dwarves, we faced the challenge of efficiently transcribing and summarizing our weekly OGIF event recordings for our Brainery knowledge hub. This required developing a scalable data pipeline capable of processing YouTube videos, extracting audio, and leveraging AI models for transcription. -The pipeline needed to handle diverse video formats and lengths, support high-volume concurrent requests, and integrate with our existing data storage and access patterns. The solution would democratize access to valuable OGIF content, reduce manual transcription effort, and enable new content discovery and analytics capabilities. +Our solution needed to handle diverse video formats, support concurrent processing, and integrate seamlessly with existing infrastructure. The goal: democratize access to OGIF content while enabling powerful search and analytics capabilities. ## Data pipeline design @@ -162,16 +163,16 @@ The API acts as the intermediary between User and backend components. The Databa The workflow ensures efficient job processing through asynchronous processing and job locking. The separation of Downloader and Transcriber allows for parallel processing and scalability. The optional typo correction step with OpenAI enhances transcription quality. -## Performance and scaling +## Performance benchmarks The system is designed to handle the following benchmarks: -- 100 simultaneous transcription jobs -- Videos from 5 minutes to 2 hours in length -- Processing time under 5 minutes per video -- Transcription accuracy of 90% or higher -- API response times under 500ms -- Job completion within 15 minutes +- Process 100 simultaneous transcription jobs +- Handle videos from 5 minutes to 2 hours +- Complete processing within 5 minutes per video +- Maintain 90%+ transcription accuracy +- Ensure sub-500ms API response times +- Complete jobs within 15 minutes To ensure the pipeline could handle the expected scale and provide timely results, several optimizations were implemented: @@ -190,16 +191,40 @@ Robust error handling and monitoring were critical to ensure pipeline reliabilit The transcription service leverages the following technologies and tools: -- **Core**: Python 3.9+, Flask framework, Asynchronous processing with Python threading -- **AI/ML**: Groq AI for transcription, OpenAI for text refinement -- **Data**: PostgreSQL database, AWS S3 storage -- **Infrastructure**: Docker, Gunicorn HTTP server, Token-based authentication -- **CI/CD**: GitHub Actions - -Key libraries used include `psycopg2` for Postgres, `Boto3` for S3, and `yt-dlp` + `pydub` for video downloading and audio processing. +**Core platform** +- Python 3.9+ +- Flask/FastAPI for RESTful APIs +- Celery + Redis for task queue +- Gunicorn for WSGI server + +**AI/ML services** +- Groq AI for transcription +- OpenAI GPT-4 for text refinement +- Custom rate limiting and retry logic + +**Data & storage** +- PostgreSQL for persistent storage +- Redis for caching/queues +- AWS S3 for file storage +- `Boto3` for AWS operations + +**Media processing** +- `yt-dlp` for video downloading +- `FFmpeg` for video manipulation +- `pydub` for audio processing + +**Infrastructure & devOps** +- Docker + Docker Compose +- GitHub Actions for CI/CD +- Prometheus + Grafana monitoring +- Nginx reverse proxy + +**Security & documentation** +- JWT authentication +- SSL/TLS encryption +- OpenAPI/Swagger documentation ## Lessons learned - Key successes of the project include: 1. Modular decoupling of downloader, transcriber and API logic improved scalability and maintainability. Issues could be identified and fixed quickly in each module. diff --git a/Use Cases/centralized-monitoring-setup-for-trading-platform.md b/Use Cases/centralized-monitoring-setup-for-trading-platform.md index afce43c8..6dd91add 100644 --- a/Use Cases/centralized-monitoring-setup-for-trading-platform.md +++ b/Use Cases/centralized-monitoring-setup-for-trading-platform.md @@ -5,13 +5,14 @@ authors: date: '2024-11-21' description: 'A technical case study for implementing centralized monitoring for a trading platform using Grafana and Prometheus, focusing on real-time alerts, data integrity, and resource optimization to prevent financial losses.' tags: - - 'platform' - - 'monitoring' + - 'devops' + - 'fintech' + - 'blockchain' - 'case-study' -title: 'Setup centralized monitoring system for Nghenhan trading platform' +title: 'Setup centralized monitoring system for Hedge Foundation trading platform' --- -Nghenhan is a privately-owned trading platform used by a select group of traders. Given the high-stakes nature of trading and the significant financial implications of system failures, it is crucial for Nghenhan to implement a robust, centralized monitoring system. This system must ensure platform reliability, minimize downtime, and prevent data loss to protect traders from substantial monetary losses. +Hedge Foundtion, a private trading platform serving select traders, required a robust centralized monitoring system to ensure platform reliability and prevent financial losses. Given the high-stakes nature of trading operations, the system needed to provide real-time alerts, maintain data integrity, and optimize resource allocation to protect traders from potential monetary losses. ## Understanding the unique challenges @@ -41,7 +42,7 @@ Integrating Grafana and Prometheus provides Nghenhan with a powerful centralized - **Grafana**: Grafana fetches data from Prometheus to create visualizations and dashboards. It also allows users to set up alerts and explore historical data. - **Notification Receivers**: The notification receivers are the endpoints or channels where alerts are sent, such as email, Discord, or custom webhooks. The admin can also receive notifications and take appropriate actions based on the alerts. -### **Prometheus as Data Collector** +### Prometheus as Data collector Prometheus serves as the primary data collection and monitoring tool, scraping metrics from various services and recording health and performance information. The setup involves configuring Prometheus to gather data on key metrics, including: @@ -75,7 +76,7 @@ due to multiple issues, such as resource limits, configuration errors, or depend ![](assets/nghenhan-service-back-off-restarting.webp) -### **Grafana as Data Visualizer for insightful observations** +### Grafana as Data Visualizer for insightful observations Grafana complements Prometheus by providing robust data visualization capabilities. With Grafana, Nghenhan can create dynamic dashboards that display real-time data on service performance. These dashboards include: @@ -105,4 +106,4 @@ Grafana’s capabilities allow us to analyze historical data to understand syste ## Conclusion -To sum up, Nghenhan's decision to adopt a centralized monitoring system powered by Grafana and Prometheus is a testament to its dedication to providing a reliable and efficient trading platform. By focusing on real-time monitoring and ensuring data synchronization, Nghenhan can proactively identify and resolve potential issues, minimizing downtime and financial losses for its users. This monitoring system not only bolsters Nghenhan's operational capabilities but also serves as a foundation for future growth. +To sum up, Nghenhan's decision to adopt a centralized monitoring system powered by Grafana and Prometheus is a testament to its dedication to providing a reliable and efficient trading platform. By focusing on real-time monitoring and ensuring data synchronization, Nghenhan can proactively identify and resolve potential issues, minimizing downtime and financial losses for its users. This monitoring system not only bolsters Nghenhan's operational capabilities but also serves as a foundation for future growth. \ No newline at end of file diff --git a/Use Cases/crypto-market-outperform-chart-rendering.md b/Use Cases/crypto-market-outperform-chart-rendering.md index 540188c2..f8469449 100644 --- a/Use Cases/crypto-market-outperform-chart-rendering.md +++ b/Use Cases/crypto-market-outperform-chart-rendering.md @@ -1,10 +1,11 @@ --- -title: "Visualizing crypto market outperform BTC-Alt indicators with Golang" +title: "Visualizing crypto market performance: BTC-Alt dynamic indicators in Golang" date: 2024-11-18 tags: - data + - blockchain + - fintech - crypto - - golang description: "Implementing a Golang-based visualization for crypto market performance indicators, focusing on Bitcoin vs Altcoin dynamics and trading strategy effectiveness through interactive charts and data analysis" authors: - bievh diff --git a/Use Cases/data-archive-and-recovery.md b/Use Cases/data-archive-and-recovery.md index 1cea1d5f..32f3499c 100644 --- a/Use Cases/data-archive-and-recovery.md +++ b/Use Cases/data-archive-and-recovery.md @@ -1,18 +1,17 @@ --- -title: "Setup data recovery with archive strategy" +title: "Building a data archive and recovery strategy for high-volume trading system" date: 2024-12-13 tags: - - data-archive - - data-recovery - - data-safeguarding -description: "A guide to implementing data archival and recovery strategies for high-volume transactional application" + - 'data-engineering' + - fintech + - blockchain + - finance +description: "A guide to implementing data archival and recovery strategies for high-volume transactional application." authors: - bievh --- -### Data safeguarding strategies ---- - +## Data safeguarding strategies Data is an important part of software development and one of the most valuable assets for any organization, especially in economics and finance. Along with the growth of business models, a large amount of data is generated diversely. Keeping data safe is critical for business. Data is lost or becomes wrong, which can cause irreversible loss. For example, in the banking system or stock marketplace, if one transaction record is missed, it can lead to a chain of consecutive wrong behaviors. It can even cause money losses for individual users as well as organizations. In software development, we have many strategies to safeguard the data. Depending on each use case, some strategies can be listed here: @@ -26,9 +25,8 @@ Each strategy aligns with specific stages of the data lifecycle and can be combi This overview provides some strategies to protect the data safely. We will continue delving into a more specific problem in the rest of this post, and explore the way we deal with it. -### Problem with storing large amounts of transactional data that are not accessed frequently ---- -#### What problem are we solving? +## Problem with storing large amounts of transactional data that are not accessed frequently +### What problem are we solving? Imagine you are developing a financial application that produces tens of thousands of transactions per day by users because of their cryptocurrency trades. These transactions are firstly stored in the data lake as raw records. Once a trade round (normally 3 months of trading) is over, the raw transactions in this period will be used to produce the final reports and persist to the database. After this time, these records will not be used anymore in both the trading process and summary calculation except for data auditing or report recovery in the future. @@ -36,8 +34,7 @@ Once the project continues running, the amount of data becomes bigger. It requir This situation lets us think about the data archive which is a strategy helping to offload unused data into long-term storage at minimal cost. -#### Data archive, why do we need it? - +### Data archive, why do we need it? We first take a look at **data backup** which is the cyclic process of duplicating the entire or a part of data, wrapping it in a stable format then storing it in a secure place. This process is scheduled periodically to make sure we always have at least a copy of production data readies to restore at any time one issue happens. ![alt text](assets/data-backup-and-restore.png) @@ -54,8 +51,7 @@ While backup comes from production data hotfix problems, data archive focuses on By implementing the **data archive** strategy, we can decrease the live database's pressure significantly and optimize storage costs while still ensuring long-term data availability. -#### Recovery using archived data - +### Recovery using archived data Archived data is normally not used for production data hotfix or rollback application state. Instead, it is used to recover critical data such as data snapshots, market reports, or even legal matters like audits. This progress is often executed manually. This means that the data is archived automatically each time it persists after an operational phase is completed. This period is determined depending on your application. It can be monthly, yearly, or each trading round in the trading application. However, once the recovery is required to execute, it should be run manually by the administrator. @@ -69,9 +65,7 @@ This approach has some advantages: - Often results in cleaner data since it goes through current business rules - Can be useful for audit purposes -### Implementing archive-based recovery strategy for trading application ---- - +## Implementing archive-based recovery strategy for trading application Back to the first example that was used at the second part to raise the problem. Assume we have a high-frequency cryptocurrency trading platforms that produce 50,000 transactions per day, which accumulates to approximately 4.5 million transactions in a single trading cycle of three months. At an average size of 2KB per transaction, this translates to nearly 9GB of raw data every cycle. To deal with this situation, we can design a simple archive and recovery strategy as following: @@ -102,18 +96,16 @@ flowchart TD ``` *Figure 3: diagram to visualize the workflow of a archive and recovery strategy implementation to resolve the problem with the data of high-frequency cryptocurrency trading platforms* -- Archiving Workflow: - - After each trading cycle (e.g., 3 months), transactional records are processed and moved to cloud-based storage. These records are compressed and encrypted for security and cost optimization. - - Metadata for these archived transactions is maintained in a lightweight database for quick lookup. +**Archiving Workflow**: +- After each trading cycle (e.g., 3 months), transactional records are processed and moved to cloud-based storage. These records are compressed and encrypted for security and cost optimization. +- Metadata for these archived transactions is maintained in a lightweight database for quick lookup. -- Recovery Workflow: - - When data is required, administrators search the metadata for the relevant archive. - - The archived records are retrieved and reprocessed using a dedicated compute environment to generate the required reports or validate metrics. - - If needed, the processed data can be restored to a separate database instance for further analysis without affecting the production environment. - -### Conclusion ---- +**Recovery Workflow**: +- When data is required, administrators search the metadata for the relevant archive. +- The archived records are retrieved and reprocessed using a dedicated compute environment to generate the required reports or validate metrics. +- If needed, the processed data can be restored to a separate database instance for further analysis without affecting the production environment. +## Conclusion From this discussion, we have seen how archive and recovery strategy can address specific challenges such as efficiently handling large volumes of rarely accessed data. Implementing a robust archive and recovery system provides several benefits, including long-term data availability, cost-effective storage, and support for audits or legal requirements. This strategy is particularly valuable for industries like finance, healthcare, and e-commerce, where data integrity and accessibility are critical. -This knowledge is essential for system architects, database administrators, and developers who manage large-scale applications with growing data needs. Understanding and implementing this strategy equips teams to handle data growth effectively, ensuring their systems remain reliable, secure, and future-ready. +This knowledge is essential for system architects, database administrators, and developers who manage large-scale applications with growing data needs. Understanding and implementing this strategy equips teams to handle data growth effectively, ensuring their systems remain reliable, secure, and future-ready. \ No newline at end of file diff --git a/Use Cases/database-hardening-for-trading-platform.md b/Use Cases/database-hardening-for-trading-platform.md index a58b8f05..c5e97f95 100644 --- a/Use Cases/database-hardening-for-trading-platform.md +++ b/Use Cases/database-hardening-for-trading-platform.md @@ -4,18 +4,17 @@ authors: date: '2025-01-02' description: 'Discover how a trading platform mitigated database access risks, enhanced security, and ensured data integrity through role-based access control, network isolation, MFA, and robust logging. Learn about the strategies and tools, like Teleport, that transformed operational efficiency and reinforced client trust.' tags: - - 'security' + - 'blockchain' + - 'fintech' - 'database' - 'case-study' title: 'Database hardening for a trading platform' --- ## Introduction - Database vulnerabilities are a silent threat in trading platforms. They lurk in unrestricted access controls, posing risks of data breaches, operational disruptions, and loss of client trust. This case study examines how we identified these risks and implemented a structured, practical approach to mitigate them. By integrating tools like Teleport, enforcing strict access controls, and embedding detailed logging mechanisms, we significantly enhanced our security posture and operational resilience. ## Problem statement - Every trading platform depends on its database to handle sensitive operations—from storing client funds to managing trade records. Yet, our initial access controls had critical gaps: **Unrestricted access to sensitive data** @@ -56,11 +55,9 @@ These activities must be conducted under strict safeguards to prevent "oops" mom | **Operational cost** | Increased expenses for data recovery, incident response, and breach mitigation | Lack of log trails and recovery mechanism | ## Proposed approach - Addressing these risks required a phased approach. Each step introduced a new layer of security, designed to mitigate specific vulnerabilities. -### **Role-based access control** - +### Role-based access control Unrestricted developer access was the root cause of several risks. To address this: - Enforce least-privilege principles: Developers accessed only the data essential to their roles. @@ -69,35 +66,30 @@ Unrestricted developer access was the root cause of several risks. To address th - **Write permissions**: Granted only with explicit, time-limited approval. - Provide standby databases: Developers used a read-only copy of the production database for debugging. -### **Network isolation** - +### Network isolation Open access points created opportunities for unauthorized interactions with the database. To minimize exposure: - Restricted database access to approved endpoints or IP addresses. - Mandated VPN usage or secure proxy connections for all database interactions. -### **Multi-factor authentication** - +### Multi-factor authentication Insufficient authentication measures left accounts vulnerable to compromise. Implementing MFA added an extra layer of security by requiring developers to verify their identities using multiple factors before accessing the database. -### **Data masking** - +### Data masking To further protect sensitive data, even when accessed by authorized personnel, we implemented data masking: - **Selective masking**: Sensitive data like client Personally Identifiable Information (PII) or financial details were masked or obfuscated. - **Granular control**: Masking rules were applied based on user roles and specific data fields. - **Dynamic masking**: Data was masked in real-time during queries, ensuring that sensitive information was never exposed in its raw form. -### **Database observability and audit logging** - +### Database observability and audit logging Lack of visibility into database interactions hindered accountability. To address this, we: - **Implemented robust logging**: Tracked every database interaction, including queries, data changes, and administrative actions. - **Set up alerts**: Suspicious activities, such as bulk deletions or schema modifications, triggered instant notifications. - **Made logs tamper-proof**: Ensured secure storage to prevent alterations. -### **Break glass access** - +### Break glass access In emergencies, developers needed immediate access to resolve critical issues. However, such access carried risks if not carefully managed. We implemented a "break-glass" process: - **Multi-party approval**: Emergency access required sign-offs from multiple stakeholders. @@ -105,7 +97,6 @@ In emergencies, developers needed immediate access to resolve critical issues. H - **Comprehensive logging**: Every action during emergency access was logged for accountability. ## Technical implementation - ### System architecture We used [**Teleport**](https://goteleport.com/) as the central platform for managing access controls and monitoring database interactions. The architecture featured: @@ -130,7 +121,6 @@ We used [**Teleport**](https://goteleport.com/) as the central platform for mana 4. Alerts were sent to the security team for any suspicious activities. ### Masking data - We hide some sensitive information in our tables to keep data safe. Most of these fields stay hidden forever. However, a few can be accessed with special permissions when needed. Right now, we use [postgresql-anonymizer](https://postgresql-anonymizer.readthedocs.io/en/latest/) for data masking and follow this process: 1. **Identify the table**: Find out which table you need access to. @@ -139,7 +129,6 @@ We hide some sensitive information in our tables to keep data safe. Most of thes For example, if you need to see hidden fields in the `deposits` table, request the `unmasked_deposits` role. ### Request a new role for extensive access - If there is a special request for an action beyond the permissions of the existing role, the requester must follow this protocol to perform the action: ```mermaid diff --git a/Use Cases/enhancing-cryptocurrency-transfer-logger.md b/Use Cases/enhancing-cryptocurrency-transfer-logger.md index d1f81e33..f1ed099c 100644 --- a/Use Cases/enhancing-cryptocurrency-transfer-logger.md +++ b/Use Cases/enhancing-cryptocurrency-transfer-logger.md @@ -2,14 +2,15 @@ title: "Transfer mapping: enhancing loggers for better transparency" date: 2024-11-18 tags: - - data + - 'data-engineering' - blockchain + - fintech description: A comprehensive guide on improving cryptocurrency transfer logging systems to provide better transparency and traceability for users and developers. authors: - bievh --- -### **What is a logger, and why does it matter?** +## What is a logger, and why does it matter? A **logger** is a fundamental component of modern software systems, designed to record system events, user actions, and issues in real-time. It’s like the memory of an application, enabling both users and administrators to trace activities. Loggers serve two main purposes: @@ -18,25 +19,19 @@ A **logger** is a fundamental component of modern software systems, designed to Without a logger, understanding the flow of actions or diagnosing issues would be like navigating a dark room without a flashlight. ---- - -### **What makes an effective logger?** - +## What makes an effective logger? An effective logger goes beyond simply storing data. It organizes and presents information in a way that’s **useful and easy to understand**. To be effective, a logger must have the following qualities: -- **Consistency in format** - Logs should maintain a uniform structure across the application, much like a well-organized manual where every chapter follows the same layout. This consistency makes it easier to identify patterns and quickly interpret information. +### Consistency in format +Logs should maintain a uniform structure across the application, much like a well-organized manual where every chapter follows the same layout. This consistency makes it easier to identify patterns and quickly interpret information. -- **Clarity and self-documentation** - Logs should be *self-explanatory*, requiring little to no additional context to understand their meaning. For instance, a good log entry is like a well-written headline: concise, clear, and informative. +### Clarity and self-documentation +Logs should be *self-explanatory*, requiring little to no additional context to understand their meaning. For instance, a good log entry is like a well-written headline: concise, clear, and informative. -- **Purposefulness and informativeness** - Every log entry should serve a purpose. For example, instead of simply stating, "Transfer completed," a log should provide actionable insights, such as the accounts involved, the amount transferred, and timestamps. - ---- - -### **Context: transfer logs in cryptocurrency applications** +### Purposefulness and informativeness +Every log entry should serve a purpose. For example, instead of simply stating, "Transfer completed," a log should provide actionable insights, such as the accounts involved, the amount transferred, and timestamps. +## Context: transfer logs in cryptocurrency applications Imagine a **cryptocurrency trading application** that enables users to manage multiple accounts on one platform. One of its core features is handling **transfers**, which can be categorized as follows: - *Deposits*: Funds added to an account from an external source. @@ -58,19 +53,14 @@ Account_B | +1000 USDT | 2024-01-01 10:00:01 From this log, users cannot deduce that the two entries are part of the same transfer. This ambiguity can cause confusion, especially in financial applications where clarity and transparency are paramount. ---- - -### **Why is this problematic?** - +## Why is this problematic? The lack of clear relationships between log entries creates the following issues: 1. **User confusion**: Without context, users may struggle to understand the flow of their funds. 2. **Reduced trust**: Ambiguous logs can erode user confidence, especially in financial systems. 3. **Limited debugging capability**: Developers and support teams cannot efficiently diagnose issues or trace transactions without meaningful, connected data. ---- - -### **Why does this happen? A look at the current system** +## Why does this happen? A look at the current system The existing system focuses on individual transactions, treating withdrawals and deposits as **isolated events**. The process is outlined below: @@ -111,9 +101,7 @@ flowchart LR This method records events but fails to link related transactions. For example, a withdrawal from one account and a deposit into another might appear as two separate, unrelated logs. ---- - -### **A solution: enhanced logging system** +## A solution: enhanced logging system To resolve these limitations, we propose an **enhanced logging system** that links related transactions and provides a clear view of asset movement. The process is illustrated below: @@ -146,9 +134,7 @@ flowchart TD TWB --> FR ``` ---- - -### **Key steps in the enhanced system** +## Key steps in the enhanced system 1. **Data sources (input)** 2. **Future incomes**: Primary source of transfer records containing the raw transaction data including amounts, timestamps, and account IDs @@ -233,9 +219,7 @@ GREATEST(0, ( - Calculates both before and after states for each transfer - Uses window functions for running totals within groups ---- - -### **Benefits of the enhanced logger** +## Benefits of the enhanced logger 1. **Enhanced clarity** 2. Logs clearly link related transactions. @@ -259,4 +243,4 @@ The proposed enhancements transform the logging system from disconnected entries - **Clarity**: Clear relationships between transactions. - **Purposefulness**: Meaningful, actionable data for users and developers. -This system not only addresses current logging limitations but also sets a solid foundation for future improvements in transaction tracking and user notifications. +This system not only addresses current logging limitations but also sets a solid foundation for future improvements in transaction tracking and user notifications. \ No newline at end of file diff --git a/Use Cases/implement-binance-future-pnl-analysis-page.md b/Use Cases/implement-binance-future-pnl-analysis-page.md index d72f2b1d..6718ca80 100644 --- a/Use Cases/implement-binance-future-pnl-analysis-page.md +++ b/Use Cases/implement-binance-future-pnl-analysis-page.md @@ -1,11 +1,11 @@ --- -title: "Implement Binance Futures PNL Analysis page by Phoenix LiveView" +title: "Implement Binance Futures PNL analysis page by Phoenix LiveView" date: 2025-01-15 tags: - - phoenix-live-view - - binance + - blockchain + - fintech - future-pnl-calculation - - data + - phoenix-live-view description: "Implementing Binance Futures PNL Analysis page with Phoenix LiveView to optimize development efficiency. This approach reduces the need for separate frontend and backend resources while enabling faster real-time data updates through WebSocket connections and server-side rendering." authors: - minhth @@ -13,7 +13,6 @@ authors: As Binance doesn't allow Master Account see MSA account Future PNL Analysis, so we decide to clone Binance Future PNL Analysis page with Phoenix Live View to show all Account Future PNL ## Why we use Phoenix Live View for Binance Future PNL Analysis page - ### Real-Time Data Handling - Phoenix Live View has built-in Websocket management so we can update data realtime with price or position update - Efficient handling of continuous data streams from Binance @@ -35,7 +34,6 @@ As Binance doesn't allow Master Account see MSA account Future PNL Analysis, so - Simplified state management ## How to optimize query with timescale - ### Data Source Base on [Binance Docs](https://www.binance.com/en/support/faq/how-are-pnl-calculated-on-binance-futures-and-options-pnl-analysis-dbb171c4db1e4626863ec8bc545be46a) we have compound data from 2 timescale tables: `ts_user_trades` and `ts_future_incomes` @@ -124,7 +122,7 @@ Because timescale table will be spitted into multiple chunks so if we use normal - With this cronjob it will help us no need to recalculate from 2 tables every request so it will be fast and don't make database pressure -## Screenshots +## User interface implementation ![Overview](assets/analysis-page/overview.jpg) *Figure 1: Future PNL Analysis Overview Tab* @@ -135,4 +133,4 @@ Because timescale table will be spitted into multiple chunks so if we use normal *Figure 3: Future PNL Analysis Symbol Tab* ![Symbol Analysis](assets/analysis-page/funding-and-transaction.png) -*Figure 4: Future PNL Funding and Transaction Tab* +*Figure 4: Future PNL Funding and Transaction Tab* \ No newline at end of file diff --git a/Use Cases/migrate-normal-table-to-timescale-table.md b/Use Cases/migrate-normal-table-to-timescale-table.md index 1b1013ba..ce8493af 100644 --- a/Use Cases/migrate-normal-table-to-timescale-table.md +++ b/Use Cases/migrate-normal-table-to-timescale-table.md @@ -2,10 +2,10 @@ title: "Migrate regular tables into TimescaleDB hypertables to improve query performance" date: 2025-01-15 tags: - - postgresql + - 'data-engineering' + - blockchain + - fintech - timescaledb - - database - - hedge-foundation description: "How do we migrate normal table to timescale table to optimized data storage" authors: - minhth diff --git a/Use Cases/persist-history-using-data-snapshot-pattern.md b/Use Cases/persist-history-using-data-snapshot-pattern.md index 906be601..6a17d10a 100644 --- a/Use Cases/persist-history-using-data-snapshot-pattern.md +++ b/Use Cases/persist-history-using-data-snapshot-pattern.md @@ -1,8 +1,10 @@ --- -title: "Implementing data snapshot pattern to persist historical data" +title: "Implementing data snapshot pattern to persist historical data" date: 2024-12-11 -tags: - - data-persistence +tags: + - 'data-engineering' + - fintech + - blockchain - snapshot-pattern description: "A technical exploration of implementing the data snapshot pattern for efficient historical data persistence" authors: @@ -21,11 +23,8 @@ The common point is the requirement of reporting each time a phase is completed. The "report" and "summary" are the "historical data" that is mentioned in the title. This post is the strange yet familiar journey to remind the name of the techniques we use to "persist historical data" in our transactional system. ### Data persistence, why do we need to persist our historical data? ---- - The longevity of data after the application or process that created it is done or crashed is the data persistence. Once data is persisted, each time the application is opened again, the same data is retrieved from the data storage, giving a seamless experience to the user no matter how much time has passed. - ![alt text](assets/binance-order-history.png) *Figure 1: Order history on the Binance, a popular trading marketplace. Each transaction is affected by the asset's price. If it is not persisted, it may wrong in the next query* @@ -36,12 +35,10 @@ But the marketplace is more than just normal transactions. Imagine, I am not a n The idea of long-time reports brings us to another question. Unlike a nascent market like cryptocurrencies, the stock market has hundreds of years of history. How much computing power is enough for us to analyze the market trend over 50 years? This question is another aspect that we will answer later. ### Snapshot pattern, data snapshot ---- - A snapshot aka memento is normally mentioned as a complete copy of a dataset or system at a specific point in its life cycle. As a "photograph" of the data that represents its exact state at that moment. It is typically used in rolling back to the previous object's state once something goes wrong; running tests or analysis data in the production-like environment; or backup and preserving data for compliance or audit purposes. Data snapshot means the applying of snapshot pattern in the data processing. Depending on the use case, we choose proper strategy type of snapshot. ![alt text](assets/aws-ebs-snapshot-function.png) -*Figure 2: The snapshot function in the EBS that is the virtual storage in AWS. This function grants us the ability to take a full snapshot of entire data and store it in the Amazon Simple Storage Service to be used to recover this state at any time* +*Figure 2: AWS EBS snapshot mechanism showing how complete data states are captured and stored in S3 for recovery purposes* These are 3 popular types of data snapshots. Firstly, the **full snapshot** copies all data in the system at a specific time. Because of its data integrity characteristic, it is often used in backup and restore data before major upgrades, compliance audits that require the complete system state, or data warehouse periodic loads. @@ -61,8 +58,6 @@ Imagine our system has implemented all 3 types of snapshot strategies: full snap After Tuesday, a problem happens in the system that requires us to recover data. We can choose to use the incremental Tuesday snapshot to recover Tuesday's data or use the differential Tuesday snapshot to recover data for both Monday and Tuesday. It depends on the use case and your strategy. ## Use snapshot pattern to persist historical data ---- - Back to the problem at the beginning of this post, we don't want to re-calculate our finance report which is affected by multiple factors over time, each time a user makes a new request. So we need to persist the summary to historical data. In simple words, it is the progress of collecting transactional data, aggregating and calculating the report for each period. Finally, we store these reports in the database as snapshot records. You may think that this progress looks familiar to your experiences when developing applications in your career. And it does not relate to any strategy that is mentioned above. If you feel it, you are right but wrong. Firstly, it is actually a normal practice when developing this type of application. We implement this feature as a feasible part of our database. But rarely think about it seriously. Second, the next story is one of the cases in which we use **incremental snapshot** to persist our historical data. Lets go to our hypothetical problem. @@ -100,10 +95,8 @@ Instead of recalculating the sum of profit and loss from a massive volume of tra We can considered that we are using DB-lvl **memoization** incidentally. That is an optimization technique used in programming where the results of expensive function calls are stored in a cache. When the function is called again with the same inputs, the result is retrieved from the cache instead of recomputing it. ### Conclusions ---- - The snapshot pattern is deceptively simple. This is the reason why it is often overlooked during development. We are often focus on solving immediate problems, implementing features without considering the long-term implications of recalculations and data inconsistency. The simplicity of this approach masks its power to address complex issues like historical data accuracy and computational efficiency. It ensures accurate and consistent data, reduces computational overhead, and improves user experience by delivering faster query responses. Moreover, by persisting historical data, businesses gain a robust foundation for long-term analytics, such as trend analysis and strategic decision-making. It highlights the importance of viewing data as a long-term asset, requiring strategies like snapshots to ensure its reliability and usability over time. -Whether you are building financial applications, e-commerce platforms, or any system requiring accurate historical records, understanding and applying the snapshot pattern can elevate your application's performance and reliability. +Whether you are building financial applications, e-commerce platforms, or any system requiring accurate historical records, understanding and applying the snapshot pattern can elevate your application's performance and reliability. \ No newline at end of file diff --git a/Use Cases/reconstructing_trading_pnl_data_pipeline_approach.md b/Use Cases/reconstructing_trading_pnl_data_pipeline_approach.md index d84130af..2851b9d6 100644 --- a/Use Cases/reconstructing_trading_pnl_data_pipeline_approach.md +++ b/Use Cases/reconstructing_trading_pnl_data_pipeline_approach.md @@ -2,7 +2,8 @@ title: "Reconstructing historical trading PnL: a data pipeline approach" date: 2024-11-18 tags: - - data + - 'data-engineering' + - fintech - blockchain description: A detailed look at how we rebuilt historical trading PnL data through an efficient data pipeline approach, transforming a complex problem into a maintainable solution. authors: @@ -10,13 +11,9 @@ authors: --- ## Executive summary - Recovering historical trading profit and loss (PnL) data is a critical challenge for finance and cryptocurrency platforms. When historical records are unavailable, users cannot validate past trading strategies, assess long-term performance, or reconcile discrepancies. This blog details how I tackled this problem by transforming a technically daunting challenge into a robust, maintainable data pipeline solution. ---- - ## Background and context - ### What is trading PnL? In trading, **Profit and loss (PnL)** represents financial outcomes: @@ -26,18 +23,14 @@ In trading, **Profit and loss (PnL)** represents financial outcomes: For instance, when you close a Bitcoin position at a higher price than you entered, your realized PnL reflects the profit after fees. If the position is still open, unrealized PnL tracks potential outcomes as prices fluctuate. ### Why does historical PnL matter? - Historical PnL data provides traders with: 1. **Performance insights**: Understanding which strategies worked and which didn’t. 2. **Compliance and reporting**: Regulatory or internal needs often require accurate historical data. 3. **Strategy validation**: Testing new algorithms against past market conditions relies on accurate PnL records. -### The problem at Hand - +### The problem at hand While designing a trading PnL chart for my platform, a significant gap emerged: historical PnL data for certain periods was missing. The existing system calculated PnL in real-time but didn’t store intermediary data, making reconstruction impossible without extensive changes to the codebase. ---- - ## The challenge > “How can we reconstruct historical trading PnL data efficiently when the original records no longer exist?” @@ -48,8 +41,6 @@ This question encapsulated two core issues: Moreover, the system’s reliance on multiple data sources (trades, market prices, fees) and the sheer volume of transactions compounded the problem. ---- - ## Technical requirements To reconstruct PnL, the following were essential: @@ -58,10 +49,7 @@ To reconstruct PnL, the following were essential: - **Fee details**: Trading commissions, funding rates, and other costs affecting PnL. - **Efficient processing**: Handling massive datasets without overloading system resources. ---- - ## System analysis - ### From complex code to data flows Instead of delving into intricate application logic, I reimagined the system as a series of **data flows**, where data is ingested, transformed, and stored across multiple layers. Below is the existing flow: @@ -101,7 +89,6 @@ flowchart LR - **Outputs**: Processed data powers analytics and reporting tools. ### Reconstructing the flow - From the above flow of data, we can easily determine which parts of the flow we should reproduce to find the old PnLs. - Firstly, data comes from Binance - Second, data passes through ETS before processing @@ -154,8 +141,6 @@ flowchart LR C -->|Results| PNL ``` ---- - ## Implementation The reconstruction process involves five major steps: @@ -200,8 +185,6 @@ flowchart TD C --> A ``` ---- - ## Outstanding challenges **Volume of data** @@ -213,8 +196,6 @@ Minute-level Kline data is essential for accuracy, but retrieving and processing **PnL accuracy** PnL, specifically realized PnL, is stuck to the trade set to help us know the total PnL of this trade set by accumulating the closed trade PnL and fee over time. So if we retrieve the list of user trades randomly, it may produce the wrong PnL and let our report make nonsense. ---- - ## Optimization strategies - **Time-series state reconstruction**: Our trading events naturally fall into their proper timeline. Each trade, fee, and price change finds its proper place in the chronological sequence. So our system can reconstruct a trading position's PnL at any moment. - **Map-reduce**: As mentioned above, an account needs a long time to process. So forcing all our data through a single filter is impossible. The real benchmark test takes me about 5 hours to recover 1 trade set. By creating mapping by trading pairs, the data can be processed in parallel and done in minutes. @@ -240,8 +221,6 @@ To validate the reconstruction process: - **Cross-validation**: Compare reconstructed PnL with existing records in the database. - **Visual analysis**: Render reconstructed data onto charts to ensure trends align with expected strategies. ---- - ## Conclusion This case study highlights the power of a **data-centric approach** in solving financial system problems. By treating the challenge as a structured data pipeline problem, we avoided risky codebase modifications and developed a robust, scalable solution. Techniques such as parallel processing, time-series reconstruction, and efficient data retrieval were key to solving the problem within system constraints.