iaBIH · AbuAttieh · Jun 3, 2024 · Jun 3, 2024 · Jun 3, 2024 · Jun 3, 2024
diff --git a/README.md b/README.md
@@ -5,30 +5,7 @@ without input datasets. The method leverages available statistics to create attr
 
 ![](https://github.com/iaBIH/synth-md/blob/main/resources/RDsStats.png)
 
-The tool is used to generate [synthetic datasets](https://github.com/iaBIH/synth-md/edit/main/rare_disease_datasets) with 187,709 patients for three popular rare diseases i.e. [Sickle Cell Disease](https://en.wikipedia.org/wiki/Sickle_cell_disease#:~:text=Sickle%20cell%20disease%20(SCD)%20is,like%20shape%20under%20certain%20circumstances.) (SCD, ORPHA code: 232), [Cystic Fibrosis](https://en.wikipedia.org/wiki/Cystic_fibrosis) (CF, ORPHA code: 586), and [Duchenne Muscular Dystrophy](https://en.wikipedia.org/wiki/Duchenne_muscular_dystrophy) (DMD, ORPHA code: 98896). 
-Each dataset has 10+ attributes including patient personal information and clinical parameters. 
-
-![](https://github.com/iaBIH/synth-md/blob/main/resources/SampleData.png)
-
-The synthetic data follow the input census and disease statistics with high accuracy. 
-
-<p float="left">
-<img src="https://github.com/iaBIH/synth-md/blob/main/resources/result_Gender.png" width="400">
-<img src="https://github.com/iaBIH/synth-md/blob/main/resources/result_Race.png" width="400">
-<img src="https://github.com/iaBIH/synth-md/blob/main/resources/result_Age.png" width="400">
-</p>
-
-### Citation: 
-
-This tool and the datasets are described in this paper: [Synthetic Datasets for Software Development in Rare Disease Research]() (to be published). 
-
-### Features:
-
- * No need for input datasets (only basic statistics are needed).
- * Very fast, can generate thousands of synthetic patient data in a few seconds.
- * Can generate synthetic data for a single disease or multiple diseases at the same time.
- * Adding a new disease is simply done by modifying a JSON file (see the sections below).
- * The synthetic data follow the input census and disease statistics with high accuracy. 
+This repository contains code to use the tool for generating three [synthetic datasets](https://github.com/iaBIH/synth-md/edit/main/rare_disease_datasets) for three popular rare diseases i.e. [Sickle Cell Disease](https://en.wikipedia.org/wiki/Sickle_cell_disease) (SCD), [Cystic Fibrosis](https://en.wikipedia.org/wiki/Cystic_fibrosis) (CF), and [Duchenne Muscular Dystrophy](https://en.wikipedia.org/wiki/Duchenne_muscular_dystrophy) (DMD). The datasets contain demographic data and selected clinical parameters.
 
 ## Repository Structure:
 
@@ -63,38 +40,40 @@ This tool and the datasets are described in this paper: [Synthetic Datasets for
                    ├── MDutils.py: Utilities 
                    └── synthMD.py: Setup                
 
-## Installation: 
+## Installation Guide
+
+Follow these steps to set up SynthMD and start generating synthetic datasets.
 
-  1. To import the U.S.A census data, one needs to get API key from here: https://api.census.gov/data/key_signup.html after that the key will be submitted to the email and needs activation. Some census variables may need updating. Check the census website for details and modify the MDimport file if necessary (or open a new issue and we will update them).
+### 1. Obtain Your Census API Key
+To import US Census data, you'll need an API key:
+- Visit the [Census API Signup Page](https://api.census.gov/data/key_signup.html) to get your API key. Check the Census website for any additional details.
+- (Modify the `MDimport` file if necessary to accommodate specific requirements.)
+- You will receive the API key by e-mail.
 
-  2. Download and install SynthMD: 
+### 2. Download and Install SynthMD
 
               git clone https://github.com/iaBIH/synth-md.git
               cd synth-md
               pip install . --user 
 
+### 3. Insert Your Census API Key
+Replace 'None' with your Census API key in the example script in this [line](https://github.com/iaBIH/synth-md/blob/73abf642d45b895a608644c3728bc1730dd8d770/example.py#L5).
+Note that it must be inserted as a string!
 
-## Example:
+### 4. Execute the code
+Run the example script to start the data generation process.
 
-   The provided [example](https://github.com/iaBIH/synth-md/blob/main/example.py) file shows how to use the tool: 
+             python example.py
 
-   1. Get a census API from https://api.census.gov/data/key_signup.html
-   2. Replace 'None' by your census API in this [line](https://github.com/iaBIH/synth-md/blob/73abf642d45b895a608644c3728bc1730dd8d770/example.py#L5) in the example:
-
-              censusAPIKey= None 
+### File Locations
+- **Downloaded US Census files:** `datasets` folder.
+- **Generated synthetic datasets:** `output` folder.
 
-   3. Run the following lines in your terminal
+_You can find three generated example files here: [Example files](https://github.com/iaBIH/synth-md/blob/main/output)._
 
-             cd synth-md
-             python example.py
-
-      The downloaded files from census will be saved in datasets folder. The generated synthetic datasets will be saved in output folder.
+## Extending the code
 
-## Generating synthetic data for a new disease:
-
-  To add a new disease using its statistics related to the U.S.A, 
-  modify the file [RDsDataUSA.json](https://github.com/iaBIH/synth-md/blob/main/config/RDsDataUSA.json) and create a new disease similar to the ones available 
-  e.g. copy/paste one of the diseases and change the values: 
+To extend the scripts to generate data for a new rare disease modify the file [RDsDataUSA.json](https://github.com/iaBIH/synth-md/blob/main/config/RDsDataUSA.json) and create a new disease configuration similar to the ones already included:
 
                   {
                   "RDID": 4,
@@ -156,30 +135,27 @@ This tool and the datasets are described in this paper: [Synthetic Datasets for
                   }
 
 
-  To add a new disease for a different country/area, you should create a new config file similar to 
-  [config/configUSA.json](https://github.com/iaBIH/synth-md/blob/main/config/configUSA.json) and use it. In the new config file, you should provide census data with the 
-  same format as the one provided for the U.S.A:
+  If you want to add statistics about a new geography, create a new config file similar to 
+  [config/configUSA.json](https://github.com/iaBIH/synth-md/blob/main/config/configUSA.json). Information that needs to be provided:
 
-     - states-race_ext.csv: race information 
-     - states-age-sex: age and sex information for male, female and both
+- **states-race_ext.csv:** race information
+- **states-age-sex:** age and sex information for male, female and both
 
-  Modify [example.py](https://github.com/iaBIH/synth-md/blob/main/example.py) and disable import, preparation (the evaluation part is optional) e.g. 
+Modify [example.py](https://github.com/iaBIH/synth-md/blob/main/example.py) and disable import, preparation (the evaluation part is optional) e.g.:
 
         doImport     = 0
         doPrepare    = 0
         doCreate     = 1 
         doEvaluation = 1    
 
-  After that, you can use RDcreate.py to create the synthetic data as shown in the section above.  
-  You can also automate the process by importing the census data directly but you will have to modify the files
-  RDimport and RDprepare.
+After that, you can use MDcreate.py to create synthetic data using the newly provided statistics.
 
+## Citation: 
 
+This tool and the datasets are described in this paper: [Synthetic Datasets for Software Development in Rare Disease Research]() (to be published). 
 
 ## License
 
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0).
 
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
diff --git a/synthMD/MDcharts.py b/synthMD/MDcharts.py
@@ -3,20 +3,23 @@
 import matplotlib.patheffects as PathEffects
 from synthMD import MDutils 
 
-# saving chart data to csv file for external processing if needed
+# Function to save chart data to a CSV file for external processing
 def saveChartData(inputData,figPath):
 
+    # Create default values if none are provided
     if inputData[2] is None:
        inputData[2] =list(range(len(inputData[0])))
     if inputData[3] is None:
        inputData[3] =list(range(len(inputData[0])))
-
+
+    # Determine if the input data includes state labels
     if len(inputData[0]) == len(inputData[-1]):
        data = [[x,y,xT,xTLbl] for x,y,xT,xTLbl in zip(*inputData)]
     else:
        X,Y,Xticks,statesLabels = inputData
        data = [[x,y,xT] for x,y,xT in zip(X,Y,Xticks)] 
 
+    # Create a CSV file path
     fnmPath = figPath[:-4] + ".csv"
     with open(fnmPath, 'w') as f:
         for row in data:
@@ -25,21 +28,25 @@ def saveChartData(inputData,figPath):
 
     f.close()
 
+# Function to plot data
 def plotData(Y, figTitle=None, XticksLabelsLst=None, isPercentageOutput=None, doShow=None, chartFnmPath=None, szW=20, szH=10):
 
+        # Set default values if none are provided
         doShow = 1 if doShow is None else doShow     
         isPercentageOutput=0 if isPercentageOutput is None else isPercentageOutput
 
         figTitle = "chart" if figTitle is None else figTitle
-
+
+        # Convert data to percentages
         Y = [ x/sum(Y)*100 for x in Y] if isPercentageOutput else Y
         Ylabel= 'percentage %' if isPercentageOutput else 'counts'
 
         X = list(range(len(Y)))
         Xticks = list(range(len(X)))        
         XticksLabelsLst = X if XticksLabelsLst is None else XticksLabelsLst   
         XlabelRotation= 60 if len(X)>50 else 0     
-
+
+        # Set up the figure
         plt.clf()
         plt.gcf().set_size_inches(szW,szH)
         plt.title(figTitle, fontsize=20)
@@ -49,21 +56,24 @@ def plotData(Y, figTitle=None, XticksLabelsLst=None, isPercentageOutput=None, do
         plt.margins(x=0, y=0)
         plt.xticks(Xticks, fontsize=8, rotation=XlabelRotation, labels=XticksLabelsLst)
         plt.bar(X,Y)
-
+
+        # Save the chart to a file if a path is provided
         if not chartFnmPath is None:            
            plt.savefig(chartFnmPath)           
            saveChartData([X,Y,Xticks,XticksLabelsLst],chartFnmPath)
 
+        # Display the chart if requested
         if doShow:
            plt.show()        
 
         plt.close()
 
-# plotting a map and data as a color map
-# input is a list of lists: data = [ [dataName,Value],... ]
+# Function to plot a map with data as a color map
+# Input is a list of lists: data = [ [dataName,Value],... ]
 # dataName will be mapped to mapName in the map data
 def plotMap(input_data, cfg=None):
 
+    # Default configuration if none is provided
     cfg = cfg if not cfg is None else { "shapefile_path": "datasets/usa/map/cb_2018_us_state_20m.shp",
                                         "xylim": [-130, -60, 20, 55],
                                         "fontsize": 6,
@@ -84,7 +94,7 @@ def plotMap(input_data, cfg=None):
     # Load the shapefile using geopandas
     mapData = gpd.read_file(shapefile_path)
 
-    # names could be: state, city, zipcode, ...
+    # Convert the input data to a pandas DataFrame
     data = {
     "state": [name for name in input_data.keys()],
     "vals": list(input_data.values())
@@ -126,34 +136,37 @@ def plotMap(input_data, cfg=None):
     # Set the title
     ax.set_title(cfg["mapTitle"])
 
+    # Save the map to a file if requested
     if cfg["doSave"]:
        plt.savefig(cfg["outputFnmPath"])
        saveChartData([[name for name in input_data.keys()],list(input_data.values()),None,None],cfg["outputFnmPath"])
 
+    # Display the map if requested
     if cfg["doShow"]:      
        # Show the plot
        plt.show()    
 
-        
+# Function to get frequency of data from a list
 def getFreqFromList(data, isPercentageOutput=None):
 
     isPercentageOutput = isPercentageOutput if not isPercentageOutput is None else 0
-    # get frequency of data 
+    # Get frequency of data 
     result = []
-    # get unique values 
+    # Get unique values 
     labels = sorted(list(set(data)))
     freq = []
     for lbl in labels:     
         count = len( [y for y in data if y == lbl ] ) 
         freq.append(count )
 
     if isPercentageOutput:
-       # get percentage instead of count
+       # Get percentage instead of count
        freq = [ (x/sum(freq)*100) for x in freq]  
 
     result = [labels, freq]    
     return result
 
+# Function to plot patient charts
 def plotPatientsCharts(p, dataLabels, dataArray, chartFolderPath, rdSName,statesLabels,  rd_datasset_size, isPercentageOutput):
 
             statesIDs, statesSName, statesLName =  MDutils.getUSAstateNames()
@@ -168,6 +181,7 @@ def plotPatientsCharts(p, dataLabels, dataArray, chartFolderPath, rdSName,states
             szH = 6
             Ylabel= 'percentage %' if isPercentageOutput else 'counts'
 
+            # Set up the figure
             plt.clf()
             plt.gcf().set_size_inches(szW,szH)
             plt.title(rdSName+" : "+pltTitle, fontsize=20)
@@ -219,20 +233,21 @@ def plotPatientsCharts(p, dataLabels, dataArray, chartFolderPath, rdSName,states
                 plt.savefig(chartFnmPath)
 
             else: 
-                #clinical parameters
+                # Clinical parameters
                 stepSize = 0.01
                 X = np.arange((np.min(L)), (np.max(L)), stepSize) if (np.max(L) - np.min(L))  < 100 else np.arange((np.min(L)), (np.max(L)), 1)   
                 n, bins, _ = plt.hist(L, bins=len(X))
                 bin_centers = (bins[:-1] + bins[1:]) / 2
                 saveChartData([bin_centers,n,None,None],chartFnmPath)
                 plt.savefig(chartFnmPath)
 
+# Function to plot death charts
 def plotDeathCharts(p, dataArray, sexLabels, racelabels, isPercentageOutput, maxUSAAge, rdSName, statesLabels, rd_datasset_size, chartFolderPath, szW,szH):
 
         raceNamesLst = racelabels[1]  
         sexLst       = sexLabels[0]
 
-        ##death per: age, state, sex, race         
+        # Death per: age, state, sex, race          
         Y1 = [len([x for x in dataArray if (x[8] not in (None, 0)) and (x[1]==a)])  for a in range(maxUSAAge+1)]
         Y2 = [len([x for x in dataArray if (x[8] not in (None, 0)) and (x[2]==a)])  for a in MDutils.getUSAstateNames()[2]]       
         Y3 = [len([x for x in dataArray if (x[8] not in (None, 0)) and (x[4]==a)])  for a in sexLst]
@@ -267,15 +282,16 @@ def plotDeathCharts(p, dataArray, sexLabels, racelabels, isPercentageOutput, max
             saveChartData([X,Y,Xticks,XticksLabels],chartFnmPath)
             plt.savefig(chartFnmPath)
             p = p + 1
-
+
+# Function to plot rare disease data
 def plotRareDiseaseData(fnm, sexLabels, racelabels, isPercentageOutput=None):
         print("=======================================================================")
         print("        RD CREATE CHARTS ")
         print("=======================================================================")
         startTm = time.time()
         isPercentageOutput = isPercentageOutput if not isPercentageOutput is None else 0 
 
-        # get disease name and output path from the file name
+        # Get disease name and output path from the file name
         chartFolderPath, csvFnm = os.path.split(fnm) # os.path.dirname(fnm)
         rdSName =  csvFnm.split("_")[0]
 
@@ -292,10 +308,11 @@ def plotRareDiseaseData(fnm, sexLabels, racelabels, isPercentageOutput=None):
         #   0,     1,     2,       3,         4,      5,       6,           7,          8,         9,     10  ]
         # "idx", "age",	"state", "zipCode",	"sex",	"race",	"birthDate", "diagDate", "deathDate", "CP1",  "CP2"]
 
+        # Excluded labels    
         excludedLabels =  ["idx","zipCode"]
         chartsIdx= [ j  for j in range(len(dataLabels)) if not dataLabels[j] in excludedLabels] 
 
-        # figure size 
+        # Figure size 
         szW = 10; szH = 6
         for p in chartsIdx:                    
             plotPatientsCharts(p, dataLabels, dataArray, chartFolderPath, rdSName,statesLabels,  rd_datasset_size, isPercentageOutput)