Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Data Flow Interruption with Function Parameters and Variable Arguments in Python #17753

Open
gravingPro opened this issue Oct 14, 2024 · 7 comments
Labels
question Further information is requested

Comments

@gravingPro
Copy link

I've encountered issues in CodeQL regarding data flow interruption. Here are the details:

1. Function Parameter Passing Interruption

In the code below:

def read_sql(sql):
    spark.sql()  # sink custom

def process(func, args): 
    func(*args) 

sql = request.json['data']  # Source
process(func=read_sql, args=sql)

CodeQL fails to detect that the tainted variable sql is passed into read_sql when using the process function to handle the function call and its argument. This shows an interruption in data flow tracking during function parameter passing and subsequent invocation with variable arguments.

2. *args and **kwargs Interruption

The problem with *args (variable positional arguments) and **kwargs (variable keyword arguments) is that when used in a way that impacts data flow, CodeQL can't track accurately. In the given example, using *args in the process function leads to incorrect recognition of the data flow for sql. This issue extends to similar scenarios involving these constructs.

Moreover, these problems also occur in functions related to multithreading and multiprocessing like threading.Thread, mulitprocess.Process, concurrent.futures.ThreadPoolExecutor, and concurrent.futures.ProcessPoolExecutor.

I hope this description helps in identifying and resolving these problems. Looking forward to a timely fix or further guidance on handling such complex data flow tracking scenarios.

Best regards

@gravingPro gravingPro added the question Further information is requested label Oct 14, 2024
@rvermeulen
Copy link
Contributor

Hi @gravingPro,

Thanks for the bug report.
We will inform the Python team and get back to you on possible further guidance.

@rvermeulen
Copy link
Contributor

Hi @gravingPro,

A quick follow-up question. How is your custom sink defined?
In read_sql the argument sql is currently unused.

@gravingPro
Copy link
Author

Hi @gravingPro,

A quick follow-up question. How is your custom sink defined? In read_sql the argument sql is currently unused.

It's just a simple example. Any sink can be used here, no matter sql injection or ssrf.

@rvermeulen
Copy link
Contributor

Hi @gravingPro,

Is read_sql your sink, because the comment suggests spark.sql() # sink custom is the sink? If spark.sql() is the sink, then the flow will stop at read_sql's parameter sql that is unused. That is, it will never reach your sink so I would like to exclude that possibility.

@rvermeulen rvermeulen added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 17, 2024
@gravingPro
Copy link
Author

Hi @gravingPro,

Is read_sql your sink, because the comment suggests spark.sql() # sink custom is the sink? If spark.sql() is the sink, then the flow will stop at read_sql's parameter sql that is unused. That is, it will never reach your sink so I would like to exclude that possibility.

Oh, i made a mistake there. I missed the taint data. The correct codes are:

def read_sql(sql_data):
    spark.sql(sql_data)  # A simple sink for  example. Any other official sink can be appied here.

def process(func, args): 
    func(*args) 

sql = request.json['data']  # Source
process(func=read_sql, args=sql)

@gravingPro
Copy link
Author

As said in description, it's found in the projects using multiprocess functions as "threading.Thread", "concurrent.futures.Executor" and "multiprocessing.Process".
The CodeQL cannot get the path.

# example
import threading
import time

def print_greeting(greeting):
    sink(greeting)

taint_data = 
thread = threading.Thread(target=print_greeting, args=(taint_data, ))
thread.start()

We have inspected the underlying code of the threading. And finally found the self._target(*self._args, **self._kwargs) in run function which can not be correctly extracted by CodeQL.

class Thread(BaseThread):
    # ...

    def run(self):
        """A virtual method to be overridden by the subclass."""
        if self._target is not None:
            self._target(*self._args, **self._kwargs)

@rvermeulen rvermeulen removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 18, 2024
@rvermeulen
Copy link
Contributor

@gravingPro Thanks for clarifying your example.
We will provide an update when we know more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants