Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DeepSmith to generate programs in other languages #47

Open
JWesleySM opened this issue Jul 12, 2019 · 6 comments
Open

Use DeepSmith to generate programs in other languages #47

JWesleySM opened this issue Jul 12, 2019 · 6 comments

Comments

@JWesleySM
Copy link

JWesleySM commented Jul 12, 2019

Hi @ChrisCummins .

I'm trying to use your neural network to generate kernels in other languages. I'm currently following the comment #30 (comment). Is that the best way to do it?

I tried to create a corpus following this: https://github.com/ChrisCummins/phd/tree/master/datasets/github/scrape_repos.
All the commands ran, I scraped and cloned the repos, and when I run importer it gives me

Importing 0 $LANG repos ...

Is that right? Am I missing something?

Another question, given a tar.gz file with source code files, how do I create a CLgen model? How can I specify a config file for it? I tried follow this file, changing the work_dir, the local_tar_archive and the sampler , but when I run bazel run //deeplearning/clgen -- --config=/path/to/the/config/file, it gives me the error:

clgen.py:176] [Errno 2] No such file or directory: '/tmp/clgen_corpus_lh_2nyrr/corpus/some_file.c' (FileNotFoundError)

@ChrisCummins
Copy link
Owner

Hi Jose,

Sorry for the slow response! I was away when you sent this.

So, you reported a couple of problems:

"Importing 0 $LANG repos ..." when trying to scrape files

I suspect there may be something janky your config file. Could you please paste the contents of the "clone list" you're using here?

"No such file or directory" with CLgen local_tar_archive

You're absolutely right about setting the path in local_tar_archive. I can't tell without a little more context what's causing that error. Could you please re-run the command with the --clgen_debug flag set? E.g.

$ bazel run //deeplearning/clgen -- --clgen_debug --config=/path/to/the/config/file

Cheers,
Chris

@JWesleySM
Copy link
Author

Hi Chris,

No worries about the slow response.

This error

"Importing 0 $LANG repos ..." when trying to scrape files I already fixed it. Now the problem seems to be with the extractor corpus. When I run:

bazel run //datasets/github/scrape_repos:export_corpus -- \ --clone_list $PWD/clone_list.pbtxt \ --export_path /tmp/phd/datasets/github/scrape_repos/corpuses/java

It builds perfectly, but it doesn't export any file to the the export directory. By debugging the code, I found out that the line 84 from export_corpus.py the condition

if index_path.is_dir():

it's always being false.

By running the command with --clgen_debug, it returns me this:

`INFO: Analysed target //deeplearning/clgen:clgen (0 packages loaded).
INFO: Found 1 target...
Target //deeplearning/clgen:clgen up-to-date:
bazel-phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen
INFO: Elapsed time: 0.586s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen --clgen_debug '--config=/home/jwesley/phd/deeplearning/clgen/tests/data/INFO: Build completed successfully, 1 total action

Traceback (most recent call last):
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 294, in <module>
    app.run(main)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 274, in run
    _run_main(main, argv)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 238, in _run_main
    sys.exit(main(argv))
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 290, in main
    RunWithErrorHandling(DoFlagsAction)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 200, in RunWithErrorHandling
    return function_to_run(*args, **kwargs)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 244, in DoFlagsAction
    instance = Instance(config)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 100, in __init__
    self.model: models.Model = models.Model(config.model)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/models/models.py", line 67, in __init__
    self.corpus = corpuses.Corpus(config.corpus)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/corpuses/corpuses.py", line 113, in __init__
    self.content_id = ResolveContentId(self.config, hc)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/corpuses/corpuses.py", line 361, in ResolveContentId
    path_prefix=FLAGS.clgen_local_path_prefix))
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/corpuses/corpuses.py", line 416, in GetHashOfArchiveContents
    return checksumdir.dirhash(d, 'sha1')
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__checksumdir_1_0_5/checksumdir/__init__.py", line 40, in dirhash
    hash_func) for f in files if not
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__checksumdir_1_0_5/checksumdir/__init__.py", line 41, in <listcomp>
    f.startswith('.') and not re.search(r'/\.', f)])
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__checksumdir_1_0_5/checksumdir/__init__.py", line 48, in _filehash
    with open(filepath, 'rb') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/clgen_corpus_zn4ptcvg/corpus/vphn.c'
`

@JWesleySM
Copy link
Author

JWesleySM commented Jul 23, 2019

My config file looks like this:

# A tiny corpus of OpenCL kernels and a correspondingly small model.
# It should take a few minutes to train on a reasonably powerful GPU.
# File: //deeplearning/deepsmith/proto/clgen.proto
# Proto: clgen.Instance
working_dir: "/tmp/phd/deeplearning/clgen/c"
model {
corpus {
local_tar_archive: "$PWD/c_corpus.tar.bz2"
ascii_character_atomizer: true
contentfile_separator: "\n\n"
preprocessor: "deeplearning.clgen.preprocessors.cxx:ClangPreprocess"
preprocessor: "deeplearning.clgen.preprocessors.cxx:Compile"
preprocessor: "deeplearning.clgen.preprocessors.cxx:NormalizeIdentifiers"
preprocessor: "deeplearning.clgen.preprocessors.common:StripDuplicateEmptyLines"
preprocessor: "deeplearning.clgen.preprocessors.common:StripTrailingWhitespace"
preprocessor: "deeplearning.clgen.preprocessors.cxx:ClangFormat"
preprocessor: "deeplearning.clgen.preprocessors.common:MinimumLineCount3"
preprocessor: "deeplearning.clgen.preprocessors.cxx:Compile"
}
architecture {
backend: TENSORFLOW
neuron_type: LSTM
neurons_per_layer: 128
num_layers: 2
post_layer_dropout_micros: 0 # = 0.0 real value
}
training {
num_epochs: 32
sequence_length: 64
batch_size: 64
shuffle_corpus_contentfiles_between_epochs: false
adam_optimizer {
initial_learning_rate_micros: 2000 # = 0.01 real value
learning_rate_decay_per_epoch_micros: 50000 # = 0.05 real value
beta_1_micros: 900000 # = 0.9 real value
beta_2_micros: 999000 # = 0.999 real value
normalized_gradient_clip_micros: 5000000 # = 5.0 real value
}
}
}
sampler {
start_text: "void A("
batch_size: 1
temperature_micros: 800000 # = 0.8 real value
termination_criteria {
symtok {
depth_increase_token: "{"
depth_decrease_token: "}"
}
}
termination_criteria {
maxlen {
maximum_tokens_in_sample: 500
}
}
}

@ChrisCummins
Copy link
Owner

It builds perfectly, but it doesn't export any file to the the export directory.

Ah! My mistake. It seems I accidentally committed some debugging code :) This region of the code should be uncommented:

https://github.com/ChrisCummins/phd/blob/master/datasets/github/scrape_repos/export_corpus.py#L72-L79

and this section should be commented:

https://github.com/ChrisCummins/phd/blob/master/datasets/github/scrape_repos/export_corpus.py#L81-L86

That should fix the corpus exporting :)

For the CLgen corpus error, I tried your config and couldn't reproduce your error. Can you check that the corpus archive contains nothing but the text files you want to train on?

Here's the steps I took to try and reproduce:

# Create a single file "corpus"
$ cat <<EOF > main.c
int main() {
  int x = 1;
  int y = 2;
  return x + y;
}
EOF
# Create the corpus tarball
$ tar cjvf c_corpus.tar.bz2 main.c
# Remove any cached files from previous failed runs
$ rm -rf /tmp/phd/deeplearning/clgen
# Run CLgen on the config.
$ bazel run //deeplearning/clgen -- --config=$PWD/config.pbtxt
...
I0724 17:45:08.671088 4568839616 preprocessed.py:188] Preprocessing 1 of 1 content files
...
I0724 17:45:09.225190 4568839616 encoded.py:226] Encoding 1 of 1 preprocessed files
...
I0724 17:45:09.271282 4568839616 encoded.py:173] Encoded corpus: 53 tokens, 1 files.
E0724 17:45:09.278481 4568839616 clgen.py:246] Not enough data. Use a smaller sequence_length and batch_size (UserError)

As you can see, the above commands won't train a model (you need more than a single file to train on), but hopefully it'll be enough to know that we're both running the same commands.

Cheers,
Chris

@JWesleySM
Copy link
Author

JWesleySM commented Jul 25, 2019

By commenting the lines you suggested in export_corpus.py, it gives the error:

INFO: Analysed target //datasets/github/scrape_repos:export_corpus (0 packages loaded). INFO: Found 1 target... Target //datasets/github/scrape_repos:export_corpus up-to-date: bazel-phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus INFO: Elapsed time: 0.278s, Critical Path: 0.00s INFO: 0 processes. INFO: Build completed successfully, 1 total action INFO: Running command line: bazel-phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus --clone_list /home/jwesley/phd/datasets/github/scrape_repos/clone_list.pbtxt --export_path /tmp/phd/datasets/github/scrape_repos/corpuINFO: Build completed successfully, 1 total action Traceback (most recent call last): File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/datasets/github/scrape_repos/export_corpus.py", line 94, in <module> app.run(main) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 274, in run _run_main(main, argv) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 238, in _run_main sys.exit(main(argv)) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/datasets/github/scrape_repos/export_corpus.py", line 77, in main db = contentfiles.ContentFiles(d) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/datasets/github/scrape_repos/contentfiles.py", line 152, in __init__ super(ContentFiles, self).__init__(url, Base) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/labm8/sqlutil.py", line 250, in __init__ self.engine = CreateEngine(url, must_exist=must_exist) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/labm8/sqlutil.py", line 129, in CreateEngine if url.startswith('mysql://'): AttributeError: 'PosixPath' object has no attribute 'startswith'

I thought this happens because Python version, but I'm using version 3.6.8

About the error with CLgen, I kinda fixed it, and I could run now, it trains a model!!
I deleted the file that was not being found, and some other files that were causing trouble and it worked.
However the generated kernels are real weird, some of them are even empty or something like this:

void A(bnp (%c3),%k5");
asm volatile("kadnd 0x12345678(%rax,%rcx,1), %xnm0");
asm volatile("kmovf %bnd,%k3");
asm volatile("bndsd (%rax,%rcx,1), %bnd0");
asm volatile("sha1msg2 0x12(%rax,%rcx,8)", %xm1,%k0");
asm volatile("vcvtqudp $0x12,%zmm26,%k7}

or this:

void A(void))

rd (unn - = , ( c- b) ' |  u (r (b 6) <:
    return - 0)  < < 30))

  *+++5;
 c * 1;
 a  (*af4 ( < '))
  < 1;
  = + 12;
    = <a ( = ' ' | - = < 1+) < '= '0;

}

My tar file contains only source (.c) files I extracted from the repositories I cloned using this. I extracted them with a shell script. Do you have any ideia on why the kernels looks like that? Is there a problem with my corpus?

@ChrisCummins
Copy link
Owner

The export_corpus crash is because this line:

https://github.com/ChrisCummins/phd/blob/master/datasets/github/scrape_repos/export_corpus.py#L76

should be:

    db = contentfiles.ContentFiles(f'sqlite:///{d}')

Sorry about that :)

Your progress is promising! But clearly the model has not learned anything useful yet. There's a whole bunch of potential issues to narrow down, some starting things to consider include:

  1. how big is your corpus? How much variance is there in the code in the corpus?
  2. what is the final training loss of your model? what was the starting training loss? You can find the training loss in the model's log files, located at <path_to_clgen_cache>/models/<model_id>/logs/.
  3. parameters of your model? The tiny.pbtxt config you have based on is, well, tiny. You'll want to train a much larger network.

It'll be quicker and easier to go through them on a call. Shoot me an email and we can have a Google Hangouts chat.

Of course, there could also be something broken in CLgen's model training/sampling logic. I'm currently working on a private fork which has a handful of improvements, but it isn't ready for release yet

Cheers,
Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants