Use DeepSmith to generate programs in other languages #47

JWesleySM · 2019-07-12T02:15:36Z

I'm trying to use your neural network to generate kernels in other languages. I'm currently following the comment #30 (comment). Is that the best way to do it?

I tried to create a corpus following this: https://github.com/ChrisCummins/phd/tree/master/datasets/github/scrape_repos.
All the commands ran, I scraped and cloned the repos, and when I run importer it gives me

Importing 0 $LANG repos ...

Is that right? Am I missing something?

Another question, given a tar.gz file with source code files, how do I create a CLgen model? How can I specify a config file for it? I tried follow this file, changing the work_dir, the local_tar_archive and the sampler , but when I run bazel run //deeplearning/clgen -- --config=/path/to/the/config/file, it gives me the error:

clgen.py:176] [Errno 2] No such file or directory: '/tmp/clgen_corpus_lh_2nyrr/corpus/some_file.c' (FileNotFoundError)

The text was updated successfully, but these errors were encountered:

ChrisCummins · 2019-07-23T15:25:14Z

Hi Jose,

Sorry for the slow response! I was away when you sent this.

So, you reported a couple of problems:

"Importing 0 $LANG repos ..." when trying to scrape files

I suspect there may be something janky your config file. Could you please paste the contents of the "clone list" you're using here?

"No such file or directory" with CLgen local_tar_archive

You're absolutely right about setting the path in local_tar_archive. I can't tell without a little more context what's causing that error. Could you please re-run the command with the --clgen_debug flag set? E.g.

$ bazel run //deeplearning/clgen -- --clgen_debug --config=/path/to/the/config/file

Cheers,
Chris

JWesleySM · 2019-07-23T15:35:34Z

Hi Chris,

No worries about the slow response.

This error

"Importing 0 $LANG repos ..." when trying to scrape files I already fixed it. Now the problem seems to be with the extractor corpus. When I run:

bazel run //datasets/github/scrape_repos:export_corpus -- \ --clone_list $PWD/clone_list.pbtxt \ --export_path /tmp/phd/datasets/github/scrape_repos/corpuses/java

It builds perfectly, but it doesn't export any file to the the export directory. By debugging the code, I found out that the line 84 from export_corpus.py the condition

if index_path.is_dir():

it's always being false.

By running the command with --clgen_debug, it returns me this:

`INFO: Analysed target //deeplearning/clgen:clgen (0 packages loaded).
INFO: Found 1 target...
Target //deeplearning/clgen:clgen up-to-date:
bazel-phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen
INFO: Elapsed time: 0.586s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen --clgen_debug '--config=/home/jwesley/phd/deeplearning/clgen/tests/data/INFO: Build completed successfully, 1 total action

Traceback (most recent call last):
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 294, in <module>
    app.run(main)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 274, in run
    _run_main(main, argv)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 238, in _run_main
    sys.exit(main(argv))
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 290, in main
    RunWithErrorHandling(DoFlagsAction)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 200, in RunWithErrorHandling
    return function_to_run(*args, **kwargs)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 244, in DoFlagsAction
    instance = Instance(config)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/clgen.py", line 100, in __init__
    self.model: models.Model = models.Model(config.model)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/models/models.py", line 67, in __init__
    self.corpus = corpuses.Corpus(config.corpus)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/corpuses/corpuses.py", line 113, in __init__
    self.content_id = ResolveContentId(self.config, hc)
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/corpuses/corpuses.py", line 361, in ResolveContentId
    path_prefix=FLAGS.clgen_local_path_prefix))
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/phd/deeplearning/clgen/corpuses/corpuses.py", line 416, in GetHashOfArchiveContents
    return checksumdir.dirhash(d, 'sha1')
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__checksumdir_1_0_5/checksumdir/__init__.py", line 40, in dirhash
    hash_func) for f in files if not
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__checksumdir_1_0_5/checksumdir/__init__.py", line 41, in <listcomp>
    f.startswith('.') and not re.search(r'/\.', f)])
  File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/deeplearning/clgen/clgen.runfiles/pypi__checksumdir_1_0_5/checksumdir/__init__.py", line 48, in _filehash
    with open(filepath, 'rb') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/clgen_corpus_zn4ptcvg/corpus/vphn.c'

`

JWesleySM · 2019-07-23T15:40:54Z

My config file looks like this:

# A tiny corpus of OpenCL kernels and a correspondingly small model.
# It should take a few minutes to train on a reasonably powerful GPU.
# File: //deeplearning/deepsmith/proto/clgen.proto
# Proto: clgen.Instance
working_dir: "/tmp/phd/deeplearning/clgen/c"
model {
corpus {
local_tar_archive: "$PWD/c_corpus.tar.bz2"
ascii_character_atomizer: true
contentfile_separator: "\n\n"
preprocessor: "deeplearning.clgen.preprocessors.cxx:ClangPreprocess"
preprocessor: "deeplearning.clgen.preprocessors.cxx:Compile"
preprocessor: "deeplearning.clgen.preprocessors.cxx:NormalizeIdentifiers"
preprocessor: "deeplearning.clgen.preprocessors.common:StripDuplicateEmptyLines"
preprocessor: "deeplearning.clgen.preprocessors.common:StripTrailingWhitespace"
preprocessor: "deeplearning.clgen.preprocessors.cxx:ClangFormat"
preprocessor: "deeplearning.clgen.preprocessors.common:MinimumLineCount3"
preprocessor: "deeplearning.clgen.preprocessors.cxx:Compile"
}
architecture {
backend: TENSORFLOW
neuron_type: LSTM
neurons_per_layer: 128
num_layers: 2
post_layer_dropout_micros: 0 # = 0.0 real value
}
training {
num_epochs: 32
sequence_length: 64
batch_size: 64
shuffle_corpus_contentfiles_between_epochs: false
adam_optimizer {
initial_learning_rate_micros: 2000 # = 0.01 real value
learning_rate_decay_per_epoch_micros: 50000 # = 0.05 real value
beta_1_micros: 900000 # = 0.9 real value
beta_2_micros: 999000 # = 0.999 real value
normalized_gradient_clip_micros: 5000000 # = 5.0 real value
}
}
}
sampler {
start_text: "void A("
batch_size: 1
temperature_micros: 800000 # = 0.8 real value
termination_criteria {
symtok {
depth_increase_token: "{"
depth_decrease_token: "}"
}
}
termination_criteria {
maxlen {
maximum_tokens_in_sample: 500
}
}
}

ChrisCummins · 2019-07-24T16:50:50Z

It builds perfectly, but it doesn't export any file to the the export directory.

Ah! My mistake. It seems I accidentally committed some debugging code :) This region of the code should be uncommented:

https://github.com/ChrisCummins/phd/blob/master/datasets/github/scrape_repos/export_corpus.py#L72-L79

and this section should be commented:

https://github.com/ChrisCummins/phd/blob/master/datasets/github/scrape_repos/export_corpus.py#L81-L86

That should fix the corpus exporting :)

For the CLgen corpus error, I tried your config and couldn't reproduce your error. Can you check that the corpus archive contains nothing but the text files you want to train on?

Here's the steps I took to try and reproduce:

# Create a single file "corpus"
$ cat <<EOF > main.c
int main() {
  int x = 1;
  int y = 2;
  return x + y;
}
EOF
# Create the corpus tarball
$ tar cjvf c_corpus.tar.bz2 main.c
# Remove any cached files from previous failed runs
$ rm -rf /tmp/phd/deeplearning/clgen
# Run CLgen on the config.
$ bazel run //deeplearning/clgen -- --config=$PWD/config.pbtxt
...
I0724 17:45:08.671088 4568839616 preprocessed.py:188] Preprocessing 1 of 1 content files
...
I0724 17:45:09.225190 4568839616 encoded.py:226] Encoding 1 of 1 preprocessed files
...
I0724 17:45:09.271282 4568839616 encoded.py:173] Encoded corpus: 53 tokens, 1 files.
E0724 17:45:09.278481 4568839616 clgen.py:246] Not enough data. Use a smaller sequence_length and batch_size (UserError)

As you can see, the above commands won't train a model (you need more than a single file to train on), but hopefully it'll be enough to know that we're both running the same commands.

Cheers,
Chris

JWesleySM · 2019-07-25T05:28:56Z

By commenting the lines you suggested in export_corpus.py, it gives the error:

INFO: Analysed target //datasets/github/scrape_repos:export_corpus (0 packages loaded). INFO: Found 1 target... Target //datasets/github/scrape_repos:export_corpus up-to-date: bazel-phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus INFO: Elapsed time: 0.278s, Critical Path: 0.00s INFO: 0 processes. INFO: Build completed successfully, 1 total action INFO: Running command line: bazel-phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus --clone_list /home/jwesley/phd/datasets/github/scrape_repos/clone_list.pbtxt --export_path /tmp/phd/datasets/github/scrape_repos/corpuINFO: Build completed successfully, 1 total action Traceback (most recent call last): File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/datasets/github/scrape_repos/export_corpus.py", line 94, in <module> app.run(main) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 274, in run _run_main(main, argv) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/pypi__absl_py_0_1_10/absl/app.py", line 238, in _run_main sys.exit(main(argv)) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/datasets/github/scrape_repos/export_corpus.py", line 77, in main db = contentfiles.ContentFiles(d) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/datasets/github/scrape_repos/contentfiles.py", line 152, in __init__ super(ContentFiles, self).__init__(url, Base) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/labm8/sqlutil.py", line 250, in __init__ self.engine = CreateEngine(url, must_exist=must_exist) File "/home/jwesley/.cache/bazel/_bazel_jwesley/4a1188fa51f277d88b59f17a8b59eb11/execroot/phd/bazel-out/k8-py3-opt/bin/datasets/github/scrape_repos/export_corpus.runfiles/phd/labm8/sqlutil.py", line 129, in CreateEngine if url.startswith('mysql://'): AttributeError: 'PosixPath' object has no attribute 'startswith'

I thought this happens because Python version, but I'm using version 3.6.8

About the error with CLgen, I kinda fixed it, and I could run now, it trains a model!!
I deleted the file that was not being found, and some other files that were causing trouble and it worked.
However the generated kernels are real weird, some of them are even empty or something like this:

void A(bnp (%c3),%k5");
asm volatile("kadnd 0x12345678(%rax,%rcx,1), %xnm0");
asm volatile("kmovf %bnd,%k3");
asm volatile("bndsd (%rax,%rcx,1), %bnd0");
asm volatile("sha1msg2 0x12(%rax,%rcx,8)", %xm1,%k0");
asm volatile("vcvtqudp $0x12,%zmm26,%k7}

or this:

void A(void))

rd (unn - = , ( c- b) ' |  u (r (b 6) <:
    return - 0)  < < 30))

  *+++5;
 c * 1;
 a  (*af4 ( < '))
  < 1;
  = + 12;
    = <a ( = ' ' | - = < 1+) < '= '0;

}

My tar file contains only source (.c) files I extracted from the repositories I cloned using this. I extracted them with a shell script. Do you have any ideia on why the kernels looks like that? Is there a problem with my corpus?

ChrisCummins · 2019-07-26T08:37:20Z

The export_corpus crash is because this line:

https://github.com/ChrisCummins/phd/blob/master/datasets/github/scrape_repos/export_corpus.py#L76

should be:

    db = contentfiles.ContentFiles(f'sqlite:///{d}')

Sorry about that :)

Your progress is promising! But clearly the model has not learned anything useful yet. There's a whole bunch of potential issues to narrow down, some starting things to consider include:

how big is your corpus? How much variance is there in the code in the corpus?
what is the final training loss of your model? what was the starting training loss? You can find the training loss in the model's log files, located at <path_to_clgen_cache>/models/<model_id>/logs/.
parameters of your model? The tiny.pbtxt config you have based on is, well, tiny. You'll want to train a much larger network.

It'll be quicker and easier to go through them on a call. Shoot me an email and we can have a Google Hangouts chat.

Of course, there could also be something broken in CLgen's model training/sampling logic. I'm currently working on a private fork which has a handful of improvements, but it isn't ready for release yet

Cheers,
Chris

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DeepSmith to generate programs in other languages #47

Use DeepSmith to generate programs in other languages #47

JWesleySM commented Jul 12, 2019 •

edited

Loading

ChrisCummins commented Jul 23, 2019

JWesleySM commented Jul 23, 2019

JWesleySM commented Jul 23, 2019 •

edited

Loading

ChrisCummins commented Jul 24, 2019

JWesleySM commented Jul 25, 2019 •

edited

Loading

ChrisCummins commented Jul 26, 2019

Use DeepSmith to generate programs in other languages #47

Use DeepSmith to generate programs in other languages #47

Comments

JWesleySM commented Jul 12, 2019 • edited Loading

ChrisCummins commented Jul 23, 2019

"Importing 0 $LANG repos ..." when trying to scrape files

"No such file or directory" with CLgen local_tar_archive

JWesleySM commented Jul 23, 2019

JWesleySM commented Jul 23, 2019 • edited Loading

ChrisCummins commented Jul 24, 2019

JWesleySM commented Jul 25, 2019 • edited Loading

ChrisCummins commented Jul 26, 2019

JWesleySM commented Jul 12, 2019 •

edited

Loading

JWesleySM commented Jul 23, 2019 •

edited

Loading

JWesleySM commented Jul 25, 2019 •

edited

Loading