This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
""" | |
This scripts downloads WARC files from commoncrawl.org's news crawl and extracts articles from these files. You can | |
define filter criteria that need to be met (see YOUR CONFIG section), otherwise an article is discarded. Currently, the | |
script stores the extracted articles in JSON files, but this behaviour can be adapted to your needs in the method | |
on_valid_article_extracted. To speed up the crawling and extraction process, the script supports multiprocessing. You can | |
control the number of processes with the parameter my_number_of_extraction_processes. | |
You can also crawl and extract articles programmatically, i.e., from within | |
your own code, by using the class CommonCrawlCrawler or the function |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3.716041911135552 | |
==================== | |
iter = 1 | |
0.0001 news relu 1.0 3.716041911135552 | |
Train on 1024 samples | |
Epoch 1/5 | |
1024/1024 [==============================] - 10s 10ms/sample - loss: 0.5194 | |
Epoch 2/5 | |
1024/1024 [==============================] - 10s 9ms/sample - loss: 0.3417 | |
Epoch 3/5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3.716041911135552 | |
==================== | |
iter = 1 | |
0.0001 news relu 1.0 3.716041911135552 | |
WARNING:tensorflow:From /home/sarahm/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. | |
Instructions for updating: | |
If using Keras pass *_constraint arguments to layers. | |
WARNING:tensorflow:From /home/sarahm/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. | |
Instructions for updating: | |
Use tf.where in 2.0, which has the same broadcast rule as np.where |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4.89017726799057 | |
Learning Rate 0.0001 | |
======================================== | |
iter= 1 | |
news GRU 1.0 4.89017726799057 | |
Tensor("attention_64/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Tensor("model_128/attention_64/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Train on 1024 samples | |
Epoch 1/5 | |
1024/1024 [==============================] - 43s 42ms/sample - loss: 0.5893 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4.89017726799057 | |
Learning Rate 0.0001 | |
======================================== | |
iter= 1 | |
news GRU 1.0 4.89017726799057 | |
WARNING:tensorflow:From /home/sarahm/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. | |
Instructions for updating: | |
If using Keras pass *_constraint arguments to layers. | |
Tensor("attention/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Tensor("model/attention/ExpandDims:0", shape=(?, 1, 64), dtype=float32) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4.89017726799057 | |
======================================== | |
iter= 1 | |
news GRU 1.0 4.89017726799057 | |
Tensor("attention_41/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Tensor("model_82/attention_41/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Train on 1024 samples | |
Epoch 1/5 | |
1024/1024 [==============================] - 40s 39ms/sample - loss: 0.3400 | |
Epoch 2/5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4.89017726799057 | |
======================================== | |
iter= 1 | |
news GRU 1.0 4.89017726799057 | |
Tensor("attention_25/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Tensor("model_50/attention_25/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Train on 1024 samples | |
Epoch 1/5 | |
1024/1024 [==============================] - 38s 37ms/sample - loss: 0.1309 | |
Epoch 2/5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4.89017726799057 | |
======================================== | |
iter= 1 | |
news GRU 1.0 4.89017726799057 | |
Tensor("attention_9/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Tensor("model_18/attention_9/ExpandDims:0", shape=(?, 1, 64), dtype=float32) | |
Train on 1024 samples | |
Epoch 1/5 | |
1024/1024 [==============================] - 34s 33ms/sample - loss: 0.1623 | |
Epoch 2/5 |
NewerOlder