Skip to content

Instantly share code, notes, and snippets.

@salgo60
Last active July 14, 2020 19:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save salgo60/7d4f29b9a1ca1ff126c5d9c48c5fa852 to your computer and use it in GitHub Desktop.
Save salgo60/7d4f29b9a1ca1ff126c5d9c48c5fa852 to your computer and use it in GitHub Desktop.
Try read SWEPUB data 10 Gb
# Try to get ORCID from SWEPUB
# see https://kundo.se/org/swepub/d/api-for-amnesklassificering/#c3571837
from tqdm import tqdm
import pandas as pd
import json
import time
start_time = time.time()
filename ="data/swepub-duplicated-2020-07-05.jsonl"
filestore ="data/swepub-duplicated-2020-07-05.pd"
#df = pd.read_json(filename, lines=True) gives --> interrupted by signal 9: SIGKILL
# chunk 5 --> 301180 iterations and exit code 137 (interrupted by signal 9: SIGKILL)
# chunk 10000 --> 150 iterationmer and exit code 137 efter 4 timmar
df_chunk = pd.read_json(filename, lines=True, chunksize=10000)
chunk_list = []
i=0
for i, chunk in enumerate(df_chunk):
print(i)
chunk_list.append(chunk)
print("--- %s seconds ---" % (time.time() - start_time))
# concat the list into dataframe
df_concat = pd.concat(chunk_list)
print("--- %s seconds ---" % (time.time() - start_time))
df_concat.info()
df_concat.to_pickle(filestore)
print("--- %s seconds ---" % (time.time() - start_time))
@salgo60
Copy link
Author

salgo60 commented Jul 14, 2020

Del av Output:
...
...
149
150
--- 13953.98006105423 seconds ---
/Users/magnus/Documents/GitHub/open-data-examples/ReadSwepub.py:22: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

print("--- %s seconds ---" % (time.time() - start_time))
--- 14524.747683048248 seconds ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1505902 entries, 0 to 1505901
Data columns (total 15 columns):
@context 1505902 non-null object
@id 1505902 non-null object
@type 1505902 non-null object
carrierType 975900 non-null object
editionStatement 22059 non-null object
extent 170729 non-null object
hasSeries 117001 non-null object
identifiedBy 1505902 non-null object
indirectlyIdentifiedBy 160 non-null object
instanceOf 1505902 non-null object
meta 1505902 non-null object
partOf 1193782 non-null object
provisionActivity 6695 non-null object
publication 1481484 non-null object
usageAndAccessPolicy 248495 non-null object
dtypes: object(15)
memory usage: 172.3+ MB

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment