Last active
July 14, 2020 19:47
-
-
Save salgo60/7d4f29b9a1ca1ff126c5d9c48c5fa852 to your computer and use it in GitHub Desktop.
Try read SWEPUB data 10 Gb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Try to get ORCID from SWEPUB | |
# see https://kundo.se/org/swepub/d/api-for-amnesklassificering/#c3571837 | |
from tqdm import tqdm | |
import pandas as pd | |
import json | |
import time | |
start_time = time.time() | |
filename ="data/swepub-duplicated-2020-07-05.jsonl" | |
filestore ="data/swepub-duplicated-2020-07-05.pd" | |
#df = pd.read_json(filename, lines=True) gives --> interrupted by signal 9: SIGKILL | |
# chunk 5 --> 301180 iterations and exit code 137 (interrupted by signal 9: SIGKILL) | |
# chunk 10000 --> 150 iterationmer and exit code 137 efter 4 timmar | |
df_chunk = pd.read_json(filename, lines=True, chunksize=10000) | |
chunk_list = [] | |
i=0 | |
for i, chunk in enumerate(df_chunk): | |
print(i) | |
chunk_list.append(chunk) | |
print("--- %s seconds ---" % (time.time() - start_time)) | |
# concat the list into dataframe | |
df_concat = pd.concat(chunk_list) | |
print("--- %s seconds ---" % (time.time() - start_time)) | |
df_concat.info() | |
df_concat.to_pickle(filestore) | |
print("--- %s seconds ---" % (time.time() - start_time)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Del av Output:
...
...
149
150
--- 13953.98006105423 seconds ---
/Users/magnus/Documents/GitHub/open-data-examples/ReadSwepub.py:22: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
print("--- %s seconds ---" % (time.time() - start_time))
--- 14524.747683048248 seconds ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1505902 entries, 0 to 1505901
Data columns (total 15 columns):
@context 1505902 non-null object
@id 1505902 non-null object
@type 1505902 non-null object
carrierType 975900 non-null object
editionStatement 22059 non-null object
extent 170729 non-null object
hasSeries 117001 non-null object
identifiedBy 1505902 non-null object
indirectlyIdentifiedBy 160 non-null object
instanceOf 1505902 non-null object
meta 1505902 non-null object
partOf 1193782 non-null object
provisionActivity 6695 non-null object
publication 1481484 non-null object
usageAndAccessPolicy 248495 non-null object
dtypes: object(15)
memory usage: 172.3+ MB
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)