Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Merge and extract tgz files from Google Takeout

Recently found some clowny gist was the top result for 'google takeout multiple tgz', where it was using two bash scripts to extract all the tgz files and then merge them together. Don't do that. Use brace expansion, cat the TGZs, and extract:

$ cat takeout-20201023T123551Z-{001..011}.tgz | tar xzivf -

You don't even need to use brace expansion. Globbing will order the files numerically:

$ cat takeout-20201023T123551Z-*.tgz | tar xzivf -

tar has been around forever, they didn't design it to need custom scripts to deal with multipart archives. Since it's extracting the combined archive, there's no 'mess of partial directories' to be merged. It just works, as intended.

An additional tip, courtesy of Dmitriy Otstavnov (@bvc3at): if you have pv available, you can track the progress of the extraction:

> pv takeout-* | tar xzif -
 190GiB 2:37:54 [18.9MiB/s] [==============>                                   ] 30% ETA 5:03:49
@chrishop
Copy link

chrishop commented Aug 3, 2022

You're now the top result for "how to join .tgz google takeout"
Anyway it's just what I needed, cheers!

@chabala
Copy link
Author

chabala commented Aug 3, 2022

You're now the top result for "how to join .tgz google takeout"

Mission accomplished then. 😆 Had to displace the bad results with better results.

@justinhartman
Copy link

I mean I get that you have a better solution for tgz archives but the original author of the clowny gist created a solution that a) worked with zips and b) worked for whatever use case he had and felt he wanted to share it. I'm not sure it's fair to dismiss his result just because you don't like it. NB: I use your solution, I just don't like the dismissiveness of someone creating and sharing a solution, no matter how inefficienct it appears.

@chabala
Copy link
Author

chabala commented Aug 12, 2022

  1. The scripts are also pointless for zip files, which have a similar one liner. At the time, zip downloads from takeout were limited to 2GB per file, versus 50GB per tgz file, so using zip was already a poor choice.
  2. If someone makes a group of scripts that replicate the basic function of unix commands, badly, and people find them and use them because they're the best search result, the world becomes a dumber place. That deserves some slight ridicule. The original gist author removed the gist; only forks of it persist.

@ariccio
Copy link

ariccio commented Aug 27, 2022

Protip for MacOS users: use gnu tar/gtar, not the built-in bsdtar. This just gave me a few days of headaches! It also appears that the built in archive utility won't correctly extract these files if they've first been cated.

@bvc3at
Copy link

bvc3at commented Sep 18, 2022

Using pv instead of cat can help tracking progress of extracting archives:

> pv takeout-* | tar xzif -
 190GiB 2:37:54 [18.9MiB/s] [==============>                                   ] 30% ETA 5:03:49

@chabala
Copy link
Author

chabala commented Sep 18, 2022

This is a useful enhancement, though I note I didn't have pv installed by default in Ubuntu, so it's perhaps less portable. I'll add it regardless.

@sagz
Copy link

sagz commented Oct 20, 2022

On MacOS (Mojave or Ventura+), there's a small mod:

pv takeout-* | gtar -xzif -

(if you don't have pv, then install with Homebrew: brew install pv. Also, MacOS uses bsdtar which doesn't support -i ignore zeroes, so use gnutar; installable with brew install gnu-tar)

@Jinion7
Copy link

Jinion7 commented Nov 14, 2022

On MacOS (Mojave or Ventura+), there's a small mod:

pv takeout-* | gtar -xzif -

(if you don't have pv, then install with Homebrew: brew install pv. Also, MacOS uses bsdtar which doesn't support -i ignore zeroes, so use gnutar; installable with brew install gnu-tar)

This worked for me seamlessly. Thank you!
(MacBook Pro M1 Max Ventura 13.0.1)

@chasealanbrown
Copy link

chasealanbrown commented Feb 1, 2023

This works quite well for several tar.gz files for the same backup, but are simply split into smaller chunks.

However, I am interested in the case wherein you have several tar.gz files which are from different time points - i.e. an example file chrome_history.json may be partially the same, but with some parts missing or added.

Is there an effective way to leverage rsync to merge the files (append only without remove) in a script?
I have started on some bash/python scripts to cram it together poorly, but I can't shake the feeling that there's a better tool for this.

I realize this is far more complex, and use meld to manually merge, but it would be nice to have an automated solution, such as constructing a union-merge if a diff shows a threshold value of contiguous content match.

Unfortunately, rsync or rsnapshot won't work either, for multiple reasons - they replace changed files with the entire new file (i.e. doubling or worse the storage requirements). Also, it's unlikely that the files are tracked in a way that's amenable to comparing two different directories.

Unfortunately not striving for this more complex solution leaves pretty awful solutions.
Another good incremental backup solution is borg - so perhaps perhaps something like adding all the files with datetimes appended to the filenames could be done.
borg is more appropriate here since de-duplication seems to work on the block level; however it brings up new problems in read/write capability, since it does not appear that borgfs is a filesystem that was intended to be mounted and used often in production with lots of read/write.
Even if borg solves some of the duplication / space saving problems though, the organization of the information is completely lost by splitting the data across mutliple datetime tagged files.

The use of diff --line-format %L $file1 $file2 seems to be very useful when constructing a merged file for known file matches, so some of the task is just finding file-matches quickly.

Does anyone have any better suggestions?

@chabala
Copy link
Author

chabala commented Feb 1, 2023

@chasealanbrown

Does anyone have any better suggestions?

Please do not use gist comments as a forum for off topic questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment