Skip to content

Instantly share code, notes, and snippets.

@disulfidebond
Last active February 10, 2020 17:26
Show Gist options
  • Save disulfidebond/996e2b9083327fb3bc297c0940504be4 to your computer and use it in GitHub Desktop.
Save disulfidebond/996e2b9083327fb3bc297c0940504be4 to your computer and use it in GitHub Desktop.
stata description and overview

Overview

This gist is designed to provide a brief overview of how Stata is configured, as well as some basic commands.

A concise way to describe Stata would be that Stata is to R as a multitool is to a team of contractors. Stata has been designed with a different purpose in mind than other pythonic or R libraries.

Its strengths include:

  • Users can use either a type commands with a command-line interface (CLI), or point and click with a graphical user interface (GUI) to perform the same command
  • Highly customizable charts and graphs can be generated with a single mouse click
  • Although admittedly a bit quirky at first glance, Stata is fairly easy to pick up
  • Documentation is surprisingly informative and readily available via Google searches and within the App

Its weaknesses include:

  • It is possible to design pipelines and workflows within Stata, but this process is neither straightforward nor simple
  • You can only view/load/manipulate one dataset at a time, but you can create subsets from the initial dataset
  • Exporting analyzed data (but not created charts and graphs) can be tricky, and occasionally relies on copy-paste

To put it another way, Stata is perfect tool at the end of a pipeline or workflow, but decidedly less ideal at the initial or intermdiate step of said workflow.

General Workflow

The general template for work in Stata is:

  1. Load dataset into memory
  2. Get statistics on column/row/whatever within the dataset
  3. Create groupings for the dataset
  4. Perform statistical analyses on the dataset
  5. Visualize the dataset (or the analyses from the data) with charts and figures
  6. (Optional) Write a subset of the dataset to disk using defined parameters
  7. Return to Step 1 for a different dataset, or return to Step 1 for the dataset that you created in Step 6, or return to Step 2 for additional analyses on the current dataset

Start to Finish Examples

Sources: https://youtu.be/c5btifh3EPE

https://stats.idre.ucla.edu/stata/faq/how-can-i-see-the-number-of-missing-values-and-patterns-of-missing-values-in-my-data-file/

https://www.stata.com/support/ssc-installation/

https://www.stata.com/help.cgi?summarize

https://youtu.be/YMt5K68ZvjQ

Install a missing package in Stata

Usually this happens with the error, 'command unrecognized'. Packages can be downloaded and installed manually with caution, or via the internet with a similar level of caution.

  • Use Boston College Statistical Software Components (SSC) repository

     . ssc install PACKAGE
    
  • Install from the internet

     net install PACKAGE
    
  • Install vai GUI

     search for the package name under help, and then click to install the package
    

Save the current dataset

Select 'Save' under the file menu to save your current dataset or subset of the dataset that you've created

Import a dataset and get summary statistics on it

The commands below have been run in order, specifically all reference the dataset from stats.idre.ucal.edu

  1. Load dataset using the use command, and clear the current contents of memory using the clear command

    use https://stats.idre.ucla.edu:/stat/stata/notes/hsb1, clear
    
  2. Get summary statistics from column names (read, write, math, science, socst in this example) with the summarize command

    summarize read write math science socst
    
  3. Get n, mean, SD, and quartiles for the dataset

    univar read write math science socst
    
  4. Show statistics for missing values for the dataset or for specified columns using mdesc or tabmiss. In either case, if no arguments are provided, statistics for the entire dataset will be provided

    mdesc read write
    tabmiss read write
    

Group column names and then generate a boxplot

  1. Load the dataset

    use https://stats.idre.ucla.edu:/stat/stata/notes/hsb1, clear
    
  2. Create a group of the read, write, and math columns

    egen important = group(read write math), label
    
  3. (Optional) Edit the group important

    edit important
    
  4. Within group important, get a summary of the mean, SD, min, max and sort in order of math,science,read

    by important, sort: summarize math science read
    
  5. Generate an ANOVA comparing the scores to the science score

    anova science important
    
  6. Generate a boxplot, using values from the 'id' column to identify outliers

    graph box science, over(important) mark(1,mlabel(id))
    

Output to a CSV file

This is a bit wonky. First, the syntax:

   . estout using "PATH/TO/OUTPUT/FILE", replace cells(list,of,cells,you,want,in,CSV,file)
  • The directories in the path to the output file must exist, and the file itself will be overwritten.

  • If you provide no arguments for replace cells(), then it will export an average of all values.

  • Note that Stata creates new columns for values that you generate, so you could create a summary statistics column called 'sum_stats' and then list that as the cell to export to the CSV file.

  • Concrete example that will create the file 'examplecsv.csv' with column names math,read,write:

     . estout using "C:\Users\jrcaskey\Documents\examplecsv.csv", replace cells(math,read,write)
    
  • When in doubt, create a table, select the table with the mouse, right-click, select 'copy as table', then paste it into an Excel Spreadsheet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment