disulfidebond/stata_overview.md

## stata_overview.md

      
    Raw
  

              stata_overview.md
            
          
    Overview

This gist is designed to provide a brief overview of how Stata is configured, as well as some basic commands.
A concise way to describe Stata would be that Stata is to R as a multitool is to a team of contractors. Stata has been designed with a different purpose in mind than other pythonic or R libraries.
Its strengths include:

Users can use either a type commands with a command-line interface (CLI), or point and click with a graphical user interface (GUI) to perform the same command
Highly customizable charts and graphs can be generated with a single mouse click
Although admittedly a bit quirky at first glance, Stata is fairly easy to pick up
Documentation is surprisingly informative and readily available via Google searches and within the App

Its weaknesses include:

It is possible to design pipelines and workflows within Stata, but this process is neither straightforward nor simple
You can only view/load/manipulate one dataset at a time, but you can create subsets from the initial dataset
Exporting analyzed data (but not created charts and graphs) can be tricky, and occasionally relies on copy-paste

To put it another way, Stata is perfect tool at the end of a pipeline or workflow, but decidedly less ideal at the initial or intermdiate step of said workflow.
General Workflow

The general template for work in Stata is:

Load dataset into memory
Get statistics on column/row/whatever within the dataset
Create groupings for the dataset
Perform statistical analyses on the dataset
Visualize the dataset (or the analyses from the data) with charts and figures
(Optional) Write a subset of the dataset to disk using defined parameters
Return to Step 1 for a different dataset, or return to Step 1 for the dataset that you created in Step 6, or return to Step 2 for additional analyses on the current dataset

Start to Finish Examples

Sources:
https://youtu.be/c5btifh3EPE
https://stats.idre.ucla.edu/stata/faq/how-can-i-see-the-number-of-missing-values-and-patterns-of-missing-values-in-my-data-file/
https://www.stata.com/support/ssc-installation/
https://www.stata.com/help.cgi?summarize
https://youtu.be/YMt5K68ZvjQ
Install a missing package in Stata

Usually this happens with the error, 'command unrecognized'. Packages can be downloaded and installed manually with caution, or via the internet with a similar level of caution.


Use Boston College Statistical Software Components (SSC) repository
 . ssc install PACKAGE


Install from the internet
 net install PACKAGE


Install vai GUI
 search for the package name under help, and then click to install the package


Save the current dataset

Select 'Save' under the file menu to save your current dataset or subset of the dataset that you've created
Import a dataset and get summary statistics on it

The commands below have been run in order, specifically all reference the dataset from stats.idre.ucal.edu


Load dataset using the use command, and clear the current contents of memory using the clear command
use https://stats.idre.ucla.edu:/stat/stata/notes/hsb1, clear


Get summary statistics from column names (read, write, math, science, socst in this example) with the summarize command
summarize read write math science socst


Get n, mean, SD, and quartiles for the dataset
univar read write math science socst


Show statistics for missing values for the dataset or for specified columns using mdesc or tabmiss. In either case, if no arguments are provided, statistics for the entire dataset will be provided
mdesc read write
tabmiss read write


Group column names and then generate a boxplot


Load the dataset
use https://stats.idre.ucla.edu:/stat/stata/notes/hsb1, clear


Create a group of the read, write, and math columns
egen important = group(read write math), label


(Optional) Edit the group important
edit important


Within group important, get a summary of the mean, SD, min, max and sort in order of math,science,read
by important, sort: summarize math science read


Generate an ANOVA comparing the scores to the science score
anova science important


Generate a boxplot, using values from the 'id' column to identify outliers
graph box science, over(important) mark(1,mlabel(id))


Output to a CSV file

This is a bit wonky. First, the syntax:
   . estout using "PATH/TO/OUTPUT/FILE", replace cells(list,of,cells,you,want,in,CSV,file)


The directories in the path to the output file must exist, and the file itself will be overwritten.


If you provide no arguments for replace cells(), then it will export an average of all values.


Note that Stata creates new columns for values that you generate, so you could create a summary statistics column called 'sum_stats' and then list that as the cell to export to the CSV file.


Concrete example that will create the file 'examplecsv.csv' with column names math,read,write:
 . estout using "C:\Users\jrcaskey\Documents\examplecsv.csv", replace cells(math,read,write)


When in doubt, create a table, select the table with the mouse, right-click, select 'copy as table', then paste it into an Excel Spreadsheet.