Skip to content

Instantly share code, notes, and snippets.

@lihaoyi
Last active June 24, 2021 01:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lihaoyi/90fa7a60f93d32324d4f to your computer and use it in GitHub Desktop.
Save lihaoyi/90fa7a60f93d32324d4f to your computer and use it in GitHub Desktop.
Scala Syntax Popularity, from Least Popular to Most Popular Syntactic Forms

These numbers are taken from the FastParse test suite, which runs over the following libraries:

  • fastparse
  • scalaJs
  • scalaz
  • shapeless
  • akka
  • lift
  • play
  • PredictionIO
  • spark
  • sbt
  • cats
  • finagle
  • kafka
  • breeze
  • spire
  • saddle
  • scala

And checks how often each rule succeeds. This covers >15,000 files and >12,000,000 LOC

The raw numbers are below:

Rule Count
Ideographic 0
PI 0
EmptyElemTagPEnd 0
CDStart 2
CData 2
CDSect 2
CDEnd 2
CharRef 3
ScalaPatterns 4
Patterns 4
ContentP 7
ETagP 7
XmlPattern 7
ElemPattern 7
TagPHeader 7
STagPEnd 7
EntityRef 24
Reference 25
OctalEscape 73
ClsAnnot 89
DoWhile 109
do 109
<% 130
CharA 160
Assign 168
ClassQualifier 171
ImplicitLambda 179
Digit 189
" "
PkgObj 312
Implicit 316
EmptyElemTagEnd 378
EarlyDefTmpl 385
forSome 392
ExistentialClause 392
PkgBlock 399
432
# 806
macro 807
XmlExpr 1099
finally 1129
Finally 1130
SelfType 1243
sealed 1597
>: 1620
Refinement 1629
Return 1639
return 1640
ScalaExpr 1715
PostFix 1754
yield 1774
Binding 1810
UnicodeEscape 1881
FloatType 1890
"*" 1908
Exp 1909
Symbol 1945
Eq 1963
Attribute 1964
AttValue 1964
_* 2067
Enumerator 2370
Content 2396
ETag 2396
STagEnd 2397
{ 2419
super 2547
XmlContent 2634
Content1 2744
abstract 2769
Element 2774
TagHeader 2775
Catch 2799
catch 2799
CharData 3075
HexNum 3207
BacktickId 3233
While 3549
while 3657
ThisPath 3713
ThisSuper 3714
Try 3782
try 3782
Guard 4210
lazy 4585
TripleTail 4616
TripleChars 4616
throw 4963
Throw 4963
protected 5736
Enumerators 5935
For 5935
for 5936
TypeDef 6196
<: 6568
6677
ClsArgMod 6797
PatLiteral 7126
Name 7169
BaseChar 7170
XNameStart 7170
HexDigit 7521
<- 7565
Generator 7998
final 8851
PlainIdNoDollar 9224
TQ 9231
Selectors 9559
EscapedChars 9771
TopPkgSeq 9842
AccessQualifier 10200
with 10811
AscriptionType 10969
type 11060
TypePattern 11661
TypePat 11681
TraitDef 11894
trait 11896
TypeArgList 12882
Ascription 12988
QualId 13092
Annot 13194
CharQ 13375
match 13379
package 13574
var 13900
CompilationUnit 15064
@ 15097
TopStatSeq 15434
AllArgs 16338
Thingy 16563
ClsArgs 16609
ExprPrefix 18739
ObjDef 19601
object 19615
this 19755
Variant 19860
NameStartChar 20087
implicit 20323
NameChar 20354
FunTypeArgs 20711
override 21124
else 21729
Else 21730
Selector 23101
Bool 23596
Thing 24398
CaseClauses 24502
private 24904
Thing2 26236
Float 26332
MatchAscriptionSuffix 26371
LambdaRhs 26849
ClsDef 27830
class 27885
LocalMod 27970
AccessMod 30636
ClsArg 30743
extends 33124
LetterDigitDollarUnderscore 33174
If 33797
TupleEx 36982
ExtractorArgs 36986
if 37993
ParenedLambda 39642
TmplBody 49989
Char1 51177
DefTmpl 51319
New 53019
new 53026
Import 54099
import 54100
ImportExpr 54280
CtxBounds 55627
TypeArg 55634
Parened 56529
Char 57771
CaseClause 59454
Mod 79023
TopStat 83166
NamedTmpl 85746
Constrs 85747
AnonTmpl 86155
Constr 94889
Args 97834
=> 99875
FunArgs 114242
BlockExpr 119049
_ 121207
Pattern 123107
TypeOrBindPattern 131584
val 139110
ValVarDef 146190
FunArg 146440
Tmpl 148466
FunSig 148928
FunDef 149123
def 149142
Int 154846
} 171254
SingleChars 171605
String 176462
case 177082
"}" 183623
BlockEnd 184599
Block 184996
TypeArgs 190898
"{" 202803
OpChar 203886
Types 205562
Operator 219160
InfixSuffix 228631
"[" 234697
"]" 234701
Extractor 238699
TmplStat 240386
VarId 255910
InfixPattern 266774
BindPattern 266951
SimplePattern 272459
Dcl 301426
= 319508
BlockDef 321937
: 323664
Body 343983
DecNum 358460
ExprLiteral 393978
Literal 400798
MultilineComment 417655
SameLineCharChunks 422960
LineComment 428650
BlockStat 434312
"." 487694
"," 497062
PostDotCheck 527584
ArgList 532846
ParenArgList 542139
Exprs 553102
PostfixType 607743
InfixType 608010
Type 612167
Unbounded 615180
CompoundType 628819
Letter 648719
TypeBounds 670383
AnnotType 723032
SimpleType 737524
BasicType 738579
O 779453
")" 827263
"(" 828441
Comment 845933
Prelude 875773
W 1076937
Semis 1132011
Path 1204887
AlphabetKeywords 1248607
SymbolicKeywords 1484209
PostfixLambda 1538313
PostfixExpr 1545698
PostfixSuffix 1566877
SmallerExprOrLambda 1579317
Expr 1627600
SimpleExpr 1767224
PrefixExpr 1771497
ExprSuffix 1799842
Upper 1938985
IdPath 2189408
StableId 2195822
OneNLMax 2345703
ConsumeComments 2353451
Semi 2530044
CommentChunk 2646017
Keywords 2716775
NotNewline 5355498
VarId0 5860421
Lower 5890725
IdUnderscoreChunk 6689840
IdRest 7467894
PlainId 7710709
WS 12040535
Newline 12456005
Id 12580572
WSChars 42329645
WL 62297135
@propensive
Copy link

Interesting! I don't know what half the things actually represent, but I'd be intrigued to see a list of the ratio of features used by each library as a proportion of these totals, i.e. for each library list the features they use significantly more/less than average. @jrudolph did some similar research for a talk at Scala Days 2014.

@malcolmgreaves
Copy link

Cool stuff! Is there a mapping from rule name to definition? For example, what does "WL" mean? It's the most common thing here.

And, if you do another run, throw in the rapture libraries 😃. They are a lot of interesting Scala code.

Wondering what the breakdown of that library's idoms vs. "mainstream" are. Maybe we could do some topic modeling or clustering on these rules to see what the "neighborhoods" of code structure are in Scala. For example, maybe there's a Scalaz/cats/rapture space that is distinct from a Spark/"Scala as Java" space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment