XML Streaming library for Scala (xs4s)
Capabilities
xs4s enables the processing of large (multi-gigabyte) XML files in Scala, for example .xml.gz files straight from Wikipedia (example below), without running out of memory.
In terms of specific features, xs4s offers:
- Scala-friendly utilities around the
javax.xml.stream.events
API. - A mapping from the StAX to
scala.xml.Elem
and other Scala XML classes. - An alternative method of parsing XML to
scala.xml.XML.load()
, for exampleassert(xs4s.XML.loadString("<test/>") == <test/>)
. - An integration with FS2 and ZIO for pure FP streaming.
Release notes / change log
Version | Date | Changes |
---|---|---|
v0.9.1 | 2021-07-27 | Scala 3.0.1 support; FS2 v3 (cats-effect 3) support; fs2 performance improvement |
v0.8.7 | 2021-03-16 | Scala 3.0.0-RC1 support |
v0.8.5 | 2020-12-07 | Latest ScalaTest |
v0.8.0 | 2020-07-02 | ZIO support; latest FS2 |
v0.7.0 | 2020-05-16 | FS2 support; latest libraries; Scala 2.13 support |
v0.5 | 2017-12-07 | Cross-compile to both Scala 2.12 and 2.11 |
v0.4 | 2017-10-05 | Update to Scala 2.12 |
v0.3 | 2016-09-19 | Upgrades to many examples and slimming down API |
v0.2 | 2016-04-05 | Simplify code |
v0.1 | 2015-02-03 | Initial release |
How it does it
It uses the standard XML API (https://github.com/FasterXML/woodstox) as a back-end. It gradually forms a partial tree, and based on a user-supplied function ("query"), it will materialise that partial tree into a full tree, which will return to the user.
Getting started
Add the following to your build.sbt (compatible with Scala 3.0.1, Scala 2.13 and 2.12 series):
libraryDependencies += "com.scalawilliam" %% "xs4s-core" % "0.9.1"
// for cats-effect 2
libraryDependencies += "com.scalawilliam" %% "xs4s-fs2" % "0.9.1"
// for cats-effect 3
libraryDependencies += "com.scalawilliam" %% "xs4s-fs2v3" % "0.9.1"
libraryDependencies += "com.scalawilliam" %% "xs4s-zio" % "0.9.1"
Examples
FS2 Streaming
Then, you can implement functions such as the following (BriefFS2Example - note the explicit types are for clarity):
/**
*
* @param byteStream Could be, for example, fs2.io.readInputStream(inputStream)
* @param blocker obtained with Blocker[IO]
*/
def extractAnchorTexts(byteStream: Stream[IO, Byte]): Stream[IO, String] = {
/** extract all elements called 'anchor' **/
val anchorElementExtractor: XmlElementExtractor[Elem] =
XmlElementExtractor.filterElementsByName("anchor")
/** Turn into XMLEvent */
val xmlEventStream: Stream[IO, XMLEvent] =
byteStream.through(byteStreamToXmlEventStream())
/** Collect all the anchors as [[scala.xml.Elem]] */
val anchorElements: Stream[IO, Elem] =
xmlEventStream.through(anchorElementExtractor.toFs2PipeThrowError)
/** And finally extract the text contents for each Elem */
anchorElements.map(_.text)
}
ZIO Streaming
Then, you can implement functions such as the following (BriefZIOExample - note the explicit types are for clarity):
/**
*
* @param byteStream Could be, for example, zio.stream.Stream.fromInputStream(inputStream)
* @return
*/
def extractAnchorTexts[R <: Blocking](byteStream: ZStream[R, IOException, Byte]):
ZStream[R, Throwable, String] = {
/** extract all elements called 'anchor' **/
val anchorElementExtractor: XmlElementExtractor[Elem] =
XmlElementExtractor.filterElementsByName("anchor")
/** Turn into XMLEvent */
val xmlEventStream: ZStream[R, Throwable, XMLEvent] =
byteStream.via(byteStreamToXmlEventStream()(_))
/** Collect all the anchors as [[scala.xml.Elem]] */
val anchorElements: ZStream[R, Throwable, Elem] =
xmlEventStream.via(anchorElementExtractor.toZIOPipeThrowError)
/** And finally extract the text contents for each Elem */
anchorElements.map(_.text)
}
Iterator
streaming
Plain Alternatively, we have a plain-Scala API, especially where you have legacy Java interaction, or you feel uncomfortable with pure FP for now: BriefPlainScalaExample.:
def extractAnchorTexts(sourceFile: File): Unit = {
val anchorElementExtractor: XmlElementExtractor[Elem] =
XmlElementExtractor.filterElementsByName("anchor")
val xmlEventReader = XMLStream.fromFile(sourceFile)
try {
val elements: Iterator[Elem] =
xmlEventReader.extractWith(anchorElementExtractor)
val text: Iterator[String] = elements.map(_.text)
text.foreach(println)
} finally xmlEventReader.close()
}
Advanced Wikipedia example
This example counts the popularity of Wikipedia anchors from their abstract
documentation.
Many things all at once:
- Reading a streaming URL
- Passing through GZip decoder
- Then parsing XML
- Then doing map-reduce data from Wikipedia
The main example is in FindMostPopularWikipediaKeywordsFs2App
or
FindMostPopularWikipediaKeywordsZIOApp.
There is also a plain Scala example (using Iterator
) in FindMostPopularWikipediaKeywordsPlainScalaApp.
$ git clone https://github.com/ScalaWilliam/xs4s.git
$ sbt "examples/runMain xs4s.example.FindMostPopularWikipediaKeywordsFs2App"
$ sbt "examples/runMain xs4s.example.FindMostPopularWikipediaKeywordsZIOApp"
$ sbt "examples/runMain xs4s.example.FindMostPopularWikipediaKeywordsPlainScalaApp"
This can consume 100MB files or 4GB files without any problems. And it does it fast. It converts XML streams into Scala XML trees on demand, which you can then query from.