Skip to main content
Version: v4.28

Parse nested data

Logs are often a mess. There's no other way to put it. Here's an example that we found in the wild just recently:

2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png

Let's take this apart. First, we notice that this looks like Syslog. Let's throw the Syslog parser at it.

load_file "sample.txt"
read_syslog
{
  facility: null,
  severity: null,
  timestamp: 2023-06-09T05:51:36.433Z,
  hostname: "mrs",
  app_name: "www",
  process_id: "929",
  content: "I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png",
}

The content field appears to contain the actual log message. It has four sections:

  • I103 looks like some identifier.
  • 1703182696 looks like a Unix timestamp in seconds.
  • request looks like it specified an action.
  • baddr=194.1.1.1 bport=58630 … looks like a list of key-value pairs.

Let's start by taking the content field apart:

load_file "sample.txt"
read_syslog
content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")
{
  facility: null,
  severity: null,
  timestamp: 2023-06-09T05:51:36.433Z,
  hostname: "mrs",
  app_name: "www",
  process_id: "929",
  content: {
    id: "I103",
    ts: 1703182696,
    action: "request",
    params: "baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png",
  },
}

Let's further take apart the content.params:

load_file "sample.txt"
read_syslog
content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")
content.params = content.params.parse_kv()
{
  facility: null,
  severity: null,
  timestamp: 2023-06-09T05:51:36.433Z,
  hostname: "mrs",
  app_name: "www",
  process_id: "929",
  content: {
    id: "I103",
    ts: 1703182696,
    action: "request",
    params: {
      baddr: 194.1.1.1,
      bport: 58630,
      c_ctype: "image/png",
      caddr: 194.1.1.1,
      cin: 860,
      cout: 898,
      cport: 58630,
      duration: 0.0,
      forwarded_for: "192.1.1.1, 191.1.1.1",
      laddr: 127.128.11.145,
      lport: 1,
      method: "GET",
      paddr: 172.0.0.1,
      pport: 80,
      proto: 6,
      relay_name: "wwwserver",
      response_code: 200,
      rnum: 6489895,
      s_ctype: "image/png",
      saddr: 172.0.0.1,
      sin: 898,
      sout: 819,
      sport: 80,
      status: "OK",
      url: "http://test.de/images/icons/144x144.png",
    },
  },
}

Let's fix up a few types:

load_file "sample.txt"
read_syslog
content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")
content.params = content.params.parse_kv()
// Interpret the timestamp as a seconds since Unix epoch.
content.ts = from_epoch(content.ts * 1s)
// Interpret the duration parameter as seconds.
content.params.duration = content.params.duration * 1s
// Split the forwarded_for field into an array of IP addresses.
content.params.forwarded_for = content.params.forwarded_for.split(", ").map(x, x.ip())
{
  facility: null,
  severity: null,
  timestamp: 2023-06-09T05:51:36.433Z,
  hostname: "mrs",
  app_name: "www",
  process_id: "929",
  content: {
    id: "I103",
    ts: 2023-12-21T18:18:16Z,
    action: "request",
    params: {
      baddr: 194.1.1.1,
      bport: 58630,
      c_ctype: "image/png",
      caddr: 194.1.1.1,
      cin: 860,
      cout: 898,
      cport: 58630,
      duration: 0ns,
      forwarded_for: [
        192.1.1.1,
        191.1.1.1,
      ],
      laddr: 127.128.11.145,
      lport: 1,
      method: "GET",
      paddr: 172.0.0.1,
      pport: 80,
      proto: 6,
      relay_name: "wwwserver",
      response_code: 200,
      rnum: 6489895,
      s_ctype: "image/png",
      saddr: 172.0.0.1,
      sin: 898,
      sout: 819,
      sport: 80,
      status: "OK",
      url: "http://test.de/images/icons/144x144.png",
    },
  },
}

Let's now reshape the data, keeping only the fields we're interested in:

load_file "sample.txt"
read_syslog
content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")
content.params = content.params.parse_kv()
// Interpret the timestamp as a seconds since Unix epoch.
content.ts = from_epoch(content.ts * 1s)
// Interpret the duration parameter as seconds.
content.params.duration = content.params.duration * 1s
// Split the forwarded_for field into an array of IP addresses.
content.params.forwarded_for = content.params.forwarded_for.split(", ").map(x, x.ip())
this = {
  // We keep both timestamps, labelling the one from the message `time` and the
  // one from Syslog itself `forwarded_time`:
  time: content.ts,
  forwarded_time: timestamp,
  // Copy the other id and the action from the Syslog message:
  id: content.id,
  action: content.action,
  // Lastly, take all the params:
  ...content.params,
}
{
  time: 2023-12-21T18:18:16Z,
  forwarded_time: 2023-06-09T05:51:36.433Z,
  id: "I103",
  action: "request",
  baddr: 194.1.1.1,
  bport: 58630,
  c_ctype: "image/png",
  caddr: 194.1.1.1,
  cin: 860,
  cout: 898,
  cport: 58630,
  duration: 0ns,
  forwarded_for: [
    192.1.1.1,
    191.1.1.1,
  ],
  laddr: 127.128.11.145,
  lport: 1,
  method: "GET",
  paddr: 172.0.0.1,
  pport: 80,
  proto: 6,
  relay_name: "wwwserver",
  response_code: 200,
  rnum: 6489895,
  s_ctype: "image/png",
  saddr: 172.0.0.1,
  sin: 898,
  sout: 819,
  sport: 80,
  status: "OK",
  url: "http://test.de/images/icons/144x144.png",
}

Now we just need to switch from reading from a file to accepting Syslog over UDP directly, and we're done. That's as simple as replacing load_file "…" with load_udp "…" or load_tcp "…".