mongoexport - Leaf Level - JSON to CSV conversion - egrep not working with multiple patterns using "|" pipe or with -f option

AKS

Why egrep is not giving me all the matching entries?

This is my simple JSON blob:

[nukaNUKA@dev-machine csv]$ cat jsonfile.json
{"number": 303,"projectName": "giga","queueId":8881,"result":"SUCCESS"}

This is my pattern file (so that I don't scare the editor):

[nukaNUKA@dev-machine csv]$ cat egrep-pattern.txt
\"number\":.*\"projectName
\"projectName\":.*,\"queueId
\"queueId\":.*,\"result
\"result\":\".*$

This is egrep/grep command for individual searches, which works!:

[nukaNUKA@dev-machine csv]$ egrep -o "\"number\":.*\"projectName" jsonfile.json
"number": 303,"projectName
[nukaNUKA@dev-machine csv]$ egrep -o "\"projectName\":.*,\"queueId" jsonfile.json
"projectName": "giga","queueId
[nukaNUKA@dev-machine csv]$ egrep -o "\"queueId\":.*,\"result" jsonfile.json
"queueId":8881,"result
[nukaNUKA@dev-machine csv]$ egrep -o "\"result\":\".*$" jsonfile.json
"result":"SUCCESS"}

So, wth this didn't work? I don't wear glasses, yes.

[nukaNUKA@dev-machine csv]$ egrep -o "\"number\":.*\"projectName|\"projectName\":.*,\"queueId|\"queueId\":.*,\"result|\"result\":\".*$" jsonfile.json
"number": 303,"projectName
"queueId":8881,"result
[nukaNUKA@dev-machine csv]$ egrep -o -f egrep-pattern.txt jsonfile.json
"number": 303,"projectName
"queueId":8881,"result
[nukaNUKA@dev-machine csv]$

I have a complex nested JSON blob and because everything is unstructured, it seems like, I can't use JQ or JSONV or anything other Python script (as the data that I'm looking for is stored in arrays containing 1 dictionary entries (key=value) with same key names for what I'm looking for (ex: { "parameters": [ { "name": "jobname", "value": "shenzi" }, { "name": "pipelineVersion", "value": "1.2.3.4" }, ...so on..., ... ]) and the index for jobname and pipelineVersion or similar parameter names is not at the same index[X] location in every JSON entry I have.

Worst case, I can add conditional checks to see if the key at every index matches, jobname etc and then I get those fields what I looking for, but then, there are hundreds of such fields that I want to grab. I don't want to hard code those if possible.

I thought as my JSON entry is per line, I can simply write a cool patterns (ugly I know) but at least then I don't need to worry about the conditional code or just use BASH/sed/tr/cut power to get what I need but it seems like egrep -f or -o ... didn't work as shown above.

Sample JSON blob object (from one Jenkins job). There are different Jenkins build job's JSON blob entries (each having different JSON structures, parameters etc) in a single JenkinsJobsBuild collection in MongoDB. See attached for sample JSON blob object.

{
  "_id": {
    "$oid": "5120349es967yhsdfs907c4f"
  },
  "actions": [
    {
      "causes": [
        {
          "shortDescription": "Started by an SCM change"
        }
      ]
    },
    {

    },
    {
      "oneClickDeployPossible": false,
      "oneClickDeployReady": false,
      "oneClickDeployValid": false
    },
    {

    },
    {

    },
    {

    },
    {
      "cspec": "element * ...\/MyProject_latest_int\/LATESTnelement * ...\/MyProject_integration\/LATESTnelement \/vobs\/some_vob\/gigi \/main\/myproject_integration\/MyProject_Slot_0_maint_int\/LATESTnelement * ...\/myproject_integration\/LATESTnelement \/vobs\/some_vob \/main\/LATEST",
      "latestBlsOnConfiguredStream": null,
      "stream": null
    },
    {

    },
    {
      "parameters": [
        {
          "name": "CLEARCASE_VIEWTAG",
          "value": "jenkins_MyProject_latest"
        },
        {
          "name": "BUILD_DEBUG",
          "value": false
        },
        {
          "name": "CLEAN_BUILD",
          "value": true
        },
        {
          "name": "BASEVERSION",
          "value": "7.4.1"
        },
        {
          "name": "ARTIFACTID",
          "value": "lowercaseprojectname"
        },
        {
          "name": "SYSTEM",
          "value": "myprojectSystem"
        },
        {
          "name": "LOT",
          "value": "02"
        },
        {
          "name": "PIPENUMBER",
          "value": "7.4.1.303"
        }
      ]
    },
    {

    },
    {

    },
    {
      "parameters": [
        {
          "name": "DESCRIPTION_SETTER_DESCRIPTION",
          "value": "lowercaseprojectname_V7.4.1.303"
        }
      ]
    },
    {

    },
    {

    },
    {

    },
    {

    }
  ],
  "artifacts": [

  ],
  "building": false,
  "builtOn": "servername",
  "changeSet": {
    "items": [
      {
        "affectedPaths": [
          "vobs\/some_vob\/myproject\/apps\/app1\/Java\/test\/src\/com\/giga\/highlevelproject\/myproject\/schedule\/validation\/SomeActivityTest.java"
        ],
        "author": {
          "absoluteUrl": "http:\/\/11.22.33.44:8080\/user\/hitj1620",
          "fullName": "name1, name2 A"
        },
        "commitId": null,
        "date": {
          "$numberLong": "1489439532000"
        },
        "dateStr": "13\/03\/2017 21:12:12",
        "elements": [
          {
            "action": "create version",
            "editType": "edit",
            "file": "vobs\/some_vob\/myproject\/apps\/app1\/Java\/test\/src\/com\/giga\/highlevelproject\/myproject\/schedule\/validation\/SomeActivityTest.java",
            "operation": "checkin",
            "version": "\/main\/MyProject_latest_int\/2"
          }
        ],
        "msg": "",
        "timestamp": -1,
        "user": "user111"
      }
    ],
    "kind": null
  },
  "culprits": [
    {
      "absoluteUrl": "http:\/\/11.22.33.44:8080\/user\/nuka1620",
      "fullName": "nuka, Chuck"
    }
  ],
  "description": "lowercaseprojectname_V7.4.1.303",
  "displayName": "#303",
  "duration": 525758,
  "estimatedDuration": 306374,
  "executor": null,
  "fullDisplayName": "MyProject \u00bb MyProject-build #303",
  "highlevelproject_metrics_source_url": "http:\/\/11.22.33.44:8080\/job\/MyProject\/job\/MyProject-build\/303\/\/api\/json",
  "id": "303",
  "keepLog": false,
  "number": 303,
  "projectName": "MyProject-build",
  "queueId": 8201,
  "result": "SUCCESS",
  "timeToRepair": null,
  "timestamp": {
    "$numberLong": "1489439650307"
  },
  "url": "http:\/\/11.22.33.44:8080\/job\/MyProject\/job\/MyProject-build\/303\/"
}
Jonathan Leffler

When the regexes are in a file, you don't have to escape double quotes; you don't have to fight to get your double quotes past the shell.

"number":.*"projectName
"projectName":.*,"queueId
"queueId":.*,"result
"result":".*$

When that's fixed, I get:

$ egrep -o -f egrep-pattern.txt jsonfile.json 
"number": 303,"projectName
"queueId":8881,"result
$

The trouble now is, I think, that you've consumed the projectName with the first pattern, so the others don't get a chance to match it. Change the patterns to read up to a comma and you can get better results:

"number":[^,]*
"projectName":[^,]*
"queueId":[^,]*
"result":".*$

yields:

"number": 303
"projectName": "giga"
"queueId":8881
"result":"SUCESS"}

You could try to be more delicate, but you rapidly reach a point where a JSON-aware tool becomes more sensible. Commas in a string value would mess up the modified regexes, for example. (So, if the project name was "Giga, if not Tera", you'd have problems.)


Matching more general JSON name:value notation

As long as you're looking for simple "key":"quoted value" objects, you can use the following grep -E (aka egrep) command:

grep -Eoe '"[^"]+":"((\\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"])*"' data

Given the JSON-like data (in the file called data):

{"key1":"value","key2":"value2 with \"quoted\" text","key3":"value3 with \\ and \/ and \f and \uA32D embedded"}

that script produces:

"key1":"value"
"key2":"value2 with \"quoted\" text"
"key3":"value3 with \\ and \/ and \f and \uA32D embedded"

You can upgrade it to handle almost any valid JSON "key":value by using:

grep -Eoe '"[^"]+":(("((\\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"])*")|true|false|null|(-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][-+]?[0-9]+)?))' data

With a new data file containing:

{"key1":"value","key2":"value2 with \"quoted\" text"}
{"key3":"value3 with \\ and \/ and \f and \uA32D embedded"}
{"key4":false,"key5":true,"key6":null,"key7":7,"key8":0,"key9":0.123E-23}
{"key10":10,"key11":3.14159,"key12":0.876,"key13":-543.123}

the script produces:

"key1":"value"
"key2":"value2 with \"quoted\" text"
"key3":"value3 with \\ and \/ and \f and \uA32D embedded"
"key4":false
"key5":true
"key6":null
"key7":7
"key8":0
"key9":0.123E-23
"key10":10
"key11":3.14159
"key12":0.876
"key13":-543.123

You can follow the railroad diagrams in the outline JSON specification at http://json.org to see how I created the regex.

It could be enhanced by the judicious addition of [[:space:]]* in places where spaces are permitted but not required — before the key string, before the colon, after the colon (you could add it after the value too, but you probably don't want that).

Another simplification that I've taken is that the key doesn't allow for the various escape characters that the value string does. You could repeat that.

And, of course, this only works for 'leaf' name:value pairs; if a value is itself an object {…} or an array […], this doesn't handle the value as a whole.

However, this just goes to emphasize that it gets very messy very quickly and you would be better off using a special-purpose JSON query tool. One such tool is jq, as mentioned in a comment to the main query.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related